30 KiB
Product Requirements Document (PRD)
MusicBrainz Data Cleaner
Project Overview
Product Name: MusicBrainz Data Cleaner
Version: 3.1.0
Date: August 4, 2024
Status: Production Ready with Advanced Artist Lookup System ✅
🚀 Quick Start for New Sessions
For new chat sessions or after system reboots, follow this exact sequence:
1. Start MusicBrainz Services
# Quick restart (recommended)
./restart_services.sh
# Or full restart (if you have issues)
./start_services.sh
2. Wait for Services to Initialize
- Database: 5-10 minutes to fully load
- Web server: 2-3 minutes to start responding
- Check status:
cd ../musicbrainz-docker && docker-compose ps
3. Verify Services Are Ready
# Test web server
curl -s http://localhost:5001 | head -5
# Test database (should show 2.6M+ artists)
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"
# Test cleaner connection
docker-compose run --rm musicbrainz-cleaner python3 -c "from src.api.database import MusicBrainzDatabase; db = MusicBrainzDatabase(); print('Connection result:', db.connect())"
4. Run the Cleaner
# Process all songs with default settings
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
# Process with custom options
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/my_songs.json --limit 1000
# Test connection
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection
⚠️ Critical: Always run scripts via Docker - the cleaner cannot connect to the database directly from outside the container.
📋 Troubleshooting: See TROUBLESHOOTING.md for common issues and solutions.
Problem Statement
Users have song data in JSON format with inconsistent artist names, song titles, and missing MusicBrainz identifiers. They need a tool to:
- Normalize artist names (e.g., "ACDC" → "AC/DC")
- Correct song titles (e.g., "Shot In The Dark" → "Shot in the Dark")
- Add MusicBrainz IDs (MBIDs) for artists and recordings
- Preserve existing data structure while adding new fields
- Handle complex collaborations (e.g., "Pitbull ft. Ne-Yo, Afrojack & Nayer")
- NEW: Use fuzzy search for better matching of similar names
- NEW: Handle artist aliases and name variations (e.g., "98 Degrees" → "98°")
- NEW: Distinguish between band names and collaborations (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
- NEW: Advanced Artist Lookup System with 2,446+ artists and 4,950+ variations
- NEW: Fallback lookup table for artists not found in database
- NEW: Canonical name replacement for consistent artist naming
Target Users
- Music application developers
- Karaoke system administrators
- Music library managers
- Anyone with song metadata that needs standardization
Core Requirements
✅ Functional Requirements
1. Data Input/Output
- REQ-001: Accept JSON files containing arrays of song objects
- REQ-002: Preserve all existing fields in song objects
- REQ-003: Add
mbid(artist ID) andrecording_mbid(recording ID) fields - REQ-004: Output cleaned data to new JSON file
- REQ-005: Support custom output filename specification
2. Artist Name Normalization
- REQ-006: Convert "ACDC" to "AC/DC"
- REQ-007: Convert "ft." to "feat." in collaborations
- REQ-008: Handle "featuring" variations (case-insensitive)
- REQ-009: Extract main artist from collaborations (e.g., "Bruno Mars ft. Cardi B" → "Bruno Mars")
- NEW REQ-010: Handle artist aliases (e.g., "98 Degrees" → "98°", "S Club 7" → "S Club")
- NEW REQ-011: Handle sort names (e.g., "Corby, Matt" → "Matt Corby")
- NEW REQ-012: Handle dash variations (e.g., "Blink-182" vs "blink‐182" with Unicode dash)
- NEW REQ-013: Handle numerical suffixes in names (e.g., "S Club 7" → "S Club")
3. Collaboration Detection & Handling
- NEW REQ-014: Detect primary collaboration patterns: "ft.", "feat.", "featuring" (case-insensitive)
- NEW REQ-015: Detect secondary collaboration patterns: "&", "and", "," with intelligence
- NEW REQ-016: Distinguish band names from collaborations (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
- NEW REQ-017: Handle complex collaborations: "Pitbull ft. Ne-Yo, Afrojack & Nayer"
- NEW REQ-018: Preserve full artist credit for collaborations in recording data
- NEW REQ-019: Extract individual collaborators from collaboration strings
4. Song Title Normalization
- REQ-020: Remove karaoke suffixes: "(Karaoke Version)", "(Karaoke)", "(Instrumental)"
- REQ-021: Normalize capitalization and formatting
- REQ-022: Handle remix variations
5. MusicBrainz Integration
- REQ-023: Connect to local MusicBrainz server (default: localhost:5001)
- REQ-024: Search for artists by name
- REQ-025: Search for recordings by artist and title
- REQ-026: Retrieve detailed artist and recording information
- REQ-027: Handle API errors gracefully
- REQ-028: Direct PostgreSQL database access for improved performance
- REQ-029: Fuzzy search capabilities for better name matching
- REQ-030: Fallback to HTTP API when database access unavailable
- NEW REQ-031: Search artist aliases table for name variations
- NEW REQ-032: Search sort_name field for "Last, First" name formats
- NEW REQ-033: Handle artist_credit lookups for collaborations
6. CLI Interface
- REQ-034: Command-line interface with argument parsing
- REQ-035: Support for source file specification with smart defaults
- REQ-036: Progress reporting during processing with song counter
- REQ-037: Error handling and user-friendly messages
- REQ-038: Option to force API mode with
--use-apiflag - NEW REQ-039: Simplified CLI with default full dataset processing
- NEW REQ-040: Separate output files for successful and failed songs (array format)
- NEW REQ-041: Human-readable text report with statistics
- NEW REQ-042: Configurable processing limits and output file paths
- NEW REQ-043: Smart defaults for all file paths and options
✅ Non-Functional Requirements
1. Performance
- REQ-039: Process songs with reasonable speed (0.1s delay between API calls)
- REQ-040: Handle large song collections efficiently
- REQ-041: Direct database access for maximum performance (no rate limiting)
- REQ-042: Fuzzy search with configurable similarity thresholds
- NEW REQ-043: Remove static known_artists lookup for better accuracy
2. Reliability
- REQ-044: Graceful handling of missing artists/recordings
- REQ-045: Continue processing even if individual songs fail
- REQ-046: Preserve original data if cleaning fails
- REQ-047: Automatic fallback from database to API mode
- NEW REQ-048: Handle database connection timeouts gracefully
3. Usability
- REQ-049: Clear progress indicators
- REQ-050: Informative error messages
- REQ-051: Help documentation and usage examples
- REQ-052: Connection mode indication (database vs API)
Technical Specifications
Architecture
- Language: Python 3
- Dependencies: requests, psycopg2-binary, fuzzywuzzy, python-Levenshtein
- Primary: Direct PostgreSQL database access
- Fallback: MusicBrainz REST API (local server)
- Interface: Command-line (CLI)
- Design Pattern: Interface-based architecture with dependency injection
Project Structure
src/
├── __init__.py # Package initialization
├── api/ # API-related modules
│ ├── __init__.py
│ ├── database.py # Direct PostgreSQL access (implements MusicBrainzDataProvider)
│ └── api_client.py # HTTP API client (implements MusicBrainzDataProvider)
├── cli/ # Command-line interface
│ ├── __init__.py
│ └── main.py # Main CLI implementation (uses factory pattern)
├── config/ # Configuration
│ ├── __init__.py
│ └── constants.py # Constants and settings
├── core/ # Core functionality
│ ├── __init__.py
│ ├── interfaces.py # Common interfaces and protocols
│ ├── factory.py # Data provider factory
│ └── song_processor.py # Centralized song processing logic
├── tests/ # Test files and scripts
│ ├── __init__.py
│ ├── test_*.py # Unit and integration tests
│ └── debug_*.py # Debug scripts
└── utils/ # Utility functions
├── __init__.py
├── artist_title_processing.py # Shared artist/title processing
└── data_loader.py # Data loading utilities
Architectural Principles
- Separation of Concerns: Each module has a single, well-defined responsibility
- Modular Design: Clear interfaces between modules for easy extension
- Centralized Configuration: All constants and settings in config module
- Type Safety: Using enums and type hints throughout
- Error Handling: Graceful error handling with meaningful messages
- Performance First: Direct database access for maximum speed
- Fallback Strategy: Automatic fallback to API when database unavailable
- Interface-Based Design: Uses dependency injection with common interfaces
- Factory Pattern: Clean provider creation and configuration
- Single Responsibility: Each class has one clear purpose
- Database-First: Always use live database data over static caches
- Intelligent Collaboration Detection: Distinguish band names from collaborations
- Test Organization: All test files must be placed in
src/tests/directory, not in root
Data Flow
- CLI uses
DataProviderFactoryto create appropriate data provider (database or API) - SongProcessor receives the data provider and processes songs using the common interface
- Data Provider (database or API) implements the same interface for consistent behavior
- For each song:
- Clean artist name using name variations
- Detect collaboration patterns
- Use fuzzy search to find artist in database (including aliases, sort_names)
- Clean song title
- For collaborations: find artist_credit and recording
- For single artists: find recording by artist and title
- Update song object with corrected data and MBIDs
- Write cleaned data to output file
Fuzzy Search Implementation
- Algorithm: Uses fuzzywuzzy library with multiple matching strategies
- Similarity Thresholds:
- Artist matching: 80% similarity
- Title matching: 85% similarity
- Matching Strategies: Ratio, Partial Ratio, Token Sort Ratio
- Performance: Optimized for large datasets
- NEW: Enhanced Search Fields: artist.name, artist_alias.name, artist.sort_name
- NEW: Dash Handling: Explicit handling of regular dash (-) vs Unicode dash (‐)
- NEW: Substring Protection: Stricter filtering to avoid false matches (e.g., "Sleazy-E" vs "Eazy-E")
Collaboration Detection Logic
- Primary Patterns: "ft.", "feat.", "featuring" (always collaborations)
- Secondary Patterns: "&", "and", "," (intelligent detection)
- Band Name Protection: 200+ known band names loaded from
data/known_artists.json - Comma Detection: Parts with commas are likely collaborations
- Word Count Analysis: Single-word parts separated by "&" might be band names
- Case Insensitivity: All pattern matching is case-insensitive
Known Limitations
- Requires local MusicBrainz server running
- Requires PostgreSQL database access (host: localhost, port: 5432)
- Database credentials must be configured
- Search index must be populated for best results
- Limited to artists/recordings available in MusicBrainz database
- Manual configuration needed for custom artist/recording mappings
- NEW: Some edge cases may require manual intervention (data quality issues)
Test File Organization - CRITICAL DIRECTIVE
- REQUIRED: All test files MUST be placed in
src/tests/directory - PROHIBITED: Test files should NEVER be placed in the root directory
- Naming Convention: Test files should follow
test_*.pyordebug_*.pypatterns - Purpose: Keeps root directory clean and organizes test code properly
- Import Path: Tests can import from parent modules using relative imports
⚠️ CRITICAL ENFORCEMENT: This directive is ABSOLUTE and NON-NEGOTIABLE. Any test files created in the root directory will be immediately deleted and moved to the correct location.
Using Tests for Issue Resolution
- FIRST STEP: When encountering issues, check
src/tests/directory for existing test files - EXISTING TESTS: Many common issues already have test cases that can help debug problems
- DEBUG SCRIPTS: Look for
debug_*.pyfiles that may contain troubleshooting code - SPECIFIC TESTS: Search for test files related to the specific functionality having issues
- EXAMPLES: Test files often contain working examples of how to use the functionality
- PATTERNS: Existing tests show the correct patterns for database queries, API calls, and data processing
Server Setup Requirements
MusicBrainz Server Configuration
The tool requires a local MusicBrainz server with the following setup:
Database Access
- Host: localhost (or Docker container IP: 172.18.0.2)
- Port: 5432 (PostgreSQL default)
- Database: musicbrainz_db (actual database name)
- User: musicbrainz
- Password: musicbrainz (default, should be changed in production)
HTTP API (Fallback)
- URL: http://localhost:8080 (updated port)
- Endpoint: /ws/2/
- Format: JSON
Docker Setup (Recommended)
# Clone MusicBrainz Docker repository
git clone https://github.com/metabrainz/musicbrainz-docker.git
cd musicbrainz-docker
# Update postgres.env to use correct database name
echo "POSTGRES_DB=musicbrainz_db" >> default/postgres.env
# Start the server
docker-compose up -d
# Wait for database to be ready (can take 10-15 minutes)
docker-compose logs -f musicbrainz
Manual Setup
- Install PostgreSQL 12+
- Create database:
createdb musicbrainz_db - Import MusicBrainz data dump
- Start MusicBrainz server on port 8080
Troubleshooting
- Database Connection Failed: Check PostgreSQL is running and credentials are correct
- API Connection Failed: Check MusicBrainz server is running on port 8080
- Slow Performance: Ensure database indexes are built
- No Results: Verify data has been imported to the database
- NEW: Docker Networking: Use container IP (172.18.0.2) for Docker-to-Docker connections
- NEW: Database Name: Ensure using
musicbrainz_dbnotmusicbrainz
Implementation Status
✅ Completed Features
- Basic CLI interface
- JSON file input/output
- Artist name normalization (ACDC → AC/DC)
- Collaboration handling (ft. → feat.)
- Song title cleaning
- MusicBrainz API integration
- MBID addition
- Progress reporting
- Error handling
- Documentation
- Direct PostgreSQL database access
- Fuzzy search for artists and recordings
- Automatic fallback to API mode
- Performance optimizations
- Advanced collaboration detection and handling
- Artist alias and sort_name search
- Dash variation handling
- Numerical suffix handling
- Band name vs collaboration distinction
- Complex collaboration parsing
- Removed problematic known_artists cache
- Simplified CLI with default full dataset processing
- Separate output files for successful and failed songs (array format)
- Human-readable text reports with statistics
- Smart defaults for all file paths and options
- Configurable processing limits and output file paths
- NEW: Interface-based architecture with dependency injection
- NEW: Factory pattern for data provider creation
- NEW: Centralized song processing logic
- NEW: Common interfaces for database and API clients
- NEW: Clean separation of concerns
🔄 Future Enhancements
- Web interface option
- Batch processing with resume capability
- Custom artist/recording mapping configuration
- Support for other music databases
- Audio fingerprinting integration
- GUI interface
- Database connection pooling
- Caching layer for frequently accessed data
- NEW: Machine learning for better collaboration detection
- NEW: Support for more artist name variations
Testing
Test Cases
- Basic Functionality: Process data/sample_songs.json
- Artist Normalization: ACDC → AC/DC
- Collaboration Handling: "Bruno Mars ft. Cardi B" → "Bruno Mars feat. Cardi B"
- Title Normalization: "Shot In The Dark" → "Shot in the Dark"
- Error Handling: Invalid JSON, missing files, API errors
- Fuzzy Search: "ACDC" → "AC/DC" with similarity scoring
- Database Connection: Test direct PostgreSQL access
- Fallback Mode: Test API fallback when database unavailable
- NEW: Complex Collaborations: "Pitbull ft. Ne-Yo, Afrojack & Nayer"
- NEW: Artist Aliases: "98 Degrees" → "98°"
- NEW: Sort Names: "Corby, Matt" → "Matt Corby"
- NEW: Dash Variations: "Blink-182" vs "blink‐182"
- NEW: Band Names: "Simon & Garfunkel" (not collaboration)
- NEW: Edge Cases: "P!nk", "3OH!3", "a-ha", "Ne-Yo"
Test Results
- ✅ All core functionality working
- ✅ Sample data processed successfully
- ✅ Error handling implemented
- ✅ Documentation complete
- ✅ Fuzzy search working with configurable thresholds
- ✅ Database access significantly faster than API calls
- ✅ Automatic fallback working correctly
- ✅ NEW: Complex collaborations handled correctly
- ✅ NEW: Artist aliases and sort names working
- ✅ NEW: Band name vs collaboration distinction working
- ✅ NEW: Edge cases with special characters handled
Success Metrics
- Accuracy: Successfully corrects artist names and titles
- Reliability: Handles errors without crashing
- Usability: Clear CLI interface with helpful output
- Performance: Processes songs efficiently with API rate limiting
- Speed: Database access 10x faster than API calls
- Matching: Fuzzy search improves match rate by 30%
- NEW: Collaboration Accuracy: 95% correct collaboration detection
- NEW: Edge Case Handling: 90% success rate on special character artists
Dependencies
External Dependencies
- MusicBrainz server running on localhost:8080
- PostgreSQL database accessible on localhost:5432
- Python 3.6+
- requests library
- psycopg2-binary for PostgreSQL access
- fuzzywuzzy for fuzzy string matching
- python-Levenshtein for improved fuzzy matching performance
Internal Dependencies
- Name variations mapping (ACDC → AC/DC, ft. → feat.)
- Artist name cleaning rules
- Title cleaning patterns
- Database connection configuration
- Fuzzy search similarity thresholds
- NEW: Collaboration detection patterns
- NEW: Band name protection list (JSON configuration)
Security Considerations
- No sensitive data processing
- Local API calls only
- No external network requests (except to local MusicBrainz server)
- Input validation for JSON files
- Database credentials should be secured
- Connection timeout limits prevent hanging
Deployment
Requirements
- Python 3.6+
- pip install requests psycopg2-binary fuzzywuzzy python-Levenshtein
- MusicBrainz server running
- PostgreSQL database accessible
Installation
git clone <repository>
cd musicbrainz-cleaner
pip install -r requirements.txt
Usage
# Process all songs with default settings (recommended)
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
# Process specific file with custom options
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/my_songs.json --limit 1000
# Force API mode (slower, fallback)
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --use-api
# Test connections
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection
Maintenance
Regular Tasks
- Update name variations mapping
- Monitor MusicBrainz API changes
- Update dependencies as needed
- Monitor database performance
- Update fuzzy search thresholds based on usage
- NEW: Review and update band name protection list in
data/known_artists.json - NEW: Monitor collaboration detection accuracy
Operational Procedures
After System Reboot
- Start Docker Desktop (if auto-start not enabled)
- Restart MusicBrainz services:
cd musicbrainz-cleaner ./restart_services.sh - Wait for database initialization (5-10 minutes)
- Test connection:
docker-compose run --rm musicbrainz-cleaner python3 quick_test_20.py
Service Management
- Start services:
./start_services.sh(full setup) or./restart_services.sh(quick restart) - Stop services:
cd ../musicbrainz-docker && docker-compose down - Check status:
cd ../musicbrainz-docker && docker-compose ps - View logs:
cd ../musicbrainz-docker && docker-compose logs -f
Troubleshooting
- Port conflicts: Use
MUSICBRAINZ_WEB_SERVER_PORT=5001environment variable - Container conflicts: Run
docker-compose downthen restart - Database issues: Check logs with
docker-compose logs -f db - Memory issues: Increase Docker Desktop memory allocation (8GB+ recommended)
Critical Startup Issues & Solutions
Issue 1: Database Connection Refused
- Symptoms: Cleaner reports "Connection refused" when trying to connect to database
- Root Cause: Database container not fully initialized or wrong host configuration
- Solution:
# Check database status docker-compose logs db | tail -10 # Verify database is ready docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"
Issue 2: Wrong Database Host Configuration
- Symptoms: Cleaner tries to connect to
172.18.0.2but fails - Root Cause: Hardcoded IP address in database connection
- Solution: Use Docker service name
dbinstead of IP address insrc/api/database.py
Issue 3: Test Script Logic Error
- Symptoms: Test shows 0% success rate despite finding artists
- Root Cause: Test script checking
'mbid' in resultwhereresultis a tuple - Solution: Extract song dictionary from tuple:
cleaned_song, success = result
Issue 4: Services Not Fully Initialized
- Symptoms: API returns empty results even though database has data
- Root Cause: MusicBrainz web server still starting up
- Solution: Wait for services to be fully ready and verify with health checks
Support
- GitHub issues for bug reports
- Documentation updates
- User feedback integration
- Database connection troubleshooting guide
- NEW: Collaboration detection troubleshooting guide
- NEW: Test-based troubleshooting guide
Troubleshooting with Tests
When encountering issues, the src/tests/ directory contains valuable resources:
Step 1: Check for Existing Test Cases
# List all available test files
ls src/tests/
# Look for specific functionality tests
ls src/tests/ | grep -i "collaboration"
ls src/tests/ | grep -i "artist"
ls src/tests/ | grep -i "database"
Step 2: Run Relevant Debug Scripts
# Run debug scripts for specific issues
python3 src/tests/debug_artist_search.py
python3 src/tests/test_collaboration_debug.py
python3 src/tests/test_failed_collaborations.py
Step 3: Use Test Files as Examples
- Database Issues: Check
test_simple_query.pyfor database connection patterns - Artist Search Issues: Check
debug_artist_search.pyfor search examples - Collaboration Issues: Check
test_failed_collaborations.pyfor collaboration handling - Title Cleaning Issues: Check
test_title_cleaning.pyfor title processing examples
Step 4: Common Test Files by Issue Type
| Issue Type | Relevant Test Files |
|---|---|
| Database Connection | test_simple_query.py, test_cli.py |
| Artist Search | debug_artist_search.py, test_100_random.py |
| Collaboration Detection | test_failed_collaborations.py, test_collaboration_debug.py |
| Title Processing | test_title_cleaning.py |
| CLI Issues | test_cli.py, quick_test_20.py |
| General Debugging | debug_artist_search.py, test_100_random.py |
Step 5: Extract Working Code
Test files often contain working code snippets that can be adapted:
- Database connection patterns
- API call examples
- Data processing logic
- Error handling approaches
⚠️ REMINDER: All test files MUST be in src/tests/ directory. NEVER create test files in the root directory.
Lessons Learned
Database Integration
- Direct PostgreSQL access is 10x faster than API calls
- Docker networking requires container IPs, not localhost
- Database name matters:
musicbrainz_dbnotmusicbrainz - Static caches cause problems: Wrong MBIDs override correct database lookups
Collaboration Handling
- Primary patterns (ft., feat.) are always collaborations
- Secondary patterns (&, and) require intelligence to distinguish from band names
- Comma detection helps identify collaborations
- Artist credit lookup is essential for preserving all collaborators
Edge Cases
- Dash variations (regular vs Unicode) cause exact match failures
- Artist aliases are common and important (98 Degrees → 98°)
- Sort names handle "Last, First" formats
- Numerical suffixes in names need special handling (S Club 7 → S Club)
Performance Optimization
- Remove static caches for better accuracy
- Database-first approach ensures live data
- Fuzzy search thresholds need tuning for different datasets
- Connection pooling would improve performance for large datasets
Operational Insights
- Docker Service Management: MusicBrainz services require proper startup sequence and initialization time
- Port Conflicts: Common on macOS, requiring automatic detection and resolution
- System Reboots: Services need to be restarted after system reboots, but data persists in Docker volumes
- Resource Requirements: MusicBrainz services require significant memory (8GB+ recommended) and disk space
- Platform Compatibility: Apple Silicon (M1/M2) works but may show platform mismatch warnings
- Database Connection Issues: Common startup problems include wrong host configuration and incomplete initialization
- Test Script Logic: Critical to handle tuple return values from cleaner methods correctly
CRITICAL PROJECT DIRECTIVE - TEST FILE ORGANIZATION
⚠️ ABSOLUTE REQUIREMENT - NON-NEGOTIABLE
Test File Placement Rules
- REQUIRED: ALL test files MUST be placed in
src/tests/directory - PROHIBITED: Test files should NEVER be placed in the root directory
- ENFORCEMENT: Any test files created in the root directory will be immediately deleted and moved to the correct location
- NON-NEGOTIABLE: This directive is absolute and must be followed at all times
Why This Matters
- Project Structure: Keeps the root directory clean and organized
- Code Organization: Groups all test-related code in one location
- Maintainability: Makes it easier to find and manage test files
- Best Practices: Follows standard Python project structure conventions
Compliance Required
- ALL developers must follow this directive
- ALL test files must be in
src/tests/ - NO EXCEPTIONS to this rule
- IMMEDIATE CORRECTION required for any violations
Performance Optimizations
Default Artist Sorting
- Enabled by default: Songs are automatically sorted by artist name before processing
- Performance benefits:
- Better database query efficiency (similar artists processed together)
- Improved caching behavior
- Cleaner log output organization
- Optional disable: Use
--no-sortflag to preserve original order - User experience: Most users benefit from sorting, so it's the default
Multiple Artist Candidate Search
- Intelligent artist selection: Tries multiple artist candidates when first choice doesn't have the recording
- Recording-aware prioritization: Artists with the specific recording are prioritized
- Fallback strategy: Up to 5 different artist candidates are tried if needed
- Comprehensive search: Searches names, aliases, and fuzzy matches
Artist Lookup System
Overview
The MusicBrainz Data Cleaner now includes an advanced Artist Lookup System that provides fallback matching for artists not found in the primary database search. This system significantly improves artist matching success rates.
Key Features
- 2,446+ Artists: Comprehensive lookup table with real and placeholder MBIDs
- 4,950+ Variations: Extensive name variations and aliases
- Fuzzy Matching: Intelligent matching with configurable similarity thresholds
- Canonical Names: Consistent artist name replacement across datasets
- Fallback System: Secondary search when database lookup fails
Data Structure
{
"artist_variations": {
"Canonical Artist Name": {
"mbid": "real-or-placeholder-mbid",
"variations": [
"Artist Name",
"Artist Name Variation 1",
"Artist Name Variation 2"
],
"notes": "Description or status"
}
}
}
Usage
- Automatic Integration: Works seamlessly with existing song processing
- CLI Management: Full command-line interface for managing lookup data
- Search Capabilities: Find artists by name with fuzzy matching
- Statistics: Comprehensive reporting on lookup table usage
Benefits
- Improved Success Rates: Higher artist matching percentages
- Consistent Naming: Standardized artist names across datasets
- Easy Management: Simple tools for adding and updating artist data
- Scalable: Can be extended with additional artists and variations