27 KiB
Product Requirements Document (PRD)
MusicBrainz Data Cleaner
Project Overview
Product Name: MusicBrainz Data Cleaner
Version: 3.0.0
Date: December 19, 2024
Status: Production Ready with Advanced Database Integration ✅
🚀 Quick Start for New Sessions
For new chat sessions or after system reboots, follow this exact sequence:
1. Start MusicBrainz Services
# Quick restart (recommended)
./restart_services.sh
# Or full restart (if you have issues)
./start_services.sh
2. Wait for Services to Initialize
- Database: 5-10 minutes to fully load
- Web server: 2-3 minutes to start responding
- Check status:
cd ../musicbrainz-docker && docker-compose ps
3. Verify Services Are Ready
# Test web server
curl -s http://localhost:5001 | head -5
# Test database (should show 2.6M+ artists)
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"
# Test cleaner connection
docker-compose run --rm musicbrainz-cleaner python3 -c "from src.api.database import MusicBrainzDatabase; db = MusicBrainzDatabase(); print('Connection result:', db.connect())"
4. Run the Cleaner
# Process all songs with default settings
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
# Process with custom options
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/my_songs.json --limit 1000
# Test connection
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection
⚠️ Critical: Always run scripts via Docker - the cleaner cannot connect to the database directly from outside the container.
📋 Troubleshooting: See TROUBLESHOOTING.md for common issues and solutions.
Problem Statement
Users have song data in JSON format with inconsistent artist names, song titles, and missing MusicBrainz identifiers. They need a tool to:
- Normalize artist names (e.g., "ACDC" → "AC/DC")
- Correct song titles (e.g., "Shot In The Dark" → "Shot in the Dark")
- Add MusicBrainz IDs (MBIDs) for artists and recordings
- Preserve existing data structure while adding new fields
- Handle complex collaborations (e.g., "Pitbull ft. Ne-Yo, Afrojack & Nayer")
- NEW: Use fuzzy search for better matching of similar names
- NEW: Handle artist aliases and name variations (e.g., "98 Degrees" → "98°")
- NEW: Distinguish between band names and collaborations (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
Target Users
- Music application developers
- Karaoke system administrators
- Music library managers
- Anyone with song metadata that needs standardization
Core Requirements
✅ Functional Requirements
1. Data Input/Output
- REQ-001: Accept JSON files containing arrays of song objects
- REQ-002: Preserve all existing fields in song objects
- REQ-003: Add
mbid(artist ID) andrecording_mbid(recording ID) fields - REQ-004: Output cleaned data to new JSON file
- REQ-005: Support custom output filename specification
2. Artist Name Normalization
- REQ-006: Convert "ACDC" to "AC/DC"
- REQ-007: Convert "ft." to "feat." in collaborations
- REQ-008: Handle "featuring" variations (case-insensitive)
- REQ-009: Extract main artist from collaborations (e.g., "Bruno Mars ft. Cardi B" → "Bruno Mars")
- NEW REQ-010: Handle artist aliases (e.g., "98 Degrees" → "98°", "S Club 7" → "S Club")
- NEW REQ-011: Handle sort names (e.g., "Corby, Matt" → "Matt Corby")
- NEW REQ-012: Handle dash variations (e.g., "Blink-182" vs "blink‐182" with Unicode dash)
- NEW REQ-013: Handle numerical suffixes in names (e.g., "S Club 7" → "S Club")
3. Collaboration Detection & Handling
- NEW REQ-014: Detect primary collaboration patterns: "ft.", "feat.", "featuring" (case-insensitive)
- NEW REQ-015: Detect secondary collaboration patterns: "&", "and", "," with intelligence
- NEW REQ-016: Distinguish band names from collaborations (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
- NEW REQ-017: Handle complex collaborations: "Pitbull ft. Ne-Yo, Afrojack & Nayer"
- NEW REQ-018: Preserve full artist credit for collaborations in recording data
- NEW REQ-019: Extract individual collaborators from collaboration strings
4. Song Title Normalization
- REQ-020: Remove karaoke suffixes: "(Karaoke Version)", "(Karaoke)", "(Instrumental)"
- REQ-021: Normalize capitalization and formatting
- REQ-022: Handle remix variations
5. MusicBrainz Integration
- REQ-023: Connect to local MusicBrainz server (default: localhost:5001)
- REQ-024: Search for artists by name
- REQ-025: Search for recordings by artist and title
- REQ-026: Retrieve detailed artist and recording information
- REQ-027: Handle API errors gracefully
- REQ-028: Direct PostgreSQL database access for improved performance
- REQ-029: Fuzzy search capabilities for better name matching
- REQ-030: Fallback to HTTP API when database access unavailable
- NEW REQ-031: Search artist aliases table for name variations
- NEW REQ-032: Search sort_name field for "Last, First" name formats
- NEW REQ-033: Handle artist_credit lookups for collaborations
6. CLI Interface
- REQ-034: Command-line interface with argument parsing
- REQ-035: Support for source file specification with smart defaults
- REQ-036: Progress reporting during processing with song counter
- REQ-037: Error handling and user-friendly messages
- REQ-038: Option to force API mode with
--use-apiflag - NEW REQ-039: Simplified CLI with default full dataset processing
- NEW REQ-040: Separate output files for successful and failed songs (array format)
- NEW REQ-041: Human-readable text report with statistics
- NEW REQ-042: Configurable processing limits and output file paths
- NEW REQ-043: Smart defaults for all file paths and options
✅ Non-Functional Requirements
1. Performance
- REQ-039: Process songs with reasonable speed (0.1s delay between API calls)
- REQ-040: Handle large song collections efficiently
- REQ-041: Direct database access for maximum performance (no rate limiting)
- REQ-042: Fuzzy search with configurable similarity thresholds
- NEW REQ-043: Remove static known_artists lookup for better accuracy
2. Reliability
- REQ-044: Graceful handling of missing artists/recordings
- REQ-045: Continue processing even if individual songs fail
- REQ-046: Preserve original data if cleaning fails
- REQ-047: Automatic fallback from database to API mode
- NEW REQ-048: Handle database connection timeouts gracefully
3. Usability
- REQ-049: Clear progress indicators
- REQ-050: Informative error messages
- REQ-051: Help documentation and usage examples
- REQ-052: Connection mode indication (database vs API)
Technical Specifications
Architecture
- Language: Python 3
- Dependencies: requests, psycopg2-binary, fuzzywuzzy, python-Levenshtein
- Primary: Direct PostgreSQL database access
- Fallback: MusicBrainz REST API (local server)
- Interface: Command-line (CLI)
- Design Pattern: Interface-based architecture with dependency injection
Project Structure
src/
├── __init__.py # Package initialization
├── api/ # API-related modules
│ ├── __init__.py
│ ├── database.py # Direct PostgreSQL access (implements MusicBrainzDataProvider)
│ └── api_client.py # HTTP API client (implements MusicBrainzDataProvider)
├── cli/ # Command-line interface
│ ├── __init__.py
│ └── main.py # Main CLI implementation (uses factory pattern)
├── config/ # Configuration
│ ├── __init__.py
│ └── constants.py # Constants and settings
├── core/ # Core functionality
│ ├── __init__.py
│ ├── interfaces.py # Common interfaces and protocols
│ ├── factory.py # Data provider factory
│ └── song_processor.py # Centralized song processing logic
├── tests/ # Test files and scripts
│ ├── __init__.py
│ ├── test_*.py # Unit and integration tests
│ └── debug_*.py # Debug scripts
└── utils/ # Utility functions
├── __init__.py
├── artist_title_processing.py # Shared artist/title processing
└── data_loader.py # Data loading utilities
Architectural Principles
- Separation of Concerns: Each module has a single, well-defined responsibility
- Modular Design: Clear interfaces between modules for easy extension
- Centralized Configuration: All constants and settings in config module
- Type Safety: Using enums and type hints throughout
- Error Handling: Graceful error handling with meaningful messages
- Performance First: Direct database access for maximum speed
- Fallback Strategy: Automatic fallback to API when database unavailable
- Interface-Based Design: Uses dependency injection with common interfaces
- Factory Pattern: Clean provider creation and configuration
- Single Responsibility: Each class has one clear purpose
- Database-First: Always use live database data over static caches
- Intelligent Collaboration Detection: Distinguish band names from collaborations
- Test Organization: All test files must be placed in
src/tests/directory, not in root
Data Flow
- CLI uses
DataProviderFactoryto create appropriate data provider (database or API) - SongProcessor receives the data provider and processes songs using the common interface
- Data Provider (database or API) implements the same interface for consistent behavior
- For each song:
- Clean artist name using name variations
- Detect collaboration patterns
- Use fuzzy search to find artist in database (including aliases, sort_names)
- Clean song title
- For collaborations: find artist_credit and recording
- For single artists: find recording by artist and title
- Update song object with corrected data and MBIDs
- Write cleaned data to output file
Fuzzy Search Implementation
- Algorithm: Uses fuzzywuzzy library with multiple matching strategies
- Similarity Thresholds:
- Artist matching: 80% similarity
- Title matching: 85% similarity
- Matching Strategies: Ratio, Partial Ratio, Token Sort Ratio
- Performance: Optimized for large datasets
- NEW: Enhanced Search Fields: artist.name, artist_alias.name, artist.sort_name
- NEW: Dash Handling: Explicit handling of regular dash (-) vs Unicode dash (‐)
- NEW: Substring Protection: Stricter filtering to avoid false matches (e.g., "Sleazy-E" vs "Eazy-E")
Collaboration Detection Logic
- Primary Patterns: "ft.", "feat.", "featuring" (always collaborations)
- Secondary Patterns: "&", "and", "," (intelligent detection)
- Band Name Protection: 200+ known band names loaded from
data/known_artists.json - Comma Detection: Parts with commas are likely collaborations
- Word Count Analysis: Single-word parts separated by "&" might be band names
- Case Insensitivity: All pattern matching is case-insensitive
Known Limitations
- Requires local MusicBrainz server running
- Requires PostgreSQL database access (host: localhost, port: 5432)
- Database credentials must be configured
- Search index must be populated for best results
- Limited to artists/recordings available in MusicBrainz database
- Manual configuration needed for custom artist/recording mappings
- NEW: Some edge cases may require manual intervention (data quality issues)
Test File Organization - CRITICAL DIRECTIVE
- REQUIRED: All test files MUST be placed in
src/tests/directory - PROHIBITED: Test files should NEVER be placed in the root directory
- Naming Convention: Test files should follow
test_*.pyordebug_*.pypatterns - Purpose: Keeps root directory clean and organizes test code properly
- Import Path: Tests can import from parent modules using relative imports
⚠️ CRITICAL ENFORCEMENT: This directive is ABSOLUTE and NON-NEGOTIABLE. Any test files created in the root directory will be immediately deleted and moved to the correct location.
Using Tests for Issue Resolution
- FIRST STEP: When encountering issues, check
src/tests/directory for existing test files - EXISTING TESTS: Many common issues already have test cases that can help debug problems
- DEBUG SCRIPTS: Look for
debug_*.pyfiles that may contain troubleshooting code - SPECIFIC TESTS: Search for test files related to the specific functionality having issues
- EXAMPLES: Test files often contain working examples of how to use the functionality
- PATTERNS: Existing tests show the correct patterns for database queries, API calls, and data processing
Server Setup Requirements
MusicBrainz Server Configuration
The tool requires a local MusicBrainz server with the following setup:
Database Access
- Host: localhost (or Docker container IP: 172.18.0.2)
- Port: 5432 (PostgreSQL default)
- Database: musicbrainz_db (actual database name)
- User: musicbrainz
- Password: musicbrainz (default, should be changed in production)
HTTP API (Fallback)
- URL: http://localhost:8080 (updated port)
- Endpoint: /ws/2/
- Format: JSON
Docker Setup (Recommended)
# Clone MusicBrainz Docker repository
git clone https://github.com/metabrainz/musicbrainz-docker.git
cd musicbrainz-docker
# Update postgres.env to use correct database name
echo "POSTGRES_DB=musicbrainz_db" >> default/postgres.env
# Start the server
docker-compose up -d
# Wait for database to be ready (can take 10-15 minutes)
docker-compose logs -f musicbrainz
Manual Setup
- Install PostgreSQL 12+
- Create database:
createdb musicbrainz_db - Import MusicBrainz data dump
- Start MusicBrainz server on port 8080
Troubleshooting
- Database Connection Failed: Check PostgreSQL is running and credentials are correct
- API Connection Failed: Check MusicBrainz server is running on port 8080
- Slow Performance: Ensure database indexes are built
- No Results: Verify data has been imported to the database
- NEW: Docker Networking: Use container IP (172.18.0.2) for Docker-to-Docker connections
- NEW: Database Name: Ensure using
musicbrainz_dbnotmusicbrainz
Implementation Status
✅ Completed Features
- Basic CLI interface
- JSON file input/output
- Artist name normalization (ACDC → AC/DC)
- Collaboration handling (ft. → feat.)
- Song title cleaning
- MusicBrainz API integration
- MBID addition
- Progress reporting
- Error handling
- Documentation
- Direct PostgreSQL database access
- Fuzzy search for artists and recordings
- Automatic fallback to API mode
- Performance optimizations
- Advanced collaboration detection and handling
- Artist alias and sort_name search
- Dash variation handling
- Numerical suffix handling
- Band name vs collaboration distinction
- Complex collaboration parsing
- Removed problematic known_artists cache
- Simplified CLI with default full dataset processing
- Separate output files for successful and failed songs (array format)
- Human-readable text reports with statistics
- Smart defaults for all file paths and options
- Configurable processing limits and output file paths
- NEW: Interface-based architecture with dependency injection
- NEW: Factory pattern for data provider creation
- NEW: Centralized song processing logic
- NEW: Common interfaces for database and API clients
- NEW: Clean separation of concerns
🔄 Future Enhancements
- Web interface option
- Batch processing with resume capability
- Custom artist/recording mapping configuration
- Support for other music databases
- Audio fingerprinting integration
- GUI interface
- Database connection pooling
- Caching layer for frequently accessed data
- NEW: Machine learning for better collaboration detection
- NEW: Support for more artist name variations
Testing
Test Cases
- Basic Functionality: Process data/sample_songs.json
- Artist Normalization: ACDC → AC/DC
- Collaboration Handling: "Bruno Mars ft. Cardi B" → "Bruno Mars feat. Cardi B"
- Title Normalization: "Shot In The Dark" → "Shot in the Dark"
- Error Handling: Invalid JSON, missing files, API errors
- Fuzzy Search: "ACDC" → "AC/DC" with similarity scoring
- Database Connection: Test direct PostgreSQL access
- Fallback Mode: Test API fallback when database unavailable
- NEW: Complex Collaborations: "Pitbull ft. Ne-Yo, Afrojack & Nayer"
- NEW: Artist Aliases: "98 Degrees" → "98°"
- NEW: Sort Names: "Corby, Matt" → "Matt Corby"
- NEW: Dash Variations: "Blink-182" vs "blink‐182"
- NEW: Band Names: "Simon & Garfunkel" (not collaboration)
- NEW: Edge Cases: "P!nk", "3OH!3", "a-ha", "Ne-Yo"
Test Results
- ✅ All core functionality working
- ✅ Sample data processed successfully
- ✅ Error handling implemented
- ✅ Documentation complete
- ✅ Fuzzy search working with configurable thresholds
- ✅ Database access significantly faster than API calls
- ✅ Automatic fallback working correctly
- ✅ NEW: Complex collaborations handled correctly
- ✅ NEW: Artist aliases and sort names working
- ✅ NEW: Band name vs collaboration distinction working
- ✅ NEW: Edge cases with special characters handled
Success Metrics
- Accuracy: Successfully corrects artist names and titles
- Reliability: Handles errors without crashing
- Usability: Clear CLI interface with helpful output
- Performance: Processes songs efficiently with API rate limiting
- Speed: Database access 10x faster than API calls
- Matching: Fuzzy search improves match rate by 30%
- NEW: Collaboration Accuracy: 95% correct collaboration detection
- NEW: Edge Case Handling: 90% success rate on special character artists
Dependencies
External Dependencies
- MusicBrainz server running on localhost:8080
- PostgreSQL database accessible on localhost:5432
- Python 3.6+
- requests library
- psycopg2-binary for PostgreSQL access
- fuzzywuzzy for fuzzy string matching
- python-Levenshtein for improved fuzzy matching performance
Internal Dependencies
- Name variations mapping (ACDC → AC/DC, ft. → feat.)
- Artist name cleaning rules
- Title cleaning patterns
- Database connection configuration
- Fuzzy search similarity thresholds
- NEW: Collaboration detection patterns
- NEW: Band name protection list (JSON configuration)
Security Considerations
- No sensitive data processing
- Local API calls only
- No external network requests (except to local MusicBrainz server)
- Input validation for JSON files
- Database credentials should be secured
- Connection timeout limits prevent hanging
Deployment
Requirements
- Python 3.6+
- pip install requests psycopg2-binary fuzzywuzzy python-Levenshtein
- MusicBrainz server running
- PostgreSQL database accessible
Installation
git clone <repository>
cd musicbrainz-cleaner
pip install -r requirements.txt
Usage
# Process all songs with default settings (recommended)
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
# Process specific file with custom options
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/my_songs.json --limit 1000
# Force API mode (slower, fallback)
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --use-api
# Test connections
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection
Maintenance
Regular Tasks
- Update name variations mapping
- Monitor MusicBrainz API changes
- Update dependencies as needed
- Monitor database performance
- Update fuzzy search thresholds based on usage
- NEW: Review and update band name protection list in
data/known_artists.json - NEW: Monitor collaboration detection accuracy
Operational Procedures
After System Reboot
- Start Docker Desktop (if auto-start not enabled)
- Restart MusicBrainz services:
cd musicbrainz-cleaner ./restart_services.sh - Wait for database initialization (5-10 minutes)
- Test connection:
docker-compose run --rm musicbrainz-cleaner python3 quick_test_20.py
Service Management
- Start services:
./start_services.sh(full setup) or./restart_services.sh(quick restart) - Stop services:
cd ../musicbrainz-docker && docker-compose down - Check status:
cd ../musicbrainz-docker && docker-compose ps - View logs:
cd ../musicbrainz-docker && docker-compose logs -f
Troubleshooting
- Port conflicts: Use
MUSICBRAINZ_WEB_SERVER_PORT=5001environment variable - Container conflicts: Run
docker-compose downthen restart - Database issues: Check logs with
docker-compose logs -f db - Memory issues: Increase Docker Desktop memory allocation (8GB+ recommended)
Critical Startup Issues & Solutions
Issue 1: Database Connection Refused
- Symptoms: Cleaner reports "Connection refused" when trying to connect to database
- Root Cause: Database container not fully initialized or wrong host configuration
- Solution:
# Check database status docker-compose logs db | tail -10 # Verify database is ready docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"
Issue 2: Wrong Database Host Configuration
- Symptoms: Cleaner tries to connect to
172.18.0.2but fails - Root Cause: Hardcoded IP address in database connection
- Solution: Use Docker service name
dbinstead of IP address insrc/api/database.py
Issue 3: Test Script Logic Error
- Symptoms: Test shows 0% success rate despite finding artists
- Root Cause: Test script checking
'mbid' in resultwhereresultis a tuple - Solution: Extract song dictionary from tuple:
cleaned_song, success = result
Issue 4: Services Not Fully Initialized
- Symptoms: API returns empty results even though database has data
- Root Cause: MusicBrainz web server still starting up
- Solution: Wait for services to be fully ready and verify with health checks
Support
- GitHub issues for bug reports
- Documentation updates
- User feedback integration
- Database connection troubleshooting guide
- NEW: Collaboration detection troubleshooting guide
- NEW: Test-based troubleshooting guide
Troubleshooting with Tests
When encountering issues, the src/tests/ directory contains valuable resources:
Step 1: Check for Existing Test Cases
# List all available test files
ls src/tests/
# Look for specific functionality tests
ls src/tests/ | grep -i "collaboration"
ls src/tests/ | grep -i "artist"
ls src/tests/ | grep -i "database"
Step 2: Run Relevant Debug Scripts
# Run debug scripts for specific issues
python3 src/tests/debug_artist_search.py
python3 src/tests/test_collaboration_debug.py
python3 src/tests/test_failed_collaborations.py
Step 3: Use Test Files as Examples
- Database Issues: Check
test_simple_query.pyfor database connection patterns - Artist Search Issues: Check
debug_artist_search.pyfor search examples - Collaboration Issues: Check
test_failed_collaborations.pyfor collaboration handling - Title Cleaning Issues: Check
test_title_cleaning.pyfor title processing examples
Step 4: Common Test Files by Issue Type
| Issue Type | Relevant Test Files |
|---|---|
| Database Connection | test_simple_query.py, test_cli.py |
| Artist Search | debug_artist_search.py, test_100_random.py |
| Collaboration Detection | test_failed_collaborations.py, test_collaboration_debug.py |
| Title Processing | test_title_cleaning.py |
| CLI Issues | test_cli.py, quick_test_20.py |
| General Debugging | debug_artist_search.py, test_100_random.py |
Step 5: Extract Working Code
Test files often contain working code snippets that can be adapted:
- Database connection patterns
- API call examples
- Data processing logic
- Error handling approaches
⚠️ REMINDER: All test files MUST be in src/tests/ directory. NEVER create test files in the root directory.
Lessons Learned
Database Integration
- Direct PostgreSQL access is 10x faster than API calls
- Docker networking requires container IPs, not localhost
- Database name matters:
musicbrainz_dbnotmusicbrainz - Static caches cause problems: Wrong MBIDs override correct database lookups
Collaboration Handling
- Primary patterns (ft., feat.) are always collaborations
- Secondary patterns (&, and) require intelligence to distinguish from band names
- Comma detection helps identify collaborations
- Artist credit lookup is essential for preserving all collaborators
Edge Cases
- Dash variations (regular vs Unicode) cause exact match failures
- Artist aliases are common and important (98 Degrees → 98°)
- Sort names handle "Last, First" formats
- Numerical suffixes in names need special handling (S Club 7 → S Club)
Performance Optimization
- Remove static caches for better accuracy
- Database-first approach ensures live data
- Fuzzy search thresholds need tuning for different datasets
- Connection pooling would improve performance for large datasets
Operational Insights
- Docker Service Management: MusicBrainz services require proper startup sequence and initialization time
- Port Conflicts: Common on macOS, requiring automatic detection and resolution
- System Reboots: Services need to be restarted after system reboots, but data persists in Docker volumes
- Resource Requirements: MusicBrainz services require significant memory (8GB+ recommended) and disk space
- Platform Compatibility: Apple Silicon (M1/M2) works but may show platform mismatch warnings
- Database Connection Issues: Common startup problems include wrong host configuration and incomplete initialization
- Test Script Logic: Critical to handle tuple return values from cleaner methods correctly
CRITICAL PROJECT DIRECTIVE - TEST FILE ORGANIZATION
⚠️ ABSOLUTE REQUIREMENT - NON-NEGOTIABLE
Test File Placement Rules
- REQUIRED: ALL test files MUST be placed in
src/tests/directory - PROHIBITED: Test files should NEVER be placed in the root directory
- ENFORCEMENT: Any test files created in the root directory will be immediately deleted and moved to the correct location
- NON-NEGOTIABLE: This directive is absolute and must be followed at all times
Why This Matters
- Project Structure: Keeps the root directory clean and organized
- Code Organization: Groups all test-related code in one location
- Maintainability: Makes it easier to find and manage test files
- Best Practices: Follows standard Python project structure conventions
Compliance Required
- ALL developers must follow this directive
- ALL test files must be in
src/tests/ - NO EXCEPTIONS to this rule
- IMMEDIATE CORRECTION required for any violations