# 🎡 MusicBrainz Data Cleaner v3.0 A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!** ## πŸš€ Quick Start for New Sessions **If you're starting fresh or after a reboot, follow this exact sequence:** ### 1. Start MusicBrainz Services ```bash # Quick restart (recommended) ./restart_services.sh # Or full restart (if you have issues) ./start_services.sh ``` ### 2. Wait for Services to Initialize - **Database**: 5-10 minutes to fully load - **Web server**: 2-3 minutes to start responding - **Check status**: `cd ../musicbrainz-docker && docker-compose ps` ### 3. Verify Services Are Ready ```bash # Test web server curl -s http://localhost:5001 | head -5 # Test database (should show 2.6M+ artists) docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;" # Test cleaner connection docker-compose run --rm musicbrainz-cleaner python3 -c "from src.api.database import MusicBrainzDatabase; db = MusicBrainzDatabase(); print('Connection result:', db.connect())" ``` ### 4. Run Tests ```bash # Test 100 random songs docker-compose run --rm musicbrainz-cleaner python3 test_100_random.py # Or other test scripts docker-compose run --rm musicbrainz-cleaner python3 [script_name].py ``` **⚠️ Important**: Always run scripts via Docker - the cleaner cannot connect to the database directly from outside the container. **πŸ“‹ Troubleshooting**: See `TROUBLESHOOTING.md` for common issues and solutions. ## ✨ What's New in v3.0 - **πŸš€ Direct Database Access**: Connect directly to PostgreSQL for 10x faster performance - **🎯 Advanced Fuzzy Search**: Intelligent matching for similar artist names and song titles - **πŸ”„ Automatic Fallback**: Falls back to API mode if database access fails - **⚑ No Rate Limiting**: Database queries don't have API rate limits - **πŸ“Š Similarity Scoring**: See how well matches are scored - **πŸ†• Collaboration Detection**: Intelligently handle complex collaborations like "Pitbull ft. Ne-Yo, Afrojack & Nayer" - **πŸ†• Artist Aliases**: Handle name variations like "98 Degrees" β†’ "98Β°" and "S Club 7" β†’ "S Club" - **πŸ†• Sort Names**: Handle "Last, First" formats like "Corby, Matt" β†’ "Matt Corby" - **πŸ†• Edge Case Handling**: Support for artists with hyphens, exclamation marks, numbers, and special characters - **πŸ†• Band Name Protection**: Distinguish between band names (Simon & Garfunkel) and collaborations (Lavato, Demi & Joe Jonas) ## ✨ What It Does **Before:** ```json { "artist": "ACDC", "title": "Shot In The Dark", "favorite": true } ``` **After:** ```json { "artist": "AC/DC", "title": "Shot in the Dark", "favorite": true, "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1", "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db" } ``` ## πŸš€ Quick Start ### Option 1: Automated Setup (Recommended) 1. **Start MusicBrainz services**: ```bash ./start_services.sh ``` This script will: - Check for Docker and port conflicts - Start all MusicBrainz services - Wait for database initialization - Create environment configuration - Test the connection 2. **Run the cleaner**: ```bash docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --input data/songs.json --output cleaned_songs.json ``` ### Option 2: Manual Setup 1. **Start MusicBrainz services manually**: ```bash cd ../musicbrainz-docker MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d ``` Wait 5-10 minutes for database initialization. 2. **Create environment configuration**: ```bash # Create .env file in musicbrainz-cleaner directory cat > .env << EOF DB_HOST=172.18.0.2 DB_PORT=5432 DB_NAME=musicbrainz_db DB_USER=musicbrainz DB_PASSWORD=musicbrainz MUSICBRAINZ_WEB_SERVER_PORT=5001 EOF ``` 3. **Run the cleaner**: ```bash docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --input data/songs.json --output cleaned_songs.json ``` ### For detailed setup instructions, see [SETUP.md](SETUP.md) ## πŸ”„ After System Reboot After restarting your Mac, you'll need to restart the MusicBrainz services: ### Quick Restart (Recommended) ```bash # If Docker Desktop is already running ./restart_services.sh # Or manually cd ../musicbrainz-docker && MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d ``` ### Full Restart (If you have issues) ```bash # Complete setup including Docker checks ./start_services.sh ``` ### Auto-start Setup (Optional) 1. **Enable Docker Desktop auto-start**: - Open Docker Desktop - Go to Settings β†’ General - Check "Start Docker Desktop when you log in" 2. **Then just run**: `./restart_services.sh` after each reboot **Note**: Your data is preserved in Docker volumes, so you don't need to reconfigure anything after a reboot. ## 🚨 Common Startup Issues & Fixes ### Issue 1: Database Connection Refused **Problem**: Cleaner can't connect to database with error "Connection refused" **Root Cause**: Database container not fully initialized or wrong host configuration **Fix**: ```bash # Wait for database to be ready (check logs) cd ../musicbrainz-docker && docker-compose logs db | tail -10 # Verify database is accepting connections docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;" ``` ### Issue 2: Wrong Database Host Configuration **Problem**: Cleaner tries to connect to `172.18.0.2` but can't reach it **Root Cause**: Hardcoded IP address in database connection **Fix**: Use Docker service name `db` instead of IP address ```python # In src/api/database.py, change: host='172.18.0.2' # ❌ Wrong host='db' # βœ… Correct ``` ### Issue 3: Test Script Logic Error **Problem**: Test shows 0% success rate despite finding artists **Root Cause**: Test script checking `'mbid' in result` where `result` is a tuple `(song_dict, success_boolean)` **Fix**: Extract song dictionary from tuple ```python # Wrong: artist_found = 'mbid' in result # Correct: cleaned_song, success = result artist_found = 'mbid' in cleaned_song ``` ### Issue 4: Services Not Fully Initialized **Problem**: API returns empty results even though database has data **Root Cause**: MusicBrainz web server still starting up **Fix**: Wait for services to be fully ready ```bash # Check if web server is responding curl -s http://localhost:5001 | head -5 # Wait for database to be ready docker-compose logs db | grep "database system is ready" ``` ### Issue 5: Port Conflicts **Problem**: Port 5000 already in use **Root Cause**: Another service using the port **Fix**: Use alternative port ```bash MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d ``` ### Issue 6: Container Name Conflicts **Problem**: "Container name already in use" error **Root Cause**: Previous containers not properly cleaned up **Fix**: Remove conflicting containers ```bash docker-compose down docker rm -f ``` ## πŸ”§ Startup Checklist Before running tests, verify: 1. βœ… Docker Desktop is running 2. βœ… All containers are up: `docker-compose ps` 3. βœ… Database is ready: `docker-compose logs db | grep "ready"` 4. βœ… Web server responds: `curl -s http://localhost:5001` 5. βœ… Database has data: `docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"` 6. βœ… Cleaner can connect: Test database connection in cleaner ## πŸ“‹ Requirements - **Python 3.6+** - **MusicBrainz Server** running on localhost:8080 - **PostgreSQL Database** accessible on localhost:5432 - **Dependencies**: `requests`, `psycopg2-binary`, `fuzzywuzzy`, `python-Levenshtein` ## πŸ”§ Server Configuration ### Database Access - **Host**: localhost (or Docker container IP: 172.18.0.2) - **Port**: 5432 (PostgreSQL default) - **Database**: musicbrainz_db (actual database name) - **User**: musicbrainz - **Password**: musicbrainz (default, should be changed in production) ### HTTP API (Fallback) - **URL**: http://localhost:8080 - **Endpoint**: /ws/2/ - **Format**: JSON ### Troubleshooting - **Database Connection Failed**: Check PostgreSQL is running and credentials are correct - **API Connection Failed**: Check MusicBrainz server is running on port 8080 - **Slow Performance**: Ensure database indexes are built - **No Results**: Verify data has been imported to the database - **NEW**: **Docker Networking**: Use container IP (172.18.0.2) for Docker-to-Docker connections - **NEW**: **Database Name**: Ensure using `musicbrainz_db` not `musicbrainz` ## πŸ§ͺ Testing ### Test File Organization - **REQUIRED**: All test files must be placed in `src/tests/` directory - **PROHIBITED**: Test files should not be placed in the root directory - **Naming Convention**: Test files should follow `test_*.py` or `debug_*.py` patterns - **Purpose**: Keeps root directory clean and organizes test code properly ### Running Tests ```bash # Run all tests python3 src/tests/run_tests.py # Run specific test categories python3 src/tests/run_tests.py --unit # Unit tests only python3 src/tests/run_tests.py --integration # Integration tests only # Run specific test module python3 src/tests/run_tests.py test_data_loader python3 src/tests/run_tests.py test_cli # List all available tests python3 src/tests/run_tests.py --list ``` #### Test Categories - **Unit Tests**: Test individual components in isolation - **Integration Tests**: Test interactions between components and database - **Debug Tests**: Debug scripts and troubleshooting tools ## πŸ“ Project Structure ``` musicbrainz-cleaner/ β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ api/ # Database and API access β”‚ β”œβ”€β”€ cli/ # Command-line interface β”‚ β”œβ”€β”€ config/ # Configuration and constants β”‚ β”œβ”€β”€ core/ # Core functionality β”‚ β”œβ”€β”€ tests/ # Test files (REQUIRED location) β”‚ └── utils/ # Utility functions β”œβ”€β”€ data/ # Data files and output β”‚ β”œβ”€β”€ known_artists.json # Name variations (ACDC β†’ AC/DC) β”‚ β”œβ”€β”€ known_recordings.json # Known recording MBIDs β”‚ └── songs.json # Source songs file └── docker-compose.yml # Docker configuration ``` ### Data Files The tool uses external JSON files for name variations: - **`data/known_artists.json`**: Contains name variations (ACDC β†’ AC/DC, ft. β†’ feat.) - **`data/known_recordings.json`**: Contains known recording MBIDs for common songs These files can be easily updated without touching the code, making it simple to add new name variations. ## 🎯 Features ### βœ… Artist Name Fixes - `ACDC` β†’ `AC/DC` - `Bruno Mars ft. Cardi B` β†’ `Bruno Mars feat. Cardi B` - `featuring` β†’ `feat.` - `98 Degrees` β†’ `98Β°` (artist aliases) - `S Club 7` β†’ `S Club` (numerical suffixes) - `Corby, Matt` β†’ `Matt Corby` (sort names) ### βœ… Collaboration Detection - **Primary Patterns**: "ft.", "feat.", "featuring" (always collaborations) - **Secondary Patterns**: "&", "and", "," (intelligent detection) - **Band Name Protection**: 200+ known band names from `data/known_artists.json` - **Complex Collaborations**: "Pitbull ft. Ne-Yo, Afrojack & Nayer" - **Case Insensitive**: "Featuring" β†’ "featuring" ### βœ… Song Title Fixes - `Shot In The Dark` β†’ `Shot in the Dark` - Removes `(Karaoke Version)`, `(Instrumental)` suffixes - Normalizes capitalization and formatting ### βœ… Added Data - **`mbid`**: Official MusicBrainz Artist ID - **`recording_mbid`**: Official MusicBrainz Recording ID ### βœ… Preserves Your Data - Keeps all your existing fields (guid, path, disabled, favorite, etc.) - Only adds new fields, never removes existing ones ### πŸ†• Advanced Fuzzy Search - **Intelligent Matching**: Finds similar names even with typos or variations - **Similarity Scoring**: Shows how well each match scores (0.0 to 1.0) - **Configurable Thresholds**: Adjust matching sensitivity - **Multiple Algorithms**: Uses ratio, partial ratio, and token sort matching - **Enhanced Search Fields**: artist.name, artist_alias.name, artist.sort_name - **Dash Handling**: Regular dash (-) vs Unicode dash (‐) - **Substring Protection**: Avoids false matches like "Sleazy-E" vs "Eazy-E" ### πŸ†• Edge Case Support - **Hyphenated Artists**: "Blink-182", "Ne-Yo", "G-Eazy" - **Exclamation Marks**: "P!nk", "Panic! At The Disco", "3OH!3" - **Numbers**: "98 Degrees", "S Club 7", "3 Doors Down" - **Special Characters**: "a-ha", "The B-52s", "Salt-N-Pepa" ### πŸ†• Simplified Processing - **Default Behavior**: Process all songs by default (no special flags needed) - **Separate Output Files**: Successful and failed songs saved to different files - **Progress Tracking**: Real-time progress with song counter and status - **Smart Defaults**: Sensible defaults for all file paths and options - **Detailed Reporting**: Comprehensive statistics and processing report - **Batch Processing**: Efficient handling of large song collections ## πŸ“– Usage Examples ### Basic Usage (Default) ```bash # Process all songs with default settings (data/songs.json) docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main # Output: data/songs-success.json and data/songs-failure.json ``` ### Custom Source File ```bash # Process specific file docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/my_songs.json # Output: data/my_songs-success.json and data/my_songs-failure.json ``` ### Custom Output Files ```bash # Specify custom output files docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/songs.json --output-success cleaned.json --output-failure failed.json ``` ### Limit Processing ```bash # Process only first 1000 songs docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --limit 1000 ``` ### Force API Mode ```bash # Use HTTP API instead of database (slower but works without PostgreSQL) docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --use-api ``` ### Test Connections ```bash # Test database connection docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection # Test with API mode docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection --use-api ``` ### Help ```bash # Show usage information docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --help ``` ## πŸ“ Data Files ### Input Format Your JSON file should contain an array of song objects: ```json [ { "artist": "ACDC", "title": "Shot In The Dark", "disabled": false, "favorite": true, "guid": "8946008c-7acc-d187-60e6-5286e55ad502", "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4" }, { "artist": "Bruno Mars ft. Cardi B", "title": "Finesse Remix", "disabled": false, "favorite": false, "guid": "946a1077-ab9e-300c-3a72-b1e141e9706f", "path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4" } ] ``` ## πŸ“€ Output Format The tool creates **three output files**: ### 1. Successful Songs (`source-success.json`) Array of successfully processed songs with MBIDs added: ```json [ { "artist": "AC/DC", "title": "Shot in the Dark", "disabled": false, "favorite": true, "guid": "8946008c-7acc-d187-60e6-5286e55ad502", "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4", "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1", "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db" }, { "artist": "Bruno Mars feat. Cardi B", "title": "Finesse (remix)", "disabled": false, "favorite": false, "guid": "946a1077-ab9e-300c-3a72-b1e141e9706f", "path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4", "mbid": "afb680f2-b6eb-4cd7-a70b-a63b25c763d5", "recording_mbid": "8ed14014-547a-4128-ab81-c2dca7ae198e" } ] ``` ### 2. Failed Songs (`source-failure.json`) Array of songs that couldn't be processed (same format as source): ```json [ { "artist": "Unknown Artist", "title": "Unknown Song", "disabled": false, "favorite": false, "guid": "12345678-1234-1234-1234-123456789012", "path": "z://MP4\\Unknown Artist - Unknown Song.mp4" } ] ``` ### 3. Processing Report (`processing_report_YYYYMMDD_HHMMSS.txt`) Human-readable text report with statistics and failed song list: ``` MusicBrainz Data Cleaner - Processing Report ================================================== Source File: data/songs.json Processing Date: 2024-12-19 14:30:22 Processing Time: 15263.3 seconds SUMMARY -------------------- Total Songs Processed: 49,170 Successful Songs: 40,692 Failed Songs: 8,478 Success Rate: 82.8% DETAILED STATISTICS -------------------- Artists Found: 44,526/49,170 (90.6%) Recordings Found: 40,998/49,170 (83.4%) Processing Speed: 3.2 songs/second OUTPUT FILES -------------------- Successful Songs: data/songs-success.json Failed Songs: data/songs-failure.json Report File: data/processing_report_20241219_143022.txt FAILED SONGS (First 50) -------------------- 1. Unknown Artist - Unknown Song 2. Invalid Artist - Invalid Title 3. Test Artist - Test Song ... ``` ## 🎬 Example Run ### Basic Processing ```bash $ docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main πŸš€ Starting song processing... πŸ“Š Total songs to process: 49,170 Using database connection ================================================== [1 of 49,170] βœ… PASS: ACDC - Shot In The Dark [2 of 49,170] ❌ FAIL: Unknown Artist - Unknown Song [3 of 49,170] βœ… PASS: Bruno Mars feat. Cardi B - Finesse (remix) [4 of 49,170] βœ… PASS: Taylor Swift - Love Story ... πŸ“ˆ Progress: 100/49,170 (0.2%) - Success: 85.0% - Rate: 3.2 songs/sec πŸ“ˆ Progress: 200/49,170 (0.4%) - Success: 87.5% - Rate: 3.1 songs/sec ... ================================================== πŸŽ‰ Processing completed! πŸ“Š Final Results: ⏱️ Total processing time: 15263.3 seconds πŸš€ Average speed: 3.2 songs/second βœ… Artists found: 44,526/49,170 (90.6%) βœ… Recordings found: 40,998/49,170 (83.4%) ❌ Failed songs: 8,478 (17.2%) πŸ“„ Files saved: βœ… Successful songs: data/songs-success.json ❌ Failed songs: data/songs-failure.json πŸ“‹ Text report: data/processing_report_20241219_143022.txt πŸ“Š JSON report: data/processing_report_20241219_143022.json ``` ### Limited Processing ```bash $ docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --limit 1000 ⚠️ Limiting processing to first 1000 songs πŸš€ Starting song processing... πŸ“Š Total songs to process: 1,000 Using database connection ================================================== [1 of 1,000] βœ… PASS: ACDC - Shot In The Dark [2 of 1,000] ❌ FAIL: Unknown Artist - Unknown Song ... ================================================== πŸŽ‰ Processing completed! πŸ“Š Final Results: ⏱️ Total processing time: 312.5 seconds πŸš€ Average speed: 3.2 songs/second βœ… Artists found: 856/1,000 (85.6%) βœ… Recordings found: 789/1,000 (78.9%) ❌ Failed songs: 211 (21.1%) πŸ“„ Files saved: βœ… Successful songs: data/songs-success.json ❌ Failed songs: data/songs-failure.json πŸ“‹ Text report: data/processing_report_20241219_143022.txt πŸ“Š JSON report: data/processing_report_20241219_143022.json ``` ## πŸ”§ Troubleshooting ### "Could not find artist" - The artist might not be in the MusicBrainz database - Try checking the spelling or using a different variation - The search index might still be building (wait a few minutes) - Check fuzzy search similarity score - lower threshold if needed - **NEW**: Check for artist aliases (e.g., "98 Degrees" β†’ "98Β°") - **NEW**: Check for sort names (e.g., "Corby, Matt" β†’ "Matt Corby") ### "Could not find recording" - The song might not be in the database - The title might not match exactly - Try a simpler title (remove extra words) - Check fuzzy search similarity score - lower threshold if needed - **NEW**: For collaborations, check if it's stored under the main artist ### Connection errors - **Database**: Make sure PostgreSQL is running and accessible - **API**: Make sure your MusicBrainz server is running on `http://localhost:8080` - Check that Docker containers are up and running - Verify the server is accessible in your browser - **NEW**: For Docker, use container IP (172.18.0.2) instead of localhost ### JSON errors - Make sure your input file is valid JSON - Check that it contains an array of objects - Verify all required fields are present ### Performance issues - Use database mode instead of API mode for better performance - Ensure database indexes are built for faster queries - Check fuzzy search thresholds - higher thresholds mean fewer but more accurate matches ### Collaboration detection issues - **NEW**: Check if it's a band name vs collaboration (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas") - **NEW**: Verify the collaboration pattern is supported (ft., feat., featuring, &, and, ,) - **NEW**: Check case sensitivity - patterns are case-insensitive ## 🎯 Use Cases - **Karaoke Systems**: Clean up song metadata for better search and organization - **Music Libraries**: Standardize artist names and add official IDs - **Music Apps**: Ensure consistent data across your application - **Data Migration**: Clean up legacy music data when moving to new systems - **Fuzzy Matching**: Handle typos and variations in artist/song names - **NEW**: **Collaboration Handling**: Process complex artist collaborations - **NEW**: **Edge Cases**: Handle artists with special characters and unusual names ## πŸ“š What are MBIDs? **MBID** stands for **MusicBrainz Identifier**. These are unique, permanent IDs assigned to artists, recordings, and other music entities in the MusicBrainz database. **Benefits:** - **Permanent**: Never change, even if names change - **Universal**: Used across many music applications - **Reliable**: Official identifiers from the MusicBrainz database - **Linked Data**: Connect to other music databases and services ## πŸ†• Performance Comparison | Method | Speed | Rate Limiting | Fuzzy Search | Setup Complexity | |--------|-------|---------------|--------------|------------------| | **Database** | ⚑ 10x faster | ❌ None | βœ… Yes | πŸ”§ Medium | | **API** | 🐌 Slower | ⏱️ Yes (0.1s delay) | ❌ No | βœ… Easy | ## πŸ†• Collaboration Detection Examples | Input | Type | Detection | Output | |-------|------|-----------|---------| | `Bruno Mars ft. Cardi B` | Collaboration | βœ… Primary pattern | `Bruno Mars feat. Cardi B` | | `Pitbull ft. Ne-Yo, Afrojack & Nayer` | Complex Collaboration | βœ… Multiple patterns | `Pitbull feat. Ne-Yo, Afrojack & Nayer` | | `Simon & Garfunkel` | Band Name | ❌ Protected | `Simon & Garfunkel` | | `Lavato, Demi & Joe Jonas` | Collaboration | βœ… Comma detection | `Lavato, Demi & Joe Jonas` | | `Hall & Oates` | Band Name | ❌ Protected | `Hall & Oates` | ## πŸ†• Edge Case Examples | Input | Type | Handling | Output | |-------|------|----------|---------| | `ACDC` | Name Variation | βœ… Alias lookup | `AC/DC` | | `98 Degrees` | Artist Alias | βœ… Alias search | `98Β°` | | `S Club 7` | Numerical Suffix | βœ… Suffix removal | `S Club` | | `Corby, Matt` | Sort Name | βœ… Sort name search | `Matt Corby` | | `Blink-182` | Dash Variation | βœ… Unicode dash handling | `blink‐182` | | `P!nk` | Special Characters | βœ… Direct search | `P!nk` | | `3OH!3` | Numbers + Special | βœ… Direct search | `3OH!3` | ## 🀝 Contributing Found a bug or have a feature request? 1. Check the existing issues 2. Create a new issue with details 3. Include sample data if possible ## πŸ“„ License This tool is provided as-is for educational and personal use. ## πŸ”— Related Links - [MusicBrainz](https://musicbrainz.org/) - The open music encyclopedia - [MusicBrainz API](https://musicbrainz.org/doc/Development) - API documentation - [MusicBrainz Docker](https://github.com/metabrainz/musicbrainz-docker) - Docker setup - [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy string matching library ## πŸ“ Lessons Learned ### Database Integration - **Direct PostgreSQL access is 10x faster** than API calls - **Docker networking** requires container IPs, not localhost - **Database name matters**: `musicbrainz_db` not `musicbrainz` - **Static caches cause problems**: Wrong MBIDs override correct database lookups ### Collaboration Handling - **Primary patterns** (ft., feat.) are always collaborations - **Secondary patterns** (&, and) require intelligence to distinguish from band names - **Comma detection** helps identify collaborations - **Artist credit lookup** is essential for preserving all collaborators ### Edge Cases - **Dash variations** (regular vs Unicode) cause exact match failures - **Artist aliases** are common and important (98 Degrees β†’ 98Β°) - **Sort names** handle "Last, First" formats - **Numerical suffixes** in names need special handling (S Club 7 β†’ S Club) ### Performance Optimization - **Remove static caches** for better accuracy - **Database-first approach** ensures live data - **Fuzzy search thresholds** need tuning for different datasets - **Connection pooling** would improve performance for large datasets ### CLI Design - **Simplified interface** with smart defaults reduces complexity - **Array format consistency** makes output files easier to work with - **Human-readable reports** improve user experience - **Test file organization** keeps project structure clean --- **Happy cleaning! 🎡✨**