# 🎡 MusicBrainz Data Cleaner v3.0 A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!** ## ✨ What's New in v3.0 - **πŸš€ Direct Database Access**: Connect directly to PostgreSQL for 10x faster performance - **🎯 Advanced Fuzzy Search**: Intelligent matching for similar artist names and song titles - **πŸ”„ Automatic Fallback**: Falls back to API mode if database access fails - **⚑ No Rate Limiting**: Database queries don't have API rate limits - **πŸ“Š Similarity Scoring**: See how well matches are scored - **πŸ†• Collaboration Detection**: Intelligently handle complex collaborations like "Pitbull ft. Ne-Yo, Afrojack & Nayer" - **πŸ†• Artist Aliases**: Handle name variations like "98 Degrees" β†’ "98Β°" and "S Club 7" β†’ "S Club" - **πŸ†• Sort Names**: Handle "Last, First" formats like "Corby, Matt" β†’ "Matt Corby" - **πŸ†• Edge Case Handling**: Support for artists with hyphens, exclamation marks, numbers, and special characters - **πŸ†• Band Name Protection**: Distinguish between band names (Simon & Garfunkel) and collaborations (Lavato, Demi & Joe Jonas) ## ✨ What It Does **Before:** ```json { "artist": "ACDC", "title": "Shot In The Dark", "favorite": true } ``` **After:** ```json { "artist": "AC/DC", "title": "Shot in the Dark", "favorite": true, "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1", "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db" } ``` ## πŸš€ Quick Start ### 1. Install Dependencies ```bash pip install requests psycopg2-binary fuzzywuzzy python-Levenshtein ``` ### 2. Set Up MusicBrainz Server #### Option A: Docker (Recommended) ```bash # Clone MusicBrainz Docker repository git clone https://github.com/metabrainz/musicbrainz-docker.git cd musicbrainz-docker # Update postgres.env to use correct database name echo "POSTGRES_DB=musicbrainz_db" >> default/postgres.env # Start the server docker-compose up -d # Wait for database to be ready (can take 10-15 minutes) docker-compose logs -f musicbrainz ``` #### Option B: Manual Setup 1. Install PostgreSQL 12+ 2. Create database: `createdb musicbrainz_db` 3. Import MusicBrainz data dump 4. Start MusicBrainz server on port 8080 ### 3. Test Connection ```bash python musicbrainz_cleaner.py --test-connection ``` ### 4. Run the Cleaner ```bash # Use database access (recommended, faster) python musicbrainz_cleaner.py your_songs.json # Force API mode (slower, fallback) python musicbrainz_cleaner.py your_songs.json --use-api ``` That's it! Your cleaned data will be saved to `your_songs_cleaned.json` ## πŸ“‹ Requirements - **Python 3.6+** - **MusicBrainz Server** running on localhost:8080 - **PostgreSQL Database** accessible on localhost:5432 - **Dependencies**: `requests`, `psycopg2-binary`, `fuzzywuzzy`, `python-Levenshtein` ## πŸ”§ Server Configuration ### Database Access - **Host**: localhost (or Docker container IP: 172.18.0.2) - **Port**: 5432 (PostgreSQL default) - **Database**: musicbrainz_db (actual database name) - **User**: musicbrainz - **Password**: musicbrainz (default, should be changed in production) ### HTTP API (Fallback) - **URL**: http://localhost:8080 - **Endpoint**: /ws/2/ - **Format**: JSON ### Troubleshooting - **Database Connection Failed**: Check PostgreSQL is running and credentials are correct - **API Connection Failed**: Check MusicBrainz server is running on port 8080 - **Slow Performance**: Ensure database indexes are built - **No Results**: Verify data has been imported to the database - **NEW**: **Docker Networking**: Use container IP (172.18.0.2) for Docker-to-Docker connections - **NEW**: **Database Name**: Ensure using `musicbrainz_db` not `musicbrainz` ## πŸ§ͺ Testing Run the test suite to verify everything works correctly: ```bash # Run all tests python3 src/tests/run_tests.py # Run specific test categories python3 src/tests/run_tests.py --unit # Unit tests only python3 src/tests/run_tests.py --integration # Integration tests only # Run specific test module python3 src/tests/run_tests.py test_data_loader python3 src/tests/run_tests.py test_cli # List all available tests python3 src/tests/run_tests.py --list ``` #### Test Categories - **Unit Tests**: Test individual components in isolation - **Integration Tests**: Test interactions between components and database - **Debug Tests**: Debug scripts and troubleshooting tools ## πŸ“ Data Files The tool uses external JSON files for name variations: - **`data/known_artists.json`**: Contains name variations (ACDC β†’ AC/DC, ft. β†’ feat.) - **`data/known_recordings.json`**: Contains known recording MBIDs for common songs These files can be easily updated without touching the code, making it simple to add new name variations. ## 🎯 Features ### βœ… Artist Name Fixes - `ACDC` β†’ `AC/DC` - `Bruno Mars ft. Cardi B` β†’ `Bruno Mars feat. Cardi B` - `featuring` β†’ `feat.` - `98 Degrees` β†’ `98Β°` (artist aliases) - `S Club 7` β†’ `S Club` (numerical suffixes) - `Corby, Matt` β†’ `Matt Corby` (sort names) ### βœ… Collaboration Detection - **Primary Patterns**: "ft.", "feat.", "featuring" (always collaborations) - **Secondary Patterns**: "&", "and", "," (intelligent detection) - **Band Name Protection**: 200+ known band names from `data/known_artists.json` - **Complex Collaborations**: "Pitbull ft. Ne-Yo, Afrojack & Nayer" - **Case Insensitive**: "Featuring" β†’ "featuring" ### βœ… Song Title Fixes - `Shot In The Dark` β†’ `Shot in the Dark` - Removes `(Karaoke Version)`, `(Instrumental)` suffixes - Normalizes capitalization and formatting ### βœ… Added Data - **`mbid`**: Official MusicBrainz Artist ID - **`recording_mbid`**: Official MusicBrainz Recording ID ### βœ… Preserves Your Data - Keeps all your existing fields (guid, path, disabled, favorite, etc.) - Only adds new fields, never removes existing ones ### πŸ†• Advanced Fuzzy Search - **Intelligent Matching**: Finds similar names even with typos or variations - **Similarity Scoring**: Shows how well each match scores (0.0 to 1.0) - **Configurable Thresholds**: Adjust matching sensitivity - **Multiple Algorithms**: Uses ratio, partial ratio, and token sort matching - **Enhanced Search Fields**: artist.name, artist_alias.name, artist.sort_name - **Dash Handling**: Regular dash (-) vs Unicode dash (‐) - **Substring Protection**: Avoids false matches like "Sleazy-E" vs "Eazy-E" ### πŸ†• Edge Case Support - **Hyphenated Artists**: "Blink-182", "Ne-Yo", "G-Eazy" - **Exclamation Marks**: "P!nk", "Panic! At The Disco", "3OH!3" - **Numbers**: "98 Degrees", "S Club 7", "3 Doors Down" - **Special Characters**: "a-ha", "The B-52s", "Salt-N-Pepa" ## πŸ“– Usage Examples ### Basic Usage ```bash # Clean your songs and save to auto-generated filename python musicbrainz_cleaner.py my_songs.json # Output: my_songs_cleaned.json ``` ### Custom Output File ```bash # Specify your own output filename python musicbrainz_cleaner.py my_songs.json cleaned_songs.json ``` ### Force API Mode ```bash # Use HTTP API instead of database (slower but works without PostgreSQL) python musicbrainz_cleaner.py my_songs.json --use-api ``` ### Test Connections ```bash # Test database connection python musicbrainz_cleaner.py --test-connection # Test with API mode python musicbrainz_cleaner.py --test-connection --use-api ``` ### Help ```bash # Show usage information python musicbrainz_cleaner.py --help ``` ## πŸ“ Data Files ### Input Format Your JSON file should contain an array of song objects: ```json [ { "artist": "ACDC", "title": "Shot In The Dark", "disabled": false, "favorite": true, "guid": "8946008c-7acc-d187-60e6-5286e55ad502", "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4" }, { "artist": "Bruno Mars ft. Cardi B", "title": "Finesse Remix", "disabled": false, "favorite": false, "guid": "946a1077-ab9e-300c-3a72-b1e141e9706f", "path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4" } ] ``` ## πŸ“€ Output Format The tool will update your objects with corrected data: ```json [ { "artist": "AC/DC", "title": "Shot in the Dark", "disabled": false, "favorite": true, "guid": "8946008c-7acc-d187-60e6-5286e55ad502", "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4", "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1", "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db" }, { "artist": "Bruno Mars feat. Cardi B", "title": "Finesse (remix)", "disabled": false, "favorite": false, "guid": "946a1077-ab9e-300c-3a72-b1e141e9706f", "path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4", "mbid": "afb680f2-b6eb-4cd7-a70b-a63b25c763d5", "recording_mbid": "8ed14014-547a-4128-ab81-c2dca7ae198e" } ] ``` ## 🎬 Example Run ```bash $ python musicbrainz_cleaner.py data/sample_songs.json Processing 3 songs... Using database connection ================================================== [1/3] Processing: ACDC - Shot In The Dark 🎯 Fuzzy match found: ACDC β†’ AC/DC (score: 0.85) βœ… Found artist: AC/DC (MBID: 66c662b6-6e2f-4930-8610-912e24c63ed1) 🎯 Fuzzy match found: Shot In The Dark β†’ Shot in the Dark (score: 0.92) βœ… Found recording: Shot in the Dark (MBID: cf8b5cd0-d97c-413d-882f-fc422a2e57db) βœ… Updated to: AC/DC - Shot in the Dark [2/3] Processing: Bruno Mars ft. Cardi B - Finesse Remix 🎯 Fuzzy match found: Bruno Mars β†’ Bruno Mars (score: 1.00) βœ… Found artist: Bruno Mars (MBID: afb680f2-b6eb-4cd7-a70b-a63b25c763d5) 🎯 Fuzzy match found: Finesse Remix β†’ Finesse (remix) (score: 0.88) βœ… Found recording: Finesse (remix) (MBID: 8ed14014-547a-4128-ab81-c2dca7ae198e) βœ… Updated to: Bruno Mars feat. Cardi B - Finesse (remix) [3/3] Processing: Taylor Swift - Love Story 🎯 Fuzzy match found: Taylor Swift β†’ Taylor Swift (score: 1.00) βœ… Found artist: Taylor Swift (MBID: 20244d07-534f-4eff-b4d4-930878889970) 🎯 Fuzzy match found: Love Story β†’ Love Story (score: 1.00) βœ… Found recording: Love Story (MBID: d783e6c5-761f-4fc3-bfcf-6089cdfc8f96) βœ… Updated to: Taylor Swift - Love Story ================================================== βœ… Processing complete! πŸ“ Output saved to: data/sample_songs_cleaned.json ``` ## πŸ”§ Troubleshooting ### "Could not find artist" - The artist might not be in the MusicBrainz database - Try checking the spelling or using a different variation - The search index might still be building (wait a few minutes) - Check fuzzy search similarity score - lower threshold if needed - **NEW**: Check for artist aliases (e.g., "98 Degrees" β†’ "98Β°") - **NEW**: Check for sort names (e.g., "Corby, Matt" β†’ "Matt Corby") ### "Could not find recording" - The song might not be in the database - The title might not match exactly - Try a simpler title (remove extra words) - Check fuzzy search similarity score - lower threshold if needed - **NEW**: For collaborations, check if it's stored under the main artist ### Connection errors - **Database**: Make sure PostgreSQL is running and accessible - **API**: Make sure your MusicBrainz server is running on `http://localhost:8080` - Check that Docker containers are up and running - Verify the server is accessible in your browser - **NEW**: For Docker, use container IP (172.18.0.2) instead of localhost ### JSON errors - Make sure your input file is valid JSON - Check that it contains an array of objects - Verify all required fields are present ### Performance issues - Use database mode instead of API mode for better performance - Ensure database indexes are built for faster queries - Check fuzzy search thresholds - higher thresholds mean fewer but more accurate matches ### Collaboration detection issues - **NEW**: Check if it's a band name vs collaboration (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas") - **NEW**: Verify the collaboration pattern is supported (ft., feat., featuring, &, and, ,) - **NEW**: Check case sensitivity - patterns are case-insensitive ## 🎯 Use Cases - **Karaoke Systems**: Clean up song metadata for better search and organization - **Music Libraries**: Standardize artist names and add official IDs - **Music Apps**: Ensure consistent data across your application - **Data Migration**: Clean up legacy music data when moving to new systems - **Fuzzy Matching**: Handle typos and variations in artist/song names - **NEW**: **Collaboration Handling**: Process complex artist collaborations - **NEW**: **Edge Cases**: Handle artists with special characters and unusual names ## πŸ“š What are MBIDs? **MBID** stands for **MusicBrainz Identifier**. These are unique, permanent IDs assigned to artists, recordings, and other music entities in the MusicBrainz database. **Benefits:** - **Permanent**: Never change, even if names change - **Universal**: Used across many music applications - **Reliable**: Official identifiers from the MusicBrainz database - **Linked Data**: Connect to other music databases and services ## πŸ†• Performance Comparison | Method | Speed | Rate Limiting | Fuzzy Search | Setup Complexity | |--------|-------|---------------|--------------|------------------| | **Database** | ⚑ 10x faster | ❌ None | βœ… Yes | πŸ”§ Medium | | **API** | 🐌 Slower | ⏱️ Yes (0.1s delay) | ❌ No | βœ… Easy | ## πŸ†• Collaboration Detection Examples | Input | Type | Detection | Output | |-------|------|-----------|---------| | `Bruno Mars ft. Cardi B` | Collaboration | βœ… Primary pattern | `Bruno Mars feat. Cardi B` | | `Pitbull ft. Ne-Yo, Afrojack & Nayer` | Complex Collaboration | βœ… Multiple patterns | `Pitbull feat. Ne-Yo, Afrojack & Nayer` | | `Simon & Garfunkel` | Band Name | ❌ Protected | `Simon & Garfunkel` | | `Lavato, Demi & Joe Jonas` | Collaboration | βœ… Comma detection | `Lavato, Demi & Joe Jonas` | | `Hall & Oates` | Band Name | ❌ Protected | `Hall & Oates` | ## πŸ†• Edge Case Examples | Input | Type | Handling | Output | |-------|------|----------|---------| | `ACDC` | Name Variation | βœ… Alias lookup | `AC/DC` | | `98 Degrees` | Artist Alias | βœ… Alias search | `98Β°` | | `S Club 7` | Numerical Suffix | βœ… Suffix removal | `S Club` | | `Corby, Matt` | Sort Name | βœ… Sort name search | `Matt Corby` | | `Blink-182` | Dash Variation | βœ… Unicode dash handling | `blink‐182` | | `P!nk` | Special Characters | βœ… Direct search | `P!nk` | | `3OH!3` | Numbers + Special | βœ… Direct search | `3OH!3` | ## 🀝 Contributing Found a bug or have a feature request? 1. Check the existing issues 2. Create a new issue with details 3. Include sample data if possible ## πŸ“„ License This tool is provided as-is for educational and personal use. ## πŸ”— Related Links - [MusicBrainz](https://musicbrainz.org/) - The open music encyclopedia - [MusicBrainz API](https://musicbrainz.org/doc/Development) - API documentation - [MusicBrainz Docker](https://github.com/metabrainz/musicbrainz-docker) - Docker setup - [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy string matching library ## πŸ“ Lessons Learned ### Database Integration - **Direct PostgreSQL access is 10x faster** than API calls - **Docker networking** requires container IPs, not localhost - **Database name matters**: `musicbrainz_db` not `musicbrainz` - **Static caches cause problems**: Wrong MBIDs override correct database lookups ### Collaboration Handling - **Primary patterns** (ft., feat.) are always collaborations - **Secondary patterns** (&, and) require intelligence to distinguish from band names - **Comma detection** helps identify collaborations - **Artist credit lookup** is essential for preserving all collaborators ### Edge Cases - **Dash variations** (regular vs Unicode) cause exact match failures - **Artist aliases** are common and important (98 Degrees β†’ 98Β°) - **Sort names** handle "Last, First" formats - **Numerical suffixes** in names need special handling (S Club 7 β†’ S Club) ### Performance Optimization - **Remove static caches** for better accuracy - **Database-first approach** ensures live data - **Fuzzy search thresholds** need tuning for different datasets - **Connection pooling** would improve performance for large datasets --- **Happy cleaning! 🎡✨**