# ๐ŸŽต MusicBrainz Data Cleaner v2.0 A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with direct database access and fuzzy search for maximum performance and accuracy!** ## โœจ What's New in v2.0 - **๐Ÿš€ Direct Database Access**: Connect directly to PostgreSQL for 10x faster performance - **๐ŸŽฏ Fuzzy Search**: Intelligent matching for similar artist names and song titles - **๐Ÿ”„ Automatic Fallback**: Falls back to API mode if database access fails - **โšก No Rate Limiting**: Database queries don't have API rate limits - **๐Ÿ“Š Similarity Scoring**: See how well matches are scored ## โœจ What It Does **Before:** ```json { "artist": "ACDC", "title": "Shot In The Dark", "favorite": true } ``` **After:** ```json { "artist": "AC/DC", "title": "Shot in the Dark", "favorite": true, "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1", "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db" } ``` ## ๐Ÿš€ Quick Start ### 1. Install Dependencies ```bash pip install requests psycopg2-binary fuzzywuzzy python-Levenshtein ``` ### 2. Set Up MusicBrainz Server #### Option A: Docker (Recommended) ```bash # Clone MusicBrainz Docker repository git clone https://github.com/metabrainz/musicbrainz-docker.git cd musicbrainz-docker # Start the server docker-compose up -d # Wait for database to be ready (can take 10-15 minutes) docker-compose logs -f musicbrainz ``` #### Option B: Manual Setup 1. Install PostgreSQL 12+ 2. Create database: `createdb musicbrainz` 3. Import MusicBrainz data dump 4. Start MusicBrainz server on port 5001 ### 3. Test Connection ```bash python musicbrainz_cleaner.py --test-connection ``` ### 4. Run the Cleaner ```bash # Use database access (recommended, faster) python musicbrainz_cleaner.py your_songs.json # Force API mode (slower, fallback) python musicbrainz_cleaner.py your_songs.json --use-api ``` That's it! Your cleaned data will be saved to `your_songs_cleaned.json` ## ๐Ÿ“‹ Requirements - **Python 3.6+** - **MusicBrainz Server** running on localhost:5001 - **PostgreSQL Database** accessible on localhost:5432 - **Dependencies**: `requests`, `psycopg2-binary`, `fuzzywuzzy`, `python-Levenshtein` ## ๐Ÿ”ง Server Configuration ### Database Access - **Host**: localhost - **Port**: 5432 (PostgreSQL default) - **Database**: musicbrainz - **User**: musicbrainz - **Password**: musicbrainz (default, should be changed in production) ### HTTP API (Fallback) - **URL**: http://localhost:5001 - **Endpoint**: /ws/2/ - **Format**: JSON ### Troubleshooting - **Database Connection Failed**: Check PostgreSQL is running and credentials are correct - **API Connection Failed**: Check MusicBrainz server is running on port 5001 - **Slow Performance**: Ensure database indexes are built - **No Results**: Verify data has been imported to the database ## ๐Ÿงช Testing Run the test suite to verify everything works correctly: ```bash # Run all tests python3 src/tests/run_tests.py # Run specific test module python3 src/tests/run_tests.py test_data_loader python3 src/tests/run_tests.py test_cli ``` ## ๐Ÿ“ Data Files The tool uses external JSON files for known artist and recording data: - **`data/known_artists.json`**: Contains known artist MBIDs for common artists - **`data/known_recordings.json`**: Contains known recording MBIDs for common songs These files can be easily updated without touching the code, making it simple to add new artists and recordings. ## ๐ŸŽฏ Features ### โœ… Artist Name Fixes - `ACDC` โ†’ `AC/DC` - `Bruno Mars ft. Cardi B` โ†’ `Bruno Mars feat. Cardi B` - `featuring` โ†’ `feat.` ### โœ… Song Title Fixes - `Shot In The Dark` โ†’ `Shot in the Dark` - Removes `(Karaoke Version)`, `(Instrumental)` suffixes - Normalizes capitalization and formatting ### โœ… Added Data - **`mbid`**: Official MusicBrainz Artist ID - **`recording_mbid`**: Official MusicBrainz Recording ID ### โœ… Preserves Your Data - Keeps all your existing fields (guid, path, disabled, favorite, etc.) - Only adds new fields, never removes existing ones ### ๐Ÿ†• Fuzzy Search - **Intelligent Matching**: Finds similar names even with typos or variations - **Similarity Scoring**: Shows how well each match scores (0.0 to 1.0) - **Configurable Thresholds**: Adjust matching sensitivity - **Multiple Algorithms**: Uses ratio, partial ratio, and token sort matching ## ๐Ÿ“– Usage Examples ### Basic Usage ```bash # Clean your songs and save to auto-generated filename python musicbrainz_cleaner.py my_songs.json # Output: my_songs_cleaned.json ``` ### Custom Output File ```bash # Specify your own output filename python musicbrainz_cleaner.py my_songs.json cleaned_songs.json ``` ### Force API Mode ```bash # Use HTTP API instead of database (slower but works without PostgreSQL) python musicbrainz_cleaner.py my_songs.json --use-api ``` ### Test Connections ```bash # Test database connection python musicbrainz_cleaner.py --test-connection # Test with API mode python musicbrainz_cleaner.py --test-connection --use-api ``` ### Help ```bash # Show usage information python musicbrainz_cleaner.py --help ``` ## ๐Ÿ“ Data Files ### Input Format Your JSON file should contain an array of song objects: ```json [ { "artist": "ACDC", "title": "Shot In The Dark", "disabled": false, "favorite": true, "guid": "8946008c-7acc-d187-60e6-5286e55ad502", "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4" }, { "artist": "Bruno Mars ft. Cardi B", "title": "Finesse Remix", "disabled": false, "favorite": false, "guid": "946a1077-ab9e-300c-3a72-b1e141e9706f", "path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4" } ] ``` ## ๐Ÿ“ค Output Format The tool will update your objects with corrected data: ```json [ { "artist": "AC/DC", "title": "Shot in the Dark", "disabled": false, "favorite": true, "guid": "8946008c-7acc-d187-60e6-5286e55ad502", "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4", "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1", "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db" }, { "artist": "Bruno Mars feat. Cardi B", "title": "Finesse (remix)", "disabled": false, "favorite": false, "guid": "946a1077-ab9e-300c-3a72-b1e141e9706f", "path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4", "mbid": "afb680f2-b6eb-4cd7-a70b-a63b25c763d5", "recording_mbid": "8ed14014-547a-4128-ab81-c2dca7ae198e" } ] ``` ## ๐ŸŽฌ Example Run ```bash $ python musicbrainz_cleaner.py data/sample_songs.json Processing 3 songs... Using database connection ================================================== [1/3] Processing: ACDC - Shot In The Dark ๐ŸŽฏ Fuzzy match found: ACDC โ†’ AC/DC (score: 0.85) โœ… Found artist: AC/DC (MBID: 66c662b6-6e2f-4930-8610-912e24c63ed1) ๐ŸŽฏ Fuzzy match found: Shot In The Dark โ†’ Shot in the Dark (score: 0.92) โœ… Found recording: Shot in the Dark (MBID: cf8b5cd0-d97c-413d-882f-fc422a2e57db) โœ… Updated to: AC/DC - Shot in the Dark [2/3] Processing: Bruno Mars ft. Cardi B - Finesse Remix ๐ŸŽฏ Fuzzy match found: Bruno Mars โ†’ Bruno Mars (score: 1.00) โœ… Found artist: Bruno Mars (MBID: afb680f2-b6eb-4cd7-a70b-a63b25c763d5) ๐ŸŽฏ Fuzzy match found: Finesse Remix โ†’ Finesse (remix) (score: 0.88) โœ… Found recording: Finesse (remix) (MBID: 8ed14014-547a-4128-ab81-c2dca7ae198e) โœ… Updated to: Bruno Mars feat. Cardi B - Finesse (remix) [3/3] Processing: Taylor Swift - Love Story ๐ŸŽฏ Fuzzy match found: Taylor Swift โ†’ Taylor Swift (score: 1.00) โœ… Found artist: Taylor Swift (MBID: 20244d07-534f-4eff-b4d4-930878889970) ๐ŸŽฏ Fuzzy match found: Love Story โ†’ Love Story (score: 1.00) โœ… Found recording: Love Story (MBID: d783e6c5-761f-4fc3-bfcf-6089cdfc8f96) โœ… Updated to: Taylor Swift - Love Story ================================================== โœ… Processing complete! ๐Ÿ“ Output saved to: data/sample_songs_cleaned.json ``` ## ๐Ÿ”ง Troubleshooting ### "Could not find artist" - The artist might not be in the MusicBrainz database - Try checking the spelling or using a different variation - The search index might still be building (wait a few minutes) - **NEW**: Check fuzzy search similarity score - lower threshold if needed ### "Could not find recording" - The song might not be in the database - The title might not match exactly - Try a simpler title (remove extra words) - **NEW**: Check fuzzy search similarity score - lower threshold if needed ### Connection errors - **Database**: Make sure PostgreSQL is running and accessible - **API**: Make sure your MusicBrainz server is running on `http://localhost:5001` - Check that Docker containers are up and running - Verify the server is accessible in your browser ### JSON errors - Make sure your input file is valid JSON - Check that it contains an array of objects - Verify all required fields are present ### Performance issues - **NEW**: Use database mode instead of API mode for better performance - **NEW**: Ensure database indexes are built for faster queries - **NEW**: Check fuzzy search thresholds - higher thresholds mean fewer but more accurate matches ## ๐ŸŽฏ Use Cases - **Karaoke Systems**: Clean up song metadata for better search and organization - **Music Libraries**: Standardize artist names and add official IDs - **Music Apps**: Ensure consistent data across your application - **Data Migration**: Clean up legacy music data when moving to new systems - **Fuzzy Matching**: Handle typos and variations in artist/song names ## ๐Ÿ“š What are MBIDs? **MBID** stands for **MusicBrainz Identifier**. These are unique, permanent IDs assigned to artists, recordings, and other music entities in the MusicBrainz database. **Benefits:** - **Permanent**: Never change, even if names change - **Universal**: Used across many music applications - **Reliable**: Official identifiers from the MusicBrainz database - **Linked Data**: Connect to other music databases and services ## ๐Ÿ†• Performance Comparison | Method | Speed | Rate Limiting | Fuzzy Search | Setup Complexity | |--------|-------|---------------|--------------|------------------| | **Database** | โšก 10x faster | โŒ None | โœ… Yes | ๐Ÿ”ง Medium | | **API** | ๐ŸŒ Slower | โฑ๏ธ Yes (0.1s delay) | โŒ No | โœ… Easy | ## ๐Ÿค Contributing Found a bug or have a feature request? 1. Check the existing issues 2. Create a new issue with details 3. Include sample data if possible ## ๐Ÿ“„ License This tool is provided as-is for educational and personal use. ## ๐Ÿ”— Related Links - [MusicBrainz](https://musicbrainz.org/) - The open music encyclopedia - [MusicBrainz API](https://musicbrainz.org/doc/Development) - API documentation - [MusicBrainz Docker](https://github.com/metabrainz/musicbrainz-docker) - Docker setup - [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy string matching library --- **Happy cleaning! ๐ŸŽตโœจ**