| data | ||
| src | ||
| .gitignore | ||
| COMMANDS.md | ||
| docker-compose.yml | ||
| Dockerfile | ||
| LICENSE | ||
| musicbrainz_cleaner.py | ||
| PRD.md | ||
| README.md | ||
| requirements.txt | ||
| setup.py | ||
🎵 MusicBrainz Data Cleaner v3.0
A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. Now with advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!
✨ What's New in v3.0
- 🚀 Direct Database Access: Connect directly to PostgreSQL for 10x faster performance
- 🎯 Advanced Fuzzy Search: Intelligent matching for similar artist names and song titles
- 🔄 Automatic Fallback: Falls back to API mode if database access fails
- ⚡ No Rate Limiting: Database queries don't have API rate limits
- 📊 Similarity Scoring: See how well matches are scored
- 🆕 Collaboration Detection: Intelligently handle complex collaborations like "Pitbull ft. Ne-Yo, Afrojack & Nayer"
- 🆕 Artist Aliases: Handle name variations like "98 Degrees" → "98°" and "S Club 7" → "S Club"
- 🆕 Sort Names: Handle "Last, First" formats like "Corby, Matt" → "Matt Corby"
- 🆕 Edge Case Handling: Support for artists with hyphens, exclamation marks, numbers, and special characters
- 🆕 Band Name Protection: Distinguish between band names (Simon & Garfunkel) and collaborations (Lavato, Demi & Joe Jonas)
✨ What It Does
Before:
{
"artist": "ACDC",
"title": "Shot In The Dark",
"favorite": true
}
After:
{
"artist": "AC/DC",
"title": "Shot in the Dark",
"favorite": true,
"mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
"recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
}
🚀 Quick Start
1. Install Dependencies
pip install requests psycopg2-binary fuzzywuzzy python-Levenshtein
2. Set Up MusicBrainz Server
Option A: Docker (Recommended)
# Clone MusicBrainz Docker repository
git clone https://github.com/metabrainz/musicbrainz-docker.git
cd musicbrainz-docker
# Update postgres.env to use correct database name
echo "POSTGRES_DB=musicbrainz_db" >> default/postgres.env
# Start the server
docker-compose up -d
# Wait for database to be ready (can take 10-15 minutes)
docker-compose logs -f musicbrainz
Option B: Manual Setup
- Install PostgreSQL 12+
- Create database:
createdb musicbrainz_db - Import MusicBrainz data dump
- Start MusicBrainz server on port 8080
3. Test Connection
python musicbrainz_cleaner.py --test-connection
4. Run the Cleaner
# Use database access (recommended, faster)
python musicbrainz_cleaner.py your_songs.json
# Force API mode (slower, fallback)
python musicbrainz_cleaner.py your_songs.json --use-api
That's it! Your cleaned data will be saved to your_songs_cleaned.json
📋 Requirements
- Python 3.6+
- MusicBrainz Server running on localhost:8080
- PostgreSQL Database accessible on localhost:5432
- Dependencies:
requests,psycopg2-binary,fuzzywuzzy,python-Levenshtein
🔧 Server Configuration
Database Access
- Host: localhost (or Docker container IP: 172.18.0.2)
- Port: 5432 (PostgreSQL default)
- Database: musicbrainz_db (actual database name)
- User: musicbrainz
- Password: musicbrainz (default, should be changed in production)
HTTP API (Fallback)
- URL: http://localhost:8080
- Endpoint: /ws/2/
- Format: JSON
Troubleshooting
- Database Connection Failed: Check PostgreSQL is running and credentials are correct
- API Connection Failed: Check MusicBrainz server is running on port 8080
- Slow Performance: Ensure database indexes are built
- No Results: Verify data has been imported to the database
- NEW: Docker Networking: Use container IP (172.18.0.2) for Docker-to-Docker connections
- NEW: Database Name: Ensure using
musicbrainz_dbnotmusicbrainz
🧪 Testing
Run the test suite to verify everything works correctly:
# Run all tests
python3 src/tests/run_tests.py
# Run specific test categories
python3 src/tests/run_tests.py --unit # Unit tests only
python3 src/tests/run_tests.py --integration # Integration tests only
# Run specific test module
python3 src/tests/run_tests.py test_data_loader
python3 src/tests/run_tests.py test_cli
# List all available tests
python3 src/tests/run_tests.py --list
Test Categories
- Unit Tests: Test individual components in isolation
- Integration Tests: Test interactions between components and database
- Debug Tests: Debug scripts and troubleshooting tools
📁 Data Files
The tool uses external JSON files for name variations:
data/known_artists.json: Contains name variations (ACDC → AC/DC, ft. → feat.)data/known_recordings.json: Contains known recording MBIDs for common songs
These files can be easily updated without touching the code, making it simple to add new name variations.
🎯 Features
✅ Artist Name Fixes
ACDC→AC/DCBruno Mars ft. Cardi B→Bruno Mars feat. Cardi Bfeaturing→feat.98 Degrees→98°(artist aliases)S Club 7→S Club(numerical suffixes)Corby, Matt→Matt Corby(sort names)
✅ Collaboration Detection
- Primary Patterns: "ft.", "feat.", "featuring" (always collaborations)
- Secondary Patterns: "&", "and", "," (intelligent detection)
- Band Name Protection: "Simon & Garfunkel" (not collaboration)
- Complex Collaborations: "Pitbull ft. Ne-Yo, Afrojack & Nayer"
- Case Insensitive: "Featuring" → "featuring"
✅ Song Title Fixes
Shot In The Dark→Shot in the Dark- Removes
(Karaoke Version),(Instrumental)suffixes - Normalizes capitalization and formatting
✅ Added Data
mbid: Official MusicBrainz Artist IDrecording_mbid: Official MusicBrainz Recording ID
✅ Preserves Your Data
- Keeps all your existing fields (guid, path, disabled, favorite, etc.)
- Only adds new fields, never removes existing ones
🆕 Advanced Fuzzy Search
- Intelligent Matching: Finds similar names even with typos or variations
- Similarity Scoring: Shows how well each match scores (0.0 to 1.0)
- Configurable Thresholds: Adjust matching sensitivity
- Multiple Algorithms: Uses ratio, partial ratio, and token sort matching
- Enhanced Search Fields: artist.name, artist_alias.name, artist.sort_name
- Dash Handling: Regular dash (-) vs Unicode dash (‐)
- Substring Protection: Avoids false matches like "Sleazy-E" vs "Eazy-E"
🆕 Edge Case Support
- Hyphenated Artists: "Blink-182", "Ne-Yo", "G-Eazy"
- Exclamation Marks: "P!nk", "Panic! At The Disco", "3OH!3"
- Numbers: "98 Degrees", "S Club 7", "3 Doors Down"
- Special Characters: "a-ha", "The B-52s", "Salt-N-Pepa"
📖 Usage Examples
Basic Usage
# Clean your songs and save to auto-generated filename
python musicbrainz_cleaner.py my_songs.json
# Output: my_songs_cleaned.json
Custom Output File
# Specify your own output filename
python musicbrainz_cleaner.py my_songs.json cleaned_songs.json
Force API Mode
# Use HTTP API instead of database (slower but works without PostgreSQL)
python musicbrainz_cleaner.py my_songs.json --use-api
Test Connections
# Test database connection
python musicbrainz_cleaner.py --test-connection
# Test with API mode
python musicbrainz_cleaner.py --test-connection --use-api
Help
# Show usage information
python musicbrainz_cleaner.py --help
📁 Data Files
Input Format
Your JSON file should contain an array of song objects:
[
{
"artist": "ACDC",
"title": "Shot In The Dark",
"disabled": false,
"favorite": true,
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4"
},
{
"artist": "Bruno Mars ft. Cardi B",
"title": "Finesse Remix",
"disabled": false,
"favorite": false,
"guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
"path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4"
}
]
📤 Output Format
The tool will update your objects with corrected data:
[
{
"artist": "AC/DC",
"title": "Shot in the Dark",
"disabled": false,
"favorite": true,
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
"mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
"recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
},
{
"artist": "Bruno Mars feat. Cardi B",
"title": "Finesse (remix)",
"disabled": false,
"favorite": false,
"guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
"path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4",
"mbid": "afb680f2-b6eb-4cd7-a70b-a63b25c763d5",
"recording_mbid": "8ed14014-547a-4128-ab81-c2dca7ae198e"
}
]
🎬 Example Run
$ python musicbrainz_cleaner.py data/sample_songs.json
Processing 3 songs...
Using database connection
==================================================
[1/3] Processing: ACDC - Shot In The Dark
🎯 Fuzzy match found: ACDC → AC/DC (score: 0.85)
✅ Found artist: AC/DC (MBID: 66c662b6-6e2f-4930-8610-912e24c63ed1)
🎯 Fuzzy match found: Shot In The Dark → Shot in the Dark (score: 0.92)
✅ Found recording: Shot in the Dark (MBID: cf8b5cd0-d97c-413d-882f-fc422a2e57db)
✅ Updated to: AC/DC - Shot in the Dark
[2/3] Processing: Bruno Mars ft. Cardi B - Finesse Remix
🎯 Fuzzy match found: Bruno Mars → Bruno Mars (score: 1.00)
✅ Found artist: Bruno Mars (MBID: afb680f2-b6eb-4cd7-a70b-a63b25c763d5)
🎯 Fuzzy match found: Finesse Remix → Finesse (remix) (score: 0.88)
✅ Found recording: Finesse (remix) (MBID: 8ed14014-547a-4128-ab81-c2dca7ae198e)
✅ Updated to: Bruno Mars feat. Cardi B - Finesse (remix)
[3/3] Processing: Taylor Swift - Love Story
🎯 Fuzzy match found: Taylor Swift → Taylor Swift (score: 1.00)
✅ Found artist: Taylor Swift (MBID: 20244d07-534f-4eff-b4d4-930878889970)
🎯 Fuzzy match found: Love Story → Love Story (score: 1.00)
✅ Found recording: Love Story (MBID: d783e6c5-761f-4fc3-bfcf-6089cdfc8f96)
✅ Updated to: Taylor Swift - Love Story
==================================================
✅ Processing complete!
📁 Output saved to: data/sample_songs_cleaned.json
🔧 Troubleshooting
"Could not find artist"
- The artist might not be in the MusicBrainz database
- Try checking the spelling or using a different variation
- The search index might still be building (wait a few minutes)
- Check fuzzy search similarity score - lower threshold if needed
- NEW: Check for artist aliases (e.g., "98 Degrees" → "98°")
- NEW: Check for sort names (e.g., "Corby, Matt" → "Matt Corby")
"Could not find recording"
- The song might not be in the database
- The title might not match exactly
- Try a simpler title (remove extra words)
- Check fuzzy search similarity score - lower threshold if needed
- NEW: For collaborations, check if it's stored under the main artist
Connection errors
- Database: Make sure PostgreSQL is running and accessible
- API: Make sure your MusicBrainz server is running on
http://localhost:8080 - Check that Docker containers are up and running
- Verify the server is accessible in your browser
- NEW: For Docker, use container IP (172.18.0.2) instead of localhost
JSON errors
- Make sure your input file is valid JSON
- Check that it contains an array of objects
- Verify all required fields are present
Performance issues
- Use database mode instead of API mode for better performance
- Ensure database indexes are built for faster queries
- Check fuzzy search thresholds - higher thresholds mean fewer but more accurate matches
Collaboration detection issues
- NEW: Check if it's a band name vs collaboration (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
- NEW: Verify the collaboration pattern is supported (ft., feat., featuring, &, and, ,)
- NEW: Check case sensitivity - patterns are case-insensitive
🎯 Use Cases
- Karaoke Systems: Clean up song metadata for better search and organization
- Music Libraries: Standardize artist names and add official IDs
- Music Apps: Ensure consistent data across your application
- Data Migration: Clean up legacy music data when moving to new systems
- Fuzzy Matching: Handle typos and variations in artist/song names
- NEW: Collaboration Handling: Process complex artist collaborations
- NEW: Edge Cases: Handle artists with special characters and unusual names
📚 What are MBIDs?
MBID stands for MusicBrainz Identifier. These are unique, permanent IDs assigned to artists, recordings, and other music entities in the MusicBrainz database.
Benefits:
- Permanent: Never change, even if names change
- Universal: Used across many music applications
- Reliable: Official identifiers from the MusicBrainz database
- Linked Data: Connect to other music databases and services
🆕 Performance Comparison
| Method | Speed | Rate Limiting | Fuzzy Search | Setup Complexity |
|---|---|---|---|---|
| Database | ⚡ 10x faster | ❌ None | ✅ Yes | 🔧 Medium |
| API | 🐌 Slower | ⏱️ Yes (0.1s delay) | ❌ No | ✅ Easy |
🆕 Collaboration Detection Examples
| Input | Type | Detection | Output |
|---|---|---|---|
Bruno Mars ft. Cardi B |
Collaboration | ✅ Primary pattern | Bruno Mars feat. Cardi B |
Pitbull ft. Ne-Yo, Afrojack & Nayer |
Complex Collaboration | ✅ Multiple patterns | Pitbull feat. Ne-Yo, Afrojack & Nayer |
Simon & Garfunkel |
Band Name | ❌ Protected | Simon & Garfunkel |
Lavato, Demi & Joe Jonas |
Collaboration | ✅ Comma detection | Lavato, Demi & Joe Jonas |
Hall & Oates |
Band Name | ❌ Protected | Hall & Oates |
🆕 Edge Case Examples
| Input | Type | Handling | Output |
|---|---|---|---|
ACDC |
Name Variation | ✅ Alias lookup | AC/DC |
98 Degrees |
Artist Alias | ✅ Alias search | 98° |
S Club 7 |
Numerical Suffix | ✅ Suffix removal | S Club |
Corby, Matt |
Sort Name | ✅ Sort name search | Matt Corby |
Blink-182 |
Dash Variation | ✅ Unicode dash handling | blink‐182 |
P!nk |
Special Characters | ✅ Direct search | P!nk |
3OH!3 |
Numbers + Special | ✅ Direct search | 3OH!3 |
🤝 Contributing
Found a bug or have a feature request?
- Check the existing issues
- Create a new issue with details
- Include sample data if possible
📄 License
This tool is provided as-is for educational and personal use.
🔗 Related Links
- MusicBrainz - The open music encyclopedia
- MusicBrainz API - API documentation
- MusicBrainz Docker - Docker setup
- FuzzyWuzzy - Fuzzy string matching library
📝 Lessons Learned
Database Integration
- Direct PostgreSQL access is 10x faster than API calls
- Docker networking requires container IPs, not localhost
- Database name matters:
musicbrainz_dbnotmusicbrainz - Static caches cause problems: Wrong MBIDs override correct database lookups
Collaboration Handling
- Primary patterns (ft., feat.) are always collaborations
- Secondary patterns (&, and) require intelligence to distinguish from band names
- Comma detection helps identify collaborations
- Artist credit lookup is essential for preserving all collaborators
Edge Cases
- Dash variations (regular vs Unicode) cause exact match failures
- Artist aliases are common and important (98 Degrees → 98°)
- Sort names handle "Last, First" formats
- Numerical suffixes in names need special handling (S Club 7 → S Club)
Performance Optimization
- Remove static caches for better accuracy
- Database-first approach ensures live data
- Fuzzy search thresholds need tuning for different datasets
- Connection pooling would improve performance for large datasets
Happy cleaning! 🎵✨