| data | ||
| src | ||
| .gitignore | ||
| COMMANDS.md | ||
| docker-compose.yml | ||
| Dockerfile | ||
| LICENSE | ||
| musicbrainz_cleaner.py | ||
| PRD.md | ||
| quick_test_20.py | ||
| README.md | ||
| requirements.txt | ||
| restart_services.sh | ||
| SETUP.md | ||
| setup.py | ||
| start_services.sh | ||
| TROUBLESHOOTING.md | ||
🎵 MusicBrainz Data Cleaner v3.0
A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. Now with advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!
🚀 Quick Start for New Sessions
If you're starting fresh or after a reboot, follow this exact sequence:
1. Start MusicBrainz Services
# Quick restart (recommended)
./restart_services.sh
# Or full restart (if you have issues)
./start_services.sh
2. Wait for Services to Initialize
- Database: 5-10 minutes to fully load
- Web server: 2-3 minutes to start responding
- Check status:
cd ../musicbrainz-docker && docker-compose ps
3. Verify Services Are Ready
# Test web server
curl -s http://localhost:5001 | head -5
# Test database (should show 2.6M+ artists)
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"
# Test cleaner connection
docker-compose run --rm musicbrainz-cleaner python3 -c "from src.api.database import MusicBrainzDatabase; db = MusicBrainzDatabase(); print('Connection result:', db.connect())"
4. Run Tests
# Test 100 random songs
docker-compose run --rm musicbrainz-cleaner python3 test_100_random.py
# Or other test scripts
docker-compose run --rm musicbrainz-cleaner python3 [script_name].py
⚠️ Important: Always run scripts via Docker - the cleaner cannot connect to the database directly from outside the container.
📋 Troubleshooting: See TROUBLESHOOTING.md for common issues and solutions.
✨ What's New in v3.0
- 🚀 Direct Database Access: Connect directly to PostgreSQL for 10x faster performance
- 🎯 Advanced Fuzzy Search: Intelligent matching for similar artist names and song titles
- 🔄 Automatic Fallback: Falls back to API mode if database access fails
- ⚡ No Rate Limiting: Database queries don't have API rate limits
- 📊 Similarity Scoring: See how well matches are scored
- 🆕 Collaboration Detection: Intelligently handle complex collaborations like "Pitbull ft. Ne-Yo, Afrojack & Nayer"
- 🆕 Artist Aliases: Handle name variations like "98 Degrees" → "98°" and "S Club 7" → "S Club"
- 🆕 Sort Names: Handle "Last, First" formats like "Corby, Matt" → "Matt Corby"
- 🆕 Edge Case Handling: Support for artists with hyphens, exclamation marks, numbers, and special characters
- 🆕 Band Name Protection: Distinguish between band names (Simon & Garfunkel) and collaborations (Lavato, Demi & Joe Jonas)
✨ What It Does
Before:
{
"artist": "ACDC",
"title": "Shot In The Dark",
"favorite": true
}
After:
{
"artist": "AC/DC",
"title": "Shot in the Dark",
"favorite": true,
"mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
"recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
}
🚀 Quick Start
Option 1: Automated Setup (Recommended)
-
Start MusicBrainz services:
./start_services.shThis script will:
- Check for Docker and port conflicts
- Start all MusicBrainz services
- Wait for database initialization
- Create environment configuration
- Test the connection
-
Run the cleaner:
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --input data/songs.json --output cleaned_songs.json
Option 2: Manual Setup
-
Start MusicBrainz services manually:
cd ../musicbrainz-docker MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -dWait 5-10 minutes for database initialization.
-
Create environment configuration:
# Create .env file in musicbrainz-cleaner directory cat > .env << EOF DB_HOST=172.18.0.2 DB_PORT=5432 DB_NAME=musicbrainz_db DB_USER=musicbrainz DB_PASSWORD=musicbrainz MUSICBRAINZ_WEB_SERVER_PORT=5001 EOF -
Run the cleaner:
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --input data/songs.json --output cleaned_songs.json
For detailed setup instructions, see SETUP.md
🔄 After System Reboot
After restarting your Mac, you'll need to restart the MusicBrainz services:
Quick Restart (Recommended)
# If Docker Desktop is already running
./restart_services.sh
# Or manually
cd ../musicbrainz-docker && MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
Full Restart (If you have issues)
# Complete setup including Docker checks
./start_services.sh
Auto-start Setup (Optional)
-
Enable Docker Desktop auto-start:
- Open Docker Desktop
- Go to Settings → General
- Check "Start Docker Desktop when you log in"
-
Then just run:
./restart_services.shafter each reboot
Note: Your data is preserved in Docker volumes, so you don't need to reconfigure anything after a reboot.
🚨 Common Startup Issues & Fixes
Issue 1: Database Connection Refused
Problem: Cleaner can't connect to database with error "Connection refused" Root Cause: Database container not fully initialized or wrong host configuration Fix:
# Wait for database to be ready (check logs)
cd ../musicbrainz-docker && docker-compose logs db | tail -10
# Verify database is accepting connections
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"
Issue 2: Wrong Database Host Configuration
Problem: Cleaner tries to connect to 172.18.0.2 but can't reach it
Root Cause: Hardcoded IP address in database connection
Fix: Use Docker service name db instead of IP address
# In src/api/database.py, change:
host='172.18.0.2' # ❌ Wrong
host='db' # ✅ Correct
Issue 3: Test Script Logic Error
Problem: Test shows 0% success rate despite finding artists
Root Cause: Test script checking 'mbid' in result where result is a tuple (song_dict, success_boolean)
Fix: Extract song dictionary from tuple
# Wrong:
artist_found = 'mbid' in result
# Correct:
cleaned_song, success = result
artist_found = 'mbid' in cleaned_song
Issue 4: Services Not Fully Initialized
Problem: API returns empty results even though database has data Root Cause: MusicBrainz web server still starting up Fix: Wait for services to be fully ready
# Check if web server is responding
curl -s http://localhost:5001 | head -5
# Wait for database to be ready
docker-compose logs db | grep "database system is ready"
Issue 5: Port Conflicts
Problem: Port 5000 already in use Root Cause: Another service using the port Fix: Use alternative port
MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
Issue 6: Container Name Conflicts
Problem: "Container name already in use" error Root Cause: Previous containers not properly cleaned up Fix: Remove conflicting containers
docker-compose down
docker rm -f <container_name>
🔧 Startup Checklist
Before running tests, verify:
- ✅ Docker Desktop is running
- ✅ All containers are up:
docker-compose ps - ✅ Database is ready:
docker-compose logs db | grep "ready" - ✅ Web server responds:
curl -s http://localhost:5001 - ✅ Database has data:
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;" - ✅ Cleaner can connect: Test database connection in cleaner
📋 Requirements
- Python 3.6+
- MusicBrainz Server running on localhost:8080
- PostgreSQL Database accessible on localhost:5432
- Dependencies:
requests,psycopg2-binary,fuzzywuzzy,python-Levenshtein
🔧 Server Configuration
Database Access
- Host: localhost (or Docker container IP: 172.18.0.2)
- Port: 5432 (PostgreSQL default)
- Database: musicbrainz_db (actual database name)
- User: musicbrainz
- Password: musicbrainz (default, should be changed in production)
HTTP API (Fallback)
- URL: http://localhost:8080
- Endpoint: /ws/2/
- Format: JSON
Troubleshooting
- Database Connection Failed: Check PostgreSQL is running and credentials are correct
- API Connection Failed: Check MusicBrainz server is running on port 8080
- Slow Performance: Ensure database indexes are built
- No Results: Verify data has been imported to the database
- NEW: Docker Networking: Use container IP (172.18.0.2) for Docker-to-Docker connections
- NEW: Database Name: Ensure using
musicbrainz_dbnotmusicbrainz
🧪 Testing
Test File Organization
- REQUIRED: All test files must be placed in
src/tests/directory - PROHIBITED: Test files should not be placed in the root directory
- Naming Convention: Test files should follow
test_*.pyordebug_*.pypatterns - Purpose: Keeps root directory clean and organizes test code properly
Running Tests
# Run all tests
python3 src/tests/run_tests.py
# Run specific test categories
python3 src/tests/run_tests.py --unit # Unit tests only
python3 src/tests/run_tests.py --integration # Integration tests only
# Run specific test module
python3 src/tests/run_tests.py test_data_loader
python3 src/tests/run_tests.py test_cli
# List all available tests
python3 src/tests/run_tests.py --list
Test Categories
- Unit Tests: Test individual components in isolation
- Integration Tests: Test interactions between components and database
- Debug Tests: Debug scripts and troubleshooting tools
📁 Project Structure
musicbrainz-cleaner/
├── src/
│ ├── api/ # Database and API access
│ ├── cli/ # Command-line interface
│ ├── config/ # Configuration and constants
│ ├── core/ # Core functionality
│ ├── tests/ # Test files (REQUIRED location)
│ └── utils/ # Utility functions
├── data/ # Data files and output
│ ├── known_artists.json # Name variations (ACDC → AC/DC)
│ ├── known_recordings.json # Known recording MBIDs
│ └── songs.json # Source songs file
└── docker-compose.yml # Docker configuration
Data Files
The tool uses external JSON files for name variations:
data/known_artists.json: Contains name variations (ACDC → AC/DC, ft. → feat.)data/known_recordings.json: Contains known recording MBIDs for common songs
These files can be easily updated without touching the code, making it simple to add new name variations.
🎯 Features
✅ Artist Name Fixes
ACDC→AC/DCBruno Mars ft. Cardi B→Bruno Mars feat. Cardi Bfeaturing→feat.98 Degrees→98°(artist aliases)S Club 7→S Club(numerical suffixes)Corby, Matt→Matt Corby(sort names)
✅ Collaboration Detection
- Primary Patterns: "ft.", "feat.", "featuring" (always collaborations)
- Secondary Patterns: "&", "and", "," (intelligent detection)
- Band Name Protection: 200+ known band names from
data/known_artists.json - Complex Collaborations: "Pitbull ft. Ne-Yo, Afrojack & Nayer"
- Case Insensitive: "Featuring" → "featuring"
✅ Song Title Fixes
Shot In The Dark→Shot in the Dark- Removes
(Karaoke Version),(Instrumental)suffixes - Normalizes capitalization and formatting
✅ Added Data
mbid: Official MusicBrainz Artist IDrecording_mbid: Official MusicBrainz Recording ID
✅ Preserves Your Data
- Keeps all your existing fields (guid, path, disabled, favorite, etc.)
- Only adds new fields, never removes existing ones
🆕 Advanced Fuzzy Search
- Intelligent Matching: Finds similar names even with typos or variations
- Similarity Scoring: Shows how well each match scores (0.0 to 1.0)
- Configurable Thresholds: Adjust matching sensitivity
- Multiple Algorithms: Uses ratio, partial ratio, and token sort matching
- Enhanced Search Fields: artist.name, artist_alias.name, artist.sort_name
- Dash Handling: Regular dash (-) vs Unicode dash (‐)
- Substring Protection: Avoids false matches like "Sleazy-E" vs "Eazy-E"
🆕 Edge Case Support
- Hyphenated Artists: "Blink-182", "Ne-Yo", "G-Eazy"
- Exclamation Marks: "P!nk", "Panic! At The Disco", "3OH!3"
- Numbers: "98 Degrees", "S Club 7", "3 Doors Down"
- Special Characters: "a-ha", "The B-52s", "Salt-N-Pepa"
🆕 Simplified Processing
- Default Behavior: Process all songs by default (no special flags needed)
- Separate Output Files: Successful and failed songs saved to different files
- Progress Tracking: Real-time progress with song counter and status
- Smart Defaults: Sensible defaults for all file paths and options
- Detailed Reporting: Comprehensive statistics and processing report
- Batch Processing: Efficient handling of large song collections
📖 Usage Examples
Basic Usage (Default)
# Process all songs with default settings (data/songs.json)
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
# Output: data/songs-success.json and data/songs-failure.json
Custom Source File
# Process specific file
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/my_songs.json
# Output: data/my_songs-success.json and data/my_songs-failure.json
Custom Output Files
# Specify custom output files
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/songs.json --output-success cleaned.json --output-failure failed.json
Limit Processing
# Process only first 1000 songs
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --limit 1000
Force API Mode
# Use HTTP API instead of database (slower but works without PostgreSQL)
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --use-api
Test Connections
# Test database connection
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection
# Test with API mode
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection --use-api
Help
# Show usage information
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --help
📁 Data Files
Input Format
Your JSON file should contain an array of song objects:
[
{
"artist": "ACDC",
"title": "Shot In The Dark",
"disabled": false,
"favorite": true,
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4"
},
{
"artist": "Bruno Mars ft. Cardi B",
"title": "Finesse Remix",
"disabled": false,
"favorite": false,
"guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
"path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4"
}
]
📤 Output Format
The tool creates three output files:
1. Successful Songs (source-success.json)
Array of successfully processed songs with MBIDs added:
[
{
"artist": "AC/DC",
"title": "Shot in the Dark",
"disabled": false,
"favorite": true,
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
"mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
"recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
},
{
"artist": "Bruno Mars feat. Cardi B",
"title": "Finesse (remix)",
"disabled": false,
"favorite": false,
"guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
"path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4",
"mbid": "afb680f2-b6eb-4cd7-a70b-a63b25c763d5",
"recording_mbid": "8ed14014-547a-4128-ab81-c2dca7ae198e"
}
]
2. Failed Songs (source-failure.json)
Array of songs that couldn't be processed (same format as source):
[
{
"artist": "Unknown Artist",
"title": "Unknown Song",
"disabled": false,
"favorite": false,
"guid": "12345678-1234-1234-1234-123456789012",
"path": "z://MP4\\Unknown Artist - Unknown Song.mp4"
}
]
3. Processing Report (processing_report_YYYYMMDD_HHMMSS.txt)
Human-readable text report with statistics and failed song list:
MusicBrainz Data Cleaner - Processing Report
==================================================
Source File: data/songs.json
Processing Date: 2024-12-19 14:30:22
Processing Time: 15263.3 seconds
SUMMARY
--------------------
Total Songs Processed: 49,170
Successful Songs: 40,692
Failed Songs: 8,478
Success Rate: 82.8%
DETAILED STATISTICS
--------------------
Artists Found: 44,526/49,170 (90.6%)
Recordings Found: 40,998/49,170 (83.4%)
Processing Speed: 3.2 songs/second
OUTPUT FILES
--------------------
Successful Songs: data/songs-success.json
Failed Songs: data/songs-failure.json
Report File: data/processing_report_20241219_143022.txt
FAILED SONGS (First 50)
--------------------
1. Unknown Artist - Unknown Song
2. Invalid Artist - Invalid Title
3. Test Artist - Test Song
...
🎬 Example Run
Basic Processing
$ docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
🚀 Starting song processing...
📊 Total songs to process: 49,170
Using database connection
==================================================
[1 of 49,170] ✅ PASS: ACDC - Shot In The Dark
[2 of 49,170] ❌ FAIL: Unknown Artist - Unknown Song
[3 of 49,170] ✅ PASS: Bruno Mars feat. Cardi B - Finesse (remix)
[4 of 49,170] ✅ PASS: Taylor Swift - Love Story
...
📈 Progress: 100/49,170 (0.2%) - Success: 85.0% - Rate: 3.2 songs/sec
📈 Progress: 200/49,170 (0.4%) - Success: 87.5% - Rate: 3.1 songs/sec
...
==================================================
🎉 Processing completed!
📊 Final Results:
⏱️ Total processing time: 15263.3 seconds
🚀 Average speed: 3.2 songs/second
✅ Artists found: 44,526/49,170 (90.6%)
✅ Recordings found: 40,998/49,170 (83.4%)
❌ Failed songs: 8,478 (17.2%)
📄 Files saved:
✅ Successful songs: data/songs-success.json
❌ Failed songs: data/songs-failure.json
📋 Text report: data/processing_report_20241219_143022.txt
📊 JSON report: data/processing_report_20241219_143022.json
Limited Processing
$ docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --limit 1000
⚠️ Limiting processing to first 1000 songs
🚀 Starting song processing...
📊 Total songs to process: 1,000
Using database connection
==================================================
[1 of 1,000] ✅ PASS: ACDC - Shot In The Dark
[2 of 1,000] ❌ FAIL: Unknown Artist - Unknown Song
...
==================================================
🎉 Processing completed!
📊 Final Results:
⏱️ Total processing time: 312.5 seconds
🚀 Average speed: 3.2 songs/second
✅ Artists found: 856/1,000 (85.6%)
✅ Recordings found: 789/1,000 (78.9%)
❌ Failed songs: 211 (21.1%)
📄 Files saved:
✅ Successful songs: data/songs-success.json
❌ Failed songs: data/songs-failure.json
📋 Text report: data/processing_report_20241219_143022.txt
📊 JSON report: data/processing_report_20241219_143022.json
🔧 Troubleshooting
"Could not find artist"
- The artist might not be in the MusicBrainz database
- Try checking the spelling or using a different variation
- The search index might still be building (wait a few minutes)
- Check fuzzy search similarity score - lower threshold if needed
- NEW: Check for artist aliases (e.g., "98 Degrees" → "98°")
- NEW: Check for sort names (e.g., "Corby, Matt" → "Matt Corby")
"Could not find recording"
- The song might not be in the database
- The title might not match exactly
- Try a simpler title (remove extra words)
- Check fuzzy search similarity score - lower threshold if needed
- NEW: For collaborations, check if it's stored under the main artist
Connection errors
- Database: Make sure PostgreSQL is running and accessible
- API: Make sure your MusicBrainz server is running on
http://localhost:8080 - Check that Docker containers are up and running
- Verify the server is accessible in your browser
- NEW: For Docker, use container IP (172.18.0.2) instead of localhost
JSON errors
- Make sure your input file is valid JSON
- Check that it contains an array of objects
- Verify all required fields are present
Performance issues
- Use database mode instead of API mode for better performance
- Ensure database indexes are built for faster queries
- Check fuzzy search thresholds - higher thresholds mean fewer but more accurate matches
Collaboration detection issues
- NEW: Check if it's a band name vs collaboration (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
- NEW: Verify the collaboration pattern is supported (ft., feat., featuring, &, and, ,)
- NEW: Check case sensitivity - patterns are case-insensitive
🎯 Use Cases
- Karaoke Systems: Clean up song metadata for better search and organization
- Music Libraries: Standardize artist names and add official IDs
- Music Apps: Ensure consistent data across your application
- Data Migration: Clean up legacy music data when moving to new systems
- Fuzzy Matching: Handle typos and variations in artist/song names
- NEW: Collaboration Handling: Process complex artist collaborations
- NEW: Edge Cases: Handle artists with special characters and unusual names
📚 What are MBIDs?
MBID stands for MusicBrainz Identifier. These are unique, permanent IDs assigned to artists, recordings, and other music entities in the MusicBrainz database.
Benefits:
- Permanent: Never change, even if names change
- Universal: Used across many music applications
- Reliable: Official identifiers from the MusicBrainz database
- Linked Data: Connect to other music databases and services
🆕 Performance Comparison
| Method | Speed | Rate Limiting | Fuzzy Search | Setup Complexity |
|---|---|---|---|---|
| Database | ⚡ 10x faster | ❌ None | ✅ Yes | 🔧 Medium |
| API | 🐌 Slower | ⏱️ Yes (0.1s delay) | ❌ No | ✅ Easy |
🆕 Collaboration Detection Examples
| Input | Type | Detection | Output |
|---|---|---|---|
Bruno Mars ft. Cardi B |
Collaboration | ✅ Primary pattern | Bruno Mars feat. Cardi B |
Pitbull ft. Ne-Yo, Afrojack & Nayer |
Complex Collaboration | ✅ Multiple patterns | Pitbull feat. Ne-Yo, Afrojack & Nayer |
Simon & Garfunkel |
Band Name | ❌ Protected | Simon & Garfunkel |
Lavato, Demi & Joe Jonas |
Collaboration | ✅ Comma detection | Lavato, Demi & Joe Jonas |
Hall & Oates |
Band Name | ❌ Protected | Hall & Oates |
🆕 Edge Case Examples
| Input | Type | Handling | Output |
|---|---|---|---|
ACDC |
Name Variation | ✅ Alias lookup | AC/DC |
98 Degrees |
Artist Alias | ✅ Alias search | 98° |
S Club 7 |
Numerical Suffix | ✅ Suffix removal | S Club |
Corby, Matt |
Sort Name | ✅ Sort name search | Matt Corby |
Blink-182 |
Dash Variation | ✅ Unicode dash handling | blink‐182 |
P!nk |
Special Characters | ✅ Direct search | P!nk |
3OH!3 |
Numbers + Special | ✅ Direct search | 3OH!3 |
🤝 Contributing
Found a bug or have a feature request?
- Check the existing issues
- Create a new issue with details
- Include sample data if possible
📄 License
This tool is provided as-is for educational and personal use.
🔗 Related Links
- MusicBrainz - The open music encyclopedia
- MusicBrainz API - API documentation
- MusicBrainz Docker - Docker setup
- FuzzyWuzzy - Fuzzy string matching library
📝 Lessons Learned
Database Integration
- Direct PostgreSQL access is 10x faster than API calls
- Docker networking requires container IPs, not localhost
- Database name matters:
musicbrainz_dbnotmusicbrainz - Static caches cause problems: Wrong MBIDs override correct database lookups
Collaboration Handling
- Primary patterns (ft., feat.) are always collaborations
- Secondary patterns (&, and) require intelligence to distinguish from band names
- Comma detection helps identify collaborations
- Artist credit lookup is essential for preserving all collaborators
Edge Cases
- Dash variations (regular vs Unicode) cause exact match failures
- Artist aliases are common and important (98 Degrees → 98°)
- Sort names handle "Last, First" formats
- Numerical suffixes in names need special handling (S Club 7 → S Club)
Performance Optimization
- Remove static caches for better accuracy
- Database-first approach ensures live data
- Fuzzy search thresholds need tuning for different datasets
- Connection pooling would improve performance for large datasets
CLI Design
- Simplified interface with smart defaults reduces complexity
- Array format consistency makes output files easier to work with
- Human-readable reports improve user experience
- Test file organization keeps project structure clean
Happy cleaning! 🎵✨