musicbrainz-cleaner/README.md

456 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🎵 MusicBrainz Data Cleaner v3.0
A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!**
## ✨ What's New in v3.0
- **🚀 Direct Database Access**: Connect directly to PostgreSQL for 10x faster performance
- **🎯 Advanced Fuzzy Search**: Intelligent matching for similar artist names and song titles
- **🔄 Automatic Fallback**: Falls back to API mode if database access fails
- **⚡ No Rate Limiting**: Database queries don't have API rate limits
- **📊 Similarity Scoring**: See how well matches are scored
- **🆕 Collaboration Detection**: Intelligently handle complex collaborations like "Pitbull ft. Ne-Yo, Afrojack & Nayer"
- **🆕 Artist Aliases**: Handle name variations like "98 Degrees" → "98°" and "S Club 7" → "S Club"
- **🆕 Sort Names**: Handle "Last, First" formats like "Corby, Matt" → "Matt Corby"
- **🆕 Edge Case Handling**: Support for artists with hyphens, exclamation marks, numbers, and special characters
- **🆕 Band Name Protection**: Distinguish between band names (Simon & Garfunkel) and collaborations (Lavato, Demi & Joe Jonas)
## ✨ What It Does
**Before:**
```json
{
"artist": "ACDC",
"title": "Shot In The Dark",
"favorite": true
}
```
**After:**
```json
{
"artist": "AC/DC",
"title": "Shot in the Dark",
"favorite": true,
"mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
"recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
}
```
## 🚀 Quick Start
### 1. Install Dependencies
```bash
pip install requests psycopg2-binary fuzzywuzzy python-Levenshtein
```
### 2. Set Up MusicBrainz Server
#### Option A: Docker (Recommended)
```bash
# Clone MusicBrainz Docker repository
git clone https://github.com/metabrainz/musicbrainz-docker.git
cd musicbrainz-docker
# Update postgres.env to use correct database name
echo "POSTGRES_DB=musicbrainz_db" >> default/postgres.env
# Start the server
docker-compose up -d
# Wait for database to be ready (can take 10-15 minutes)
docker-compose logs -f musicbrainz
```
#### Option B: Manual Setup
1. Install PostgreSQL 12+
2. Create database: `createdb musicbrainz_db`
3. Import MusicBrainz data dump
4. Start MusicBrainz server on port 8080
### 3. Test Connection
```bash
python musicbrainz_cleaner.py --test-connection
```
### 4. Run the Cleaner
```bash
# Use database access (recommended, faster)
python musicbrainz_cleaner.py your_songs.json
# Force API mode (slower, fallback)
python musicbrainz_cleaner.py your_songs.json --use-api
```
That's it! Your cleaned data will be saved to `your_songs_cleaned.json`
## 📋 Requirements
- **Python 3.6+**
- **MusicBrainz Server** running on localhost:8080
- **PostgreSQL Database** accessible on localhost:5432
- **Dependencies**: `requests`, `psycopg2-binary`, `fuzzywuzzy`, `python-Levenshtein`
## 🔧 Server Configuration
### Database Access
- **Host**: localhost (or Docker container IP: 172.18.0.2)
- **Port**: 5432 (PostgreSQL default)
- **Database**: musicbrainz_db (actual database name)
- **User**: musicbrainz
- **Password**: musicbrainz (default, should be changed in production)
### HTTP API (Fallback)
- **URL**: http://localhost:8080
- **Endpoint**: /ws/2/
- **Format**: JSON
### Troubleshooting
- **Database Connection Failed**: Check PostgreSQL is running and credentials are correct
- **API Connection Failed**: Check MusicBrainz server is running on port 8080
- **Slow Performance**: Ensure database indexes are built
- **No Results**: Verify data has been imported to the database
- **NEW**: **Docker Networking**: Use container IP (172.18.0.2) for Docker-to-Docker connections
- **NEW**: **Database Name**: Ensure using `musicbrainz_db` not `musicbrainz`
## 🧪 Testing
Run the test suite to verify everything works correctly:
```bash
# Run all tests
python3 src/tests/run_tests.py
# Run specific test categories
python3 src/tests/run_tests.py --unit # Unit tests only
python3 src/tests/run_tests.py --integration # Integration tests only
# Run specific test module
python3 src/tests/run_tests.py test_data_loader
python3 src/tests/run_tests.py test_cli
# List all available tests
python3 src/tests/run_tests.py --list
```
#### Test Categories
- **Unit Tests**: Test individual components in isolation
- **Integration Tests**: Test interactions between components and database
- **Debug Tests**: Debug scripts and troubleshooting tools
## 📁 Data Files
The tool uses external JSON files for name variations:
- **`data/known_artists.json`**: Contains name variations (ACDC → AC/DC, ft. → feat.)
- **`data/known_recordings.json`**: Contains known recording MBIDs for common songs
These files can be easily updated without touching the code, making it simple to add new name variations.
## 🎯 Features
### ✅ Artist Name Fixes
- `ACDC``AC/DC`
- `Bruno Mars ft. Cardi B``Bruno Mars feat. Cardi B`
- `featuring``feat.`
- `98 Degrees``98°` (artist aliases)
- `S Club 7``S Club` (numerical suffixes)
- `Corby, Matt``Matt Corby` (sort names)
### ✅ Collaboration Detection
- **Primary Patterns**: "ft.", "feat.", "featuring" (always collaborations)
- **Secondary Patterns**: "&", "and", "," (intelligent detection)
- **Band Name Protection**: 200+ known band names from `data/known_artists.json`
- **Complex Collaborations**: "Pitbull ft. Ne-Yo, Afrojack & Nayer"
- **Case Insensitive**: "Featuring" → "featuring"
### ✅ Song Title Fixes
- `Shot In The Dark``Shot in the Dark`
- Removes `(Karaoke Version)`, `(Instrumental)` suffixes
- Normalizes capitalization and formatting
### ✅ Added Data
- **`mbid`**: Official MusicBrainz Artist ID
- **`recording_mbid`**: Official MusicBrainz Recording ID
### ✅ Preserves Your Data
- Keeps all your existing fields (guid, path, disabled, favorite, etc.)
- Only adds new fields, never removes existing ones
### 🆕 Advanced Fuzzy Search
- **Intelligent Matching**: Finds similar names even with typos or variations
- **Similarity Scoring**: Shows how well each match scores (0.0 to 1.0)
- **Configurable Thresholds**: Adjust matching sensitivity
- **Multiple Algorithms**: Uses ratio, partial ratio, and token sort matching
- **Enhanced Search Fields**: artist.name, artist_alias.name, artist.sort_name
- **Dash Handling**: Regular dash (-) vs Unicode dash ()
- **Substring Protection**: Avoids false matches like "Sleazy-E" vs "Eazy-E"
### 🆕 Edge Case Support
- **Hyphenated Artists**: "Blink-182", "Ne-Yo", "G-Eazy"
- **Exclamation Marks**: "P!nk", "Panic! At The Disco", "3OH!3"
- **Numbers**: "98 Degrees", "S Club 7", "3 Doors Down"
- **Special Characters**: "a-ha", "The B-52s", "Salt-N-Pepa"
## 📖 Usage Examples
### Basic Usage
```bash
# Clean your songs and save to auto-generated filename
python musicbrainz_cleaner.py my_songs.json
# Output: my_songs_cleaned.json
```
### Custom Output File
```bash
# Specify your own output filename
python musicbrainz_cleaner.py my_songs.json cleaned_songs.json
```
### Force API Mode
```bash
# Use HTTP API instead of database (slower but works without PostgreSQL)
python musicbrainz_cleaner.py my_songs.json --use-api
```
### Test Connections
```bash
# Test database connection
python musicbrainz_cleaner.py --test-connection
# Test with API mode
python musicbrainz_cleaner.py --test-connection --use-api
```
### Help
```bash
# Show usage information
python musicbrainz_cleaner.py --help
```
## 📁 Data Files
### Input Format
Your JSON file should contain an array of song objects:
```json
[
{
"artist": "ACDC",
"title": "Shot In The Dark",
"disabled": false,
"favorite": true,
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4"
},
{
"artist": "Bruno Mars ft. Cardi B",
"title": "Finesse Remix",
"disabled": false,
"favorite": false,
"guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
"path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4"
}
]
```
## 📤 Output Format
The tool will update your objects with corrected data:
```json
[
{
"artist": "AC/DC",
"title": "Shot in the Dark",
"disabled": false,
"favorite": true,
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
"mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
"recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
},
{
"artist": "Bruno Mars feat. Cardi B",
"title": "Finesse (remix)",
"disabled": false,
"favorite": false,
"guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
"path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4",
"mbid": "afb680f2-b6eb-4cd7-a70b-a63b25c763d5",
"recording_mbid": "8ed14014-547a-4128-ab81-c2dca7ae198e"
}
]
```
## 🎬 Example Run
```bash
$ python musicbrainz_cleaner.py data/sample_songs.json
Processing 3 songs...
Using database connection
==================================================
[1/3] Processing: ACDC - Shot In The Dark
🎯 Fuzzy match found: ACDC → AC/DC (score: 0.85)
✅ Found artist: AC/DC (MBID: 66c662b6-6e2f-4930-8610-912e24c63ed1)
🎯 Fuzzy match found: Shot In The Dark → Shot in the Dark (score: 0.92)
✅ Found recording: Shot in the Dark (MBID: cf8b5cd0-d97c-413d-882f-fc422a2e57db)
✅ Updated to: AC/DC - Shot in the Dark
[2/3] Processing: Bruno Mars ft. Cardi B - Finesse Remix
🎯 Fuzzy match found: Bruno Mars → Bruno Mars (score: 1.00)
✅ Found artist: Bruno Mars (MBID: afb680f2-b6eb-4cd7-a70b-a63b25c763d5)
🎯 Fuzzy match found: Finesse Remix → Finesse (remix) (score: 0.88)
✅ Found recording: Finesse (remix) (MBID: 8ed14014-547a-4128-ab81-c2dca7ae198e)
✅ Updated to: Bruno Mars feat. Cardi B - Finesse (remix)
[3/3] Processing: Taylor Swift - Love Story
🎯 Fuzzy match found: Taylor Swift → Taylor Swift (score: 1.00)
✅ Found artist: Taylor Swift (MBID: 20244d07-534f-4eff-b4d4-930878889970)
🎯 Fuzzy match found: Love Story → Love Story (score: 1.00)
✅ Found recording: Love Story (MBID: d783e6c5-761f-4fc3-bfcf-6089cdfc8f96)
✅ Updated to: Taylor Swift - Love Story
==================================================
✅ Processing complete!
📁 Output saved to: data/sample_songs_cleaned.json
```
## 🔧 Troubleshooting
### "Could not find artist"
- The artist might not be in the MusicBrainz database
- Try checking the spelling or using a different variation
- The search index might still be building (wait a few minutes)
- Check fuzzy search similarity score - lower threshold if needed
- **NEW**: Check for artist aliases (e.g., "98 Degrees" → "98°")
- **NEW**: Check for sort names (e.g., "Corby, Matt" → "Matt Corby")
### "Could not find recording"
- The song might not be in the database
- The title might not match exactly
- Try a simpler title (remove extra words)
- Check fuzzy search similarity score - lower threshold if needed
- **NEW**: For collaborations, check if it's stored under the main artist
### Connection errors
- **Database**: Make sure PostgreSQL is running and accessible
- **API**: Make sure your MusicBrainz server is running on `http://localhost:8080`
- Check that Docker containers are up and running
- Verify the server is accessible in your browser
- **NEW**: For Docker, use container IP (172.18.0.2) instead of localhost
### JSON errors
- Make sure your input file is valid JSON
- Check that it contains an array of objects
- Verify all required fields are present
### Performance issues
- Use database mode instead of API mode for better performance
- Ensure database indexes are built for faster queries
- Check fuzzy search thresholds - higher thresholds mean fewer but more accurate matches
### Collaboration detection issues
- **NEW**: Check if it's a band name vs collaboration (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
- **NEW**: Verify the collaboration pattern is supported (ft., feat., featuring, &, and, ,)
- **NEW**: Check case sensitivity - patterns are case-insensitive
## 🎯 Use Cases
- **Karaoke Systems**: Clean up song metadata for better search and organization
- **Music Libraries**: Standardize artist names and add official IDs
- **Music Apps**: Ensure consistent data across your application
- **Data Migration**: Clean up legacy music data when moving to new systems
- **Fuzzy Matching**: Handle typos and variations in artist/song names
- **NEW**: **Collaboration Handling**: Process complex artist collaborations
- **NEW**: **Edge Cases**: Handle artists with special characters and unusual names
## 📚 What are MBIDs?
**MBID** stands for **MusicBrainz Identifier**. These are unique, permanent IDs assigned to artists, recordings, and other music entities in the MusicBrainz database.
**Benefits:**
- **Permanent**: Never change, even if names change
- **Universal**: Used across many music applications
- **Reliable**: Official identifiers from the MusicBrainz database
- **Linked Data**: Connect to other music databases and services
## 🆕 Performance Comparison
| Method | Speed | Rate Limiting | Fuzzy Search | Setup Complexity |
|--------|-------|---------------|--------------|------------------|
| **Database** | ⚡ 10x faster | ❌ None | ✅ Yes | 🔧 Medium |
| **API** | 🐌 Slower | ⏱️ Yes (0.1s delay) | ❌ No | ✅ Easy |
## 🆕 Collaboration Detection Examples
| Input | Type | Detection | Output |
|-------|------|-----------|---------|
| `Bruno Mars ft. Cardi B` | Collaboration | ✅ Primary pattern | `Bruno Mars feat. Cardi B` |
| `Pitbull ft. Ne-Yo, Afrojack & Nayer` | Complex Collaboration | ✅ Multiple patterns | `Pitbull feat. Ne-Yo, Afrojack & Nayer` |
| `Simon & Garfunkel` | Band Name | ❌ Protected | `Simon & Garfunkel` |
| `Lavato, Demi & Joe Jonas` | Collaboration | ✅ Comma detection | `Lavato, Demi & Joe Jonas` |
| `Hall & Oates` | Band Name | ❌ Protected | `Hall & Oates` |
## 🆕 Edge Case Examples
| Input | Type | Handling | Output |
|-------|------|----------|---------|
| `ACDC` | Name Variation | ✅ Alias lookup | `AC/DC` |
| `98 Degrees` | Artist Alias | ✅ Alias search | `98°` |
| `S Club 7` | Numerical Suffix | ✅ Suffix removal | `S Club` |
| `Corby, Matt` | Sort Name | ✅ Sort name search | `Matt Corby` |
| `Blink-182` | Dash Variation | ✅ Unicode dash handling | `blink182` |
| `P!nk` | Special Characters | ✅ Direct search | `P!nk` |
| `3OH!3` | Numbers + Special | ✅ Direct search | `3OH!3` |
## 🤝 Contributing
Found a bug or have a feature request?
1. Check the existing issues
2. Create a new issue with details
3. Include sample data if possible
## 📄 License
This tool is provided as-is for educational and personal use.
## 🔗 Related Links
- [MusicBrainz](https://musicbrainz.org/) - The open music encyclopedia
- [MusicBrainz API](https://musicbrainz.org/doc/Development) - API documentation
- [MusicBrainz Docker](https://github.com/metabrainz/musicbrainz-docker) - Docker setup
- [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy string matching library
## 📝 Lessons Learned
### Database Integration
- **Direct PostgreSQL access is 10x faster** than API calls
- **Docker networking** requires container IPs, not localhost
- **Database name matters**: `musicbrainz_db` not `musicbrainz`
- **Static caches cause problems**: Wrong MBIDs override correct database lookups
### Collaboration Handling
- **Primary patterns** (ft., feat.) are always collaborations
- **Secondary patterns** (&, and) require intelligence to distinguish from band names
- **Comma detection** helps identify collaborations
- **Artist credit lookup** is essential for preserving all collaborators
### Edge Cases
- **Dash variations** (regular vs Unicode) cause exact match failures
- **Artist aliases** are common and important (98 Degrees → 98°)
- **Sort names** handle "Last, First" formats
- **Numerical suffixes** in names need special handling (S Club 7 → S Club)
### Performance Optimization
- **Remove static caches** for better accuracy
- **Database-first approach** ensures live data
- **Fuzzy search thresholds** need tuning for different datasets
- **Connection pooling** would improve performance for large datasets
---
**Happy cleaning! 🎵✨**