356 lines
11 KiB
Markdown
356 lines
11 KiB
Markdown
# 🎵 MusicBrainz Data Cleaner v2.0
|
|
|
|
A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with direct database access and fuzzy search for maximum performance and accuracy!**
|
|
|
|
## ✨ What's New in v2.0
|
|
|
|
- **🚀 Direct Database Access**: Connect directly to PostgreSQL for 10x faster performance
|
|
- **🎯 Fuzzy Search**: Intelligent matching for similar artist names and song titles
|
|
- **🔄 Automatic Fallback**: Falls back to API mode if database access fails
|
|
- **⚡ No Rate Limiting**: Database queries don't have API rate limits
|
|
- **📊 Similarity Scoring**: See how well matches are scored
|
|
|
|
## ✨ What It Does
|
|
|
|
**Before:**
|
|
```json
|
|
{
|
|
"artist": "ACDC",
|
|
"title": "Shot In The Dark",
|
|
"favorite": true
|
|
}
|
|
```
|
|
|
|
**After:**
|
|
```json
|
|
{
|
|
"artist": "AC/DC",
|
|
"title": "Shot in the Dark",
|
|
"favorite": true,
|
|
"mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
|
|
"recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
|
|
}
|
|
```
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### 1. Install Dependencies
|
|
```bash
|
|
pip install requests psycopg2-binary fuzzywuzzy python-Levenshtein
|
|
```
|
|
|
|
### 2. Set Up MusicBrainz Server
|
|
|
|
#### Option A: Docker (Recommended)
|
|
```bash
|
|
# Clone MusicBrainz Docker repository
|
|
git clone https://github.com/metabrainz/musicbrainz-docker.git
|
|
cd musicbrainz-docker
|
|
|
|
# Start the server
|
|
docker-compose up -d
|
|
|
|
# Wait for database to be ready (can take 10-15 minutes)
|
|
docker-compose logs -f musicbrainz
|
|
```
|
|
|
|
#### Option B: Manual Setup
|
|
1. Install PostgreSQL 12+
|
|
2. Create database: `createdb musicbrainz`
|
|
3. Import MusicBrainz data dump
|
|
4. Start MusicBrainz server on port 5001
|
|
|
|
### 3. Test Connection
|
|
```bash
|
|
python musicbrainz_cleaner.py --test-connection
|
|
```
|
|
|
|
### 4. Run the Cleaner
|
|
```bash
|
|
# Use database access (recommended, faster)
|
|
python musicbrainz_cleaner.py your_songs.json
|
|
|
|
# Force API mode (slower, fallback)
|
|
python musicbrainz_cleaner.py your_songs.json --use-api
|
|
```
|
|
|
|
That's it! Your cleaned data will be saved to `your_songs_cleaned.json`
|
|
|
|
## 📋 Requirements
|
|
|
|
- **Python 3.6+**
|
|
- **MusicBrainz Server** running on localhost:5001
|
|
- **PostgreSQL Database** accessible on localhost:5432
|
|
- **Dependencies**: `requests`, `psycopg2-binary`, `fuzzywuzzy`, `python-Levenshtein`
|
|
|
|
## 🔧 Server Configuration
|
|
|
|
### Database Access
|
|
- **Host**: localhost
|
|
- **Port**: 5432 (PostgreSQL default)
|
|
- **Database**: musicbrainz
|
|
- **User**: musicbrainz
|
|
- **Password**: musicbrainz (default, should be changed in production)
|
|
|
|
### HTTP API (Fallback)
|
|
- **URL**: http://localhost:5001
|
|
- **Endpoint**: /ws/2/
|
|
- **Format**: JSON
|
|
|
|
### Troubleshooting
|
|
- **Database Connection Failed**: Check PostgreSQL is running and credentials are correct
|
|
- **API Connection Failed**: Check MusicBrainz server is running on port 5001
|
|
- **Slow Performance**: Ensure database indexes are built
|
|
- **No Results**: Verify data has been imported to the database
|
|
|
|
## 🧪 Testing
|
|
|
|
Run the test suite to verify everything works correctly:
|
|
|
|
```bash
|
|
# Run all tests
|
|
python3 src/tests/run_tests.py
|
|
|
|
# Run specific test module
|
|
python3 src/tests/run_tests.py test_data_loader
|
|
python3 src/tests/run_tests.py test_cli
|
|
```
|
|
|
|
## 📁 Data Files
|
|
|
|
The tool uses external JSON files for known artist and recording data:
|
|
|
|
- **`data/known_artists.json`**: Contains known artist MBIDs for common artists
|
|
- **`data/known_recordings.json`**: Contains known recording MBIDs for common songs
|
|
|
|
These files can be easily updated without touching the code, making it simple to add new artists and recordings.
|
|
|
|
## 🎯 Features
|
|
|
|
### ✅ Artist Name Fixes
|
|
- `ACDC` → `AC/DC`
|
|
- `Bruno Mars ft. Cardi B` → `Bruno Mars feat. Cardi B`
|
|
- `featuring` → `feat.`
|
|
|
|
### ✅ Song Title Fixes
|
|
- `Shot In The Dark` → `Shot in the Dark`
|
|
- Removes `(Karaoke Version)`, `(Instrumental)` suffixes
|
|
- Normalizes capitalization and formatting
|
|
|
|
### ✅ Added Data
|
|
- **`mbid`**: Official MusicBrainz Artist ID
|
|
- **`recording_mbid`**: Official MusicBrainz Recording ID
|
|
|
|
### ✅ Preserves Your Data
|
|
- Keeps all your existing fields (guid, path, disabled, favorite, etc.)
|
|
- Only adds new fields, never removes existing ones
|
|
|
|
### 🆕 Fuzzy Search
|
|
- **Intelligent Matching**: Finds similar names even with typos or variations
|
|
- **Similarity Scoring**: Shows how well each match scores (0.0 to 1.0)
|
|
- **Configurable Thresholds**: Adjust matching sensitivity
|
|
- **Multiple Algorithms**: Uses ratio, partial ratio, and token sort matching
|
|
|
|
## 📖 Usage Examples
|
|
|
|
### Basic Usage
|
|
```bash
|
|
# Clean your songs and save to auto-generated filename
|
|
python musicbrainz_cleaner.py my_songs.json
|
|
# Output: my_songs_cleaned.json
|
|
```
|
|
|
|
### Custom Output File
|
|
```bash
|
|
# Specify your own output filename
|
|
python musicbrainz_cleaner.py my_songs.json cleaned_songs.json
|
|
```
|
|
|
|
### Force API Mode
|
|
```bash
|
|
# Use HTTP API instead of database (slower but works without PostgreSQL)
|
|
python musicbrainz_cleaner.py my_songs.json --use-api
|
|
```
|
|
|
|
### Test Connections
|
|
```bash
|
|
# Test database connection
|
|
python musicbrainz_cleaner.py --test-connection
|
|
|
|
# Test with API mode
|
|
python musicbrainz_cleaner.py --test-connection --use-api
|
|
```
|
|
|
|
### Help
|
|
```bash
|
|
# Show usage information
|
|
python musicbrainz_cleaner.py --help
|
|
```
|
|
|
|
## 📁 Data Files
|
|
|
|
### Input Format
|
|
Your JSON file should contain an array of song objects:
|
|
|
|
```json
|
|
[
|
|
{
|
|
"artist": "ACDC",
|
|
"title": "Shot In The Dark",
|
|
"disabled": false,
|
|
"favorite": true,
|
|
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
|
|
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4"
|
|
},
|
|
{
|
|
"artist": "Bruno Mars ft. Cardi B",
|
|
"title": "Finesse Remix",
|
|
"disabled": false,
|
|
"favorite": false,
|
|
"guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
|
|
"path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4"
|
|
}
|
|
]
|
|
```
|
|
|
|
## 📤 Output Format
|
|
|
|
The tool will update your objects with corrected data:
|
|
|
|
```json
|
|
[
|
|
{
|
|
"artist": "AC/DC",
|
|
"title": "Shot in the Dark",
|
|
"disabled": false,
|
|
"favorite": true,
|
|
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
|
|
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
|
|
"mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
|
|
"recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
|
|
},
|
|
{
|
|
"artist": "Bruno Mars feat. Cardi B",
|
|
"title": "Finesse (remix)",
|
|
"disabled": false,
|
|
"favorite": false,
|
|
"guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
|
|
"path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4",
|
|
"mbid": "afb680f2-b6eb-4cd7-a70b-a63b25c763d5",
|
|
"recording_mbid": "8ed14014-547a-4128-ab81-c2dca7ae198e"
|
|
}
|
|
]
|
|
```
|
|
|
|
## 🎬 Example Run
|
|
|
|
```bash
|
|
$ python musicbrainz_cleaner.py data/sample_songs.json
|
|
|
|
Processing 3 songs...
|
|
Using database connection
|
|
==================================================
|
|
|
|
[1/3] Processing: ACDC - Shot In The Dark
|
|
🎯 Fuzzy match found: ACDC → AC/DC (score: 0.85)
|
|
✅ Found artist: AC/DC (MBID: 66c662b6-6e2f-4930-8610-912e24c63ed1)
|
|
🎯 Fuzzy match found: Shot In The Dark → Shot in the Dark (score: 0.92)
|
|
✅ Found recording: Shot in the Dark (MBID: cf8b5cd0-d97c-413d-882f-fc422a2e57db)
|
|
✅ Updated to: AC/DC - Shot in the Dark
|
|
|
|
[2/3] Processing: Bruno Mars ft. Cardi B - Finesse Remix
|
|
🎯 Fuzzy match found: Bruno Mars → Bruno Mars (score: 1.00)
|
|
✅ Found artist: Bruno Mars (MBID: afb680f2-b6eb-4cd7-a70b-a63b25c763d5)
|
|
🎯 Fuzzy match found: Finesse Remix → Finesse (remix) (score: 0.88)
|
|
✅ Found recording: Finesse (remix) (MBID: 8ed14014-547a-4128-ab81-c2dca7ae198e)
|
|
✅ Updated to: Bruno Mars feat. Cardi B - Finesse (remix)
|
|
|
|
[3/3] Processing: Taylor Swift - Love Story
|
|
🎯 Fuzzy match found: Taylor Swift → Taylor Swift (score: 1.00)
|
|
✅ Found artist: Taylor Swift (MBID: 20244d07-534f-4eff-b4d4-930878889970)
|
|
🎯 Fuzzy match found: Love Story → Love Story (score: 1.00)
|
|
✅ Found recording: Love Story (MBID: d783e6c5-761f-4fc3-bfcf-6089cdfc8f96)
|
|
✅ Updated to: Taylor Swift - Love Story
|
|
|
|
==================================================
|
|
✅ Processing complete!
|
|
📁 Output saved to: data/sample_songs_cleaned.json
|
|
```
|
|
|
|
## 🔧 Troubleshooting
|
|
|
|
### "Could not find artist"
|
|
- The artist might not be in the MusicBrainz database
|
|
- Try checking the spelling or using a different variation
|
|
- The search index might still be building (wait a few minutes)
|
|
- **NEW**: Check fuzzy search similarity score - lower threshold if needed
|
|
|
|
### "Could not find recording"
|
|
- The song might not be in the database
|
|
- The title might not match exactly
|
|
- Try a simpler title (remove extra words)
|
|
- **NEW**: Check fuzzy search similarity score - lower threshold if needed
|
|
|
|
### Connection errors
|
|
- **Database**: Make sure PostgreSQL is running and accessible
|
|
- **API**: Make sure your MusicBrainz server is running on `http://localhost:5001`
|
|
- Check that Docker containers are up and running
|
|
- Verify the server is accessible in your browser
|
|
|
|
### JSON errors
|
|
- Make sure your input file is valid JSON
|
|
- Check that it contains an array of objects
|
|
- Verify all required fields are present
|
|
|
|
### Performance issues
|
|
- **NEW**: Use database mode instead of API mode for better performance
|
|
- **NEW**: Ensure database indexes are built for faster queries
|
|
- **NEW**: Check fuzzy search thresholds - higher thresholds mean fewer but more accurate matches
|
|
|
|
## 🎯 Use Cases
|
|
|
|
- **Karaoke Systems**: Clean up song metadata for better search and organization
|
|
- **Music Libraries**: Standardize artist names and add official IDs
|
|
- **Music Apps**: Ensure consistent data across your application
|
|
- **Data Migration**: Clean up legacy music data when moving to new systems
|
|
- **Fuzzy Matching**: Handle typos and variations in artist/song names
|
|
|
|
## 📚 What are MBIDs?
|
|
|
|
**MBID** stands for **MusicBrainz Identifier**. These are unique, permanent IDs assigned to artists, recordings, and other music entities in the MusicBrainz database.
|
|
|
|
**Benefits:**
|
|
- **Permanent**: Never change, even if names change
|
|
- **Universal**: Used across many music applications
|
|
- **Reliable**: Official identifiers from the MusicBrainz database
|
|
- **Linked Data**: Connect to other music databases and services
|
|
|
|
## 🆕 Performance Comparison
|
|
|
|
| Method | Speed | Rate Limiting | Fuzzy Search | Setup Complexity |
|
|
|--------|-------|---------------|--------------|------------------|
|
|
| **Database** | ⚡ 10x faster | ❌ None | ✅ Yes | 🔧 Medium |
|
|
| **API** | 🐌 Slower | ⏱️ Yes (0.1s delay) | ❌ No | ✅ Easy |
|
|
|
|
## 🤝 Contributing
|
|
|
|
Found a bug or have a feature request?
|
|
|
|
1. Check the existing issues
|
|
2. Create a new issue with details
|
|
3. Include sample data if possible
|
|
|
|
## 📄 License
|
|
|
|
This tool is provided as-is for educational and personal use.
|
|
|
|
## 🔗 Related Links
|
|
|
|
- [MusicBrainz](https://musicbrainz.org/) - The open music encyclopedia
|
|
- [MusicBrainz API](https://musicbrainz.org/doc/Development) - API documentation
|
|
- [MusicBrainz Docker](https://github.com/metabrainz/musicbrainz-docker) - Docker setup
|
|
- [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy string matching library
|
|
|
|
---
|
|
|
|
**Happy cleaning! 🎵✨** |