musicbrainz-cleaner/README.md

16 KiB
Raw Blame History

🎵 MusicBrainz Data Cleaner v3.0

A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. Now with advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!

What's New in v3.0

  • 🚀 Direct Database Access: Connect directly to PostgreSQL for 10x faster performance
  • 🎯 Advanced Fuzzy Search: Intelligent matching for similar artist names and song titles
  • 🔄 Automatic Fallback: Falls back to API mode if database access fails
  • No Rate Limiting: Database queries don't have API rate limits
  • 📊 Similarity Scoring: See how well matches are scored
  • 🆕 Collaboration Detection: Intelligently handle complex collaborations like "Pitbull ft. Ne-Yo, Afrojack & Nayer"
  • 🆕 Artist Aliases: Handle name variations like "98 Degrees" → "98°" and "S Club 7" → "S Club"
  • 🆕 Sort Names: Handle "Last, First" formats like "Corby, Matt" → "Matt Corby"
  • 🆕 Edge Case Handling: Support for artists with hyphens, exclamation marks, numbers, and special characters
  • 🆕 Band Name Protection: Distinguish between band names (Simon & Garfunkel) and collaborations (Lavato, Demi & Joe Jonas)

What It Does

Before:

{
  "artist": "ACDC",
  "title": "Shot In The Dark",
  "favorite": true
}

After:

{
  "artist": "AC/DC",
  "title": "Shot in the Dark",
  "favorite": true,
  "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
  "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
}

🚀 Quick Start

1. Install Dependencies

pip install requests psycopg2-binary fuzzywuzzy python-Levenshtein

2. Set Up MusicBrainz Server

# Clone MusicBrainz Docker repository
git clone https://github.com/metabrainz/musicbrainz-docker.git
cd musicbrainz-docker

# Update postgres.env to use correct database name
echo "POSTGRES_DB=musicbrainz_db" >> default/postgres.env

# Start the server
docker-compose up -d

# Wait for database to be ready (can take 10-15 minutes)
docker-compose logs -f musicbrainz

Option B: Manual Setup

  1. Install PostgreSQL 12+
  2. Create database: createdb musicbrainz_db
  3. Import MusicBrainz data dump
  4. Start MusicBrainz server on port 8080

3. Test Connection

python musicbrainz_cleaner.py --test-connection

4. Run the Cleaner

# Use database access (recommended, faster)
python musicbrainz_cleaner.py your_songs.json

# Force API mode (slower, fallback)
python musicbrainz_cleaner.py your_songs.json --use-api

That's it! Your cleaned data will be saved to your_songs_cleaned.json

📋 Requirements

  • Python 3.6+
  • MusicBrainz Server running on localhost:8080
  • PostgreSQL Database accessible on localhost:5432
  • Dependencies: requests, psycopg2-binary, fuzzywuzzy, python-Levenshtein

🔧 Server Configuration

Database Access

  • Host: localhost (or Docker container IP: 172.18.0.2)
  • Port: 5432 (PostgreSQL default)
  • Database: musicbrainz_db (actual database name)
  • User: musicbrainz
  • Password: musicbrainz (default, should be changed in production)

HTTP API (Fallback)

Troubleshooting

  • Database Connection Failed: Check PostgreSQL is running and credentials are correct
  • API Connection Failed: Check MusicBrainz server is running on port 8080
  • Slow Performance: Ensure database indexes are built
  • No Results: Verify data has been imported to the database
  • NEW: Docker Networking: Use container IP (172.18.0.2) for Docker-to-Docker connections
  • NEW: Database Name: Ensure using musicbrainz_db not musicbrainz

🧪 Testing

Run the test suite to verify everything works correctly:

# Run all tests
python3 src/tests/run_tests.py

# Run specific test categories
python3 src/tests/run_tests.py --unit          # Unit tests only
python3 src/tests/run_tests.py --integration   # Integration tests only

# Run specific test module
python3 src/tests/run_tests.py test_data_loader
python3 src/tests/run_tests.py test_cli

# List all available tests
python3 src/tests/run_tests.py --list

Test Categories

  • Unit Tests: Test individual components in isolation
  • Integration Tests: Test interactions between components and database
  • Debug Tests: Debug scripts and troubleshooting tools

📁 Data Files

The tool uses external JSON files for name variations:

  • data/known_artists.json: Contains name variations (ACDC → AC/DC, ft. → feat.)
  • data/known_recordings.json: Contains known recording MBIDs for common songs

These files can be easily updated without touching the code, making it simple to add new name variations.

🎯 Features

Artist Name Fixes

  • ACDCAC/DC
  • Bruno Mars ft. Cardi BBruno Mars feat. Cardi B
  • featuringfeat.
  • 98 Degrees98° (artist aliases)
  • S Club 7S Club (numerical suffixes)
  • Corby, MattMatt Corby (sort names)

Collaboration Detection

  • Primary Patterns: "ft.", "feat.", "featuring" (always collaborations)
  • Secondary Patterns: "&", "and", "," (intelligent detection)
  • Band Name Protection: "Simon & Garfunkel" (not collaboration)
  • Complex Collaborations: "Pitbull ft. Ne-Yo, Afrojack & Nayer"
  • Case Insensitive: "Featuring" → "featuring"

Song Title Fixes

  • Shot In The DarkShot in the Dark
  • Removes (Karaoke Version), (Instrumental) suffixes
  • Normalizes capitalization and formatting

Added Data

  • mbid: Official MusicBrainz Artist ID
  • recording_mbid: Official MusicBrainz Recording ID

Preserves Your Data

  • Keeps all your existing fields (guid, path, disabled, favorite, etc.)
  • Only adds new fields, never removes existing ones
  • Intelligent Matching: Finds similar names even with typos or variations
  • Similarity Scoring: Shows how well each match scores (0.0 to 1.0)
  • Configurable Thresholds: Adjust matching sensitivity
  • Multiple Algorithms: Uses ratio, partial ratio, and token sort matching
  • Enhanced Search Fields: artist.name, artist_alias.name, artist.sort_name
  • Dash Handling: Regular dash (-) vs Unicode dash ()
  • Substring Protection: Avoids false matches like "Sleazy-E" vs "Eazy-E"

🆕 Edge Case Support

  • Hyphenated Artists: "Blink-182", "Ne-Yo", "G-Eazy"
  • Exclamation Marks: "P!nk", "Panic! At The Disco", "3OH!3"
  • Numbers: "98 Degrees", "S Club 7", "3 Doors Down"
  • Special Characters: "a-ha", "The B-52s", "Salt-N-Pepa"

📖 Usage Examples

Basic Usage

# Clean your songs and save to auto-generated filename
python musicbrainz_cleaner.py my_songs.json
# Output: my_songs_cleaned.json

Custom Output File

# Specify your own output filename
python musicbrainz_cleaner.py my_songs.json cleaned_songs.json

Force API Mode

# Use HTTP API instead of database (slower but works without PostgreSQL)
python musicbrainz_cleaner.py my_songs.json --use-api

Test Connections

# Test database connection
python musicbrainz_cleaner.py --test-connection

# Test with API mode
python musicbrainz_cleaner.py --test-connection --use-api

Help

# Show usage information
python musicbrainz_cleaner.py --help

📁 Data Files

Input Format

Your JSON file should contain an array of song objects:

[
  {
    "artist": "ACDC",
    "title": "Shot In The Dark",
    "disabled": false,
    "favorite": true,
    "guid": "8946008c-7acc-d187-60e6-5286e55ad502",
    "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4"
  },
  {
    "artist": "Bruno Mars ft. Cardi B",
    "title": "Finesse Remix",
    "disabled": false,
    "favorite": false,
    "guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
    "path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4"
  }
]

📤 Output Format

The tool will update your objects with corrected data:

[
  {
    "artist": "AC/DC",
    "title": "Shot in the Dark",
    "disabled": false,
    "favorite": true,
    "guid": "8946008c-7acc-d187-60e6-5286e55ad502",
    "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
    "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
    "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
  },
  {
    "artist": "Bruno Mars feat. Cardi B",
    "title": "Finesse (remix)",
    "disabled": false,
    "favorite": false,
    "guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
    "path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4",
    "mbid": "afb680f2-b6eb-4cd7-a70b-a63b25c763d5",
    "recording_mbid": "8ed14014-547a-4128-ab81-c2dca7ae198e"
  }
]

🎬 Example Run

$ python musicbrainz_cleaner.py data/sample_songs.json

Processing 3 songs...
Using database connection
==================================================

[1/3] Processing: ACDC - Shot In The Dark
  🎯 Fuzzy match found: ACDC → AC/DC (score: 0.85)
  ✅ Found artist: AC/DC (MBID: 66c662b6-6e2f-4930-8610-912e24c63ed1)
  🎯 Fuzzy match found: Shot In The Dark → Shot in the Dark (score: 0.92)
  ✅ Found recording: Shot in the Dark (MBID: cf8b5cd0-d97c-413d-882f-fc422a2e57db)
  ✅ Updated to: AC/DC - Shot in the Dark

[2/3] Processing: Bruno Mars ft. Cardi B - Finesse Remix
  🎯 Fuzzy match found: Bruno Mars → Bruno Mars (score: 1.00)
  ✅ Found artist: Bruno Mars (MBID: afb680f2-b6eb-4cd7-a70b-a63b25c763d5)
  🎯 Fuzzy match found: Finesse Remix → Finesse (remix) (score: 0.88)
  ✅ Found recording: Finesse (remix) (MBID: 8ed14014-547a-4128-ab81-c2dca7ae198e)
  ✅ Updated to: Bruno Mars feat. Cardi B - Finesse (remix)

[3/3] Processing: Taylor Swift - Love Story
  🎯 Fuzzy match found: Taylor Swift → Taylor Swift (score: 1.00)
  ✅ Found artist: Taylor Swift (MBID: 20244d07-534f-4eff-b4d4-930878889970)
  🎯 Fuzzy match found: Love Story → Love Story (score: 1.00)
  ✅ Found recording: Love Story (MBID: d783e6c5-761f-4fc3-bfcf-6089cdfc8f96)
  ✅ Updated to: Taylor Swift - Love Story

==================================================
✅ Processing complete!
📁 Output saved to: data/sample_songs_cleaned.json

🔧 Troubleshooting

"Could not find artist"

  • The artist might not be in the MusicBrainz database
  • Try checking the spelling or using a different variation
  • The search index might still be building (wait a few minutes)
  • Check fuzzy search similarity score - lower threshold if needed
  • NEW: Check for artist aliases (e.g., "98 Degrees" → "98°")
  • NEW: Check for sort names (e.g., "Corby, Matt" → "Matt Corby")

"Could not find recording"

  • The song might not be in the database
  • The title might not match exactly
  • Try a simpler title (remove extra words)
  • Check fuzzy search similarity score - lower threshold if needed
  • NEW: For collaborations, check if it's stored under the main artist

Connection errors

  • Database: Make sure PostgreSQL is running and accessible
  • API: Make sure your MusicBrainz server is running on http://localhost:8080
  • Check that Docker containers are up and running
  • Verify the server is accessible in your browser
  • NEW: For Docker, use container IP (172.18.0.2) instead of localhost

JSON errors

  • Make sure your input file is valid JSON
  • Check that it contains an array of objects
  • Verify all required fields are present

Performance issues

  • Use database mode instead of API mode for better performance
  • Ensure database indexes are built for faster queries
  • Check fuzzy search thresholds - higher thresholds mean fewer but more accurate matches

Collaboration detection issues

  • NEW: Check if it's a band name vs collaboration (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
  • NEW: Verify the collaboration pattern is supported (ft., feat., featuring, &, and, ,)
  • NEW: Check case sensitivity - patterns are case-insensitive

🎯 Use Cases

  • Karaoke Systems: Clean up song metadata for better search and organization
  • Music Libraries: Standardize artist names and add official IDs
  • Music Apps: Ensure consistent data across your application
  • Data Migration: Clean up legacy music data when moving to new systems
  • Fuzzy Matching: Handle typos and variations in artist/song names
  • NEW: Collaboration Handling: Process complex artist collaborations
  • NEW: Edge Cases: Handle artists with special characters and unusual names

📚 What are MBIDs?

MBID stands for MusicBrainz Identifier. These are unique, permanent IDs assigned to artists, recordings, and other music entities in the MusicBrainz database.

Benefits:

  • Permanent: Never change, even if names change
  • Universal: Used across many music applications
  • Reliable: Official identifiers from the MusicBrainz database
  • Linked Data: Connect to other music databases and services

🆕 Performance Comparison

Method Speed Rate Limiting Fuzzy Search Setup Complexity
Database 10x faster None Yes 🔧 Medium
API 🐌 Slower ⏱️ Yes (0.1s delay) No Easy

🆕 Collaboration Detection Examples

Input Type Detection Output
Bruno Mars ft. Cardi B Collaboration Primary pattern Bruno Mars feat. Cardi B
Pitbull ft. Ne-Yo, Afrojack & Nayer Complex Collaboration Multiple patterns Pitbull feat. Ne-Yo, Afrojack & Nayer
Simon & Garfunkel Band Name Protected Simon & Garfunkel
Lavato, Demi & Joe Jonas Collaboration Comma detection Lavato, Demi & Joe Jonas
Hall & Oates Band Name Protected Hall & Oates

🆕 Edge Case Examples

Input Type Handling Output
ACDC Name Variation Alias lookup AC/DC
98 Degrees Artist Alias Alias search 98°
S Club 7 Numerical Suffix Suffix removal S Club
Corby, Matt Sort Name Sort name search Matt Corby
Blink-182 Dash Variation Unicode dash handling blink182
P!nk Special Characters Direct search P!nk
3OH!3 Numbers + Special Direct search 3OH!3

🤝 Contributing

Found a bug or have a feature request?

  1. Check the existing issues
  2. Create a new issue with details
  3. Include sample data if possible

📄 License

This tool is provided as-is for educational and personal use.

📝 Lessons Learned

Database Integration

  • Direct PostgreSQL access is 10x faster than API calls
  • Docker networking requires container IPs, not localhost
  • Database name matters: musicbrainz_db not musicbrainz
  • Static caches cause problems: Wrong MBIDs override correct database lookups

Collaboration Handling

  • Primary patterns (ft., feat.) are always collaborations
  • Secondary patterns (&, and) require intelligence to distinguish from band names
  • Comma detection helps identify collaborations
  • Artist credit lookup is essential for preserving all collaborators

Edge Cases

  • Dash variations (regular vs Unicode) cause exact match failures
  • Artist aliases are common and important (98 Degrees → 98°)
  • Sort names handle "Last, First" formats
  • Numerical suffixes in names need special handling (S Club 7 → S Club)

Performance Optimization

  • Remove static caches for better accuracy
  • Database-first approach ensures live data
  • Fuzzy search thresholds need tuning for different datasets
  • Connection pooling would improve performance for large datasets

Happy cleaning! 🎵