Matt Bruce 20817a3373 Signed-off-by: Matt Bruce <mbrucedogs@gmail.com>

2025-07-31 16:01:35 -05:00

16 KiB

Raw Blame History

🎵 MusicBrainz Data Cleaner v3.0

A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. Now with advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!

✨ What's New in v3.0

🚀 Direct Database Access: Connect directly to PostgreSQL for 10x faster performance
🎯 Advanced Fuzzy Search: Intelligent matching for similar artist names and song titles
🔄 Automatic Fallback: Falls back to API mode if database access fails
⚡ No Rate Limiting: Database queries don't have API rate limits
📊 Similarity Scoring: See how well matches are scored
🆕 Collaboration Detection: Intelligently handle complex collaborations like "Pitbull ft. Ne-Yo, Afrojack & Nayer"
🆕 Artist Aliases: Handle name variations like "98 Degrees" → "98°" and "S Club 7" → "S Club"
🆕 Sort Names: Handle "Last, First" formats like "Corby, Matt" → "Matt Corby"
🆕 Edge Case Handling: Support for artists with hyphens, exclamation marks, numbers, and special characters
🆕 Band Name Protection: Distinguish between band names (Simon & Garfunkel) and collaborations (Lavato, Demi & Joe Jonas)

✨ What It Does

Before:

{
  "artist": "ACDC",
  "title": "Shot In The Dark",
  "favorite": true
}

After:

{
  "artist": "AC/DC",
  "title": "Shot in the Dark",
  "favorite": true,
  "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
  "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
}

🚀 Quick Start

1. Install Dependencies

pip install requests psycopg2-binary fuzzywuzzy python-Levenshtein

2. Set Up MusicBrainz Server

Option A: Docker (Recommended)

# Clone MusicBrainz Docker repository
git clone https://github.com/metabrainz/musicbrainz-docker.git
cd musicbrainz-docker

# Update postgres.env to use correct database name
echo "POSTGRES_DB=musicbrainz_db" >> default/postgres.env

# Start the server
docker-compose up -d

# Wait for database to be ready (can take 10-15 minutes)
docker-compose logs -f musicbrainz

Option B: Manual Setup

Install PostgreSQL 12+
Create database: createdb musicbrainz_db
Import MusicBrainz data dump
Start MusicBrainz server on port 8080

3. Test Connection

python musicbrainz_cleaner.py --test-connection

4. Run the Cleaner

# Use database access (recommended, faster)
python musicbrainz_cleaner.py your_songs.json

# Force API mode (slower, fallback)
python musicbrainz_cleaner.py your_songs.json --use-api

That's it! Your cleaned data will be saved to your_songs_cleaned.json

📋 Requirements

Python 3.6+
MusicBrainz Server running on localhost:8080
PostgreSQL Database accessible on localhost:5432
Dependencies: requests, psycopg2-binary, fuzzywuzzy, python-Levenshtein

🔧 Server Configuration

Database Access

Host: localhost (or Docker container IP: 172.18.0.2)
Port: 5432 (PostgreSQL default)
Database: musicbrainz_db (actual database name)
User: musicbrainz
Password: musicbrainz (default, should be changed in production)

HTTP API (Fallback)

URL: http://localhost:8080
Endpoint: /ws/2/
Format: JSON

Troubleshooting

Database Connection Failed: Check PostgreSQL is running and credentials are correct
API Connection Failed: Check MusicBrainz server is running on port 8080
Slow Performance: Ensure database indexes are built
No Results: Verify data has been imported to the database
NEW: Docker Networking: Use container IP (172.18.0.2) for Docker-to-Docker connections
NEW: Database Name: Ensure using musicbrainz_db not musicbrainz

🧪 Testing

Run the test suite to verify everything works correctly:

# Run all tests
python3 src/tests/run_tests.py

# Run specific test categories
python3 src/tests/run_tests.py --unit          # Unit tests only
python3 src/tests/run_tests.py --integration   # Integration tests only

# Run specific test module
python3 src/tests/run_tests.py test_data_loader
python3 src/tests/run_tests.py test_cli

# List all available tests
python3 src/tests/run_tests.py --list

Test Categories

Unit Tests: Test individual components in isolation
Integration Tests: Test interactions between components and database
Debug Tests: Debug scripts and troubleshooting tools

📁 Data Files

The tool uses external JSON files for name variations:

data/known_artists.json: Contains name variations (ACDC → AC/DC, ft. → feat.)
data/known_recordings.json: Contains known recording MBIDs for common songs

These files can be easily updated without touching the code, making it simple to add new name variations.

🎯 Features

✅ Artist Name Fixes

ACDC → AC/DC
Bruno Mars ft. Cardi B → Bruno Mars feat. Cardi B
featuring → feat.
98 Degrees → 98° (artist aliases)
S Club 7 → S Club (numerical suffixes)
Corby, Matt → Matt Corby (sort names)

✅ Collaboration Detection

Primary Patterns: "ft.", "feat.", "featuring" (always collaborations)
Secondary Patterns: "&", "and", "," (intelligent detection)
Band Name Protection: "Simon & Garfunkel" (not collaboration)
Complex Collaborations: "Pitbull ft. Ne-Yo, Afrojack & Nayer"
Case Insensitive: "Featuring" → "featuring"

✅ Song Title Fixes

Shot In The Dark → Shot in the Dark
Removes (Karaoke Version), (Instrumental) suffixes
Normalizes capitalization and formatting

✅ Added Data

mbid: Official MusicBrainz Artist ID
recording_mbid: Official MusicBrainz Recording ID

✅ Preserves Your Data

Keeps all your existing fields (guid, path, disabled, favorite, etc.)
Only adds new fields, never removes existing ones

🆕 Advanced Fuzzy Search

Intelligent Matching: Finds similar names even with typos or variations
Similarity Scoring: Shows how well each match scores (0.0 to 1.0)
Configurable Thresholds: Adjust matching sensitivity
Multiple Algorithms: Uses ratio, partial ratio, and token sort matching
Enhanced Search Fields: artist.name, artist_alias.name, artist.sort_name
Dash Handling: Regular dash (-) vs Unicode dash (‐)
Substring Protection: Avoids false matches like "Sleazy-E" vs "Eazy-E"

🆕 Edge Case Support

Hyphenated Artists: "Blink-182", "Ne-Yo", "G-Eazy"
Exclamation Marks: "P!nk", "Panic! At The Disco", "3OH!3"
Numbers: "98 Degrees", "S Club 7", "3 Doors Down"
Special Characters: "a-ha", "The B-52s", "Salt-N-Pepa"

📖 Usage Examples

Basic Usage

# Clean your songs and save to auto-generated filename
python musicbrainz_cleaner.py my_songs.json
# Output: my_songs_cleaned.json

Custom Output File

# Specify your own output filename
python musicbrainz_cleaner.py my_songs.json cleaned_songs.json

Force API Mode

# Use HTTP API instead of database (slower but works without PostgreSQL)
python musicbrainz_cleaner.py my_songs.json --use-api

Test Connections

# Test database connection
python musicbrainz_cleaner.py --test-connection

# Test with API mode
python musicbrainz_cleaner.py --test-connection --use-api

Help

# Show usage information
python musicbrainz_cleaner.py --help

📁 Data Files

Input Format

Your JSON file should contain an array of song objects:

[
  {
    "artist": "ACDC",
    "title": "Shot In The Dark",
    "disabled": false,
    "favorite": true,
    "guid": "8946008c-7acc-d187-60e6-5286e55ad502",
    "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4"
  },
  {
    "artist": "Bruno Mars ft. Cardi B",
    "title": "Finesse Remix",
    "disabled": false,
    "favorite": false,
    "guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
    "path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4"
  }
]

📤 Output Format

The tool will update your objects with corrected data:

[
  {
    "artist": "AC/DC",
    "title": "Shot in the Dark",
    "disabled": false,
    "favorite": true,
    "guid": "8946008c-7acc-d187-60e6-5286e55ad502",
    "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
    "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
    "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
  },
  {
    "artist": "Bruno Mars feat. Cardi B",
    "title": "Finesse (remix)",
    "disabled": false,
    "favorite": false,
    "guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
    "path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4",
    "mbid": "afb680f2-b6eb-4cd7-a70b-a63b25c763d5",
    "recording_mbid": "8ed14014-547a-4128-ab81-c2dca7ae198e"
  }
]

🎬 Example Run

$ python musicbrainz_cleaner.py data/sample_songs.json

Processing 3 songs...
Using database connection
==================================================

[1/3] Processing: ACDC - Shot In The Dark
  🎯 Fuzzy match found: ACDC → AC/DC (score: 0.85)
  ✅ Found artist: AC/DC (MBID: 66c662b6-6e2f-4930-8610-912e24c63ed1)
  🎯 Fuzzy match found: Shot In The Dark → Shot in the Dark (score: 0.92)
  ✅ Found recording: Shot in the Dark (MBID: cf8b5cd0-d97c-413d-882f-fc422a2e57db)
  ✅ Updated to: AC/DC - Shot in the Dark

[2/3] Processing: Bruno Mars ft. Cardi B - Finesse Remix
  🎯 Fuzzy match found: Bruno Mars → Bruno Mars (score: 1.00)
  ✅ Found artist: Bruno Mars (MBID: afb680f2-b6eb-4cd7-a70b-a63b25c763d5)
  🎯 Fuzzy match found: Finesse Remix → Finesse (remix) (score: 0.88)
  ✅ Found recording: Finesse (remix) (MBID: 8ed14014-547a-4128-ab81-c2dca7ae198e)
  ✅ Updated to: Bruno Mars feat. Cardi B - Finesse (remix)

[3/3] Processing: Taylor Swift - Love Story
  🎯 Fuzzy match found: Taylor Swift → Taylor Swift (score: 1.00)
  ✅ Found artist: Taylor Swift (MBID: 20244d07-534f-4eff-b4d4-930878889970)
  🎯 Fuzzy match found: Love Story → Love Story (score: 1.00)
  ✅ Found recording: Love Story (MBID: d783e6c5-761f-4fc3-bfcf-6089cdfc8f96)
  ✅ Updated to: Taylor Swift - Love Story

==================================================
✅ Processing complete!
📁 Output saved to: data/sample_songs_cleaned.json

🔧 Troubleshooting

"Could not find artist"

The artist might not be in the MusicBrainz database
Try checking the spelling or using a different variation
The search index might still be building (wait a few minutes)
Check fuzzy search similarity score - lower threshold if needed
NEW: Check for artist aliases (e.g., "98 Degrees" → "98°")
NEW: Check for sort names (e.g., "Corby, Matt" → "Matt Corby")

"Could not find recording"

The song might not be in the database
The title might not match exactly
Try a simpler title (remove extra words)
Check fuzzy search similarity score - lower threshold if needed
NEW: For collaborations, check if it's stored under the main artist

Connection errors

Database: Make sure PostgreSQL is running and accessible
API: Make sure your MusicBrainz server is running on http://localhost:8080
Check that Docker containers are up and running
Verify the server is accessible in your browser
NEW: For Docker, use container IP (172.18.0.2) instead of localhost

JSON errors

Make sure your input file is valid JSON
Check that it contains an array of objects
Verify all required fields are present

Performance issues

Use database mode instead of API mode for better performance
Ensure database indexes are built for faster queries
Check fuzzy search thresholds - higher thresholds mean fewer but more accurate matches

Collaboration detection issues

NEW: Check if it's a band name vs collaboration (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
NEW: Verify the collaboration pattern is supported (ft., feat., featuring, &, and, ,)
NEW: Check case sensitivity - patterns are case-insensitive

🎯 Use Cases

Karaoke Systems: Clean up song metadata for better search and organization
Music Libraries: Standardize artist names and add official IDs
Music Apps: Ensure consistent data across your application
Data Migration: Clean up legacy music data when moving to new systems
Fuzzy Matching: Handle typos and variations in artist/song names
NEW: Collaboration Handling: Process complex artist collaborations
NEW: Edge Cases: Handle artists with special characters and unusual names

📚 What are MBIDs?

MBID stands for MusicBrainz Identifier. These are unique, permanent IDs assigned to artists, recordings, and other music entities in the MusicBrainz database.

Benefits:

Permanent: Never change, even if names change
Universal: Used across many music applications
Reliable: Official identifiers from the MusicBrainz database
Linked Data: Connect to other music databases and services

🆕 Performance Comparison

Method	Speed	Rate Limiting	Fuzzy Search	Setup Complexity
Database	⚡ 10x faster	❌ None	✅ Yes	🔧 Medium
API	🐌 Slower	⏱️ Yes (0.1s delay)	❌ No	✅ Easy

🆕 Collaboration Detection Examples

Input	Type	Detection	Output
`Bruno Mars ft. Cardi B`	Collaboration	✅ Primary pattern	`Bruno Mars feat. Cardi B`
`Pitbull ft. Ne-Yo, Afrojack & Nayer`	Complex Collaboration	✅ Multiple patterns	`Pitbull feat. Ne-Yo, Afrojack & Nayer`
`Simon & Garfunkel`	Band Name	❌ Protected	`Simon & Garfunkel`
`Lavato, Demi & Joe Jonas`	Collaboration	✅ Comma detection	`Lavato, Demi & Joe Jonas`
`Hall & Oates`	Band Name	❌ Protected	`Hall & Oates`

🆕 Edge Case Examples

Input	Type	Handling	Output
`ACDC`	Name Variation	✅ Alias lookup	`AC/DC`
`98 Degrees`	Artist Alias	✅ Alias search	`98°`
`S Club 7`	Numerical Suffix	✅ Suffix removal	`S Club`
`Corby, Matt`	Sort Name	✅ Sort name search	`Matt Corby`
`Blink-182`	Dash Variation	✅ Unicode dash handling	`blink‐182`
`P!nk`	Special Characters	✅ Direct search	`P!nk`
`3OH!3`	Numbers + Special	✅ Direct search	`3OH!3`

🤝 Contributing

Found a bug or have a feature request?

Check the existing issues
Create a new issue with details
Include sample data if possible

📄 License

This tool is provided as-is for educational and personal use.

MusicBrainz - The open music encyclopedia
MusicBrainz API - API documentation
MusicBrainz Docker - Docker setup
FuzzyWuzzy - Fuzzy string matching library

📝 Lessons Learned

Database Integration

Direct PostgreSQL access is 10x faster than API calls
Docker networking requires container IPs, not localhost
Database name matters: musicbrainz_db not musicbrainz
Static caches cause problems: Wrong MBIDs override correct database lookups

Collaboration Handling

Primary patterns (ft., feat.) are always collaborations
Secondary patterns (&, and) require intelligence to distinguish from band names
Comma detection helps identify collaborations
Artist credit lookup is essential for preserving all collaborators

Edge Cases

Dash variations (regular vs Unicode) cause exact match failures
Artist aliases are common and important (98 Degrees → 98°)
Sort names handle "Last, First" formats
Numerical suffixes in names need special handling (S Club 7 → S Club)

Performance Optimization

Remove static caches for better accuracy
Database-first approach ensures live data
Fuzzy search thresholds need tuning for different datasets
Connection pooling would improve performance for large datasets

Happy cleaning! 🎵✨

16 KiB Raw Blame History Unescape Escape

🎵 MusicBrainz Data Cleaner v3.0

✨ What's New in v3.0

✨ What It Does

🚀 Quick Start

1. Install Dependencies

2. Set Up MusicBrainz Server

Option A: Docker (Recommended)

Option B: Manual Setup

3. Test Connection

4. Run the Cleaner

📋 Requirements

🔧 Server Configuration

Database Access

HTTP API (Fallback)

Troubleshooting

🧪 Testing

Test Categories

📁 Data Files

🎯 Features

✅ Artist Name Fixes

✅ Collaboration Detection

✅ Song Title Fixes

✅ Added Data

✅ Preserves Your Data

🆕 Advanced Fuzzy Search

🆕 Edge Case Support

📖 Usage Examples

Basic Usage

Custom Output File

Force API Mode

Test Connections

Help

📁 Data Files

Input Format

📤 Output Format

🎬 Example Run

🔧 Troubleshooting

"Could not find artist"

"Could not find recording"

Connection errors

JSON errors

Performance issues

Collaboration detection issues

🎯 Use Cases

📚 What are MBIDs?

🆕 Performance Comparison

🆕 Collaboration Detection Examples

🆕 Edge Case Examples

🤝 Contributing

📄 License

🔗 Related Links

📝 Lessons Learned

Database Integration

Collaboration Handling

Edge Cases

Performance Optimization

16 KiB

Raw Blame History