Go to file
2025-07-31 11:18:07 -05:00
data Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-07-31 11:18:07 -05:00
src Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-07-31 11:18:07 -05:00
.gitignore Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-07-31 11:18:07 -05:00
COMMANDS.md Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-07-31 11:18:07 -05:00
LICENSE Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-07-31 11:18:07 -05:00
musicbrainz_cleaner.py Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-07-31 11:18:07 -05:00
PRD.md Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-07-31 11:18:07 -05:00
README.md Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-07-31 11:18:07 -05:00
requirements.txt Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-07-31 11:18:07 -05:00
setup.py Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-07-31 11:18:07 -05:00
test_db_connection.py Signed-off-by: Matt Bruce <mbrucedogs@gmail.com> 2025-07-31 11:18:07 -05:00

🎵 MusicBrainz Data Cleaner v2.0

A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. Now with direct database access and fuzzy search for maximum performance and accuracy!

What's New in v2.0

  • 🚀 Direct Database Access: Connect directly to PostgreSQL for 10x faster performance
  • 🎯 Fuzzy Search: Intelligent matching for similar artist names and song titles
  • 🔄 Automatic Fallback: Falls back to API mode if database access fails
  • No Rate Limiting: Database queries don't have API rate limits
  • 📊 Similarity Scoring: See how well matches are scored

What It Does

Before:

{
  "artist": "ACDC",
  "title": "Shot In The Dark",
  "favorite": true
}

After:

{
  "artist": "AC/DC",
  "title": "Shot in the Dark",
  "favorite": true,
  "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
  "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
}

🚀 Quick Start

1. Install Dependencies

pip install requests psycopg2-binary fuzzywuzzy python-Levenshtein

2. Set Up MusicBrainz Server

# Clone MusicBrainz Docker repository
git clone https://github.com/metabrainz/musicbrainz-docker.git
cd musicbrainz-docker

# Start the server
docker-compose up -d

# Wait for database to be ready (can take 10-15 minutes)
docker-compose logs -f musicbrainz

Option B: Manual Setup

  1. Install PostgreSQL 12+
  2. Create database: createdb musicbrainz
  3. Import MusicBrainz data dump
  4. Start MusicBrainz server on port 5001

3. Test Connection

python musicbrainz_cleaner.py --test-connection

4. Run the Cleaner

# Use database access (recommended, faster)
python musicbrainz_cleaner.py your_songs.json

# Force API mode (slower, fallback)
python musicbrainz_cleaner.py your_songs.json --use-api

That's it! Your cleaned data will be saved to your_songs_cleaned.json

📋 Requirements

  • Python 3.6+
  • MusicBrainz Server running on localhost:5001
  • PostgreSQL Database accessible on localhost:5432
  • Dependencies: requests, psycopg2-binary, fuzzywuzzy, python-Levenshtein

🔧 Server Configuration

Database Access

  • Host: localhost
  • Port: 5432 (PostgreSQL default)
  • Database: musicbrainz
  • User: musicbrainz
  • Password: musicbrainz (default, should be changed in production)

HTTP API (Fallback)

Troubleshooting

  • Database Connection Failed: Check PostgreSQL is running and credentials are correct
  • API Connection Failed: Check MusicBrainz server is running on port 5001
  • Slow Performance: Ensure database indexes are built
  • No Results: Verify data has been imported to the database

🧪 Testing

Run the test suite to verify everything works correctly:

# Run all tests
python3 src/tests/run_tests.py

# Run specific test module
python3 src/tests/run_tests.py test_data_loader
python3 src/tests/run_tests.py test_cli

📁 Data Files

The tool uses external JSON files for known artist and recording data:

  • data/known_artists.json: Contains known artist MBIDs for common artists
  • data/known_recordings.json: Contains known recording MBIDs for common songs

These files can be easily updated without touching the code, making it simple to add new artists and recordings.

🎯 Features

Artist Name Fixes

  • ACDCAC/DC
  • Bruno Mars ft. Cardi BBruno Mars feat. Cardi B
  • featuringfeat.

Song Title Fixes

  • Shot In The DarkShot in the Dark
  • Removes (Karaoke Version), (Instrumental) suffixes
  • Normalizes capitalization and formatting

Added Data

  • mbid: Official MusicBrainz Artist ID
  • recording_mbid: Official MusicBrainz Recording ID

Preserves Your Data

  • Keeps all your existing fields (guid, path, disabled, favorite, etc.)
  • Only adds new fields, never removes existing ones
  • Intelligent Matching: Finds similar names even with typos or variations
  • Similarity Scoring: Shows how well each match scores (0.0 to 1.0)
  • Configurable Thresholds: Adjust matching sensitivity
  • Multiple Algorithms: Uses ratio, partial ratio, and token sort matching

📖 Usage Examples

Basic Usage

# Clean your songs and save to auto-generated filename
python musicbrainz_cleaner.py my_songs.json
# Output: my_songs_cleaned.json

Custom Output File

# Specify your own output filename
python musicbrainz_cleaner.py my_songs.json cleaned_songs.json

Force API Mode

# Use HTTP API instead of database (slower but works without PostgreSQL)
python musicbrainz_cleaner.py my_songs.json --use-api

Test Connections

# Test database connection
python musicbrainz_cleaner.py --test-connection

# Test with API mode
python musicbrainz_cleaner.py --test-connection --use-api

Help

# Show usage information
python musicbrainz_cleaner.py --help

📁 Data Files

Input Format

Your JSON file should contain an array of song objects:

[
  {
    "artist": "ACDC",
    "title": "Shot In The Dark",
    "disabled": false,
    "favorite": true,
    "guid": "8946008c-7acc-d187-60e6-5286e55ad502",
    "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4"
  },
  {
    "artist": "Bruno Mars ft. Cardi B",
    "title": "Finesse Remix",
    "disabled": false,
    "favorite": false,
    "guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
    "path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4"
  }
]

📤 Output Format

The tool will update your objects with corrected data:

[
  {
    "artist": "AC/DC",
    "title": "Shot in the Dark",
    "disabled": false,
    "favorite": true,
    "guid": "8946008c-7acc-d187-60e6-5286e55ad502",
    "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
    "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
    "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
  },
  {
    "artist": "Bruno Mars feat. Cardi B",
    "title": "Finesse (remix)",
    "disabled": false,
    "favorite": false,
    "guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
    "path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4",
    "mbid": "afb680f2-b6eb-4cd7-a70b-a63b25c763d5",
    "recording_mbid": "8ed14014-547a-4128-ab81-c2dca7ae198e"
  }
]

🎬 Example Run

$ python musicbrainz_cleaner.py data/sample_songs.json

Processing 3 songs...
Using database connection
==================================================

[1/3] Processing: ACDC - Shot In The Dark
  🎯 Fuzzy match found: ACDC → AC/DC (score: 0.85)
  ✅ Found artist: AC/DC (MBID: 66c662b6-6e2f-4930-8610-912e24c63ed1)
  🎯 Fuzzy match found: Shot In The Dark → Shot in the Dark (score: 0.92)
  ✅ Found recording: Shot in the Dark (MBID: cf8b5cd0-d97c-413d-882f-fc422a2e57db)
  ✅ Updated to: AC/DC - Shot in the Dark

[2/3] Processing: Bruno Mars ft. Cardi B - Finesse Remix
  🎯 Fuzzy match found: Bruno Mars → Bruno Mars (score: 1.00)
  ✅ Found artist: Bruno Mars (MBID: afb680f2-b6eb-4cd7-a70b-a63b25c763d5)
  🎯 Fuzzy match found: Finesse Remix → Finesse (remix) (score: 0.88)
  ✅ Found recording: Finesse (remix) (MBID: 8ed14014-547a-4128-ab81-c2dca7ae198e)
  ✅ Updated to: Bruno Mars feat. Cardi B - Finesse (remix)

[3/3] Processing: Taylor Swift - Love Story
  🎯 Fuzzy match found: Taylor Swift → Taylor Swift (score: 1.00)
  ✅ Found artist: Taylor Swift (MBID: 20244d07-534f-4eff-b4d4-930878889970)
  🎯 Fuzzy match found: Love Story → Love Story (score: 1.00)
  ✅ Found recording: Love Story (MBID: d783e6c5-761f-4fc3-bfcf-6089cdfc8f96)
  ✅ Updated to: Taylor Swift - Love Story

==================================================
✅ Processing complete!
📁 Output saved to: data/sample_songs_cleaned.json

🔧 Troubleshooting

"Could not find artist"

  • The artist might not be in the MusicBrainz database
  • Try checking the spelling or using a different variation
  • The search index might still be building (wait a few minutes)
  • NEW: Check fuzzy search similarity score - lower threshold if needed

"Could not find recording"

  • The song might not be in the database
  • The title might not match exactly
  • Try a simpler title (remove extra words)
  • NEW: Check fuzzy search similarity score - lower threshold if needed

Connection errors

  • Database: Make sure PostgreSQL is running and accessible
  • API: Make sure your MusicBrainz server is running on http://localhost:5001
  • Check that Docker containers are up and running
  • Verify the server is accessible in your browser

JSON errors

  • Make sure your input file is valid JSON
  • Check that it contains an array of objects
  • Verify all required fields are present

Performance issues

  • NEW: Use database mode instead of API mode for better performance
  • NEW: Ensure database indexes are built for faster queries
  • NEW: Check fuzzy search thresholds - higher thresholds mean fewer but more accurate matches

🎯 Use Cases

  • Karaoke Systems: Clean up song metadata for better search and organization
  • Music Libraries: Standardize artist names and add official IDs
  • Music Apps: Ensure consistent data across your application
  • Data Migration: Clean up legacy music data when moving to new systems
  • Fuzzy Matching: Handle typos and variations in artist/song names

📚 What are MBIDs?

MBID stands for MusicBrainz Identifier. These are unique, permanent IDs assigned to artists, recordings, and other music entities in the MusicBrainz database.

Benefits:

  • Permanent: Never change, even if names change
  • Universal: Used across many music applications
  • Reliable: Official identifiers from the MusicBrainz database
  • Linked Data: Connect to other music databases and services

🆕 Performance Comparison

Method Speed Rate Limiting Fuzzy Search Setup Complexity
Database 10x faster None Yes 🔧 Medium
API 🐌 Slower ⏱️ Yes (0.1s delay) No Easy

🤝 Contributing

Found a bug or have a feature request?

  1. Check the existing issues
  2. Create a new issue with details
  3. Include sample data if possible

📄 License

This tool is provided as-is for educational and personal use.


Happy cleaning! 🎵