musicbrainz-cleaner/COMMANDS.md

15 KiB

MusicBrainz Data Cleaner - CLI Commands Reference

Overview

The MusicBrainz Data Cleaner is a command-line interface (CLI) tool that processes JSON song data files and cleans/normalizes the metadata using the MusicBrainz database. The tool uses an interface-based architecture with dependency injection for clean, maintainable code. It creates separate output files for successful and failed songs, along with detailed processing reports.

Basic Command Structure

docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main [options]

Command Options

Main Options

Option Type Description Default Example
--source string Source JSON file path data/songs.json --source data/my_songs.json
--output-success string Output file for successful songs source-success.json --output-success cleaned.json
--output-failure string Output file for failed songs source-failure.json --output-failure failed.json
--limit number Process only first N songs None (all songs) --limit 1000
--use-api flag Force use of HTTP API instead of database Database mode --use-api
--test-connection flag Test connection to MusicBrainz server None --test-connection
--help flag Show help information None --help
--version flag Show version information None --version

Command Examples

Basic Usage (Default)

# Process all songs with default settings
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
# Output: data/songs-success.json and data/songs-failure.json

Custom Source File

# Process specific file
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/my_songs.json
# Output: data/my_songs-success.json and data/my_songs-failure.json

Custom Output Files

# Specify custom output files
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/songs.json --output-success cleaned.json --output-failure failed.json

Limited Processing

# Process only first 1000 songs
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --limit 1000

Force API Mode

# Use HTTP API instead of database (slower but works without PostgreSQL)
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --use-api

Test Connection

# Test database connection
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection

# Test API connection
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection --use-api

Help and Information

# Show help information
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --help

# Show version information
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --version

Input File Format

The input file must be a valid JSON file containing an array of song objects:

[
  {
    "artist": "ACDC",
    "title": "Shot In The Dark",
    "disabled": false,
    "favorite": true,
    "guid": "8946008c-7acc-d187-60e6-5286e55ad502",
    "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4"
  }
]

Required Fields

  • artist: The artist name (string)
  • title: The song title (string)

Optional Fields

Any additional fields will be preserved in the output:

  • disabled: Boolean flag
  • favorite: Boolean flag
  • guid: Unique identifier
  • path: File path
  • Any other custom fields

Output Files

The tool creates three output files:

1. Successful Songs (source-success.json)

Array of successfully processed songs with MBIDs added:

[
  {
    "artist": "AC/DC",
    "title": "Shot in the Dark",
    "disabled": false,
    "favorite": true,
    "guid": "8946008c-7acc-d187-60e6-5286e55ad502",
    "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
    "mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
    "recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
  }
]

2. Failed Songs (source-failure.json)

Array of songs that couldn't be processed (same format as source):

[
  {
    "artist": "Unknown Artist",
    "title": "Unknown Song",
    "disabled": false,
    "favorite": false,
    "guid": "12345678-1234-1234-1234-123456789012",
    "path": "z://MP4\\Unknown Artist - Unknown Song.mp4"
  }
]

3. Processing Report (processing_report_YYYYMMDD_HHMMSS.txt)

Human-readable text report with statistics and failed song list:

MusicBrainz Data Cleaner - Processing Report
==================================================

Source File: data/songs.json
Processing Date: 2024-12-19 14:30:22
Processing Time: 15263.3 seconds

SUMMARY
--------------------
Total Songs Processed: 49,170
Successful Songs: 40,692
Failed Songs: 8,478
Success Rate: 82.8%

DETAILED STATISTICS
--------------------
Artists Found: 44,526/49,170 (90.6%)
Recordings Found: 40,998/49,170 (83.4%)
Processing Speed: 3.2 songs/second

OUTPUT FILES
--------------------
Successful Songs: data/songs-success.json
Failed Songs: data/songs-failure.json
Report File: data/processing_report_20241219_143022.txt

FAILED SONGS (First 50)
--------------------
  1. Unknown Artist - Unknown Song
  2. Invalid Artist - Invalid Title
  3. Test Artist - Test Song
...

Added Fields (Successful Songs Only)

  • mbid: MusicBrainz Artist ID (string)
  • recording_mbid: MusicBrainz Recording ID (string)

Processing Output

Progress Indicators

🚀 Starting song processing...
📊 Total songs to process: 49,170
Using database connection
==================================================

[1 of 49,170] ✅ PASS: ACDC - Shot In The Dark
[2 of 49,170] ❌ FAIL: Unknown Artist - Unknown Song
[3 of 49,170] ✅ PASS: Bruno Mars feat. Cardi B - Finesse (remix)

  📈 Progress: 100/49,170 (0.2%) - Success: 85.0% - Rate: 3.2 songs/sec

==================================================
🎉 Processing completed!
📊 Final Results:
  ⏱️  Total processing time: 15263.3 seconds
  🚀 Average speed: 3.2 songs/second
  ✅ Artists found: 44,526/49,170 (90.6%)
  ✅ Recordings found: 40,998/49,170 (83.4%)
  ❌ Failed songs: 8,478 (17.2%)
📄 Files saved:
  ✅ Successful songs: data/songs-success.json
  ❌ Failed songs: data/songs-failure.json
  📋 Text report: data/processing_report_20241219_143022.txt
  📊 JSON report: data/processing_report_20241219_143022.json

Status Indicators

Symbol Meaning Description
Success Song processed successfully with MBIDs found
Failure Song processing failed (no MBIDs found)
📈 Progress Progress update with statistics
🚀 Start Processing started
🎉 Complete Processing completed successfully

Error Messages and Exit Codes

Exit Codes

Code Meaning Description
0 Success Processing completed successfully
1 Error General error occurred
2 Usage Error Invalid command line arguments

Common Error Messages

File Not Found

Error: Source file does not exist: data/songs.json

Invalid JSON

Error: Invalid JSON in file 'songs.json'

Invalid Input Format

Error: Source file should contain a JSON array of songs

Connection Error

❌ Connection to MusicBrainz database failed

Missing Dependencies

ModuleNotFoundError: No module named 'requests'

Architecture Overview

Interface-Based Design

The tool uses a clean interface-based architecture:

  • MusicBrainzDataProvider Interface: Common protocol for data access
  • DataProviderFactory: Creates appropriate provider (database or API)
  • SongProcessor: Centralized processing logic using the interface
  • Dependency Injection: CLI depends on interfaces, not concrete classes

Data Flow

  1. CLI uses DataProviderFactory to create data provider
  2. Factory returns either database or API implementation
  3. SongProcessor processes songs using the common interface
  4. Same logic works regardless of provider type

Environment Configuration

Docker Environment

The tool runs in a Docker container with the following configuration:

Setting Default Description
Database Host db PostgreSQL database container
Database Port 5432 PostgreSQL port
Database Name musicbrainz_db MusicBrainz database name
API URL http://localhost:5001 MusicBrainz web server URL

Environment Variables

# Database configuration
DB_HOST=db
DB_PORT=5432
DB_NAME=musicbrainz_db
DB_USER=musicbrainz
DB_PASSWORD=musicbrainz

# Web server configuration
MUSICBRAINZ_WEB_SERVER_PORT=5001

Troubleshooting Commands

Check MusicBrainz Server Status

# Test if web server is running
curl -I http://localhost:5001

# Test database connection
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"

Validate JSON File

# Check if JSON is valid
python -m json.tool data/songs.json

# Check JSON structure
python -c "import json; data=json.load(open('data/songs.json')); print('Valid JSON array with', len(data), 'items')"

Test Tool Connection

# Test database connection
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection

# Test API connection
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection --use-api

Advanced Usage

Batch Processing

To process multiple files, you can use shell scripting:

# Process all JSON files in data directory
for file in data/*.json; do
    docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source "$file"
done

Large Files

For large files, the tool processes songs efficiently with:

  • Direct database access for maximum speed
  • Progress tracking every 100 songs
  • Memory-efficient processing
  • No rate limiting with database access

Custom Processing

# Process with custom chunk size (for testing)
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/songs.json --limit 1000

# Process with custom output files
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/songs.json --output-success my_cleaned.json --output-failure my_failed.json

Command Line Shortcuts

Common Aliases

Add these to your shell profile for convenience:

# Add to ~/.bashrc or ~/.zshrc
alias mbclean='docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main'
alias mbclean-help='docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --help'
alias mbclean-test='docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection'

Usage with Aliases

# Using alias
mbclean --source data/songs.json

# Show help
mbclean-help

# Test connection
mbclean-test

Integration Examples

With Git

# Process files and commit changes
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/songs.json
git add data/songs-success.json data/songs-failure.json
git commit -m "Clean song metadata with MusicBrainz IDs"

With Cron Jobs

# Add to crontab to process files daily
0 2 * * * cd /path/to/musicbrainz-cleaner && docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source /path/to/songs.json

With Shell Scripts

#!/bin/bash
# clean_songs.sh
INPUT_FILE="$1"
OUTPUT_SUCCESS="${INPUT_FILE%.json}-success.json"
OUTPUT_FAILURE="${INPUT_FILE%.json}-failure.json"

docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main \
    --source "$INPUT_FILE" \
    --output-success "$OUTPUT_SUCCESS" \
    --output-failure "$OUTPUT_FAILURE"

if [ $? -eq 0 ]; then
    echo "Successfully processed $INPUT_FILE"
    echo "Successful songs: $OUTPUT_SUCCESS"
    echo "Failed songs: $OUTPUT_FAILURE"
else
    echo "Error processing $INPUT_FILE"
    exit 1
fi

Command Reference Summary

Command Description
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main Process all songs with defaults
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source file.json Process specific file
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --limit 1000 Process first 1000 songs
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection Test database connection
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --use-api Force API mode
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --help Show help
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --version Show version

Artist Lookup System Commands

The MusicBrainz Data Cleaner includes an advanced Artist Lookup System with its own CLI interface for managing artist data.

Artist Lookup CLI Structure

docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.artist_lookup_cli [command] [options]

Available Commands

Search for Artists

# Search for an artist in the lookup table
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.artist_lookup_cli search "Queen"

# Search with custom similarity threshold (0.0 to 1.0)
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.artist_lookup_cli search "Destiny's Child" --min-score 0.8

View Statistics

# Show lookup table statistics
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.artist_lookup_cli stats

List All Artists

# List all artists in the lookup table
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.artist_lookup_cli list

Add New Artists

# Add a new artist with variations
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.artist_lookup_cli add \
  --canonical-name "New Artist" \
  --mbid "12345678-1234-1234-1234-123456789abc" \
  --variations "Artist, The Artist, Artist Band" \
  --notes "Description of the artist"

Artist Lookup Command Reference

Command Description
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.artist_lookup_cli search "Artist Name" Search for artist with fuzzy matching
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.artist_lookup_cli search "Artist Name" --min-score 0.8 Search with custom similarity threshold
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.artist_lookup_cli stats Show lookup table statistics
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.artist_lookup_cli list List all artists in lookup table
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.artist_lookup_cli add --canonical-name "Name" --mbid "MBID" --variations "var1, var2" Add new artist to lookup table

Artist Lookup Features

  • 2,446+ Artists: Comprehensive lookup table
  • 4,950+ Variations: Extensive name variations and aliases
  • Fuzzy Matching: Intelligent matching with configurable thresholds
  • Canonical Names: Consistent artist name replacement
  • Automatic Integration: Works seamlessly with song processing
  • CLI Management: Full command-line interface for data management