760 lines
26 KiB
Markdown
760 lines
26 KiB
Markdown
# 🎵 MusicBrainz Data Cleaner v3.0
|
||
|
||
A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with interface-based architecture, advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!**
|
||
|
||
## 🚀 Quick Start for New Sessions
|
||
|
||
**If you're starting fresh or after a reboot, follow this exact sequence:**
|
||
|
||
### 1. Start MusicBrainz Services
|
||
```bash
|
||
# Quick restart (recommended)
|
||
./restart_services.sh
|
||
|
||
# Or full restart (if you have issues)
|
||
./start_services.sh
|
||
```
|
||
|
||
### 2. Wait for Services to Initialize
|
||
- **Database**: 5-10 minutes to fully load
|
||
- **Web server**: 2-3 minutes to start responding
|
||
- **Check status**: `cd ../musicbrainz-docker && docker-compose ps`
|
||
|
||
### 3. Verify Services Are Ready
|
||
```bash
|
||
# Test web server
|
||
curl -s http://localhost:5001 | head -5
|
||
|
||
# Test database (should show 2.6M+ artists)
|
||
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"
|
||
|
||
# Test cleaner connection
|
||
docker-compose run --rm musicbrainz-cleaner python3 -c "from src.api.database import MusicBrainzDatabase; db = MusicBrainzDatabase(); print('Connection result:', db.connect())"
|
||
```
|
||
|
||
### 4. Run Tests
|
||
```bash
|
||
# Test 100 random songs
|
||
docker-compose run --rm musicbrainz-cleaner python3 test_100_random.py
|
||
|
||
# Or other test scripts
|
||
docker-compose run --rm musicbrainz-cleaner python3 [script_name].py
|
||
```
|
||
|
||
**⚠️ Important**: Always run scripts via Docker - the cleaner cannot connect to the database directly from outside the container.
|
||
|
||
**📋 Troubleshooting**: See `TROUBLESHOOTING.md` for common issues and solutions.
|
||
|
||
## ✨ What's New in v3.0
|
||
|
||
- **🏗️ Interface-Based Architecture**: Clean dependency injection with common interfaces
|
||
- **🏭 Factory Pattern**: Smart data provider creation and configuration
|
||
- **🚀 Direct Database Access**: Connect directly to PostgreSQL for 10x faster performance
|
||
- **🎯 Advanced Fuzzy Search**: Intelligent matching for similar artist names and song titles
|
||
- **🔄 Automatic Fallback**: Falls back to API mode if database access fails
|
||
- **⚡ No Rate Limiting**: Database queries don't have API rate limits
|
||
- **📊 Similarity Scoring**: See how well matches are scored
|
||
- **🆕 Collaboration Detection**: Intelligently handle complex collaborations like "Pitbull ft. Ne-Yo, Afrojack & Nayer"
|
||
- **🆕 Artist Aliases**: Handle name variations like "98 Degrees" → "98°" and "S Club 7" → "S Club"
|
||
- **🆕 Sort Names**: Handle "Last, First" formats like "Corby, Matt" → "Matt Corby"
|
||
- **🆕 Edge Case Handling**: Support for artists with hyphens, exclamation marks, numbers, and special characters
|
||
- **🆕 Band Name Protection**: Distinguish between band names (Simon & Garfunkel) and collaborations (Lavato, Demi & Joe Jonas)
|
||
|
||
## ✨ What It Does
|
||
|
||
**Before:**
|
||
```json
|
||
{
|
||
"artist": "ACDC",
|
||
"title": "Shot In The Dark",
|
||
"favorite": true
|
||
}
|
||
```
|
||
|
||
**After:**
|
||
```json
|
||
{
|
||
"artist": "AC/DC",
|
||
"title": "Shot in the Dark",
|
||
"favorite": true,
|
||
"mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
|
||
"recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
|
||
}
|
||
```
|
||
|
||
## 🚀 Quick Start
|
||
|
||
### Option 1: Automated Setup (Recommended)
|
||
|
||
1. **Start MusicBrainz services**:
|
||
```bash
|
||
./start_services.sh
|
||
```
|
||
This script will:
|
||
- Check for Docker and port conflicts
|
||
- Start all MusicBrainz services
|
||
- Wait for database initialization
|
||
- Create environment configuration
|
||
- Test the connection
|
||
|
||
2. **Run the cleaner**:
|
||
```bash
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --input data/songs.json --output cleaned_songs.json
|
||
```
|
||
|
||
### Option 2: Manual Setup
|
||
|
||
1. **Start MusicBrainz services manually**:
|
||
```bash
|
||
cd ../musicbrainz-docker
|
||
MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
|
||
```
|
||
Wait 5-10 minutes for database initialization.
|
||
|
||
2. **Create environment configuration**:
|
||
```bash
|
||
# Create .env file in musicbrainz-cleaner directory
|
||
cat > .env << EOF
|
||
DB_HOST=172.18.0.2
|
||
DB_PORT=5432
|
||
DB_NAME=musicbrainz_db
|
||
DB_USER=musicbrainz
|
||
DB_PASSWORD=musicbrainz
|
||
MUSICBRAINZ_WEB_SERVER_PORT=5001
|
||
EOF
|
||
```
|
||
|
||
3. **Run the cleaner**:
|
||
```bash
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --input data/songs.json --output cleaned_songs.json
|
||
```
|
||
|
||
### For detailed setup instructions, see [SETUP.md](SETUP.md)
|
||
|
||
## 🔄 After System Reboot
|
||
|
||
After restarting your Mac, you'll need to restart the MusicBrainz services:
|
||
|
||
### Quick Restart (Recommended)
|
||
```bash
|
||
# If Docker Desktop is already running
|
||
./restart_services.sh
|
||
|
||
# Or manually
|
||
cd ../musicbrainz-docker && MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
|
||
```
|
||
|
||
### Full Restart (If you have issues)
|
||
```bash
|
||
# Complete setup including Docker checks
|
||
./start_services.sh
|
||
```
|
||
|
||
### Auto-start Setup (Optional)
|
||
1. **Enable Docker Desktop auto-start**:
|
||
- Open Docker Desktop
|
||
- Go to Settings → General
|
||
- Check "Start Docker Desktop when you log in"
|
||
|
||
2. **Then just run**: `./restart_services.sh` after each reboot
|
||
|
||
**Note**: Your data is preserved in Docker volumes, so you don't need to reconfigure anything after a reboot.
|
||
|
||
## 🚨 Common Startup Issues & Fixes
|
||
|
||
### Issue 1: Database Connection Refused
|
||
**Problem**: Cleaner can't connect to database with error "Connection refused"
|
||
**Root Cause**: Database container not fully initialized or wrong host configuration
|
||
**Fix**:
|
||
```bash
|
||
# Wait for database to be ready (check logs)
|
||
cd ../musicbrainz-docker && docker-compose logs db | tail -10
|
||
|
||
# Verify database is accepting connections
|
||
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"
|
||
```
|
||
|
||
### Issue 2: Wrong Database Host Configuration
|
||
**Problem**: Cleaner tries to connect to `172.18.0.2` but can't reach it
|
||
**Root Cause**: Hardcoded IP address in database connection
|
||
**Fix**: Use Docker service name `db` instead of IP address
|
||
```python
|
||
# In src/api/database.py, change:
|
||
host='172.18.0.2' # ❌ Wrong
|
||
host='db' # ✅ Correct
|
||
```
|
||
|
||
### Issue 3: Test Script Logic Error
|
||
**Problem**: Test shows 0% success rate despite finding artists
|
||
**Root Cause**: Test script checking `'mbid' in result` where `result` is a tuple `(song_dict, success_boolean)`
|
||
**Fix**: Extract song dictionary from tuple
|
||
```python
|
||
# Wrong:
|
||
artist_found = 'mbid' in result
|
||
|
||
# Correct:
|
||
cleaned_song, success = result
|
||
artist_found = 'mbid' in cleaned_song
|
||
```
|
||
|
||
### Issue 4: Services Not Fully Initialized
|
||
**Problem**: API returns empty results even though database has data
|
||
**Root Cause**: MusicBrainz web server still starting up
|
||
**Fix**: Wait for services to be fully ready
|
||
```bash
|
||
# Check if web server is responding
|
||
curl -s http://localhost:5001 | head -5
|
||
|
||
# Wait for database to be ready
|
||
docker-compose logs db | grep "database system is ready"
|
||
```
|
||
|
||
### Issue 5: Port Conflicts
|
||
**Problem**: Port 5000 already in use
|
||
**Root Cause**: Another service using the port
|
||
**Fix**: Use alternative port
|
||
```bash
|
||
MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
|
||
```
|
||
|
||
### Issue 6: Container Name Conflicts
|
||
**Problem**: "Container name already in use" error
|
||
**Root Cause**: Previous containers not properly cleaned up
|
||
**Fix**: Remove conflicting containers
|
||
```bash
|
||
docker-compose down
|
||
docker rm -f <container_name>
|
||
```
|
||
|
||
## 🔧 Startup Checklist
|
||
|
||
Before running tests, verify:
|
||
1. ✅ Docker Desktop is running
|
||
2. ✅ All containers are up: `docker-compose ps`
|
||
3. ✅ Database is ready: `docker-compose logs db | grep "ready"`
|
||
4. ✅ Web server responds: `curl -s http://localhost:5001`
|
||
5. ✅ Database has data: `docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"`
|
||
6. ✅ Cleaner can connect: Test database connection in cleaner
|
||
|
||
## 📋 Requirements
|
||
|
||
- **Python 3.6+**
|
||
- **MusicBrainz Server** running on localhost:8080
|
||
- **PostgreSQL Database** accessible on localhost:5432
|
||
- **Dependencies**: `requests`, `psycopg2-binary`, `fuzzywuzzy`, `python-Levenshtein`
|
||
|
||
## 🔧 Server Configuration
|
||
|
||
### Database Access
|
||
- **Host**: localhost (or Docker container IP: 172.18.0.2)
|
||
- **Port**: 5432 (PostgreSQL default)
|
||
- **Database**: musicbrainz_db (actual database name)
|
||
- **User**: musicbrainz
|
||
- **Password**: musicbrainz (default, should be changed in production)
|
||
|
||
### HTTP API (Fallback)
|
||
- **URL**: http://localhost:8080
|
||
- **Endpoint**: /ws/2/
|
||
- **Format**: JSON
|
||
|
||
### Troubleshooting
|
||
- **Database Connection Failed**: Check PostgreSQL is running and credentials are correct
|
||
- **API Connection Failed**: Check MusicBrainz server is running on port 8080
|
||
- **Slow Performance**: Ensure database indexes are built
|
||
- **No Results**: Verify data has been imported to the database
|
||
- **NEW**: **Docker Networking**: Use container IP (172.18.0.2) for Docker-to-Docker connections
|
||
- **NEW**: **Database Name**: Ensure using `musicbrainz_db` not `musicbrainz`
|
||
|
||
## 🧪 Testing
|
||
|
||
### Test File Organization
|
||
- **REQUIRED**: All test files must be placed in `src/tests/` directory
|
||
- **PROHIBITED**: Test files should not be placed in the root directory
|
||
- **Naming Convention**: Test files should follow `test_*.py` or `debug_*.py` patterns
|
||
- **Purpose**: Keeps root directory clean and organizes test code properly
|
||
|
||
### Running Tests
|
||
```bash
|
||
# Run all tests
|
||
python3 src/tests/run_tests.py
|
||
|
||
# Run specific test categories
|
||
python3 src/tests/run_tests.py --unit # Unit tests only
|
||
python3 src/tests/run_tests.py --integration # Integration tests only
|
||
|
||
# Run specific test module
|
||
python3 src/tests/run_tests.py test_data_loader
|
||
python3 src/tests/run_tests.py test_cli
|
||
|
||
# List all available tests
|
||
python3 src/tests/run_tests.py --list
|
||
```
|
||
|
||
#### Test Categories
|
||
- **Unit Tests**: Test individual components in isolation
|
||
- **Integration Tests**: Test interactions between components and database
|
||
- **Debug Tests**: Debug scripts and troubleshooting tools
|
||
|
||
## 📁 Project Structure
|
||
|
||
```
|
||
musicbrainz-cleaner/
|
||
├── src/
|
||
│ ├── api/ # Database and API access
|
||
│ │ ├── database.py # Direct PostgreSQL access (implements MusicBrainzDataProvider)
|
||
│ │ └── api_client.py # HTTP API client (implements MusicBrainzDataProvider)
|
||
│ ├── cli/ # Command-line interface
|
||
│ │ └── main.py # Main CLI implementation (uses factory pattern)
|
||
│ ├── config/ # Configuration and constants
|
||
│ ├── core/ # Core functionality
|
||
│ │ ├── interfaces.py # Common interfaces and protocols
|
||
│ │ ├── factory.py # Data provider factory
|
||
│ │ └── song_processor.py # Centralized song processing logic
|
||
│ ├── tests/ # Test files (REQUIRED location)
|
||
│ └── utils/ # Utility functions
|
||
│ ├── artist_title_processing.py # Shared artist/title processing
|
||
│ └── data_loader.py # Data loading utilities
|
||
├── data/ # Data files and output
|
||
│ ├── known_artists.json # Name variations (ACDC → AC/DC)
|
||
│ ├── known_recordings.json # Known recording MBIDs
|
||
│ └── songs.json # Source songs file
|
||
└── docker-compose.yml # Docker configuration
|
||
```
|
||
|
||
### Data Files
|
||
|
||
The tool uses external JSON files for name variations:
|
||
|
||
- **`data/known_artists.json`**: Contains name variations (ACDC → AC/DC, ft. → feat.)
|
||
- **`data/known_recordings.json`**: Contains known recording MBIDs for common songs
|
||
|
||
These files can be easily updated without touching the code, making it simple to add new name variations.
|
||
|
||
## 🎯 Features
|
||
|
||
### ✅ Artist Name Fixes
|
||
- `ACDC` → `AC/DC`
|
||
- `Bruno Mars ft. Cardi B` → `Bruno Mars feat. Cardi B`
|
||
- `featuring` → `feat.`
|
||
- `98 Degrees` → `98°` (artist aliases)
|
||
- `S Club 7` → `S Club` (numerical suffixes)
|
||
- `Corby, Matt` → `Matt Corby` (sort names)
|
||
|
||
### ✅ Collaboration Detection
|
||
- **Primary Patterns**: "ft.", "feat.", "featuring" (always collaborations)
|
||
- **Secondary Patterns**: "&", "and", "," (intelligent detection)
|
||
- **Band Name Protection**: 200+ known band names from `data/known_artists.json`
|
||
- **Complex Collaborations**: "Pitbull ft. Ne-Yo, Afrojack & Nayer"
|
||
- **Case Insensitive**: "Featuring" → "featuring"
|
||
|
||
### ✅ Song Title Fixes
|
||
- `Shot In The Dark` → `Shot in the Dark`
|
||
- Removes `(Karaoke Version)`, `(Instrumental)` suffixes
|
||
- Normalizes capitalization and formatting
|
||
|
||
### ✅ Added Data
|
||
- **`mbid`**: Official MusicBrainz Artist ID
|
||
- **`recording_mbid`**: Official MusicBrainz Recording ID
|
||
|
||
### ✅ Preserves Your Data
|
||
- Keeps all your existing fields (guid, path, disabled, favorite, etc.)
|
||
- Only adds new fields, never removes existing ones
|
||
|
||
### 🆕 Advanced Fuzzy Search
|
||
- **Intelligent Matching**: Finds similar names even with typos or variations
|
||
- **Similarity Scoring**: Shows how well each match scores (0.0 to 1.0)
|
||
- **Configurable Thresholds**: Adjust matching sensitivity
|
||
- **Multiple Algorithms**: Uses ratio, partial ratio, and token sort matching
|
||
- **Enhanced Search Fields**: artist.name, artist_alias.name, artist.sort_name
|
||
- **Dash Handling**: Regular dash (-) vs Unicode dash (‐)
|
||
- **Substring Protection**: Avoids false matches like "Sleazy-E" vs "Eazy-E"
|
||
|
||
### 🆕 Edge Case Support
|
||
- **Hyphenated Artists**: "Blink-182", "Ne-Yo", "G-Eazy"
|
||
- **Exclamation Marks**: "P!nk", "Panic! At The Disco", "3OH!3"
|
||
- **Numbers**: "98 Degrees", "S Club 7", "3 Doors Down"
|
||
- **Special Characters**: "a-ha", "The B-52s", "Salt-N-Pepa"
|
||
|
||
### 🆕 Simplified Processing
|
||
- **Default Behavior**: Process all songs by default (no special flags needed)
|
||
- **Separate Output Files**: Successful and failed songs saved to different files
|
||
- **Progress Tracking**: Real-time progress with song counter and status
|
||
- **Smart Defaults**: Sensible defaults for all file paths and options
|
||
- **Detailed Reporting**: Comprehensive statistics and processing report
|
||
- **Batch Processing**: Efficient handling of large song collections
|
||
|
||
## 📖 Usage Examples
|
||
|
||
### Basic Usage (Default)
|
||
```bash
|
||
# Process all songs with default settings (data/songs.json)
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
|
||
# Output: data/songs-success.json and data/songs-failure.json
|
||
```
|
||
|
||
### Custom Source File
|
||
```bash
|
||
# Process specific file
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/my_songs.json
|
||
# Output: data/my_songs-success.json and data/my_songs-failure.json
|
||
```
|
||
|
||
### Custom Output Files
|
||
```bash
|
||
# Specify custom output files
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/songs.json --output-success cleaned.json --output-failure failed.json
|
||
```
|
||
|
||
### Limit Processing
|
||
```bash
|
||
# Process only first 1000 songs
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --limit 1000
|
||
```
|
||
|
||
### Force API Mode
|
||
```bash
|
||
# Use HTTP API instead of database (slower but works without PostgreSQL)
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --use-api
|
||
```
|
||
|
||
### Test Connections
|
||
```bash
|
||
# Test database connection
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection
|
||
|
||
# Test with API mode
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection --use-api
|
||
```
|
||
|
||
### Help
|
||
```bash
|
||
# Show usage information
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --help
|
||
```
|
||
|
||
## 📁 Data Files
|
||
|
||
### Input Format
|
||
Your JSON file should contain an array of song objects:
|
||
|
||
```json
|
||
[
|
||
{
|
||
"artist": "ACDC",
|
||
"title": "Shot In The Dark",
|
||
"disabled": false,
|
||
"favorite": true,
|
||
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
|
||
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4"
|
||
},
|
||
{
|
||
"artist": "Bruno Mars ft. Cardi B",
|
||
"title": "Finesse Remix",
|
||
"disabled": false,
|
||
"favorite": false,
|
||
"guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
|
||
"path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4"
|
||
}
|
||
]
|
||
```
|
||
|
||
## 📤 Output Format
|
||
|
||
The tool creates **three output files**:
|
||
|
||
### 1. Successful Songs (`source-success.json`)
|
||
Array of successfully processed songs with MBIDs added:
|
||
|
||
```json
|
||
[
|
||
{
|
||
"artist": "AC/DC",
|
||
"title": "Shot in the Dark",
|
||
"disabled": false,
|
||
"favorite": true,
|
||
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
|
||
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
|
||
"mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
|
||
"recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
|
||
},
|
||
{
|
||
"artist": "Bruno Mars feat. Cardi B",
|
||
"title": "Finesse (remix)",
|
||
"disabled": false,
|
||
"favorite": false,
|
||
"guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
|
||
"path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4",
|
||
"mbid": "afb680f2-b6eb-4cd7-a70b-a63b25c763d5",
|
||
"recording_mbid": "8ed14014-547a-4128-ab81-c2dca7ae198e"
|
||
}
|
||
]
|
||
```
|
||
|
||
### 2. Failed Songs (`source-failure.json`)
|
||
Array of songs that couldn't be processed (same format as source):
|
||
|
||
```json
|
||
[
|
||
{
|
||
"artist": "Unknown Artist",
|
||
"title": "Unknown Song",
|
||
"disabled": false,
|
||
"favorite": false,
|
||
"guid": "12345678-1234-1234-1234-123456789012",
|
||
"path": "z://MP4\\Unknown Artist - Unknown Song.mp4"
|
||
}
|
||
]
|
||
```
|
||
|
||
### 3. Processing Report (`processing_report_YYYYMMDD_HHMMSS.txt`)
|
||
Human-readable text report with statistics and failed song list:
|
||
|
||
```
|
||
MusicBrainz Data Cleaner - Processing Report
|
||
==================================================
|
||
|
||
Source File: data/songs.json
|
||
Processing Date: 2024-12-19 14:30:22
|
||
Processing Time: 15263.3 seconds
|
||
|
||
SUMMARY
|
||
--------------------
|
||
Total Songs Processed: 49,170
|
||
Successful Songs: 40,692
|
||
Failed Songs: 8,478
|
||
Success Rate: 82.8%
|
||
|
||
DETAILED STATISTICS
|
||
--------------------
|
||
Artists Found: 44,526/49,170 (90.6%)
|
||
Recordings Found: 40,998/49,170 (83.4%)
|
||
Processing Speed: 3.2 songs/second
|
||
|
||
OUTPUT FILES
|
||
--------------------
|
||
Successful Songs: data/songs-success.json
|
||
Failed Songs: data/songs-failure.json
|
||
Report File: data/processing_report_20241219_143022.txt
|
||
|
||
FAILED SONGS (First 50)
|
||
--------------------
|
||
1. Unknown Artist - Unknown Song
|
||
2. Invalid Artist - Invalid Title
|
||
3. Test Artist - Test Song
|
||
...
|
||
```
|
||
|
||
## 🎬 Example Run
|
||
|
||
### Basic Processing
|
||
```bash
|
||
$ docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
|
||
|
||
🚀 Starting song processing...
|
||
📊 Total songs to process: 49,170
|
||
Using database connection
|
||
==================================================
|
||
|
||
[1 of 49,170] ✅ PASS: ACDC - Shot In The Dark
|
||
[2 of 49,170] ❌ FAIL: Unknown Artist - Unknown Song
|
||
[3 of 49,170] ✅ PASS: Bruno Mars feat. Cardi B - Finesse (remix)
|
||
[4 of 49,170] ✅ PASS: Taylor Swift - Love Story
|
||
...
|
||
|
||
📈 Progress: 100/49,170 (0.2%) - Success: 85.0% - Rate: 3.2 songs/sec
|
||
📈 Progress: 200/49,170 (0.4%) - Success: 87.5% - Rate: 3.1 songs/sec
|
||
...
|
||
|
||
==================================================
|
||
🎉 Processing completed!
|
||
📊 Final Results:
|
||
⏱️ Total processing time: 15263.3 seconds
|
||
🚀 Average speed: 3.2 songs/second
|
||
✅ Artists found: 44,526/49,170 (90.6%)
|
||
✅ Recordings found: 40,998/49,170 (83.4%)
|
||
❌ Failed songs: 8,478 (17.2%)
|
||
📄 Files saved:
|
||
✅ Successful songs: data/songs-success.json
|
||
❌ Failed songs: data/songs-failure.json
|
||
📋 Text report: data/processing_report_20241219_143022.txt
|
||
📊 JSON report: data/processing_report_20241219_143022.json
|
||
```
|
||
|
||
### Limited Processing
|
||
```bash
|
||
$ docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --limit 1000
|
||
|
||
⚠️ Limiting processing to first 1000 songs
|
||
🚀 Starting song processing...
|
||
📊 Total songs to process: 1,000
|
||
Using database connection
|
||
==================================================
|
||
|
||
[1 of 1,000] ✅ PASS: ACDC - Shot In The Dark
|
||
[2 of 1,000] ❌ FAIL: Unknown Artist - Unknown Song
|
||
...
|
||
|
||
==================================================
|
||
🎉 Processing completed!
|
||
📊 Final Results:
|
||
⏱️ Total processing time: 312.5 seconds
|
||
🚀 Average speed: 3.2 songs/second
|
||
✅ Artists found: 856/1,000 (85.6%)
|
||
✅ Recordings found: 789/1,000 (78.9%)
|
||
❌ Failed songs: 211 (21.1%)
|
||
📄 Files saved:
|
||
✅ Successful songs: data/songs-success.json
|
||
❌ Failed songs: data/songs-failure.json
|
||
📋 Text report: data/processing_report_20241219_143022.txt
|
||
📊 JSON report: data/processing_report_20241219_143022.json
|
||
```
|
||
|
||
## 🔧 Troubleshooting
|
||
|
||
### "Could not find artist"
|
||
- The artist might not be in the MusicBrainz database
|
||
- Try checking the spelling or using a different variation
|
||
- The search index might still be building (wait a few minutes)
|
||
- Check fuzzy search similarity score - lower threshold if needed
|
||
- **NEW**: Check for artist aliases (e.g., "98 Degrees" → "98°")
|
||
- **NEW**: Check for sort names (e.g., "Corby, Matt" → "Matt Corby")
|
||
|
||
### "Could not find recording"
|
||
- The song might not be in the database
|
||
- The title might not match exactly
|
||
- Try a simpler title (remove extra words)
|
||
- Check fuzzy search similarity score - lower threshold if needed
|
||
- **NEW**: For collaborations, check if it's stored under the main artist
|
||
|
||
### Connection errors
|
||
- **Database**: Make sure PostgreSQL is running and accessible
|
||
- **API**: Make sure your MusicBrainz server is running on `http://localhost:8080`
|
||
- Check that Docker containers are up and running
|
||
- Verify the server is accessible in your browser
|
||
- **NEW**: For Docker, use container IP (172.18.0.2) instead of localhost
|
||
|
||
### JSON errors
|
||
- Make sure your input file is valid JSON
|
||
- Check that it contains an array of objects
|
||
- Verify all required fields are present
|
||
|
||
### Performance issues
|
||
- Use database mode instead of API mode for better performance
|
||
- Ensure database indexes are built for faster queries
|
||
- Check fuzzy search thresholds - higher thresholds mean fewer but more accurate matches
|
||
|
||
### Collaboration detection issues
|
||
- **NEW**: Check if it's a band name vs collaboration (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
|
||
- **NEW**: Verify the collaboration pattern is supported (ft., feat., featuring, &, and, ,)
|
||
- **NEW**: Check case sensitivity - patterns are case-insensitive
|
||
|
||
### Using Tests for Troubleshooting
|
||
- **FIRST STEP**: Check `src/tests/` directory for existing test files that might help
|
||
- **DEBUG SCRIPTS**: Run `python3 src/tests/debug_artist_search.py` for artist search issues
|
||
- **COLLABORATION ISSUES**: Check `src/tests/test_failed_collaborations.py` for collaboration examples
|
||
- **DATABASE ISSUES**: Look at `src/tests/test_simple_query.py` for database connection patterns
|
||
- **WORKING EXAMPLES**: Test files often contain working code that can be adapted for your issue
|
||
|
||
## 🎯 Use Cases
|
||
|
||
- **Karaoke Systems**: Clean up song metadata for better search and organization
|
||
- **Music Libraries**: Standardize artist names and add official IDs
|
||
- **Music Apps**: Ensure consistent data across your application
|
||
- **Data Migration**: Clean up legacy music data when moving to new systems
|
||
- **Fuzzy Matching**: Handle typos and variations in artist/song names
|
||
- **NEW**: **Collaboration Handling**: Process complex artist collaborations
|
||
- **NEW**: **Edge Cases**: Handle artists with special characters and unusual names
|
||
|
||
## 📚 What are MBIDs?
|
||
|
||
**MBID** stands for **MusicBrainz Identifier**. These are unique, permanent IDs assigned to artists, recordings, and other music entities in the MusicBrainz database.
|
||
|
||
**Benefits:**
|
||
- **Permanent**: Never change, even if names change
|
||
- **Universal**: Used across many music applications
|
||
- **Reliable**: Official identifiers from the MusicBrainz database
|
||
- **Linked Data**: Connect to other music databases and services
|
||
|
||
## 🆕 Performance Comparison
|
||
|
||
| Method | Speed | Rate Limiting | Fuzzy Search | Setup Complexity |
|
||
|--------|-------|---------------|--------------|------------------|
|
||
| **Database** | ⚡ 10x faster | ❌ None | ✅ Yes | 🔧 Medium |
|
||
| **API** | 🐌 Slower | ⏱️ Yes (0.1s delay) | ❌ No | ✅ Easy |
|
||
|
||
## 🆕 Collaboration Detection Examples
|
||
|
||
| Input | Type | Detection | Output |
|
||
|-------|------|-----------|---------|
|
||
| `Bruno Mars ft. Cardi B` | Collaboration | ✅ Primary pattern | `Bruno Mars feat. Cardi B` |
|
||
| `Pitbull ft. Ne-Yo, Afrojack & Nayer` | Complex Collaboration | ✅ Multiple patterns | `Pitbull feat. Ne-Yo, Afrojack & Nayer` |
|
||
| `Simon & Garfunkel` | Band Name | ❌ Protected | `Simon & Garfunkel` |
|
||
| `Lavato, Demi & Joe Jonas` | Collaboration | ✅ Comma detection | `Lavato, Demi & Joe Jonas` |
|
||
| `Hall & Oates` | Band Name | ❌ Protected | `Hall & Oates` |
|
||
|
||
## 🆕 Edge Case Examples
|
||
|
||
| Input | Type | Handling | Output |
|
||
|-------|------|----------|---------|
|
||
| `ACDC` | Name Variation | ✅ Alias lookup | `AC/DC` |
|
||
| `98 Degrees` | Artist Alias | ✅ Alias search | `98°` |
|
||
| `S Club 7` | Numerical Suffix | ✅ Suffix removal | `S Club` |
|
||
| `Corby, Matt` | Sort Name | ✅ Sort name search | `Matt Corby` |
|
||
| `Blink-182` | Dash Variation | ✅ Unicode dash handling | `blink‐182` |
|
||
| `P!nk` | Special Characters | ✅ Direct search | `P!nk` |
|
||
| `3OH!3` | Numbers + Special | ✅ Direct search | `3OH!3` |
|
||
|
||
## 🤝 Contributing
|
||
|
||
Found a bug or have a feature request?
|
||
|
||
1. Check the existing issues
|
||
2. Create a new issue with details
|
||
3. Include sample data if possible
|
||
|
||
## 📄 License
|
||
|
||
This tool is provided as-is for educational and personal use.
|
||
|
||
## 🔗 Related Links
|
||
|
||
- [MusicBrainz](https://musicbrainz.org/) - The open music encyclopedia
|
||
- [MusicBrainz API](https://musicbrainz.org/doc/Development) - API documentation
|
||
- [MusicBrainz Docker](https://github.com/metabrainz/musicbrainz-docker) - Docker setup
|
||
- [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy string matching library
|
||
|
||
## 📝 Lessons Learned
|
||
|
||
### Database Integration
|
||
- **Direct PostgreSQL access is 10x faster** than API calls
|
||
- **Docker networking** requires container IPs, not localhost
|
||
- **Database name matters**: `musicbrainz_db` not `musicbrainz`
|
||
- **Static caches cause problems**: Wrong MBIDs override correct database lookups
|
||
|
||
### Collaboration Handling
|
||
- **Primary patterns** (ft., feat.) are always collaborations
|
||
- **Secondary patterns** (&, and) require intelligence to distinguish from band names
|
||
- **Comma detection** helps identify collaborations
|
||
- **Artist credit lookup** is essential for preserving all collaborators
|
||
|
||
### Edge Cases
|
||
- **Dash variations** (regular vs Unicode) cause exact match failures
|
||
- **Artist aliases** are common and important (98 Degrees → 98°)
|
||
- **Sort names** handle "Last, First" formats
|
||
- **Numerical suffixes** in names need special handling (S Club 7 → S Club)
|
||
|
||
### Performance Optimization
|
||
- **Remove static caches** for better accuracy
|
||
- **Database-first approach** ensures live data
|
||
- **Fuzzy search thresholds** need tuning for different datasets
|
||
- **Connection pooling** would improve performance for large datasets
|
||
|
||
### CLI Design
|
||
- **Simplified interface** with smart defaults reduces complexity
|
||
- **Array format consistency** makes output files easier to work with
|
||
- **Human-readable reports** improve user experience
|
||
- **Test file organization** keeps project structure clean
|
||
|
||
---
|
||
|
||
**Happy cleaning! 🎵✨** |