musicbrainz-cleaner/README.md

743 lines
25 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🎵 MusicBrainz Data Cleaner v3.0
A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!**
## 🚀 Quick Start for New Sessions
**If you're starting fresh or after a reboot, follow this exact sequence:**
### 1. Start MusicBrainz Services
```bash
# Quick restart (recommended)
./restart_services.sh
# Or full restart (if you have issues)
./start_services.sh
```
### 2. Wait for Services to Initialize
- **Database**: 5-10 minutes to fully load
- **Web server**: 2-3 minutes to start responding
- **Check status**: `cd ../musicbrainz-docker && docker-compose ps`
### 3. Verify Services Are Ready
```bash
# Test web server
curl -s http://localhost:5001 | head -5
# Test database (should show 2.6M+ artists)
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"
# Test cleaner connection
docker-compose run --rm musicbrainz-cleaner python3 -c "from src.api.database import MusicBrainzDatabase; db = MusicBrainzDatabase(); print('Connection result:', db.connect())"
```
### 4. Run Tests
```bash
# Test 100 random songs
docker-compose run --rm musicbrainz-cleaner python3 test_100_random.py
# Or other test scripts
docker-compose run --rm musicbrainz-cleaner python3 [script_name].py
```
**⚠️ Important**: Always run scripts via Docker - the cleaner cannot connect to the database directly from outside the container.
**📋 Troubleshooting**: See `TROUBLESHOOTING.md` for common issues and solutions.
## ✨ What's New in v3.0
- **🚀 Direct Database Access**: Connect directly to PostgreSQL for 10x faster performance
- **🎯 Advanced Fuzzy Search**: Intelligent matching for similar artist names and song titles
- **🔄 Automatic Fallback**: Falls back to API mode if database access fails
- **⚡ No Rate Limiting**: Database queries don't have API rate limits
- **📊 Similarity Scoring**: See how well matches are scored
- **🆕 Collaboration Detection**: Intelligently handle complex collaborations like "Pitbull ft. Ne-Yo, Afrojack & Nayer"
- **🆕 Artist Aliases**: Handle name variations like "98 Degrees" → "98°" and "S Club 7" → "S Club"
- **🆕 Sort Names**: Handle "Last, First" formats like "Corby, Matt" → "Matt Corby"
- **🆕 Edge Case Handling**: Support for artists with hyphens, exclamation marks, numbers, and special characters
- **🆕 Band Name Protection**: Distinguish between band names (Simon & Garfunkel) and collaborations (Lavato, Demi & Joe Jonas)
## ✨ What It Does
**Before:**
```json
{
"artist": "ACDC",
"title": "Shot In The Dark",
"favorite": true
}
```
**After:**
```json
{
"artist": "AC/DC",
"title": "Shot in the Dark",
"favorite": true,
"mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
"recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
}
```
## 🚀 Quick Start
### Option 1: Automated Setup (Recommended)
1. **Start MusicBrainz services**:
```bash
./start_services.sh
```
This script will:
- Check for Docker and port conflicts
- Start all MusicBrainz services
- Wait for database initialization
- Create environment configuration
- Test the connection
2. **Run the cleaner**:
```bash
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --input data/songs.json --output cleaned_songs.json
```
### Option 2: Manual Setup
1. **Start MusicBrainz services manually**:
```bash
cd ../musicbrainz-docker
MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
```
Wait 5-10 minutes for database initialization.
2. **Create environment configuration**:
```bash
# Create .env file in musicbrainz-cleaner directory
cat > .env << EOF
DB_HOST=172.18.0.2
DB_PORT=5432
DB_NAME=musicbrainz_db
DB_USER=musicbrainz
DB_PASSWORD=musicbrainz
MUSICBRAINZ_WEB_SERVER_PORT=5001
EOF
```
3. **Run the cleaner**:
```bash
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --input data/songs.json --output cleaned_songs.json
```
### For detailed setup instructions, see [SETUP.md](SETUP.md)
## 🔄 After System Reboot
After restarting your Mac, you'll need to restart the MusicBrainz services:
### Quick Restart (Recommended)
```bash
# If Docker Desktop is already running
./restart_services.sh
# Or manually
cd ../musicbrainz-docker && MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
```
### Full Restart (If you have issues)
```bash
# Complete setup including Docker checks
./start_services.sh
```
### Auto-start Setup (Optional)
1. **Enable Docker Desktop auto-start**:
- Open Docker Desktop
- Go to Settings General
- Check "Start Docker Desktop when you log in"
2. **Then just run**: `./restart_services.sh` after each reboot
**Note**: Your data is preserved in Docker volumes, so you don't need to reconfigure anything after a reboot.
## 🚨 Common Startup Issues & Fixes
### Issue 1: Database Connection Refused
**Problem**: Cleaner can't connect to database with error "Connection refused"
**Root Cause**: Database container not fully initialized or wrong host configuration
**Fix**:
```bash
# Wait for database to be ready (check logs)
cd ../musicbrainz-docker && docker-compose logs db | tail -10
# Verify database is accepting connections
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"
```
### Issue 2: Wrong Database Host Configuration
**Problem**: Cleaner tries to connect to `172.18.0.2` but can't reach it
**Root Cause**: Hardcoded IP address in database connection
**Fix**: Use Docker service name `db` instead of IP address
```python
# In src/api/database.py, change:
host='172.18.0.2' # ❌ Wrong
host='db' # ✅ Correct
```
### Issue 3: Test Script Logic Error
**Problem**: Test shows 0% success rate despite finding artists
**Root Cause**: Test script checking `'mbid' in result` where `result` is a tuple `(song_dict, success_boolean)`
**Fix**: Extract song dictionary from tuple
```python
# Wrong:
artist_found = 'mbid' in result
# Correct:
cleaned_song, success = result
artist_found = 'mbid' in cleaned_song
```
### Issue 4: Services Not Fully Initialized
**Problem**: API returns empty results even though database has data
**Root Cause**: MusicBrainz web server still starting up
**Fix**: Wait for services to be fully ready
```bash
# Check if web server is responding
curl -s http://localhost:5001 | head -5
# Wait for database to be ready
docker-compose logs db | grep "database system is ready"
```
### Issue 5: Port Conflicts
**Problem**: Port 5000 already in use
**Root Cause**: Another service using the port
**Fix**: Use alternative port
```bash
MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
```
### Issue 6: Container Name Conflicts
**Problem**: "Container name already in use" error
**Root Cause**: Previous containers not properly cleaned up
**Fix**: Remove conflicting containers
```bash
docker-compose down
docker rm -f <container_name>
```
## 🔧 Startup Checklist
Before running tests, verify:
1. Docker Desktop is running
2. All containers are up: `docker-compose ps`
3. Database is ready: `docker-compose logs db | grep "ready"`
4. Web server responds: `curl -s http://localhost:5001`
5. Database has data: `docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"`
6. Cleaner can connect: Test database connection in cleaner
## 📋 Requirements
- **Python 3.6+**
- **MusicBrainz Server** running on localhost:8080
- **PostgreSQL Database** accessible on localhost:5432
- **Dependencies**: `requests`, `psycopg2-binary`, `fuzzywuzzy`, `python-Levenshtein`
## 🔧 Server Configuration
### Database Access
- **Host**: localhost (or Docker container IP: 172.18.0.2)
- **Port**: 5432 (PostgreSQL default)
- **Database**: musicbrainz_db (actual database name)
- **User**: musicbrainz
- **Password**: musicbrainz (default, should be changed in production)
### HTTP API (Fallback)
- **URL**: http://localhost:8080
- **Endpoint**: /ws/2/
- **Format**: JSON
### Troubleshooting
- **Database Connection Failed**: Check PostgreSQL is running and credentials are correct
- **API Connection Failed**: Check MusicBrainz server is running on port 8080
- **Slow Performance**: Ensure database indexes are built
- **No Results**: Verify data has been imported to the database
- **NEW**: **Docker Networking**: Use container IP (172.18.0.2) for Docker-to-Docker connections
- **NEW**: **Database Name**: Ensure using `musicbrainz_db` not `musicbrainz`
## 🧪 Testing
### Test File Organization
- **REQUIRED**: All test files must be placed in `src/tests/` directory
- **PROHIBITED**: Test files should not be placed in the root directory
- **Naming Convention**: Test files should follow `test_*.py` or `debug_*.py` patterns
- **Purpose**: Keeps root directory clean and organizes test code properly
### Running Tests
```bash
# Run all tests
python3 src/tests/run_tests.py
# Run specific test categories
python3 src/tests/run_tests.py --unit # Unit tests only
python3 src/tests/run_tests.py --integration # Integration tests only
# Run specific test module
python3 src/tests/run_tests.py test_data_loader
python3 src/tests/run_tests.py test_cli
# List all available tests
python3 src/tests/run_tests.py --list
```
#### Test Categories
- **Unit Tests**: Test individual components in isolation
- **Integration Tests**: Test interactions between components and database
- **Debug Tests**: Debug scripts and troubleshooting tools
## 📁 Project Structure
```
musicbrainz-cleaner/
├── src/
│ ├── api/ # Database and API access
│ ├── cli/ # Command-line interface
│ ├── config/ # Configuration and constants
│ ├── core/ # Core functionality
│ ├── tests/ # Test files (REQUIRED location)
│ └── utils/ # Utility functions
├── data/ # Data files and output
│ ├── known_artists.json # Name variations (ACDC → AC/DC)
│ ├── known_recordings.json # Known recording MBIDs
│ └── songs.json # Source songs file
└── docker-compose.yml # Docker configuration
```
### Data Files
The tool uses external JSON files for name variations:
- **`data/known_artists.json`**: Contains name variations (ACDC AC/DC, ft. feat.)
- **`data/known_recordings.json`**: Contains known recording MBIDs for common songs
These files can be easily updated without touching the code, making it simple to add new name variations.
## 🎯 Features
### ✅ Artist Name Fixes
- `ACDC` `AC/DC`
- `Bruno Mars ft. Cardi B` `Bruno Mars feat. Cardi B`
- `featuring` `feat.`
- `98 Degrees` `98°` (artist aliases)
- `S Club 7` `S Club` (numerical suffixes)
- `Corby, Matt` `Matt Corby` (sort names)
### ✅ Collaboration Detection
- **Primary Patterns**: "ft.", "feat.", "featuring" (always collaborations)
- **Secondary Patterns**: "&", "and", "," (intelligent detection)
- **Band Name Protection**: 200+ known band names from `data/known_artists.json`
- **Complex Collaborations**: "Pitbull ft. Ne-Yo, Afrojack & Nayer"
- **Case Insensitive**: "Featuring" "featuring"
### ✅ Song Title Fixes
- `Shot In The Dark` `Shot in the Dark`
- Removes `(Karaoke Version)`, `(Instrumental)` suffixes
- Normalizes capitalization and formatting
### ✅ Added Data
- **`mbid`**: Official MusicBrainz Artist ID
- **`recording_mbid`**: Official MusicBrainz Recording ID
### ✅ Preserves Your Data
- Keeps all your existing fields (guid, path, disabled, favorite, etc.)
- Only adds new fields, never removes existing ones
### 🆕 Advanced Fuzzy Search
- **Intelligent Matching**: Finds similar names even with typos or variations
- **Similarity Scoring**: Shows how well each match scores (0.0 to 1.0)
- **Configurable Thresholds**: Adjust matching sensitivity
- **Multiple Algorithms**: Uses ratio, partial ratio, and token sort matching
- **Enhanced Search Fields**: artist.name, artist_alias.name, artist.sort_name
- **Dash Handling**: Regular dash (-) vs Unicode dash ()
- **Substring Protection**: Avoids false matches like "Sleazy-E" vs "Eazy-E"
### 🆕 Edge Case Support
- **Hyphenated Artists**: "Blink-182", "Ne-Yo", "G-Eazy"
- **Exclamation Marks**: "P!nk", "Panic! At The Disco", "3OH!3"
- **Numbers**: "98 Degrees", "S Club 7", "3 Doors Down"
- **Special Characters**: "a-ha", "The B-52s", "Salt-N-Pepa"
### 🆕 Simplified Processing
- **Default Behavior**: Process all songs by default (no special flags needed)
- **Separate Output Files**: Successful and failed songs saved to different files
- **Progress Tracking**: Real-time progress with song counter and status
- **Smart Defaults**: Sensible defaults for all file paths and options
- **Detailed Reporting**: Comprehensive statistics and processing report
- **Batch Processing**: Efficient handling of large song collections
## 📖 Usage Examples
### Basic Usage (Default)
```bash
# Process all songs with default settings (data/songs.json)
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
# Output: data/songs-success.json and data/songs-failure.json
```
### Custom Source File
```bash
# Process specific file
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/my_songs.json
# Output: data/my_songs-success.json and data/my_songs-failure.json
```
### Custom Output Files
```bash
# Specify custom output files
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/songs.json --output-success cleaned.json --output-failure failed.json
```
### Limit Processing
```bash
# Process only first 1000 songs
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --limit 1000
```
### Force API Mode
```bash
# Use HTTP API instead of database (slower but works without PostgreSQL)
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --use-api
```
### Test Connections
```bash
# Test database connection
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection
# Test with API mode
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection --use-api
```
### Help
```bash
# Show usage information
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --help
```
## 📁 Data Files
### Input Format
Your JSON file should contain an array of song objects:
```json
[
{
"artist": "ACDC",
"title": "Shot In The Dark",
"disabled": false,
"favorite": true,
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4"
},
{
"artist": "Bruno Mars ft. Cardi B",
"title": "Finesse Remix",
"disabled": false,
"favorite": false,
"guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
"path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4"
}
]
```
## 📤 Output Format
The tool creates **three output files**:
### 1. Successful Songs (`source-success.json`)
Array of successfully processed songs with MBIDs added:
```json
[
{
"artist": "AC/DC",
"title": "Shot in the Dark",
"disabled": false,
"favorite": true,
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
"mbid": "66c662b6-6e2f-4930-8610-912e24c63ed1",
"recording_mbid": "cf8b5cd0-d97c-413d-882f-fc422a2e57db"
},
{
"artist": "Bruno Mars feat. Cardi B",
"title": "Finesse (remix)",
"disabled": false,
"favorite": false,
"guid": "946a1077-ab9e-300c-3a72-b1e141e9706f",
"path": "z://MP4\\Bruno Mars ft. Cardi B - Finesse Remix (Karaoke Version).mp4",
"mbid": "afb680f2-b6eb-4cd7-a70b-a63b25c763d5",
"recording_mbid": "8ed14014-547a-4128-ab81-c2dca7ae198e"
}
]
```
### 2. Failed Songs (`source-failure.json`)
Array of songs that couldn't be processed (same format as source):
```json
[
{
"artist": "Unknown Artist",
"title": "Unknown Song",
"disabled": false,
"favorite": false,
"guid": "12345678-1234-1234-1234-123456789012",
"path": "z://MP4\\Unknown Artist - Unknown Song.mp4"
}
]
```
### 3. Processing Report (`processing_report_YYYYMMDD_HHMMSS.txt`)
Human-readable text report with statistics and failed song list:
```
MusicBrainz Data Cleaner - Processing Report
==================================================
Source File: data/songs.json
Processing Date: 2024-12-19 14:30:22
Processing Time: 15263.3 seconds
SUMMARY
--------------------
Total Songs Processed: 49,170
Successful Songs: 40,692
Failed Songs: 8,478
Success Rate: 82.8%
DETAILED STATISTICS
--------------------
Artists Found: 44,526/49,170 (90.6%)
Recordings Found: 40,998/49,170 (83.4%)
Processing Speed: 3.2 songs/second
OUTPUT FILES
--------------------
Successful Songs: data/songs-success.json
Failed Songs: data/songs-failure.json
Report File: data/processing_report_20241219_143022.txt
FAILED SONGS (First 50)
--------------------
1. Unknown Artist - Unknown Song
2. Invalid Artist - Invalid Title
3. Test Artist - Test Song
...
```
## 🎬 Example Run
### Basic Processing
```bash
$ docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
🚀 Starting song processing...
📊 Total songs to process: 49,170
Using database connection
==================================================
[1 of 49,170] ✅ PASS: ACDC - Shot In The Dark
[2 of 49,170] ❌ FAIL: Unknown Artist - Unknown Song
[3 of 49,170] ✅ PASS: Bruno Mars feat. Cardi B - Finesse (remix)
[4 of 49,170] ✅ PASS: Taylor Swift - Love Story
...
📈 Progress: 100/49,170 (0.2%) - Success: 85.0% - Rate: 3.2 songs/sec
📈 Progress: 200/49,170 (0.4%) - Success: 87.5% - Rate: 3.1 songs/sec
...
==================================================
🎉 Processing completed!
📊 Final Results:
⏱️ Total processing time: 15263.3 seconds
🚀 Average speed: 3.2 songs/second
✅ Artists found: 44,526/49,170 (90.6%)
✅ Recordings found: 40,998/49,170 (83.4%)
❌ Failed songs: 8,478 (17.2%)
📄 Files saved:
✅ Successful songs: data/songs-success.json
❌ Failed songs: data/songs-failure.json
📋 Text report: data/processing_report_20241219_143022.txt
📊 JSON report: data/processing_report_20241219_143022.json
```
### Limited Processing
```bash
$ docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --limit 1000
⚠️ Limiting processing to first 1000 songs
🚀 Starting song processing...
📊 Total songs to process: 1,000
Using database connection
==================================================
[1 of 1,000] ✅ PASS: ACDC - Shot In The Dark
[2 of 1,000] ❌ FAIL: Unknown Artist - Unknown Song
...
==================================================
🎉 Processing completed!
📊 Final Results:
⏱️ Total processing time: 312.5 seconds
🚀 Average speed: 3.2 songs/second
✅ Artists found: 856/1,000 (85.6%)
✅ Recordings found: 789/1,000 (78.9%)
❌ Failed songs: 211 (21.1%)
📄 Files saved:
✅ Successful songs: data/songs-success.json
❌ Failed songs: data/songs-failure.json
📋 Text report: data/processing_report_20241219_143022.txt
📊 JSON report: data/processing_report_20241219_143022.json
```
## 🔧 Troubleshooting
### "Could not find artist"
- The artist might not be in the MusicBrainz database
- Try checking the spelling or using a different variation
- The search index might still be building (wait a few minutes)
- Check fuzzy search similarity score - lower threshold if needed
- **NEW**: Check for artist aliases (e.g., "98 Degrees" "98°")
- **NEW**: Check for sort names (e.g., "Corby, Matt" "Matt Corby")
### "Could not find recording"
- The song might not be in the database
- The title might not match exactly
- Try a simpler title (remove extra words)
- Check fuzzy search similarity score - lower threshold if needed
- **NEW**: For collaborations, check if it's stored under the main artist
### Connection errors
- **Database**: Make sure PostgreSQL is running and accessible
- **API**: Make sure your MusicBrainz server is running on `http://localhost:8080`
- Check that Docker containers are up and running
- Verify the server is accessible in your browser
- **NEW**: For Docker, use container IP (172.18.0.2) instead of localhost
### JSON errors
- Make sure your input file is valid JSON
- Check that it contains an array of objects
- Verify all required fields are present
### Performance issues
- Use database mode instead of API mode for better performance
- Ensure database indexes are built for faster queries
- Check fuzzy search thresholds - higher thresholds mean fewer but more accurate matches
### Collaboration detection issues
- **NEW**: Check if it's a band name vs collaboration (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
- **NEW**: Verify the collaboration pattern is supported (ft., feat., featuring, &, and, ,)
- **NEW**: Check case sensitivity - patterns are case-insensitive
## 🎯 Use Cases
- **Karaoke Systems**: Clean up song metadata for better search and organization
- **Music Libraries**: Standardize artist names and add official IDs
- **Music Apps**: Ensure consistent data across your application
- **Data Migration**: Clean up legacy music data when moving to new systems
- **Fuzzy Matching**: Handle typos and variations in artist/song names
- **NEW**: **Collaboration Handling**: Process complex artist collaborations
- **NEW**: **Edge Cases**: Handle artists with special characters and unusual names
## 📚 What are MBIDs?
**MBID** stands for **MusicBrainz Identifier**. These are unique, permanent IDs assigned to artists, recordings, and other music entities in the MusicBrainz database.
**Benefits:**
- **Permanent**: Never change, even if names change
- **Universal**: Used across many music applications
- **Reliable**: Official identifiers from the MusicBrainz database
- **Linked Data**: Connect to other music databases and services
## 🆕 Performance Comparison
| Method | Speed | Rate Limiting | Fuzzy Search | Setup Complexity |
|--------|-------|---------------|--------------|------------------|
| **Database** | 10x faster | None | Yes | 🔧 Medium |
| **API** | 🐌 Slower | Yes (0.1s delay) | No | Easy |
## 🆕 Collaboration Detection Examples
| Input | Type | Detection | Output |
|-------|------|-----------|---------|
| `Bruno Mars ft. Cardi B` | Collaboration | Primary pattern | `Bruno Mars feat. Cardi B` |
| `Pitbull ft. Ne-Yo, Afrojack & Nayer` | Complex Collaboration | Multiple patterns | `Pitbull feat. Ne-Yo, Afrojack & Nayer` |
| `Simon & Garfunkel` | Band Name | Protected | `Simon & Garfunkel` |
| `Lavato, Demi & Joe Jonas` | Collaboration | Comma detection | `Lavato, Demi & Joe Jonas` |
| `Hall & Oates` | Band Name | Protected | `Hall & Oates` |
## 🆕 Edge Case Examples
| Input | Type | Handling | Output |
|-------|------|----------|---------|
| `ACDC` | Name Variation | Alias lookup | `AC/DC` |
| `98 Degrees` | Artist Alias | Alias search | `98°` |
| `S Club 7` | Numerical Suffix | Suffix removal | `S Club` |
| `Corby, Matt` | Sort Name | Sort name search | `Matt Corby` |
| `Blink-182` | Dash Variation | Unicode dash handling | `blink182` |
| `P!nk` | Special Characters | Direct search | `P!nk` |
| `3OH!3` | Numbers + Special | Direct search | `3OH!3` |
## 🤝 Contributing
Found a bug or have a feature request?
1. Check the existing issues
2. Create a new issue with details
3. Include sample data if possible
## 📄 License
This tool is provided as-is for educational and personal use.
## 🔗 Related Links
- [MusicBrainz](https://musicbrainz.org/) - The open music encyclopedia
- [MusicBrainz API](https://musicbrainz.org/doc/Development) - API documentation
- [MusicBrainz Docker](https://github.com/metabrainz/musicbrainz-docker) - Docker setup
- [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy string matching library
## 📝 Lessons Learned
### Database Integration
- **Direct PostgreSQL access is 10x faster** than API calls
- **Docker networking** requires container IPs, not localhost
- **Database name matters**: `musicbrainz_db` not `musicbrainz`
- **Static caches cause problems**: Wrong MBIDs override correct database lookups
### Collaboration Handling
- **Primary patterns** (ft., feat.) are always collaborations
- **Secondary patterns** (&, and) require intelligence to distinguish from band names
- **Comma detection** helps identify collaborations
- **Artist credit lookup** is essential for preserving all collaborators
### Edge Cases
- **Dash variations** (regular vs Unicode) cause exact match failures
- **Artist aliases** are common and important (98 Degrees 98°)
- **Sort names** handle "Last, First" formats
- **Numerical suffixes** in names need special handling (S Club 7 S Club)
### Performance Optimization
- **Remove static caches** for better accuracy
- **Database-first approach** ensures live data
- **Fuzzy search thresholds** need tuning for different datasets
- **Connection pooling** would improve performance for large datasets
### CLI Design
- **Simplified interface** with smart defaults reduces complexity
- **Array format consistency** makes output files easier to work with
- **Human-readable reports** improve user experience
- **Test file organization** keeps project structure clean
---
**Happy cleaning! 🎵✨**