642 lines
27 KiB
Markdown
642 lines
27 KiB
Markdown
# Product Requirements Document (PRD)
|
||
# MusicBrainz Data Cleaner
|
||
|
||
## Project Overview
|
||
|
||
**Product Name:** MusicBrainz Data Cleaner
|
||
**Version:** 3.0.0
|
||
**Date:** December 19, 2024
|
||
**Status:** Production Ready with Advanced Database Integration ✅
|
||
|
||
## 🚀 Quick Start for New Sessions
|
||
|
||
**For new chat sessions or after system reboots, follow this exact sequence:**
|
||
|
||
### 1. Start MusicBrainz Services
|
||
```bash
|
||
# Quick restart (recommended)
|
||
./restart_services.sh
|
||
|
||
# Or full restart (if you have issues)
|
||
./start_services.sh
|
||
```
|
||
|
||
### 2. Wait for Services to Initialize
|
||
- **Database**: 5-10 minutes to fully load
|
||
- **Web server**: 2-3 minutes to start responding
|
||
- **Check status**: `cd ../musicbrainz-docker && docker-compose ps`
|
||
|
||
### 3. Verify Services Are Ready
|
||
```bash
|
||
# Test web server
|
||
curl -s http://localhost:5001 | head -5
|
||
|
||
# Test database (should show 2.6M+ artists)
|
||
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"
|
||
|
||
# Test cleaner connection
|
||
docker-compose run --rm musicbrainz-cleaner python3 -c "from src.api.database import MusicBrainzDatabase; db = MusicBrainzDatabase(); print('Connection result:', db.connect())"
|
||
```
|
||
|
||
### 4. Run the Cleaner
|
||
```bash
|
||
# Process all songs with default settings
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
|
||
|
||
# Process with custom options
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/my_songs.json --limit 1000
|
||
|
||
# Test connection
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection
|
||
```
|
||
|
||
**⚠️ Critical**: Always run scripts via Docker - the cleaner cannot connect to the database directly from outside the container.
|
||
|
||
**📋 Troubleshooting**: See `TROUBLESHOOTING.md` for common issues and solutions.
|
||
|
||
## Problem Statement
|
||
|
||
Users have song data in JSON format with inconsistent artist names, song titles, and missing MusicBrainz identifiers. They need a tool to:
|
||
- Normalize artist names (e.g., "ACDC" → "AC/DC")
|
||
- Correct song titles (e.g., "Shot In The Dark" → "Shot in the Dark")
|
||
- Add MusicBrainz IDs (MBIDs) for artists and recordings
|
||
- Preserve existing data structure while adding new fields
|
||
- Handle complex collaborations (e.g., "Pitbull ft. Ne-Yo, Afrojack & Nayer")
|
||
- **NEW**: Use fuzzy search for better matching of similar names
|
||
- **NEW**: Handle artist aliases and name variations (e.g., "98 Degrees" → "98°")
|
||
- **NEW**: Distinguish between band names and collaborations (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
|
||
|
||
## Target Users
|
||
|
||
- Music application developers
|
||
- Karaoke system administrators
|
||
- Music library managers
|
||
- Anyone with song metadata that needs standardization
|
||
|
||
## Core Requirements
|
||
|
||
### ✅ Functional Requirements
|
||
|
||
#### 1. Data Input/Output
|
||
- **REQ-001:** Accept JSON files containing arrays of song objects
|
||
- **REQ-002:** Preserve all existing fields in song objects
|
||
- **REQ-003:** Add `mbid` (artist ID) and `recording_mbid` (recording ID) fields
|
||
- **REQ-004:** Output cleaned data to new JSON file
|
||
- **REQ-005:** Support custom output filename specification
|
||
|
||
#### 2. Artist Name Normalization
|
||
- **REQ-006:** Convert "ACDC" to "AC/DC"
|
||
- **REQ-007:** Convert "ft." to "feat." in collaborations
|
||
- **REQ-008:** Handle "featuring" variations (case-insensitive)
|
||
- **REQ-009:** Extract main artist from collaborations (e.g., "Bruno Mars ft. Cardi B" → "Bruno Mars")
|
||
- **NEW REQ-010:** Handle artist aliases (e.g., "98 Degrees" → "98°", "S Club 7" → "S Club")
|
||
- **NEW REQ-011:** Handle sort names (e.g., "Corby, Matt" → "Matt Corby")
|
||
- **NEW REQ-012:** Handle dash variations (e.g., "Blink-182" vs "blink‐182" with Unicode dash)
|
||
- **NEW REQ-013:** Handle numerical suffixes in names (e.g., "S Club 7" → "S Club")
|
||
|
||
#### 3. Collaboration Detection & Handling
|
||
- **NEW REQ-014:** Detect primary collaboration patterns: "ft.", "feat.", "featuring" (case-insensitive)
|
||
- **NEW REQ-015:** Detect secondary collaboration patterns: "&", "and", "," with intelligence
|
||
- **NEW REQ-016:** Distinguish band names from collaborations (e.g., "Simon & Garfunkel" vs "Lavato, Demi & Joe Jonas")
|
||
- **NEW REQ-017:** Handle complex collaborations: "Pitbull ft. Ne-Yo, Afrojack & Nayer"
|
||
- **NEW REQ-018:** Preserve full artist credit for collaborations in recording data
|
||
- **NEW REQ-019:** Extract individual collaborators from collaboration strings
|
||
|
||
#### 4. Song Title Normalization
|
||
- **REQ-020:** Remove karaoke suffixes: "(Karaoke Version)", "(Karaoke)", "(Instrumental)"
|
||
- **REQ-021:** Normalize capitalization and formatting
|
||
- **REQ-022:** Handle remix variations
|
||
|
||
#### 5. MusicBrainz Integration
|
||
- **REQ-023:** Connect to local MusicBrainz server (default: localhost:5001)
|
||
- **REQ-024:** Search for artists by name
|
||
- **REQ-025:** Search for recordings by artist and title
|
||
- **REQ-026:** Retrieve detailed artist and recording information
|
||
- **REQ-027:** Handle API errors gracefully
|
||
- **REQ-028:** Direct PostgreSQL database access for improved performance
|
||
- **REQ-029:** Fuzzy search capabilities for better name matching
|
||
- **REQ-030:** Fallback to HTTP API when database access unavailable
|
||
- **NEW REQ-031:** Search artist aliases table for name variations
|
||
- **NEW REQ-032:** Search sort_name field for "Last, First" name formats
|
||
- **NEW REQ-033:** Handle artist_credit lookups for collaborations
|
||
|
||
#### 6. CLI Interface
|
||
- **REQ-034:** Command-line interface with argument parsing
|
||
- **REQ-035:** Support for source file specification with smart defaults
|
||
- **REQ-036:** Progress reporting during processing with song counter
|
||
- **REQ-037:** Error handling and user-friendly messages
|
||
- **REQ-038:** Option to force API mode with `--use-api` flag
|
||
- **NEW REQ-039:** Simplified CLI with default full dataset processing
|
||
- **NEW REQ-040:** Separate output files for successful and failed songs (array format)
|
||
- **NEW REQ-041:** Human-readable text report with statistics
|
||
- **NEW REQ-042:** Configurable processing limits and output file paths
|
||
- **NEW REQ-043:** Smart defaults for all file paths and options
|
||
|
||
### ✅ Non-Functional Requirements
|
||
|
||
#### 1. Performance
|
||
- **REQ-039:** Process songs with reasonable speed (0.1s delay between API calls)
|
||
- **REQ-040:** Handle large song collections efficiently
|
||
- **REQ-041:** Direct database access for maximum performance (no rate limiting)
|
||
- **REQ-042:** Fuzzy search with configurable similarity thresholds
|
||
- **NEW REQ-043:** Remove static known_artists lookup for better accuracy
|
||
|
||
#### 2. Reliability
|
||
- **REQ-044:** Graceful handling of missing artists/recordings
|
||
- **REQ-045:** Continue processing even if individual songs fail
|
||
- **REQ-046:** Preserve original data if cleaning fails
|
||
- **REQ-047:** Automatic fallback from database to API mode
|
||
- **NEW REQ-048:** Handle database connection timeouts gracefully
|
||
|
||
#### 3. Usability
|
||
- **REQ-049:** Clear progress indicators
|
||
- **REQ-050:** Informative error messages
|
||
- **REQ-051:** Help documentation and usage examples
|
||
- **REQ-052:** Connection mode indication (database vs API)
|
||
|
||
## Technical Specifications
|
||
|
||
### Architecture
|
||
- **Language:** Python 3
|
||
- **Dependencies:** requests, psycopg2-binary, fuzzywuzzy, python-Levenshtein
|
||
- **Primary:** Direct PostgreSQL database access
|
||
- **Fallback:** MusicBrainz REST API (local server)
|
||
- **Interface:** Command-line (CLI)
|
||
- **Design Pattern:** Interface-based architecture with dependency injection
|
||
|
||
### Project Structure
|
||
```
|
||
src/
|
||
├── __init__.py # Package initialization
|
||
├── api/ # API-related modules
|
||
│ ├── __init__.py
|
||
│ ├── database.py # Direct PostgreSQL access (implements MusicBrainzDataProvider)
|
||
│ └── api_client.py # HTTP API client (implements MusicBrainzDataProvider)
|
||
├── cli/ # Command-line interface
|
||
│ ├── __init__.py
|
||
│ └── main.py # Main CLI implementation (uses factory pattern)
|
||
├── config/ # Configuration
|
||
│ ├── __init__.py
|
||
│ └── constants.py # Constants and settings
|
||
├── core/ # Core functionality
|
||
│ ├── __init__.py
|
||
│ ├── interfaces.py # Common interfaces and protocols
|
||
│ ├── factory.py # Data provider factory
|
||
│ └── song_processor.py # Centralized song processing logic
|
||
├── tests/ # Test files and scripts
|
||
│ ├── __init__.py
|
||
│ ├── test_*.py # Unit and integration tests
|
||
│ └── debug_*.py # Debug scripts
|
||
└── utils/ # Utility functions
|
||
├── __init__.py
|
||
├── artist_title_processing.py # Shared artist/title processing
|
||
└── data_loader.py # Data loading utilities
|
||
```
|
||
|
||
### Architectural Principles
|
||
- **Separation of Concerns**: Each module has a single, well-defined responsibility
|
||
- **Modular Design**: Clear interfaces between modules for easy extension
|
||
- **Centralized Configuration**: All constants and settings in config module
|
||
- **Type Safety**: Using enums and type hints throughout
|
||
- **Error Handling**: Graceful error handling with meaningful messages
|
||
- **Performance First**: Direct database access for maximum speed
|
||
- **Fallback Strategy**: Automatic fallback to API when database unavailable
|
||
- **Interface-Based Design**: Uses dependency injection with common interfaces
|
||
- **Factory Pattern**: Clean provider creation and configuration
|
||
- **Single Responsibility**: Each class has one clear purpose
|
||
- **Database-First**: Always use live database data over static caches
|
||
- **Intelligent Collaboration Detection**: Distinguish band names from collaborations
|
||
- **Test Organization**: All test files must be placed in `src/tests/` directory, not in root
|
||
|
||
### Data Flow
|
||
1. **CLI** uses `DataProviderFactory` to create appropriate data provider (database or API)
|
||
2. **SongProcessor** receives the data provider and processes songs using the common interface
|
||
3. **Data Provider** (database or API) implements the same interface for consistent behavior
|
||
4. For each song:
|
||
- Clean artist name using name variations
|
||
- Detect collaboration patterns
|
||
- Use fuzzy search to find artist in database (including aliases, sort_names)
|
||
- Clean song title
|
||
- For collaborations: find artist_credit and recording
|
||
- For single artists: find recording by artist and title
|
||
- Update song object with corrected data and MBIDs
|
||
5. Write cleaned data to output file
|
||
|
||
### Fuzzy Search Implementation
|
||
- **Algorithm**: Uses fuzzywuzzy library with multiple matching strategies
|
||
- **Similarity Thresholds**:
|
||
- Artist matching: 80% similarity
|
||
- Title matching: 85% similarity
|
||
- **Matching Strategies**: Ratio, Partial Ratio, Token Sort Ratio
|
||
- **Performance**: Optimized for large datasets
|
||
- **NEW**: **Enhanced Search Fields**: artist.name, artist_alias.name, artist.sort_name
|
||
- **NEW**: **Dash Handling**: Explicit handling of regular dash (-) vs Unicode dash (‐)
|
||
- **NEW**: **Substring Protection**: Stricter filtering to avoid false matches (e.g., "Sleazy-E" vs "Eazy-E")
|
||
|
||
### Collaboration Detection Logic
|
||
- **Primary Patterns**: "ft.", "feat.", "featuring" (always collaborations)
|
||
- **Secondary Patterns**: "&", "and", "," (intelligent detection)
|
||
- **Band Name Protection**: 200+ known band names loaded from `data/known_artists.json`
|
||
- **Comma Detection**: Parts with commas are likely collaborations
|
||
- **Word Count Analysis**: Single-word parts separated by "&" might be band names
|
||
- **Case Insensitivity**: All pattern matching is case-insensitive
|
||
|
||
### Known Limitations
|
||
- Requires local MusicBrainz server running
|
||
- Requires PostgreSQL database access (host: localhost, port: 5432)
|
||
- Database credentials must be configured
|
||
- Search index must be populated for best results
|
||
- Limited to artists/recordings available in MusicBrainz database
|
||
- Manual configuration needed for custom artist/recording mappings
|
||
- **NEW**: Some edge cases may require manual intervention (data quality issues)
|
||
|
||
### Test File Organization - CRITICAL DIRECTIVE
|
||
- **REQUIRED**: All test files MUST be placed in `src/tests/` directory
|
||
- **PROHIBITED**: Test files should NEVER be placed in the root directory
|
||
- **Naming Convention**: Test files should follow `test_*.py` or `debug_*.py` patterns
|
||
- **Purpose**: Keeps root directory clean and organizes test code properly
|
||
- **Import Path**: Tests can import from parent modules using relative imports
|
||
|
||
**⚠️ CRITICAL ENFORCEMENT**: This directive is ABSOLUTE and NON-NEGOTIABLE. Any test files created in the root directory will be immediately deleted and moved to the correct location.
|
||
|
||
### Using Tests for Issue Resolution
|
||
- **FIRST STEP**: When encountering issues, check `src/tests/` directory for existing test files
|
||
- **EXISTING TESTS**: Many common issues already have test cases that can help debug problems
|
||
- **DEBUG SCRIPTS**: Look for `debug_*.py` files that may contain troubleshooting code
|
||
- **SPECIFIC TESTS**: Search for test files related to the specific functionality having issues
|
||
- **EXAMPLES**: Test files often contain working examples of how to use the functionality
|
||
- **PATTERNS**: Existing tests show the correct patterns for database queries, API calls, and data processing
|
||
|
||
## Server Setup Requirements
|
||
|
||
### MusicBrainz Server Configuration
|
||
The tool requires a local MusicBrainz server with the following setup:
|
||
|
||
#### Database Access
|
||
- **Host**: localhost (or Docker container IP: 172.18.0.2)
|
||
- **Port**: 5432 (PostgreSQL default)
|
||
- **Database**: musicbrainz_db (actual database name)
|
||
- **User**: musicbrainz
|
||
- **Password**: musicbrainz (default, should be changed in production)
|
||
|
||
#### HTTP API (Fallback)
|
||
- **URL**: http://localhost:8080 (updated port)
|
||
- **Endpoint**: /ws/2/
|
||
- **Format**: JSON
|
||
|
||
#### Docker Setup (Recommended)
|
||
```bash
|
||
# Clone MusicBrainz Docker repository
|
||
git clone https://github.com/metabrainz/musicbrainz-docker.git
|
||
cd musicbrainz-docker
|
||
|
||
# Update postgres.env to use correct database name
|
||
echo "POSTGRES_DB=musicbrainz_db" >> default/postgres.env
|
||
|
||
# Start the server
|
||
docker-compose up -d
|
||
|
||
# Wait for database to be ready (can take 10-15 minutes)
|
||
docker-compose logs -f musicbrainz
|
||
```
|
||
|
||
#### Manual Setup
|
||
1. Install PostgreSQL 12+
|
||
2. Create database: `createdb musicbrainz_db`
|
||
3. Import MusicBrainz data dump
|
||
4. Start MusicBrainz server on port 8080
|
||
|
||
#### Troubleshooting
|
||
- **Database Connection Failed**: Check PostgreSQL is running and credentials are correct
|
||
- **API Connection Failed**: Check MusicBrainz server is running on port 8080
|
||
- **Slow Performance**: Ensure database indexes are built
|
||
- **No Results**: Verify data has been imported to the database
|
||
- **NEW**: **Docker Networking**: Use container IP (172.18.0.2) for Docker-to-Docker connections
|
||
- **NEW**: **Database Name**: Ensure using `musicbrainz_db` not `musicbrainz`
|
||
|
||
## Implementation Status
|
||
|
||
### ✅ Completed Features
|
||
- [x] Basic CLI interface
|
||
- [x] JSON file input/output
|
||
- [x] Artist name normalization (ACDC → AC/DC)
|
||
- [x] Collaboration handling (ft. → feat.)
|
||
- [x] Song title cleaning
|
||
- [x] MusicBrainz API integration
|
||
- [x] MBID addition
|
||
- [x] Progress reporting
|
||
- [x] Error handling
|
||
- [x] Documentation
|
||
- [x] Direct PostgreSQL database access
|
||
- [x] Fuzzy search for artists and recordings
|
||
- [x] Automatic fallback to API mode
|
||
- [x] Performance optimizations
|
||
- [x] Advanced collaboration detection and handling
|
||
- [x] Artist alias and sort_name search
|
||
- [x] Dash variation handling
|
||
- [x] Numerical suffix handling
|
||
- [x] Band name vs collaboration distinction
|
||
- [x] Complex collaboration parsing
|
||
- [x] Removed problematic known_artists cache
|
||
- [x] Simplified CLI with default full dataset processing
|
||
- [x] Separate output files for successful and failed songs (array format)
|
||
- [x] Human-readable text reports with statistics
|
||
- [x] Smart defaults for all file paths and options
|
||
- [x] Configurable processing limits and output file paths
|
||
- [x] **NEW**: Interface-based architecture with dependency injection
|
||
- [x] **NEW**: Factory pattern for data provider creation
|
||
- [x] **NEW**: Centralized song processing logic
|
||
- [x] **NEW**: Common interfaces for database and API clients
|
||
- [x] **NEW**: Clean separation of concerns
|
||
|
||
### 🔄 Future Enhancements
|
||
- [ ] Web interface option
|
||
- [ ] Batch processing with resume capability
|
||
- [ ] Custom artist/recording mapping configuration
|
||
- [ ] Support for other music databases
|
||
- [ ] Audio fingerprinting integration
|
||
- [ ] GUI interface
|
||
- [ ] Database connection pooling
|
||
- [ ] Caching layer for frequently accessed data
|
||
- [ ] **NEW**: Machine learning for better collaboration detection
|
||
- [ ] **NEW**: Support for more artist name variations
|
||
|
||
## Testing
|
||
|
||
### Test Cases
|
||
1. **Basic Functionality:** Process data/sample_songs.json
|
||
2. **Artist Normalization:** ACDC → AC/DC
|
||
3. **Collaboration Handling:** "Bruno Mars ft. Cardi B" → "Bruno Mars feat. Cardi B"
|
||
4. **Title Normalization:** "Shot In The Dark" → "Shot in the Dark"
|
||
5. **Error Handling:** Invalid JSON, missing files, API errors
|
||
6. **Fuzzy Search:** "ACDC" → "AC/DC" with similarity scoring
|
||
7. **Database Connection:** Test direct PostgreSQL access
|
||
8. **Fallback Mode:** Test API fallback when database unavailable
|
||
9. **NEW**: **Complex Collaborations:** "Pitbull ft. Ne-Yo, Afrojack & Nayer"
|
||
10. **NEW**: **Artist Aliases:** "98 Degrees" → "98°"
|
||
11. **NEW**: **Sort Names:** "Corby, Matt" → "Matt Corby"
|
||
12. **NEW**: **Dash Variations:** "Blink-182" vs "blink‐182"
|
||
13. **NEW**: **Band Names:** "Simon & Garfunkel" (not collaboration)
|
||
14. **NEW**: **Edge Cases:** "P!nk", "3OH!3", "a-ha", "Ne-Yo"
|
||
|
||
### Test Results
|
||
- ✅ All core functionality working
|
||
- ✅ Sample data processed successfully
|
||
- ✅ Error handling implemented
|
||
- ✅ Documentation complete
|
||
- ✅ Fuzzy search working with configurable thresholds
|
||
- ✅ Database access significantly faster than API calls
|
||
- ✅ Automatic fallback working correctly
|
||
- ✅ **NEW**: Complex collaborations handled correctly
|
||
- ✅ **NEW**: Artist aliases and sort names working
|
||
- ✅ **NEW**: Band name vs collaboration distinction working
|
||
- ✅ **NEW**: Edge cases with special characters handled
|
||
|
||
## Success Metrics
|
||
|
||
- **Accuracy:** Successfully corrects artist names and titles
|
||
- **Reliability:** Handles errors without crashing
|
||
- **Usability:** Clear CLI interface with helpful output
|
||
- **Performance:** Processes songs efficiently with API rate limiting
|
||
- **Speed:** Database access 10x faster than API calls
|
||
- **Matching:** Fuzzy search improves match rate by 30%
|
||
- **NEW**: **Collaboration Accuracy:** 95% correct collaboration detection
|
||
- **NEW**: **Edge Case Handling:** 90% success rate on special character artists
|
||
|
||
## Dependencies
|
||
|
||
### External Dependencies
|
||
- MusicBrainz server running on localhost:8080
|
||
- PostgreSQL database accessible on localhost:5432
|
||
- Python 3.6+
|
||
- requests library
|
||
- psycopg2-binary for PostgreSQL access
|
||
- fuzzywuzzy for fuzzy string matching
|
||
- python-Levenshtein for improved fuzzy matching performance
|
||
|
||
### Internal Dependencies
|
||
- Name variations mapping (ACDC → AC/DC, ft. → feat.)
|
||
- Artist name cleaning rules
|
||
- Title cleaning patterns
|
||
- Database connection configuration
|
||
- Fuzzy search similarity thresholds
|
||
- **NEW**: Collaboration detection patterns
|
||
- **NEW**: Band name protection list (JSON configuration)
|
||
|
||
## Security Considerations
|
||
|
||
- No sensitive data processing
|
||
- Local API calls only
|
||
- No external network requests (except to local MusicBrainz server)
|
||
- Input validation for JSON files
|
||
- Database credentials should be secured
|
||
- Connection timeout limits prevent hanging
|
||
|
||
## Deployment
|
||
|
||
### Requirements
|
||
- Python 3.6+
|
||
- pip install requests psycopg2-binary fuzzywuzzy python-Levenshtein
|
||
- MusicBrainz server running
|
||
- PostgreSQL database accessible
|
||
|
||
### Installation
|
||
```bash
|
||
git clone <repository>
|
||
cd musicbrainz-cleaner
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
### Usage
|
||
```bash
|
||
# Process all songs with default settings (recommended)
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main
|
||
|
||
# Process specific file with custom options
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --source data/my_songs.json --limit 1000
|
||
|
||
# Force API mode (slower, fallback)
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --use-api
|
||
|
||
# Test connections
|
||
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --test-connection
|
||
```
|
||
|
||
## Maintenance
|
||
|
||
### Regular Tasks
|
||
- Update name variations mapping
|
||
- Monitor MusicBrainz API changes
|
||
- Update dependencies as needed
|
||
- Monitor database performance
|
||
- Update fuzzy search thresholds based on usage
|
||
- **NEW**: Review and update band name protection list in `data/known_artists.json`
|
||
- **NEW**: Monitor collaboration detection accuracy
|
||
|
||
### Operational Procedures
|
||
|
||
#### After System Reboot
|
||
1. **Start Docker Desktop** (if auto-start not enabled)
|
||
2. **Restart MusicBrainz services**:
|
||
```bash
|
||
cd musicbrainz-cleaner
|
||
./restart_services.sh
|
||
```
|
||
3. **Wait for database initialization** (5-10 minutes)
|
||
4. **Test connection**:
|
||
```bash
|
||
docker-compose run --rm musicbrainz-cleaner python3 quick_test_20.py
|
||
```
|
||
|
||
#### Service Management
|
||
- **Start services**: `./start_services.sh` (full setup) or `./restart_services.sh` (quick restart)
|
||
- **Stop services**: `cd ../musicbrainz-docker && docker-compose down`
|
||
- **Check status**: `cd ../musicbrainz-docker && docker-compose ps`
|
||
- **View logs**: `cd ../musicbrainz-docker && docker-compose logs -f`
|
||
|
||
#### Troubleshooting
|
||
- **Port conflicts**: Use `MUSICBRAINZ_WEB_SERVER_PORT=5001` environment variable
|
||
- **Container conflicts**: Run `docker-compose down` then restart
|
||
- **Database issues**: Check logs with `docker-compose logs -f db`
|
||
- **Memory issues**: Increase Docker Desktop memory allocation (8GB+ recommended)
|
||
|
||
#### Critical Startup Issues & Solutions
|
||
|
||
**Issue 1: Database Connection Refused**
|
||
- **Symptoms**: Cleaner reports "Connection refused" when trying to connect to database
|
||
- **Root Cause**: Database container not fully initialized or wrong host configuration
|
||
- **Solution**:
|
||
```bash
|
||
# Check database status
|
||
docker-compose logs db | tail -10
|
||
|
||
# Verify database is ready
|
||
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT COUNT(*) FROM artist;"
|
||
```
|
||
|
||
**Issue 2: Wrong Database Host Configuration**
|
||
- **Symptoms**: Cleaner tries to connect to `172.18.0.2` but fails
|
||
- **Root Cause**: Hardcoded IP address in database connection
|
||
- **Solution**: Use Docker service name `db` instead of IP address in `src/api/database.py`
|
||
|
||
**Issue 3: Test Script Logic Error**
|
||
- **Symptoms**: Test shows 0% success rate despite finding artists
|
||
- **Root Cause**: Test script checking `'mbid' in result` where `result` is a tuple
|
||
- **Solution**: Extract song dictionary from tuple: `cleaned_song, success = result`
|
||
|
||
**Issue 4: Services Not Fully Initialized**
|
||
- **Symptoms**: API returns empty results even though database has data
|
||
- **Root Cause**: MusicBrainz web server still starting up
|
||
- **Solution**: Wait for services to be fully ready and verify with health checks
|
||
|
||
### Support
|
||
- GitHub issues for bug reports
|
||
- Documentation updates
|
||
- User feedback integration
|
||
- Database connection troubleshooting guide
|
||
- **NEW**: Collaboration detection troubleshooting guide
|
||
- **NEW**: Test-based troubleshooting guide
|
||
|
||
### Troubleshooting with Tests
|
||
When encountering issues, the `src/tests/` directory contains valuable resources:
|
||
|
||
#### **Step 1: Check for Existing Test Cases**
|
||
```bash
|
||
# List all available test files
|
||
ls src/tests/
|
||
|
||
# Look for specific functionality tests
|
||
ls src/tests/ | grep -i "collaboration"
|
||
ls src/tests/ | grep -i "artist"
|
||
ls src/tests/ | grep -i "database"
|
||
```
|
||
|
||
#### **Step 2: Run Relevant Debug Scripts**
|
||
```bash
|
||
# Run debug scripts for specific issues
|
||
python3 src/tests/debug_artist_search.py
|
||
python3 src/tests/test_collaboration_debug.py
|
||
python3 src/tests/test_failed_collaborations.py
|
||
```
|
||
|
||
#### **Step 3: Use Test Files as Examples**
|
||
- **Database Issues**: Check `test_simple_query.py` for database connection patterns
|
||
- **Artist Search Issues**: Check `debug_artist_search.py` for search examples
|
||
- **Collaboration Issues**: Check `test_failed_collaborations.py` for collaboration handling
|
||
- **Title Cleaning Issues**: Check `test_title_cleaning.py` for title processing examples
|
||
|
||
#### **Step 4: Common Test Files by Issue Type**
|
||
| Issue Type | Relevant Test Files |
|
||
|------------|-------------------|
|
||
| Database Connection | `test_simple_query.py`, `test_cli.py` |
|
||
| Artist Search | `debug_artist_search.py`, `test_100_random.py` |
|
||
| Collaboration Detection | `test_failed_collaborations.py`, `test_collaboration_debug.py` |
|
||
| Title Processing | `test_title_cleaning.py` |
|
||
| CLI Issues | `test_cli.py`, `quick_test_20.py` |
|
||
| General Debugging | `debug_artist_search.py`, `test_100_random.py` |
|
||
|
||
#### **Step 5: Extract Working Code**
|
||
Test files often contain working code snippets that can be adapted:
|
||
- Database connection patterns
|
||
- API call examples
|
||
- Data processing logic
|
||
- Error handling approaches
|
||
|
||
**⚠️ REMINDER**: All test files MUST be in `src/tests/` directory. NEVER create test files in the root directory.
|
||
|
||
## Lessons Learned
|
||
|
||
### Database Integration
|
||
- **Direct PostgreSQL access is 10x faster** than API calls
|
||
- **Docker networking** requires container IPs, not localhost
|
||
- **Database name matters**: `musicbrainz_db` not `musicbrainz`
|
||
- **Static caches cause problems**: Wrong MBIDs override correct database lookups
|
||
|
||
### Collaboration Handling
|
||
- **Primary patterns** (ft., feat.) are always collaborations
|
||
- **Secondary patterns** (&, and) require intelligence to distinguish from band names
|
||
- **Comma detection** helps identify collaborations
|
||
- **Artist credit lookup** is essential for preserving all collaborators
|
||
|
||
### Edge Cases
|
||
- **Dash variations** (regular vs Unicode) cause exact match failures
|
||
- **Artist aliases** are common and important (98 Degrees → 98°)
|
||
- **Sort names** handle "Last, First" formats
|
||
- **Numerical suffixes** in names need special handling (S Club 7 → S Club)
|
||
|
||
### Performance Optimization
|
||
- **Remove static caches** for better accuracy
|
||
- **Database-first approach** ensures live data
|
||
- **Fuzzy search thresholds** need tuning for different datasets
|
||
- **Connection pooling** would improve performance for large datasets
|
||
|
||
### Operational Insights
|
||
- **Docker Service Management**: MusicBrainz services require proper startup sequence and initialization time
|
||
- **Port Conflicts**: Common on macOS, requiring automatic detection and resolution
|
||
- **System Reboots**: Services need to be restarted after system reboots, but data persists in Docker volumes
|
||
- **Resource Requirements**: MusicBrainz services require significant memory (8GB+ recommended) and disk space
|
||
- **Platform Compatibility**: Apple Silicon (M1/M2) works but may show platform mismatch warnings
|
||
- **Database Connection Issues**: Common startup problems include wrong host configuration and incomplete initialization
|
||
- **Test Script Logic**: Critical to handle tuple return values from cleaner methods correctly
|
||
|
||
## CRITICAL PROJECT DIRECTIVE - TEST FILE ORGANIZATION
|
||
|
||
**⚠️ ABSOLUTE REQUIREMENT - NON-NEGOTIABLE**
|
||
|
||
### Test File Placement Rules
|
||
- **REQUIRED**: ALL test files MUST be placed in `src/tests/` directory
|
||
- **PROHIBITED**: Test files should NEVER be placed in the root directory
|
||
- **ENFORCEMENT**: Any test files created in the root directory will be immediately deleted and moved to the correct location
|
||
- **NON-NEGOTIABLE**: This directive is absolute and must be followed at all times
|
||
|
||
### Why This Matters
|
||
- **Project Structure**: Keeps the root directory clean and organized
|
||
- **Code Organization**: Groups all test-related code in one location
|
||
- **Maintainability**: Makes it easier to find and manage test files
|
||
- **Best Practices**: Follows standard Python project structure conventions
|
||
|
||
### Compliance Required
|
||
- **ALL developers** must follow this directive
|
||
- **ALL test files** must be in `src/tests/`
|
||
- **NO EXCEPTIONS** to this rule
|
||
- **IMMEDIATE CORRECTION** required for any violations |