Signed-off-by: Matt Bruce <mbrucedogs@gmail.com>

This commit is contained in:
Matt Bruce 2025-07-31 18:07:18 -05:00
parent 504820c8a1
commit 4bf359ee5d
8 changed files with 715 additions and 71 deletions

36
PRD.md
View File

@ -374,6 +374,33 @@ python musicbrainz_cleaner.py --test-connection
- **NEW**: Review and update band name protection list in `data/known_artists.json` - **NEW**: Review and update band name protection list in `data/known_artists.json`
- **NEW**: Monitor collaboration detection accuracy - **NEW**: Monitor collaboration detection accuracy
### Operational Procedures
#### After System Reboot
1. **Start Docker Desktop** (if auto-start not enabled)
2. **Restart MusicBrainz services**:
```bash
cd musicbrainz-cleaner
./restart_services.sh
```
3. **Wait for database initialization** (5-10 minutes)
4. **Test connection**:
```bash
docker-compose run --rm musicbrainz-cleaner python3 quick_test_20.py
```
#### Service Management
- **Start services**: `./start_services.sh` (full setup) or `./restart_services.sh` (quick restart)
- **Stop services**: `cd ../musicbrainz-docker && docker-compose down`
- **Check status**: `cd ../musicbrainz-docker && docker-compose ps`
- **View logs**: `cd ../musicbrainz-docker && docker-compose logs -f`
#### Troubleshooting
- **Port conflicts**: Use `MUSICBRAINZ_WEB_SERVER_PORT=5001` environment variable
- **Container conflicts**: Run `docker-compose down` then restart
- **Database issues**: Check logs with `docker-compose logs -f db`
- **Memory issues**: Increase Docker Desktop memory allocation (8GB+ recommended)
### Support ### Support
- GitHub issues for bug reports - GitHub issues for bug reports
- Documentation updates - Documentation updates
@ -405,4 +432,11 @@ python musicbrainz_cleaner.py --test-connection
- **Remove static caches** for better accuracy - **Remove static caches** for better accuracy
- **Database-first approach** ensures live data - **Database-first approach** ensures live data
- **Fuzzy search thresholds** need tuning for different datasets - **Fuzzy search thresholds** need tuning for different datasets
- **Connection pooling** would improve performance for large datasets - **Connection pooling** would improve performance for large datasets
### Operational Insights
- **Docker Service Management**: MusicBrainz services require proper startup sequence and initialization time
- **Port Conflicts**: Common on macOS, requiring automatic detection and resolution
- **System Reboots**: Services need to be restarted after system reboots, but data persists in Docker volumes
- **Resource Requirements**: MusicBrainz services require significant memory (8GB+ recommended) and disk space
- **Platform Compatibility**: Apple Silicon (M1/M2) works but may show platform mismatch warnings

103
README.md
View File

@ -39,50 +39,81 @@ A powerful command-line tool that cleans and normalizes your song data using the
## 🚀 Quick Start ## 🚀 Quick Start
### 1. Install Dependencies ### Option 1: Automated Setup (Recommended)
1. **Start MusicBrainz services**:
```bash
./start_services.sh
```
This script will:
- Check for Docker and port conflicts
- Start all MusicBrainz services
- Wait for database initialization
- Create environment configuration
- Test the connection
2. **Run the cleaner**:
```bash
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --input data/songs.json --output cleaned_songs.json
```
### Option 2: Manual Setup
1. **Start MusicBrainz services manually**:
```bash
cd ../musicbrainz-docker
MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
```
Wait 5-10 minutes for database initialization.
2. **Create environment configuration**:
```bash
# Create .env file in musicbrainz-cleaner directory
cat > .env << EOF
DB_HOST=172.18.0.2
DB_PORT=5432
DB_NAME=musicbrainz_db
DB_USER=musicbrainz
DB_PASSWORD=musicbrainz
MUSICBRAINZ_WEB_SERVER_PORT=5001
EOF
```
3. **Run the cleaner**:
```bash
docker-compose run --rm musicbrainz-cleaner python3 -m src.cli.main --input data/songs.json --output cleaned_songs.json
```
### For detailed setup instructions, see [SETUP.md](SETUP.md)
## 🔄 After System Reboot
After restarting your Mac, you'll need to restart the MusicBrainz services:
### Quick Restart (Recommended)
```bash ```bash
pip install requests psycopg2-binary fuzzywuzzy python-Levenshtein # If Docker Desktop is already running
./restart_services.sh
# Or manually
cd ../musicbrainz-docker && MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
``` ```
### 2. Set Up MusicBrainz Server ### Full Restart (If you have issues)
#### Option A: Docker (Recommended)
```bash ```bash
# Clone MusicBrainz Docker repository # Complete setup including Docker checks
git clone https://github.com/metabrainz/musicbrainz-docker.git ./start_services.sh
cd musicbrainz-docker
# Update postgres.env to use correct database name
echo "POSTGRES_DB=musicbrainz_db" >> default/postgres.env
# Start the server
docker-compose up -d
# Wait for database to be ready (can take 10-15 minutes)
docker-compose logs -f musicbrainz
``` ```
#### Option B: Manual Setup ### Auto-start Setup (Optional)
1. Install PostgreSQL 12+ 1. **Enable Docker Desktop auto-start**:
2. Create database: `createdb musicbrainz_db` - Open Docker Desktop
3. Import MusicBrainz data dump - Go to Settings → General
4. Start MusicBrainz server on port 8080 - Check "Start Docker Desktop when you log in"
### 3. Test Connection 2. **Then just run**: `./restart_services.sh` after each reboot
```bash
python musicbrainz_cleaner.py --test-connection
```
### 4. Run the Cleaner **Note**: Your data is preserved in Docker volumes, so you don't need to reconfigure anything after a reboot.
```bash
# Use database access (recommended, faster)
python musicbrainz_cleaner.py your_songs.json
# Force API mode (slower, fallback)
python musicbrainz_cleaner.py your_songs.json --use-api
```
That's it! Your cleaned data will be saved to `your_songs_cleaned.json`
## 📋 Requirements ## 📋 Requirements

266
SETUP.md Normal file
View File

@ -0,0 +1,266 @@
# MusicBrainz Cleaner Setup Guide
This guide will help you set up the MusicBrainz database and Docker services needed to run the cleaner.
## Prerequisites
- Docker Desktop installed and running
- At least 8GB of available RAM
- At least 10GB of free disk space
- Git (to clone the repositories)
## Step 1: Clone the MusicBrainz Server Repository
```bash
# Clone the main MusicBrainz server repository (if not already done)
git clone https://github.com/metabrainz/musicbrainz-server.git
cd musicbrainz-server
```
## Step 2: Start the MusicBrainz Docker Services
The MusicBrainz server uses Docker Compose to run multiple services including PostgreSQL, Solr search, Redis, and the web server.
```bash
# Navigate to the musicbrainz-docker directory
cd musicbrainz-docker
# Check if port 5000 is available (common conflict on macOS)
lsof -i :5000
# If port 5000 is in use, use port 5001 instead
MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
# Or if port 5000 is free, use the default
docker-compose up -d
```
### Troubleshooting Port Conflicts
If you get a port conflict error:
```bash
# Kill any process using port 5000
lsof -ti:5000 | xargs kill -9
# Or use a different port
MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
```
### Troubleshooting Container Conflicts
If you get container name conflicts:
```bash
# Remove existing containers
docker-compose down
# Force remove conflicting containers
docker rm -f musicbrainz-docker-db-1
# Start fresh
docker-compose up -d
```
## Step 3: Wait for Services to Start
The services take time to initialize, especially the database:
```bash
# Check service status
docker-compose ps
# Wait for all services to be healthy (this can take 5-10 minutes)
docker-compose logs -f db
```
**Important**: Wait until you see database initialization complete messages before proceeding.
## Step 4: Verify Services Are Running
```bash
# Check all containers are running
docker-compose ps
# Test the web interface (if using port 5001)
curl http://localhost:5001
# Or if using default port 5000
curl http://localhost:5000
```
## Step 5: Set Environment Variables
Create a `.env` file in the `musicbrainz-cleaner` directory:
```bash
cd ../musicbrainz-cleaner
# Create .env file
cat > .env << EOF
# Database connection (default Docker setup)
DB_HOST=172.18.0.2
DB_PORT=5432
DB_NAME=musicbrainz_db
DB_USER=musicbrainz
DB_PASSWORD=musicbrainz
# MusicBrainz web server
MUSICBRAINZ_WEB_SERVER_PORT=5001
EOF
```
**Note**: If you used the default port 5000, change `MUSICBRAINZ_WEB_SERVER_PORT=5001` to `MUSICBRAINZ_WEB_SERVER_PORT=5000`.
## Step 6: Test the Connection
```bash
# Run a simple test to verify everything is working
docker-compose run --rm musicbrainz-cleaner python3 quick_test_20.py
```
## Service Details
The Docker Compose setup includes:
- **PostgreSQL Database** (`db`): Main MusicBrainz database
- **Solr Search** (`search`): Full-text search engine
- **Redis** (`redis`): Caching and session storage
- **Message Queue** (`mq`): Background job processing
- **MusicBrainz Web Server** (`musicbrainz`): Main web application
- **Indexer** (`indexer`): Search index maintenance
## Ports Used
- **5000/5001**: MusicBrainz web server (configurable)
- **5432**: PostgreSQL database (internal)
- **8983**: Solr search (internal)
- **6379**: Redis (internal)
- **5672**: Message queue (internal)
## Stopping Services
```bash
# Stop all services
cd musicbrainz-docker
docker-compose down
# To also remove volumes (WARNING: this deletes all data)
docker-compose down -v
```
## Restarting Services
```bash
# Restart all services
docker-compose restart
# Or restart specific service
docker-compose restart db
```
## Monitoring Services
```bash
# View logs for all services
docker-compose logs -f
# View logs for specific service
docker-compose logs -f db
docker-compose logs -f musicbrainz
# Check resource usage
docker stats
```
## Troubleshooting
### Database Connection Issues
```bash
# Check if database is running
docker-compose ps db
# Check database logs
docker-compose logs db
# Test database connection
docker-compose exec db psql -U musicbrainz -d musicbrainz_db -c "SELECT 1;"
```
### Memory Issues
If you encounter memory issues:
```bash
# Increase Docker memory limit in Docker Desktop settings
# Recommended: 8GB minimum, 16GB preferred
# Check current memory usage
docker stats
```
### Platform Issues (Apple Silicon)
If you're on Apple Silicon (M1/M2) and see platform warnings:
```bash
# The services will still work, but you may see warnings about platform mismatch
# This is normal and doesn't affect functionality
```
## Performance Tips
1. **Allocate sufficient memory** to Docker Desktop (8GB+ recommended)
2. **Use SSD storage** for better database performance
3. **Close other resource-intensive applications** while running the services
4. **Wait for full initialization** before running tests
## Next Steps
Once the services are running successfully:
1. Run the quick test: `python3 quick_test_20.py`
2. Run larger tests: `python3 bulk_test_1000.py`
3. Use the cleaner on your own data: `python3 -m src.cli.main --input your_file.json --output cleaned.json`
## 🔄 After System Reboot
After restarting your Mac, you'll need to restart the MusicBrainz services:
### Quick Restart (Recommended)
```bash
# Navigate to musicbrainz-cleaner directory
cd /Users/mattbruce/Documents/Projects/musicbrainz-server/musicbrainz-cleaner
# If Docker Desktop is already running
./restart_services.sh
# Or manually
cd ../musicbrainz-docker && MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
```
### Full Restart (If you have issues)
```bash
# Complete setup including Docker checks
./start_services.sh
```
### Auto-start Setup (Optional)
1. **Enable Docker Desktop auto-start**:
- Open Docker Desktop
- Go to Settings → General
- Check "Start Docker Desktop when you log in"
2. **Then just run**: `./restart_services.sh` after each reboot
**Note**: Your data is preserved in Docker volumes, so you don't need to reconfigure anything after a reboot.
## Support
If you encounter issues:
1. Check the logs: `docker-compose logs -f`
2. Verify Docker has sufficient resources
3. Ensure all prerequisites are met
4. Try restarting the services: `docker-compose restart`

View File

@ -222,6 +222,7 @@
"The Proclaimers", "The Proclaimers",
"The Stanley Brothers", "The Stanley Brothers",
"The Statler Brothers", "The Statler Brothers",
"The Tamperer featuring Maya",
"The Walker Brothers", "The Walker Brothers",
"The Wilburn Brothers", "The Wilburn Brothers",
"Thompson Twins", "Thompson Twins",

108
quick_test_20.py Normal file
View File

@ -0,0 +1,108 @@
#!/usr/bin/env python3
"""
Quick test script for 20 random songs
Simple single-threaded approach
"""
import sys
import json
import time
from pathlib import Path
# Add the src directory to the path
sys.path.insert(0, '/app')
from src.cli.main import MusicBrainzCleaner
def main():
print('🚀 Starting quick test with 20 random songs...')
# Load songs
input_file = Path('data/songs.json')
if not input_file.exists():
print('❌ songs.json not found')
return
with open(input_file, 'r') as f:
all_songs = json.load(f)
print(f'📊 Total songs available: {len(all_songs):,}')
# Take 20 random songs
import random
sample_songs = random.sample(all_songs, 20)
print(f'🎯 Testing 20 random songs...')
# Initialize cleaner
cleaner = MusicBrainzCleaner()
# Process songs
found_artists = 0
found_recordings = 0
failed_songs = []
start_time = time.time()
for i, song in enumerate(sample_songs, 1):
print(f' [{i:2d}/20] Processing: "{song.get("artist", "Unknown")}" - "{song.get("title", "Unknown")}"')
try:
result = cleaner.clean_song(song)
artist_found = 'mbid' in result
recording_found = 'recording_mbid' in result
if artist_found and recording_found:
found_artists += 1
found_recordings += 1
print(f' ✅ Found both artist and recording')
else:
failed_songs.append({
'original': song,
'cleaned': result,
'artist_found': artist_found,
'recording_found': recording_found,
'artist_name': song.get('artist', 'Unknown'),
'title': song.get('title', 'Unknown')
})
print(f' ❌ Artist: {artist_found}, Recording: {recording_found}')
except Exception as e:
print(f' 💥 Error: {e}')
failed_songs.append({
'original': song,
'cleaned': {'error': str(e)},
'artist_found': False,
'recording_found': False,
'artist_name': song.get('artist', 'Unknown'),
'title': song.get('title', 'Unknown'),
'error': str(e)
})
end_time = time.time()
processing_time = end_time - start_time
# Calculate success rates
artist_success_rate = found_artists / 20 * 100
recording_success_rate = found_recordings / 20 * 100
failed_rate = len(failed_songs) / 20 * 100
print(f'\n📊 Final Results:')
print(f' ⏱️ Processing time: {processing_time:.2f} seconds')
print(f' 🚀 Speed: {20/processing_time:.1f} songs/second')
print(f' ✅ Artists found: {found_artists}/20 ({artist_success_rate:.1f}%)')
print(f' ✅ Recordings found: {found_recordings}/20 ({recording_success_rate:.1f}%)')
print(f' ❌ Failed songs: {len(failed_songs)} ({failed_rate:.1f}%)')
# Show failed songs
if failed_songs:
print(f'\n🔍 Failed songs:')
for i, failed in enumerate(failed_songs, 1):
print(f' [{i}] "{failed["artist_name"]}" - "{failed["title"]}"')
print(f' Artist found: {failed["artist_found"]}, Recording found: {failed["recording_found"]}')
if 'error' in failed:
print(f' Error: {failed["error"]}')
else:
print('\n🎉 All songs processed successfully!')
if __name__ == '__main__':
main()

19
restart_services.sh Executable file
View File

@ -0,0 +1,19 @@
#!/bin/bash
# Quick restart script for after Mac reboots
# This assumes Docker Desktop is already running
echo "🔄 Restarting MusicBrainz services..."
# Navigate to musicbrainz-docker
cd ../musicbrainz-docker
# Start services
MUSICBRAINZ_WEB_SERVER_PORT=5001 docker-compose up -d
echo "✅ Services started!"
echo "⏳ Database may take 5-10 minutes to fully initialize"
echo ""
echo "📊 Check status: docker-compose ps"
echo "📋 View logs: docker-compose logs -f db"
echo "🧪 Test when ready: cd ../musicbrainz-cleaner && docker-compose run --rm musicbrainz-cleaner python3 quick_test_20.py"

View File

@ -276,8 +276,12 @@ class MusicBrainzCleaner:
return collaborators return collaborators
def clean_song(self, song: Dict[str, Any]) -> Dict[str, Any]: def clean_song(self, song: Dict[str, Any]) -> Tuple[Dict[str, Any], bool]:
print(f"Processing: {song.get('artist', 'Unknown')} - {song.get('title', 'Unknown')}") """
Clean a single song and return (cleaned_song, success_status)
"""
original_artist = song.get('artist', '')
original_title = song.get('title', '')
# Find artist MBID # Find artist MBID
artist_mbid = self.find_artist_mbid(song.get('artist', '')) artist_mbid = self.find_artist_mbid(song.get('artist', ''))
@ -289,13 +293,11 @@ class MusicBrainzCleaner:
has_collaboration = len(collaborators) > 0 has_collaboration = len(collaborators) > 0
if artist_mbid is None and has_collaboration: if artist_mbid is None and has_collaboration:
print(f" 🎯 Collaboration detected: {song.get('artist')}")
# Try to find recording using artist credit approach # Try to find recording using artist credit approach
if self.use_database: if self.use_database:
result = self.db.find_artist_credit(song.get('artist', ''), song.get('title', '')) result = self.db.find_artist_credit(song.get('artist', ''), song.get('title', ''))
if result: if result:
artist_credit_id, artist_string, recording_mbid = result artist_credit_id, artist_string, recording_mbid = result
print(f" ✅ Found recording: {song.get('title')} (MBID: {recording_mbid})")
# Update with the correct artist credit # Update with the correct artist credit
song['artist'] = artist_string song['artist'] = artist_string
@ -309,11 +311,9 @@ class MusicBrainzCleaner:
if artist_result and isinstance(artist_result, tuple) and len(artist_result) >= 2: if artist_result and isinstance(artist_result, tuple) and len(artist_result) >= 2:
song['mbid'] = artist_result[1] # Set the main artist's MBID song['mbid'] = artist_result[1] # Set the main artist's MBID
print(f" ✅ Updated to: {song['artist']} - {song.get('title')}") return song, True
return song
else: else:
print(f" ❌ Could not find recording: {song.get('title')}") return song, False
return song
else: else:
# Fallback to API method # Fallback to API method
recording_mbid = self.find_recording_mbid(None, song.get('title', '')) recording_mbid = self.find_recording_mbid(None, song.get('title', ''))
@ -323,37 +323,29 @@ class MusicBrainzCleaner:
artist_string = self._build_artist_string(recording_info['artist-credit']) artist_string = self._build_artist_string(recording_info['artist-credit'])
if artist_string: if artist_string:
song['artist'] = artist_string song['artist'] = artist_string
print(f" ✅ Updated to: {song['artist']} - {recording_info['title']}")
song['title'] = recording_info['title'] song['title'] = recording_info['title']
song['recording_mbid'] = recording_mbid song['recording_mbid'] = recording_mbid
return song return song, True
else: return song, False
print(f" ❌ Could not find recording: {song.get('title')}")
return song
# Regular case (non-collaboration or collaboration not found) # Regular case (non-collaboration or collaboration not found)
if not artist_mbid: if not artist_mbid:
print(f" ❌ Could not find artist: {song.get('artist')}") return song, False
return song
# Get artist info # Get artist info
artist_info = self.get_artist_info(artist_mbid) artist_info = self.get_artist_info(artist_mbid)
if artist_info: if artist_info:
print(f" ✅ Found artist: {artist_info['name']} (MBID: {artist_mbid})")
song['artist'] = artist_info['name'] song['artist'] = artist_info['name']
song['mbid'] = artist_mbid song['mbid'] = artist_mbid
# Find recording MBID # Find recording MBID
recording_mbid = self.find_recording_mbid(artist_mbid, song.get('title', '')) recording_mbid = self.find_recording_mbid(artist_mbid, song.get('title', ''))
if not recording_mbid: if not recording_mbid:
print(f" ❌ Could not find recording: {song.get('title')}") return song, False
return song
# Get recording info # Get recording info
recording_info = self.get_recording_info(recording_mbid) recording_info = self.get_recording_info(recording_mbid)
if recording_info: if recording_info:
print(f" ✅ Found recording: {recording_info['title']} (MBID: {recording_mbid})")
# Update artist string if there are multiple artists, but preserve the artist MBID # Update artist string if there are multiple artists, but preserve the artist MBID
if self.use_database and recording_info.get('artist_credit'): if self.use_database and recording_info.get('artist_credit'):
song['artist'] = recording_info['artist_credit'] song['artist'] = recording_info['artist_credit']
@ -370,11 +362,11 @@ class MusicBrainzCleaner:
song['title'] = recording_info['title'] song['title'] = recording_info['title']
song['recording_mbid'] = recording_mbid song['recording_mbid'] = recording_mbid
return song, True
print(f" ✅ Updated to: {song['artist']} - {song['title']}") return song, False
return song
def clean_songs_file(self, input_file: Path, output_file: Optional[Path] = None, limit: Optional[int] = None) -> Path: def clean_songs_file(self, input_file: Path, output_file: Optional[Path] = None, limit: Optional[int] = None) -> Tuple[Path, List[Dict]]:
try: try:
# Read input file # Read input file
with open(input_file, 'r', encoding='utf-8') as f: with open(input_file, 'r', encoding='utf-8') as f:
@ -382,7 +374,7 @@ class MusicBrainzCleaner:
if not isinstance(songs, list): if not isinstance(songs, list):
print("Error: Input file should contain a JSON array of songs") print("Error: Input file should contain a JSON array of songs")
return input_file return input_file, []
# Apply limit if specified # Apply limit if specified
if limit is not None: if limit is not None:
@ -399,11 +391,31 @@ class MusicBrainzCleaner:
# Clean each song # Clean each song
cleaned_songs = [] cleaned_songs = []
failed_songs = []
success_count = 0
fail_count = 0
for i, song in enumerate(songs, 1): for i, song in enumerate(songs, 1):
print(f"\n[{i}/{len(songs)}]", end=" ") cleaned_song, success = self.clean_song(song)
cleaned_song = self.clean_song(song)
cleaned_songs.append(cleaned_song) cleaned_songs.append(cleaned_song)
if success:
success_count += 1
print(f"[{i}/{len(songs)}] ✅ PASS")
else:
fail_count += 1
print(f"[{i}/{len(songs)}] ❌ FAIL")
# Store failed song info for report
failed_songs.append({
'index': i,
'original_artist': song.get('artist', ''),
'original_title': song.get('title', ''),
'cleaned_artist': cleaned_song.get('artist', ''),
'cleaned_title': cleaned_song.get('title', ''),
'has_mbid': 'mbid' in cleaned_song,
'has_recording_mbid': 'recording_mbid' in cleaned_song
})
# Only add delay for API calls, not database queries # Only add delay for API calls, not database queries
if not self.use_database: if not self.use_database:
time.sleep(API_REQUEST_DELAY) time.sleep(API_REQUEST_DELAY)
@ -412,21 +424,37 @@ class MusicBrainzCleaner:
with open(output_file, 'w', encoding='utf-8') as f: with open(output_file, 'w', encoding='utf-8') as f:
json.dump(cleaned_songs, f, indent=2, ensure_ascii=False) json.dump(cleaned_songs, f, indent=2, ensure_ascii=False)
print(f"\n{PROGRESS_SEPARATOR}") # Generate failure report
print(SUCCESS_MESSAGES['processing_complete']) report_file = input_file.parent / f"{input_file.stem}_failure_report.json"
print(SUCCESS_MESSAGES['output_saved'].format(file_path=output_file)) with open(report_file, 'w', encoding='utf-8') as f:
json.dump({
'summary': {
'total_songs': len(songs),
'successful': success_count,
'failed': fail_count,
'success_rate': f"{(success_count/len(songs)*100):.1f}%"
},
'failed_songs': failed_songs
}, f, indent=2, ensure_ascii=False)
return output_file print(f"\n{PROGRESS_SEPARATOR}")
print(f"✅ SUCCESS: {success_count} songs")
print(f"❌ FAILED: {fail_count} songs")
print(f"📊 SUCCESS RATE: {(success_count/len(songs)*100):.1f}%")
print(f"💾 CLEANED DATA: {output_file}")
print(f"📋 FAILURE REPORT: {report_file}")
return output_file, failed_songs
except FileNotFoundError: except FileNotFoundError:
print(f"Error: File '{input_file}' not found") print(f"Error: File '{input_file}' not found")
return input_file return input_file, []
except json.JSONDecodeError: except json.JSONDecodeError:
print(f"Error: Invalid JSON in file '{input_file}'") print(f"Error: Invalid JSON in file '{input_file}'")
return input_file return input_file, []
except Exception as e: except Exception as e:
print(f"Error processing file: {e}") print(f"Error processing file: {e}")
return input_file return input_file, []
finally: finally:
# Clean up database connection # Clean up database connection
if self.use_database and hasattr(self, 'db'): if self.use_database and hasattr(self, 'db'):
@ -601,7 +629,7 @@ def main() -> int:
# Process the file # Process the file
cleaner = MusicBrainzCleaner(use_database=use_database) cleaner = MusicBrainzCleaner(use_database=use_database)
result_path = cleaner.clean_songs_file(input_file, output_file, limit) result_path, failed_songs = cleaner.clean_songs_file(input_file, output_file, limit)
return ExitCode.SUCCESS return ExitCode.SUCCESS

157
start_services.sh Executable file
View File

@ -0,0 +1,157 @@
#!/bin/bash
# MusicBrainz Cleaner - Quick Start Script
# This script automates the startup of MusicBrainz services
set -e
echo "🚀 Starting MusicBrainz services..."
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Function to print colored output
print_status() {
echo -e "${BLUE}[INFO]${NC} $1"
}
print_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
print_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
print_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# Check if Docker is running
if ! docker info > /dev/null 2>&1; then
print_error "Docker is not running. Please start Docker Desktop first."
exit 1
fi
print_success "Docker is running"
# Check if we're in the right directory
if [ ! -f "docker-compose.yml" ]; then
print_error "This script must be run from the musicbrainz-cleaner directory"
exit 1
fi
# Check if musicbrainz-docker directory exists
if [ ! -d "../musicbrainz-docker" ]; then
print_error "musicbrainz-docker directory not found. Please ensure you're in the musicbrainz-server directory."
exit 1
fi
# Navigate to musicbrainz-docker
cd ../musicbrainz-docker
print_status "Checking for port conflicts..."
# Check if port 5000 is available
if lsof -i :5000 > /dev/null 2>&1; then
print_warning "Port 5000 is in use. Using port 5001 instead."
PORT=5001
else
print_success "Port 5000 is available"
PORT=5000
fi
# Stop any existing containers
print_status "Stopping existing containers..."
docker-compose down > /dev/null 2>&1 || true
# Remove any conflicting containers
print_status "Cleaning up conflicting containers..."
docker rm -f musicbrainz-docker-db-1 > /dev/null 2>&1 || true
# Start services
print_status "Starting MusicBrainz services on port $PORT..."
MUSICBRAINZ_WEB_SERVER_PORT=$PORT docker-compose up -d
print_success "Services started successfully!"
# Wait for database to be ready
print_status "Waiting for database to initialize (this may take 5-10 minutes)..."
print_status "You can monitor progress with: docker-compose logs -f db"
# Check if database is ready
attempts=0
max_attempts=60
while [ $attempts -lt $max_attempts ]; do
if docker-compose exec -T db pg_isready -U musicbrainz > /dev/null 2>&1; then
print_success "Database is ready!"
break
fi
attempts=$((attempts + 1))
print_status "Waiting for database... (attempt $attempts/$max_attempts)"
sleep 10
done
if [ $attempts -eq $max_attempts ]; then
print_warning "Database may still be initializing. You can check status with: docker-compose logs db"
fi
# Create .env file in musicbrainz-cleaner directory
cd ../musicbrainz-cleaner
print_status "Creating environment configuration..."
cat > .env << EOF
# Database connection (default Docker setup)
DB_HOST=172.18.0.2
DB_PORT=5432
DB_NAME=musicbrainz_db
DB_USER=musicbrainz
DB_PASSWORD=musicbrainz
# MusicBrainz web server
MUSICBRAINZ_WEB_SERVER_PORT=$PORT
EOF
print_success "Environment configuration created"
# Test connection
print_status "Testing connection..."
if docker-compose run --rm musicbrainz-cleaner python3 -c "
import sys
sys.path.insert(0, '/app')
from src.api.database import MusicBrainzDatabase
try:
db = MusicBrainzDatabase()
print('✅ Database connection successful')
except Exception as e:
print(f'❌ Database connection failed: {e}')
sys.exit(1)
" 2>/dev/null; then
print_success "Connection test passed!"
else
print_warning "Connection test failed. Services may still be initializing."
fi
echo ""
print_success "MusicBrainz services are now running!"
echo ""
echo "📊 Service Status:"
echo " - Web Server: http://localhost:$PORT"
echo " - Database: PostgreSQL (internal)"
echo " - Search: Solr (internal)"
echo ""
echo "🧪 Next steps:"
echo " 1. Run quick test: python3 quick_test_20.py"
echo " 2. Run larger test: python3 bulk_test_1000.py"
echo " 3. Use cleaner: python3 -m src.cli.main --input your_file.json --output cleaned.json"
echo ""
echo "📋 Useful commands:"
echo " - View logs: cd ../musicbrainz-docker && docker-compose logs -f"
echo " - Stop services: cd ../musicbrainz-docker && docker-compose down"
echo " - Check status: cd ../musicbrainz-docker && docker-compose ps"
echo ""