Signed-off-by: Matt Bruce <mbrucedogs@gmail.com>

This commit is contained in:
Matt Bruce 2025-08-01 10:01:41 -05:00
parent 0f33590eca
commit b794d9dc1c
9 changed files with 252 additions and 157 deletions

View File

@ -2,7 +2,7 @@
## Overview
The MusicBrainz Data Cleaner is a command-line interface (CLI) tool that processes JSON song data files and cleans/normalizes the metadata using the MusicBrainz database. The tool creates separate output files for successful and failed songs, along with detailed processing reports.
The MusicBrainz Data Cleaner is a command-line interface (CLI) tool that processes JSON song data files and cleans/normalizes the metadata using the MusicBrainz database. The tool uses an interface-based architecture with dependency injection for clean, maintainable code. It creates separate output files for successful and failed songs, along with detailed processing reports.
## Basic Command Structure
@ -277,6 +277,24 @@ Error: Source file should contain a JSON array of songs
ModuleNotFoundError: No module named 'requests'
```
## Architecture Overview
### Interface-Based Design
The tool uses a clean interface-based architecture:
- **`MusicBrainzDataProvider` Interface**: Common protocol for data access
- **`DataProviderFactory`**: Creates appropriate provider (database or API)
- **`SongProcessor`**: Centralized processing logic using the interface
- **Dependency Injection**: CLI depends on interfaces, not concrete classes
### Data Flow
1. **CLI** uses `DataProviderFactory` to create data provider
2. **Factory** returns either database or API implementation
3. **SongProcessor** processes songs using the common interface
4. **Same logic** works regardless of provider type
## Environment Configuration
### Docker Environment

60
PRD.md
View File

@ -162,6 +162,7 @@ Users have song data in JSON format with inconsistent artist names, song titles,
- **Primary:** Direct PostgreSQL database access
- **Fallback:** MusicBrainz REST API (local server)
- **Interface:** Command-line (CLI)
- **Design Pattern:** Interface-based architecture with dependency injection
### Project Structure
```
@ -169,20 +170,27 @@ src/
├── __init__.py # Package initialization
├── api/ # API-related modules
│ ├── __init__.py
│ ├── database.py # Direct PostgreSQL access with fuzzy search
│ └── api_client.py # Legacy HTTP API client (fallback)
│ ├── database.py # Direct PostgreSQL access (implements MusicBrainzDataProvider)
│ └── api_client.py # HTTP API client (implements MusicBrainzDataProvider)
├── cli/ # Command-line interface
│ ├── __init__.py
│ └── main.py # Main CLI implementation
│ └── main.py # Main CLI implementation (uses factory pattern)
├── config/ # Configuration
│ ├── __init__.py
│ └── constants.py # Constants and settings
├── core/ # Core functionality
│ ├── __init__.py
│ ├── interfaces.py # Common interfaces and protocols
│ ├── factory.py # Data provider factory
│ └── song_processor.py # Centralized song processing logic
├── tests/ # Test files and scripts
│ ├── __init__.py
│ ├── test_*.py # Unit and integration tests
│ └── debug_*.py # Debug scripts
└── utils/ # Utility functions
├── __init__.py
├── artist_title_processing.py # Shared artist/title processing
└── data_loader.py # Data loading utilities
```
### Architectural Principles
@ -193,13 +201,18 @@ src/
- **Error Handling**: Graceful error handling with meaningful messages
- **Performance First**: Direct database access for maximum speed
- **Fallback Strategy**: Automatic fallback to API when database unavailable
- **NEW**: **Database-First**: Always use live database data over static caches
- **NEW**: **Intelligent Collaboration Detection**: Distinguish band names from collaborations
- **NEW**: **Test Organization**: All test files must be placed in `src/tests/` directory, not in root
- **Interface-Based Design**: Uses dependency injection with common interfaces
- **Factory Pattern**: Clean provider creation and configuration
- **Single Responsibility**: Each class has one clear purpose
- **Database-First**: Always use live database data over static caches
- **Intelligent Collaboration Detection**: Distinguish band names from collaborations
- **Test Organization**: All test files must be placed in `src/tests/` directory, not in root
### Data Flow
1. Read JSON input file
2. For each song:
1. **CLI** uses `DataProviderFactory` to create appropriate data provider (database or API)
2. **SongProcessor** receives the data provider and processes songs using the common interface
3. **Data Provider** (database or API) implements the same interface for consistent behavior
4. For each song:
- Clean artist name using name variations
- Detect collaboration patterns
- Use fuzzy search to find artist in database (including aliases, sort_names)
@ -207,7 +220,7 @@ src/
- For collaborations: find artist_credit and recording
- For single artists: find recording by artist and title
- Update song object with corrected data and MBIDs
3. Write cleaned data to output file
5. Write cleaned data to output file
### Fuzzy Search Implementation
- **Algorithm**: Uses fuzzywuzzy library with multiple matching strategies
@ -316,18 +329,23 @@ docker-compose logs -f musicbrainz
- [x] Fuzzy search for artists and recordings
- [x] Automatic fallback to API mode
- [x] Performance optimizations
- [x] **NEW**: Advanced collaboration detection and handling
- [x] **NEW**: Artist alias and sort_name search
- [x] **NEW**: Dash variation handling
- [x] **NEW**: Numerical suffix handling
- [x] **NEW**: Band name vs collaboration distinction
- [x] **NEW**: Complex collaboration parsing
- [x] **NEW**: Removed problematic known_artists cache
- [x] **NEW**: Simplified CLI with default full dataset processing
- [x] **NEW**: Separate output files for successful and failed songs (array format)
- [x] **NEW**: Human-readable text reports with statistics
- [x] **NEW**: Smart defaults for all file paths and options
- [x] **NEW**: Configurable processing limits and output file paths
- [x] Advanced collaboration detection and handling
- [x] Artist alias and sort_name search
- [x] Dash variation handling
- [x] Numerical suffix handling
- [x] Band name vs collaboration distinction
- [x] Complex collaboration parsing
- [x] Removed problematic known_artists cache
- [x] Simplified CLI with default full dataset processing
- [x] Separate output files for successful and failed songs (array format)
- [x] Human-readable text reports with statistics
- [x] Smart defaults for all file paths and options
- [x] Configurable processing limits and output file paths
- [x] **NEW**: Interface-based architecture with dependency injection
- [x] **NEW**: Factory pattern for data provider creation
- [x] **NEW**: Centralized song processing logic
- [x] **NEW**: Common interfaces for database and API clients
- [x] **NEW**: Clean separation of concerns
### 🔄 Future Enhancements
- [ ] Web interface option

View File

@ -1,6 +1,6 @@
# 🎵 MusicBrainz Data Cleaner v3.0
A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!**
A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with interface-based architecture, advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!**
## 🚀 Quick Start for New Sessions
@ -47,6 +47,8 @@ docker-compose run --rm musicbrainz-cleaner python3 [script_name].py
## ✨ What's New in v3.0
- **🏗️ Interface-Based Architecture**: Clean dependency injection with common interfaces
- **🏭 Factory Pattern**: Smart data provider creation and configuration
- **🚀 Direct Database Access**: Connect directly to PostgreSQL for 10x faster performance
- **🎯 Advanced Fuzzy Search**: Intelligent matching for similar artist names and song titles
- **🔄 Automatic Fallback**: Falls back to API mode if database access fails
@ -299,11 +301,19 @@ python3 src/tests/run_tests.py --list
musicbrainz-cleaner/
├── src/
│ ├── api/ # Database and API access
│ │ ├── database.py # Direct PostgreSQL access (implements MusicBrainzDataProvider)
│ │ └── api_client.py # HTTP API client (implements MusicBrainzDataProvider)
│ ├── cli/ # Command-line interface
│ │ └── main.py # Main CLI implementation (uses factory pattern)
│ ├── config/ # Configuration and constants
│ ├── core/ # Core functionality
│ │ ├── interfaces.py # Common interfaces and protocols
│ │ ├── factory.py # Data provider factory
│ │ └── song_processor.py # Centralized song processing logic
│ ├── tests/ # Test files (REQUIRED location)
│ └── utils/ # Utility functions
│ ├── artist_title_processing.py # Shared artist/title processing
│ └── data_loader.py # Data loading utilities
├── data/ # Data files and output
│ ├── known_artists.json # Name variations (ACDC → AC/DC)
│ ├── known_recordings.json # Known recording MBIDs

View File

@ -1,19 +1,20 @@
"""
Legacy HTTP API client for MusicBrainz Data Cleaner.
Used as fallback when direct database access is not available.
HTTP API client for MusicBrainz Data Cleaner.
Implements the MusicBrainzDataProvider interface for API-based access.
"""
import requests
import time
from typing import Dict, Optional, Any
from typing import Dict, Optional, Any, Tuple
from ..config.constants import (
DEFAULT_MUSICBRAINZ_URL, API_REQUEST_DELAY, REQUEST_TIMEOUT,
SUCCESS_MESSAGES, ERROR_MESSAGES
)
from ..core.interfaces import MusicBrainzDataProvider
class MusicBrainzAPIClient:
"""Legacy HTTP API client for MusicBrainz (fallback option)."""
class MusicBrainzAPIClient(MusicBrainzDataProvider):
"""HTTP API client for MusicBrainz implementing the data provider interface."""
def __init__(self, base_url: str = DEFAULT_MUSICBRAINZ_URL):
self.base_url = base_url
@ -30,8 +31,8 @@ class MusicBrainzAPIClient:
print(f"API connection test failed: {e}")
return False
def search_artist(self, artist_name: str) -> Optional[Dict[str, Any]]:
"""Search for artist by name using API."""
def fuzzy_search_artist(self, artist_name: str) -> Optional[Tuple[str, str, float]]:
"""Search for artist by name using API. Returns (artist_name, mbid, score)."""
try:
url = f"{self.base_url}/ws/2/artist/?query=name:{artist_name}&fmt=json"
response = self.session.get(url)
@ -39,14 +40,15 @@ class MusicBrainzAPIClient:
data = response.json()
if data.get('artists') and len(data['artists']) > 0:
return data['artists'][0]
artist = data['artists'][0]
return (artist['name'], artist['id'], 1.0) # Perfect match for API
return None
except Exception as e:
print(f"API search failed for artist '{artist_name}': {e}")
return None
def search_recording(self, title: str, artist_mbid: Optional[str] = None) -> Optional[Dict[str, Any]]:
"""Search for recording by title and optionally artist using API."""
def fuzzy_search_recording(self, title: str, artist_mbid: Optional[str] = None) -> Optional[Tuple[str, str, float]]:
"""Search for recording by title and optionally artist using API. Returns (recording_name, mbid, score)."""
try:
if artist_mbid:
url = f"{self.base_url}/ws/2/recording/?query=arid:{artist_mbid}%20AND%20name:{title}&fmt=json"
@ -58,7 +60,8 @@ class MusicBrainzAPIClient:
data = response.json()
if data.get('recordings') and len(data['recordings']) > 0:
return data['recordings'][0]
recording = data['recordings'][0]
return (recording['title'], recording['id'], 1.0) # Perfect match for API
return None
except Exception as e:
print(f"API search failed for recording '{title}': {e}")

View File

@ -13,10 +13,11 @@ from ..config.constants import (
TITLE_SIMILARITY_THRESHOLD, ARTIST_SIMILARITY_THRESHOLD,
SUCCESS_MESSAGES, ERROR_MESSAGES
)
from src.utils.artist_title_processing import parse_complex_collaboration, parse_collaborators, generate_title_variations
from ..utils.artist_title_processing import parse_complex_collaboration, parse_collaborators, generate_title_variations
from ..core.interfaces import MusicBrainzDataProvider
class MusicBrainzDatabase:
class MusicBrainzDatabase(MusicBrainzDataProvider):
"""Direct PostgreSQL database access for MusicBrainz with fuzzy search."""
def __init__(self, host: str = DB_HOST, port: int = DB_PORT,

View File

@ -23,35 +23,23 @@ from ..config.constants import (
# Import database and API clients
from ..api.database import MusicBrainzDatabase
from ..api.api_client import MusicBrainzAPIClient
# Import core components
from ..core.song_processor import SongProcessor
from ..core.factory import DataProviderFactory
class MusicBrainzCleaner:
"""Enhanced MusicBrainz Cleaner with direct database access and fuzzy search."""
"""Enhanced MusicBrainz Cleaner with interface-based data access."""
def __init__(self, use_database: bool = True, base_url: str = DEFAULT_MUSICBRAINZ_URL):
self.use_database = use_database
self.base_url = base_url
# Initialize database connection (primary method)
if use_database:
self.db = MusicBrainzDatabase()
if not self.db.connect():
print("⚠️ Database connection failed, falling back to API")
self.use_database = False
# Create data provider using factory
self.data_provider = DataProviderFactory.create_provider(use_database, base_url)
# Initialize API client (fallback method)
if not self.use_database:
self.api = MusicBrainzAPIClient(base_url)
# Initialize centralized song processor
self.song_processor = SongProcessor(
database_client=self.db if use_database else None,
api_client=self.api if not use_database else None
)
# Initialize centralized song processor with the data provider
self.song_processor = SongProcessor(self.data_provider)
def clean_song(self, song: Dict[str, Any]) -> Tuple[Dict[str, Any], bool]:
"""
@ -394,22 +382,15 @@ def main() -> int:
# Handle test connection
if parsed['test_connection']:
if parsed['use_api']:
api = MusicBrainzAPIClient()
if api.test_connection():
print("✅ Connection to MusicBrainz API server successful")
return ExitCode.SUCCESS
else:
print("❌ Connection to MusicBrainz API server failed")
return ExitCode.ERROR
provider = DataProviderFactory.create_provider(not parsed['use_api'])
if provider.test_connection():
provider_type = "API server" if parsed['use_api'] else "database"
print(f"✅ Connection to MusicBrainz {provider_type} successful")
return ExitCode.SUCCESS
else:
db = MusicBrainzDatabase()
if db.test_connection():
print("✅ Connection to MusicBrainz database successful")
return ExitCode.SUCCESS
else:
print("❌ Connection to MusicBrainz database failed")
return ExitCode.ERROR
provider_type = "API server" if parsed['use_api'] else "database"
print(f"❌ Connection to MusicBrainz {provider_type} failed")
return ExitCode.ERROR
# Process songs (main functionality)
source_file = Path(parsed['source'])

36
src/core/factory.py Normal file
View File

@ -0,0 +1,36 @@
"""
Factory for creating MusicBrainz data providers.
Provides a clean way to instantiate the appropriate data provider based on configuration.
"""
from typing import Optional
from ..api.database import MusicBrainzDatabase
from ..api.api_client import MusicBrainzAPIClient
from .interfaces import MusicBrainzDataProvider
from ..config.constants import DEFAULT_MUSICBRAINZ_URL
class DataProviderFactory:
"""Factory for creating MusicBrainz data providers."""
@staticmethod
def create_provider(use_database: bool = True, base_url: str = DEFAULT_MUSICBRAINZ_URL) -> MusicBrainzDataProvider:
"""
Create a data provider based on configuration.
Args:
use_database: Whether to use database (True) or API (False)
base_url: Base URL for API client (only used if use_database=False)
Returns:
MusicBrainzDataProvider instance
"""
if use_database:
provider = MusicBrainzDatabase()
if provider.connect():
return provider
else:
print("⚠️ Database connection failed, falling back to API")
return MusicBrainzAPIClient(base_url)
else:
return MusicBrainzAPIClient(base_url)

60
src/core/interfaces.py Normal file
View File

@ -0,0 +1,60 @@
"""
Core interfaces for MusicBrainz Data Cleaner.
Defines the common protocol that all data access implementations must follow.
"""
from abc import ABC, abstractmethod
from typing import Dict, Optional, Any, Tuple, List
class MusicBrainzDataProvider(ABC):
"""
Abstract base class defining the interface for MusicBrainz data access.
Both database and API implementations must implement these methods.
"""
@abstractmethod
def test_connection(self) -> bool:
"""Test connection to the data source."""
pass
@abstractmethod
def fuzzy_search_artist(self, artist_name: str) -> Optional[Tuple[str, str, float]]:
"""
Fuzzy search for artist by name.
Returns (artist_name, mbid, similarity_score) or None.
"""
pass
@abstractmethod
def fuzzy_search_recording(self, title: str, artist_mbid: Optional[str] = None) -> Optional[Tuple[str, str, float]]:
"""
Fuzzy search for recording by title and optionally artist.
Returns (recording_name, mbid, similarity_score) or None.
"""
pass
@abstractmethod
def get_artist_info(self, mbid: str) -> Optional[Dict[str, Any]]:
"""Get detailed artist information by MBID."""
pass
@abstractmethod
def get_recording_info(self, mbid: str) -> Optional[Dict[str, Any]]:
"""Get detailed recording information by MBID."""
pass
class SongProcessorInterface(ABC):
"""
Abstract base class defining the interface for song processing.
This ensures consistent behavior across different implementations.
"""
@abstractmethod
def clean_song(self, song: Dict[str, Any]) -> Tuple[Dict[str, Any], bool]:
"""
Clean a single song.
Returns (cleaned_song, success_status)
"""
pass

View File

@ -7,22 +7,20 @@ between CLI and database interactions to ensure consistency.
from typing import Dict, Optional, Tuple, Any, List
from ..utils.artist_title_processing import parse_complex_collaboration, generate_title_variations
from ..utils.data_loader import data_loader
from .interfaces import MusicBrainzDataProvider, SongProcessorInterface
class SongProcessor:
class SongProcessor(SongProcessorInterface):
"""Centralized song processing with consistent logic across all interfaces."""
def __init__(self, database_client=None, api_client=None):
def __init__(self, data_provider: MusicBrainzDataProvider):
"""
Initialize with optional database and API clients.
Initialize with a data provider that implements MusicBrainzDataProvider.
Args:
database_client: MusicBrainzDatabase instance for direct DB access
api_client: MusicBrainzAPIClient instance for API fallback
data_provider: Instance implementing MusicBrainzDataProvider interface
"""
self.db = database_client
self.api = api_client
self.use_database = database_client is not None
self.data_provider = data_provider
def find_artist_mbid(self, artist_name: str) -> Optional[str]:
"""
@ -37,20 +35,9 @@ class SongProcessor:
if not artist_name:
return None
# Try database first if available
if self.use_database:
result = self.db.fuzzy_search_artist(artist_name)
if result and isinstance(result, tuple) and len(result) >= 2:
return result[1] # Return MBID from tuple (artist_name, mbid, score)
# Fallback to API
if self.api:
try:
result = self.api.search_artist(artist_name)
if result:
return result['id']
except:
pass
result = self.data_provider.fuzzy_search_artist(artist_name)
if result and isinstance(result, tuple) and len(result) >= 2:
return result[1] # Return MBID from tuple (artist_name, mbid, score)
return None
@ -83,70 +70,52 @@ class SongProcessor:
return recording_mbid
# Handle collaborations using artist credit
if self.use_database:
# If no artist_mbid (collaboration case), try to find by title and verify artist credit
if not artist_mbid and original_artist:
# This is a collaboration case, try to find by title with all variations
for variation in title_variations:
result = self.db.fuzzy_search_recording(variation)
if result and isinstance(result, tuple) and len(result) >= 2:
recording_mbid = result[1]
# Verify that this recording has the correct artist credit
recording_info = self.get_recording_info(recording_mbid)
if recording_info and recording_info.get('artist_credit'):
# Check if the artist credit matches our expected collaboration
expected_artist_string = original_artist.replace(',', ' & ').replace(' and ', ' & ')
if recording_info['artist_credit'].lower() == expected_artist_string.lower():
return recording_mbid
# If exact match fails, try partial match
if recording_info and recording_info.get('artist_credit'):
# Check if all artists in the collaboration are present in the recording
main_artist, collaborators = parse_complex_collaboration(original_artist)
recording_artists = recording_info['artist_credit'].lower()
# Check if main artist is in the recording
if main_artist.lower() in recording_artists:
# Check if at least one collaborator is also present
for collaborator in collaborators:
if collaborator.lower() in recording_artists:
return recording_mbid
return None
else:
# Regular case with artist_mbid - try all title variations
for variation in title_variations:
result = self.db.fuzzy_search_recording(variation, artist_mbid)
if result and isinstance(result, tuple) and len(result) >= 2:
return result[1] # Return MBID from tuple (recording_name, mbid, score)
else:
# Fallback to API - try all title variations
# If no artist_mbid (collaboration case), try to find by title and verify artist credit
if not artist_mbid and original_artist:
# This is a collaboration case, try to find by title with all variations
for variation in title_variations:
try:
result = self.api.search_recording(variation, artist_mbid)
if result:
return result['id']
except:
pass
result = self.data_provider.fuzzy_search_recording(variation)
if result and isinstance(result, tuple) and len(result) >= 2:
recording_mbid = result[1]
# Verify that this recording has the correct artist credit
recording_info = self.get_recording_info(recording_mbid)
if recording_info and recording_info.get('artist_credit'):
# Check if the artist credit matches our expected collaboration
expected_artist_string = original_artist.replace(',', ' & ').replace(' and ', ' & ')
if recording_info['artist_credit'].lower() == expected_artist_string.lower():
return recording_mbid
# If exact match fails, try partial match
if recording_info and recording_info.get('artist_credit'):
# Check if all artists in the collaboration are present in the recording
main_artist, collaborators = parse_complex_collaboration(original_artist)
recording_artists = recording_info['artist_credit'].lower()
# Check if main artist is in the recording
if main_artist.lower() in recording_artists:
# Check if at least one collaborator is also present
for collaborator in collaborators:
if collaborator.lower() in recording_artists:
return recording_mbid
# Regular case with artist_mbid - try all title variations
for variation in title_variations:
result = self.data_provider.fuzzy_search_recording(variation, artist_mbid)
if result and isinstance(result, tuple) and len(result) >= 2:
return result[1] # Return MBID from tuple (recording_name, mbid, score)
return None
return None
def get_artist_info(self, mbid: str) -> Optional[Dict[str, Any]]:
"""Get artist info using consistent logic."""
if self.use_database:
return self.db.get_artist_info(mbid)
elif self.api:
return self.api.get_artist_info(mbid)
return None
return self.data_provider.get_artist_info(mbid)
def get_recording_info(self, mbid: str) -> Optional[Dict[str, Any]]:
"""Get recording info using consistent logic."""
if self.use_database:
return self.db.get_recording_info(mbid)
elif self.api:
return self.api.get_recording_info(mbid)
return None
return self.data_provider.get_recording_info(mbid)
def _build_artist_string(self, artist_credit: list) -> str:
"""Build artist string from artist credit list (API format)."""
@ -189,9 +158,9 @@ class SongProcessor:
recording_info = self.get_recording_info(recording_mbid)
if recording_info:
# Update with the correct artist credit and title
if self.use_database and recording_info.get('artist_credit'):
if recording_info.get('artist_credit'):
song['artist'] = recording_info['artist_credit']
elif not self.use_database and recording_info.get('artist-credit'):
elif recording_info.get('artist-credit'):
artist_string = self._build_artist_string(recording_info['artist-credit'])
if artist_string:
song['artist'] = artist_string
@ -200,12 +169,11 @@ class SongProcessor:
song['recording_mbid'] = recording_mbid
# For collaborations, try to get the main artist's MBID
if self.use_database:
main_artist, collaborators = parse_complex_collaboration(song.get('artist', ''))
if main_artist:
artist_result = self.db.fuzzy_search_artist(main_artist)
if artist_result and isinstance(artist_result, tuple) and len(artist_result) >= 2:
song['mbid'] = artist_result[1] # Set the main artist's MBID
main_artist, collaborators = parse_complex_collaboration(song.get('artist', ''))
if main_artist:
artist_result = self.data_provider.fuzzy_search_artist(main_artist)
if artist_result and isinstance(artist_result, tuple) and len(artist_result) >= 2:
song['mbid'] = artist_result[1] # Set the main artist's MBID
return song, True
return song, False
@ -229,12 +197,12 @@ class SongProcessor:
recording_info = self.get_recording_info(recording_mbid)
if recording_info:
# Update artist string if there are multiple artists, but preserve the artist MBID
if self.use_database and recording_info.get('artist_credit'):
if recording_info.get('artist_credit'):
song['artist'] = recording_info['artist_credit']
# Keep the original artist MBID even when updating artist name
if 'mbid' not in song:
song['mbid'] = artist_mbid
elif not self.use_database and recording_info.get('artist-credit'):
elif recording_info.get('artist-credit'):
artist_string = self._build_artist_string(recording_info['artist-credit'])
if artist_string:
song['artist'] = artist_string