Signed-off-by: Matt Bruce <mbrucedogs@gmail.com>

This commit is contained in:
Matt Bruce 2025-08-01 10:01:41 -05:00
parent 0f33590eca
commit b794d9dc1c
9 changed files with 252 additions and 157 deletions

View File

@ -2,7 +2,7 @@
## Overview ## Overview
The MusicBrainz Data Cleaner is a command-line interface (CLI) tool that processes JSON song data files and cleans/normalizes the metadata using the MusicBrainz database. The tool creates separate output files for successful and failed songs, along with detailed processing reports. The MusicBrainz Data Cleaner is a command-line interface (CLI) tool that processes JSON song data files and cleans/normalizes the metadata using the MusicBrainz database. The tool uses an interface-based architecture with dependency injection for clean, maintainable code. It creates separate output files for successful and failed songs, along with detailed processing reports.
## Basic Command Structure ## Basic Command Structure
@ -277,6 +277,24 @@ Error: Source file should contain a JSON array of songs
ModuleNotFoundError: No module named 'requests' ModuleNotFoundError: No module named 'requests'
``` ```
## Architecture Overview
### Interface-Based Design
The tool uses a clean interface-based architecture:
- **`MusicBrainzDataProvider` Interface**: Common protocol for data access
- **`DataProviderFactory`**: Creates appropriate provider (database or API)
- **`SongProcessor`**: Centralized processing logic using the interface
- **Dependency Injection**: CLI depends on interfaces, not concrete classes
### Data Flow
1. **CLI** uses `DataProviderFactory` to create data provider
2. **Factory** returns either database or API implementation
3. **SongProcessor** processes songs using the common interface
4. **Same logic** works regardless of provider type
## Environment Configuration ## Environment Configuration
### Docker Environment ### Docker Environment

60
PRD.md
View File

@ -162,6 +162,7 @@ Users have song data in JSON format with inconsistent artist names, song titles,
- **Primary:** Direct PostgreSQL database access - **Primary:** Direct PostgreSQL database access
- **Fallback:** MusicBrainz REST API (local server) - **Fallback:** MusicBrainz REST API (local server)
- **Interface:** Command-line (CLI) - **Interface:** Command-line (CLI)
- **Design Pattern:** Interface-based architecture with dependency injection
### Project Structure ### Project Structure
``` ```
@ -169,20 +170,27 @@ src/
├── __init__.py # Package initialization ├── __init__.py # Package initialization
├── api/ # API-related modules ├── api/ # API-related modules
│ ├── __init__.py │ ├── __init__.py
│ ├── database.py # Direct PostgreSQL access with fuzzy search │ ├── database.py # Direct PostgreSQL access (implements MusicBrainzDataProvider)
│ └── api_client.py # Legacy HTTP API client (fallback) │ └── api_client.py # HTTP API client (implements MusicBrainzDataProvider)
├── cli/ # Command-line interface ├── cli/ # Command-line interface
│ ├── __init__.py │ ├── __init__.py
│ └── main.py # Main CLI implementation │ └── main.py # Main CLI implementation (uses factory pattern)
├── config/ # Configuration ├── config/ # Configuration
│ ├── __init__.py │ ├── __init__.py
│ └── constants.py # Constants and settings │ └── constants.py # Constants and settings
├── core/ # Core functionality ├── core/ # Core functionality
│ ├── __init__.py
│ ├── interfaces.py # Common interfaces and protocols
│ ├── factory.py # Data provider factory
│ └── song_processor.py # Centralized song processing logic
├── tests/ # Test files and scripts ├── tests/ # Test files and scripts
│ ├── __init__.py │ ├── __init__.py
│ ├── test_*.py # Unit and integration tests │ ├── test_*.py # Unit and integration tests
│ └── debug_*.py # Debug scripts │ └── debug_*.py # Debug scripts
└── utils/ # Utility functions └── utils/ # Utility functions
├── __init__.py
├── artist_title_processing.py # Shared artist/title processing
└── data_loader.py # Data loading utilities
``` ```
### Architectural Principles ### Architectural Principles
@ -193,13 +201,18 @@ src/
- **Error Handling**: Graceful error handling with meaningful messages - **Error Handling**: Graceful error handling with meaningful messages
- **Performance First**: Direct database access for maximum speed - **Performance First**: Direct database access for maximum speed
- **Fallback Strategy**: Automatic fallback to API when database unavailable - **Fallback Strategy**: Automatic fallback to API when database unavailable
- **NEW**: **Database-First**: Always use live database data over static caches - **Interface-Based Design**: Uses dependency injection with common interfaces
- **NEW**: **Intelligent Collaboration Detection**: Distinguish band names from collaborations - **Factory Pattern**: Clean provider creation and configuration
- **NEW**: **Test Organization**: All test files must be placed in `src/tests/` directory, not in root - **Single Responsibility**: Each class has one clear purpose
- **Database-First**: Always use live database data over static caches
- **Intelligent Collaboration Detection**: Distinguish band names from collaborations
- **Test Organization**: All test files must be placed in `src/tests/` directory, not in root
### Data Flow ### Data Flow
1. Read JSON input file 1. **CLI** uses `DataProviderFactory` to create appropriate data provider (database or API)
2. For each song: 2. **SongProcessor** receives the data provider and processes songs using the common interface
3. **Data Provider** (database or API) implements the same interface for consistent behavior
4. For each song:
- Clean artist name using name variations - Clean artist name using name variations
- Detect collaboration patterns - Detect collaboration patterns
- Use fuzzy search to find artist in database (including aliases, sort_names) - Use fuzzy search to find artist in database (including aliases, sort_names)
@ -207,7 +220,7 @@ src/
- For collaborations: find artist_credit and recording - For collaborations: find artist_credit and recording
- For single artists: find recording by artist and title - For single artists: find recording by artist and title
- Update song object with corrected data and MBIDs - Update song object with corrected data and MBIDs
3. Write cleaned data to output file 5. Write cleaned data to output file
### Fuzzy Search Implementation ### Fuzzy Search Implementation
- **Algorithm**: Uses fuzzywuzzy library with multiple matching strategies - **Algorithm**: Uses fuzzywuzzy library with multiple matching strategies
@ -316,18 +329,23 @@ docker-compose logs -f musicbrainz
- [x] Fuzzy search for artists and recordings - [x] Fuzzy search for artists and recordings
- [x] Automatic fallback to API mode - [x] Automatic fallback to API mode
- [x] Performance optimizations - [x] Performance optimizations
- [x] **NEW**: Advanced collaboration detection and handling - [x] Advanced collaboration detection and handling
- [x] **NEW**: Artist alias and sort_name search - [x] Artist alias and sort_name search
- [x] **NEW**: Dash variation handling - [x] Dash variation handling
- [x] **NEW**: Numerical suffix handling - [x] Numerical suffix handling
- [x] **NEW**: Band name vs collaboration distinction - [x] Band name vs collaboration distinction
- [x] **NEW**: Complex collaboration parsing - [x] Complex collaboration parsing
- [x] **NEW**: Removed problematic known_artists cache - [x] Removed problematic known_artists cache
- [x] **NEW**: Simplified CLI with default full dataset processing - [x] Simplified CLI with default full dataset processing
- [x] **NEW**: Separate output files for successful and failed songs (array format) - [x] Separate output files for successful and failed songs (array format)
- [x] **NEW**: Human-readable text reports with statistics - [x] Human-readable text reports with statistics
- [x] **NEW**: Smart defaults for all file paths and options - [x] Smart defaults for all file paths and options
- [x] **NEW**: Configurable processing limits and output file paths - [x] Configurable processing limits and output file paths
- [x] **NEW**: Interface-based architecture with dependency injection
- [x] **NEW**: Factory pattern for data provider creation
- [x] **NEW**: Centralized song processing logic
- [x] **NEW**: Common interfaces for database and API clients
- [x] **NEW**: Clean separation of concerns
### 🔄 Future Enhancements ### 🔄 Future Enhancements
- [ ] Web interface option - [ ] Web interface option

View File

@ -1,6 +1,6 @@
# 🎵 MusicBrainz Data Cleaner v3.0 # 🎵 MusicBrainz Data Cleaner v3.0
A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!** A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with interface-based architecture, advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!**
## 🚀 Quick Start for New Sessions ## 🚀 Quick Start for New Sessions
@ -47,6 +47,8 @@ docker-compose run --rm musicbrainz-cleaner python3 [script_name].py
## ✨ What's New in v3.0 ## ✨ What's New in v3.0
- **🏗️ Interface-Based Architecture**: Clean dependency injection with common interfaces
- **🏭 Factory Pattern**: Smart data provider creation and configuration
- **🚀 Direct Database Access**: Connect directly to PostgreSQL for 10x faster performance - **🚀 Direct Database Access**: Connect directly to PostgreSQL for 10x faster performance
- **🎯 Advanced Fuzzy Search**: Intelligent matching for similar artist names and song titles - **🎯 Advanced Fuzzy Search**: Intelligent matching for similar artist names and song titles
- **🔄 Automatic Fallback**: Falls back to API mode if database access fails - **🔄 Automatic Fallback**: Falls back to API mode if database access fails
@ -299,11 +301,19 @@ python3 src/tests/run_tests.py --list
musicbrainz-cleaner/ musicbrainz-cleaner/
├── src/ ├── src/
│ ├── api/ # Database and API access │ ├── api/ # Database and API access
│ │ ├── database.py # Direct PostgreSQL access (implements MusicBrainzDataProvider)
│ │ └── api_client.py # HTTP API client (implements MusicBrainzDataProvider)
│ ├── cli/ # Command-line interface │ ├── cli/ # Command-line interface
│ │ └── main.py # Main CLI implementation (uses factory pattern)
│ ├── config/ # Configuration and constants │ ├── config/ # Configuration and constants
│ ├── core/ # Core functionality │ ├── core/ # Core functionality
│ │ ├── interfaces.py # Common interfaces and protocols
│ │ ├── factory.py # Data provider factory
│ │ └── song_processor.py # Centralized song processing logic
│ ├── tests/ # Test files (REQUIRED location) │ ├── tests/ # Test files (REQUIRED location)
│ └── utils/ # Utility functions │ └── utils/ # Utility functions
│ ├── artist_title_processing.py # Shared artist/title processing
│ └── data_loader.py # Data loading utilities
├── data/ # Data files and output ├── data/ # Data files and output
│ ├── known_artists.json # Name variations (ACDC → AC/DC) │ ├── known_artists.json # Name variations (ACDC → AC/DC)
│ ├── known_recordings.json # Known recording MBIDs │ ├── known_recordings.json # Known recording MBIDs

View File

@ -1,19 +1,20 @@
""" """
Legacy HTTP API client for MusicBrainz Data Cleaner. HTTP API client for MusicBrainz Data Cleaner.
Used as fallback when direct database access is not available. Implements the MusicBrainzDataProvider interface for API-based access.
""" """
import requests import requests
import time import time
from typing import Dict, Optional, Any from typing import Dict, Optional, Any, Tuple
from ..config.constants import ( from ..config.constants import (
DEFAULT_MUSICBRAINZ_URL, API_REQUEST_DELAY, REQUEST_TIMEOUT, DEFAULT_MUSICBRAINZ_URL, API_REQUEST_DELAY, REQUEST_TIMEOUT,
SUCCESS_MESSAGES, ERROR_MESSAGES SUCCESS_MESSAGES, ERROR_MESSAGES
) )
from ..core.interfaces import MusicBrainzDataProvider
class MusicBrainzAPIClient: class MusicBrainzAPIClient(MusicBrainzDataProvider):
"""Legacy HTTP API client for MusicBrainz (fallback option).""" """HTTP API client for MusicBrainz implementing the data provider interface."""
def __init__(self, base_url: str = DEFAULT_MUSICBRAINZ_URL): def __init__(self, base_url: str = DEFAULT_MUSICBRAINZ_URL):
self.base_url = base_url self.base_url = base_url
@ -30,8 +31,8 @@ class MusicBrainzAPIClient:
print(f"API connection test failed: {e}") print(f"API connection test failed: {e}")
return False return False
def search_artist(self, artist_name: str) -> Optional[Dict[str, Any]]: def fuzzy_search_artist(self, artist_name: str) -> Optional[Tuple[str, str, float]]:
"""Search for artist by name using API.""" """Search for artist by name using API. Returns (artist_name, mbid, score)."""
try: try:
url = f"{self.base_url}/ws/2/artist/?query=name:{artist_name}&fmt=json" url = f"{self.base_url}/ws/2/artist/?query=name:{artist_name}&fmt=json"
response = self.session.get(url) response = self.session.get(url)
@ -39,14 +40,15 @@ class MusicBrainzAPIClient:
data = response.json() data = response.json()
if data.get('artists') and len(data['artists']) > 0: if data.get('artists') and len(data['artists']) > 0:
return data['artists'][0] artist = data['artists'][0]
return (artist['name'], artist['id'], 1.0) # Perfect match for API
return None return None
except Exception as e: except Exception as e:
print(f"API search failed for artist '{artist_name}': {e}") print(f"API search failed for artist '{artist_name}': {e}")
return None return None
def search_recording(self, title: str, artist_mbid: Optional[str] = None) -> Optional[Dict[str, Any]]: def fuzzy_search_recording(self, title: str, artist_mbid: Optional[str] = None) -> Optional[Tuple[str, str, float]]:
"""Search for recording by title and optionally artist using API.""" """Search for recording by title and optionally artist using API. Returns (recording_name, mbid, score)."""
try: try:
if artist_mbid: if artist_mbid:
url = f"{self.base_url}/ws/2/recording/?query=arid:{artist_mbid}%20AND%20name:{title}&fmt=json" url = f"{self.base_url}/ws/2/recording/?query=arid:{artist_mbid}%20AND%20name:{title}&fmt=json"
@ -58,7 +60,8 @@ class MusicBrainzAPIClient:
data = response.json() data = response.json()
if data.get('recordings') and len(data['recordings']) > 0: if data.get('recordings') and len(data['recordings']) > 0:
return data['recordings'][0] recording = data['recordings'][0]
return (recording['title'], recording['id'], 1.0) # Perfect match for API
return None return None
except Exception as e: except Exception as e:
print(f"API search failed for recording '{title}': {e}") print(f"API search failed for recording '{title}': {e}")

View File

@ -13,10 +13,11 @@ from ..config.constants import (
TITLE_SIMILARITY_THRESHOLD, ARTIST_SIMILARITY_THRESHOLD, TITLE_SIMILARITY_THRESHOLD, ARTIST_SIMILARITY_THRESHOLD,
SUCCESS_MESSAGES, ERROR_MESSAGES SUCCESS_MESSAGES, ERROR_MESSAGES
) )
from src.utils.artist_title_processing import parse_complex_collaboration, parse_collaborators, generate_title_variations from ..utils.artist_title_processing import parse_complex_collaboration, parse_collaborators, generate_title_variations
from ..core.interfaces import MusicBrainzDataProvider
class MusicBrainzDatabase: class MusicBrainzDatabase(MusicBrainzDataProvider):
"""Direct PostgreSQL database access for MusicBrainz with fuzzy search.""" """Direct PostgreSQL database access for MusicBrainz with fuzzy search."""
def __init__(self, host: str = DB_HOST, port: int = DB_PORT, def __init__(self, host: str = DB_HOST, port: int = DB_PORT,

View File

@ -23,35 +23,23 @@ from ..config.constants import (
# Import database and API clients # Import core components
from ..api.database import MusicBrainzDatabase
from ..api.api_client import MusicBrainzAPIClient
from ..core.song_processor import SongProcessor from ..core.song_processor import SongProcessor
from ..core.factory import DataProviderFactory
class MusicBrainzCleaner: class MusicBrainzCleaner:
"""Enhanced MusicBrainz Cleaner with direct database access and fuzzy search.""" """Enhanced MusicBrainz Cleaner with interface-based data access."""
def __init__(self, use_database: bool = True, base_url: str = DEFAULT_MUSICBRAINZ_URL): def __init__(self, use_database: bool = True, base_url: str = DEFAULT_MUSICBRAINZ_URL):
self.use_database = use_database self.use_database = use_database
self.base_url = base_url self.base_url = base_url
# Initialize database connection (primary method) # Create data provider using factory
if use_database: self.data_provider = DataProviderFactory.create_provider(use_database, base_url)
self.db = MusicBrainzDatabase()
if not self.db.connect():
print("⚠️ Database connection failed, falling back to API")
self.use_database = False
# Initialize API client (fallback method) # Initialize centralized song processor with the data provider
if not self.use_database: self.song_processor = SongProcessor(self.data_provider)
self.api = MusicBrainzAPIClient(base_url)
# Initialize centralized song processor
self.song_processor = SongProcessor(
database_client=self.db if use_database else None,
api_client=self.api if not use_database else None
)
def clean_song(self, song: Dict[str, Any]) -> Tuple[Dict[str, Any], bool]: def clean_song(self, song: Dict[str, Any]) -> Tuple[Dict[str, Any], bool]:
""" """
@ -394,22 +382,15 @@ def main() -> int:
# Handle test connection # Handle test connection
if parsed['test_connection']: if parsed['test_connection']:
if parsed['use_api']: provider = DataProviderFactory.create_provider(not parsed['use_api'])
api = MusicBrainzAPIClient() if provider.test_connection():
if api.test_connection(): provider_type = "API server" if parsed['use_api'] else "database"
print("✅ Connection to MusicBrainz API server successful") print(f"✅ Connection to MusicBrainz {provider_type} successful")
return ExitCode.SUCCESS return ExitCode.SUCCESS
else:
print("❌ Connection to MusicBrainz API server failed")
return ExitCode.ERROR
else: else:
db = MusicBrainzDatabase() provider_type = "API server" if parsed['use_api'] else "database"
if db.test_connection(): print(f"❌ Connection to MusicBrainz {provider_type} failed")
print("✅ Connection to MusicBrainz database successful") return ExitCode.ERROR
return ExitCode.SUCCESS
else:
print("❌ Connection to MusicBrainz database failed")
return ExitCode.ERROR
# Process songs (main functionality) # Process songs (main functionality)
source_file = Path(parsed['source']) source_file = Path(parsed['source'])

36
src/core/factory.py Normal file
View File

@ -0,0 +1,36 @@
"""
Factory for creating MusicBrainz data providers.
Provides a clean way to instantiate the appropriate data provider based on configuration.
"""
from typing import Optional
from ..api.database import MusicBrainzDatabase
from ..api.api_client import MusicBrainzAPIClient
from .interfaces import MusicBrainzDataProvider
from ..config.constants import DEFAULT_MUSICBRAINZ_URL
class DataProviderFactory:
"""Factory for creating MusicBrainz data providers."""
@staticmethod
def create_provider(use_database: bool = True, base_url: str = DEFAULT_MUSICBRAINZ_URL) -> MusicBrainzDataProvider:
"""
Create a data provider based on configuration.
Args:
use_database: Whether to use database (True) or API (False)
base_url: Base URL for API client (only used if use_database=False)
Returns:
MusicBrainzDataProvider instance
"""
if use_database:
provider = MusicBrainzDatabase()
if provider.connect():
return provider
else:
print("⚠️ Database connection failed, falling back to API")
return MusicBrainzAPIClient(base_url)
else:
return MusicBrainzAPIClient(base_url)

60
src/core/interfaces.py Normal file
View File

@ -0,0 +1,60 @@
"""
Core interfaces for MusicBrainz Data Cleaner.
Defines the common protocol that all data access implementations must follow.
"""
from abc import ABC, abstractmethod
from typing import Dict, Optional, Any, Tuple, List
class MusicBrainzDataProvider(ABC):
"""
Abstract base class defining the interface for MusicBrainz data access.
Both database and API implementations must implement these methods.
"""
@abstractmethod
def test_connection(self) -> bool:
"""Test connection to the data source."""
pass
@abstractmethod
def fuzzy_search_artist(self, artist_name: str) -> Optional[Tuple[str, str, float]]:
"""
Fuzzy search for artist by name.
Returns (artist_name, mbid, similarity_score) or None.
"""
pass
@abstractmethod
def fuzzy_search_recording(self, title: str, artist_mbid: Optional[str] = None) -> Optional[Tuple[str, str, float]]:
"""
Fuzzy search for recording by title and optionally artist.
Returns (recording_name, mbid, similarity_score) or None.
"""
pass
@abstractmethod
def get_artist_info(self, mbid: str) -> Optional[Dict[str, Any]]:
"""Get detailed artist information by MBID."""
pass
@abstractmethod
def get_recording_info(self, mbid: str) -> Optional[Dict[str, Any]]:
"""Get detailed recording information by MBID."""
pass
class SongProcessorInterface(ABC):
"""
Abstract base class defining the interface for song processing.
This ensures consistent behavior across different implementations.
"""
@abstractmethod
def clean_song(self, song: Dict[str, Any]) -> Tuple[Dict[str, Any], bool]:
"""
Clean a single song.
Returns (cleaned_song, success_status)
"""
pass

View File

@ -7,22 +7,20 @@ between CLI and database interactions to ensure consistency.
from typing import Dict, Optional, Tuple, Any, List from typing import Dict, Optional, Tuple, Any, List
from ..utils.artist_title_processing import parse_complex_collaboration, generate_title_variations from ..utils.artist_title_processing import parse_complex_collaboration, generate_title_variations
from ..utils.data_loader import data_loader from ..utils.data_loader import data_loader
from .interfaces import MusicBrainzDataProvider, SongProcessorInterface
class SongProcessor: class SongProcessor(SongProcessorInterface):
"""Centralized song processing with consistent logic across all interfaces.""" """Centralized song processing with consistent logic across all interfaces."""
def __init__(self, database_client=None, api_client=None): def __init__(self, data_provider: MusicBrainzDataProvider):
""" """
Initialize with optional database and API clients. Initialize with a data provider that implements MusicBrainzDataProvider.
Args: Args:
database_client: MusicBrainzDatabase instance for direct DB access data_provider: Instance implementing MusicBrainzDataProvider interface
api_client: MusicBrainzAPIClient instance for API fallback
""" """
self.db = database_client self.data_provider = data_provider
self.api = api_client
self.use_database = database_client is not None
def find_artist_mbid(self, artist_name: str) -> Optional[str]: def find_artist_mbid(self, artist_name: str) -> Optional[str]:
""" """
@ -37,20 +35,9 @@ class SongProcessor:
if not artist_name: if not artist_name:
return None return None
# Try database first if available result = self.data_provider.fuzzy_search_artist(artist_name)
if self.use_database: if result and isinstance(result, tuple) and len(result) >= 2:
result = self.db.fuzzy_search_artist(artist_name) return result[1] # Return MBID from tuple (artist_name, mbid, score)
if result and isinstance(result, tuple) and len(result) >= 2:
return result[1] # Return MBID from tuple (artist_name, mbid, score)
# Fallback to API
if self.api:
try:
result = self.api.search_artist(artist_name)
if result:
return result['id']
except:
pass
return None return None
@ -83,70 +70,52 @@ class SongProcessor:
return recording_mbid return recording_mbid
# Handle collaborations using artist credit # Handle collaborations using artist credit
if self.use_database: # If no artist_mbid (collaboration case), try to find by title and verify artist credit
# If no artist_mbid (collaboration case), try to find by title and verify artist credit if not artist_mbid and original_artist:
if not artist_mbid and original_artist: # This is a collaboration case, try to find by title with all variations
# This is a collaboration case, try to find by title with all variations
for variation in title_variations:
result = self.db.fuzzy_search_recording(variation)
if result and isinstance(result, tuple) and len(result) >= 2:
recording_mbid = result[1]
# Verify that this recording has the correct artist credit
recording_info = self.get_recording_info(recording_mbid)
if recording_info and recording_info.get('artist_credit'):
# Check if the artist credit matches our expected collaboration
expected_artist_string = original_artist.replace(',', ' & ').replace(' and ', ' & ')
if recording_info['artist_credit'].lower() == expected_artist_string.lower():
return recording_mbid
# If exact match fails, try partial match
if recording_info and recording_info.get('artist_credit'):
# Check if all artists in the collaboration are present in the recording
main_artist, collaborators = parse_complex_collaboration(original_artist)
recording_artists = recording_info['artist_credit'].lower()
# Check if main artist is in the recording
if main_artist.lower() in recording_artists:
# Check if at least one collaborator is also present
for collaborator in collaborators:
if collaborator.lower() in recording_artists:
return recording_mbid
return None
else:
# Regular case with artist_mbid - try all title variations
for variation in title_variations:
result = self.db.fuzzy_search_recording(variation, artist_mbid)
if result and isinstance(result, tuple) and len(result) >= 2:
return result[1] # Return MBID from tuple (recording_name, mbid, score)
else:
# Fallback to API - try all title variations
for variation in title_variations: for variation in title_variations:
try: result = self.data_provider.fuzzy_search_recording(variation)
result = self.api.search_recording(variation, artist_mbid) if result and isinstance(result, tuple) and len(result) >= 2:
if result: recording_mbid = result[1]
return result['id']
except: # Verify that this recording has the correct artist credit
pass recording_info = self.get_recording_info(recording_mbid)
if recording_info and recording_info.get('artist_credit'):
# Check if the artist credit matches our expected collaboration
expected_artist_string = original_artist.replace(',', ' & ').replace(' and ', ' & ')
if recording_info['artist_credit'].lower() == expected_artist_string.lower():
return recording_mbid
# If exact match fails, try partial match
if recording_info and recording_info.get('artist_credit'):
# Check if all artists in the collaboration are present in the recording
main_artist, collaborators = parse_complex_collaboration(original_artist)
recording_artists = recording_info['artist_credit'].lower()
# Check if main artist is in the recording
if main_artist.lower() in recording_artists:
# Check if at least one collaborator is also present
for collaborator in collaborators:
if collaborator.lower() in recording_artists:
return recording_mbid
# Regular case with artist_mbid - try all title variations
for variation in title_variations:
result = self.data_provider.fuzzy_search_recording(variation, artist_mbid)
if result and isinstance(result, tuple) and len(result) >= 2:
return result[1] # Return MBID from tuple (recording_name, mbid, score)
return None
return None return None
def get_artist_info(self, mbid: str) -> Optional[Dict[str, Any]]: def get_artist_info(self, mbid: str) -> Optional[Dict[str, Any]]:
"""Get artist info using consistent logic.""" """Get artist info using consistent logic."""
if self.use_database: return self.data_provider.get_artist_info(mbid)
return self.db.get_artist_info(mbid)
elif self.api:
return self.api.get_artist_info(mbid)
return None
def get_recording_info(self, mbid: str) -> Optional[Dict[str, Any]]: def get_recording_info(self, mbid: str) -> Optional[Dict[str, Any]]:
"""Get recording info using consistent logic.""" """Get recording info using consistent logic."""
if self.use_database: return self.data_provider.get_recording_info(mbid)
return self.db.get_recording_info(mbid)
elif self.api:
return self.api.get_recording_info(mbid)
return None
def _build_artist_string(self, artist_credit: list) -> str: def _build_artist_string(self, artist_credit: list) -> str:
"""Build artist string from artist credit list (API format).""" """Build artist string from artist credit list (API format)."""
@ -189,9 +158,9 @@ class SongProcessor:
recording_info = self.get_recording_info(recording_mbid) recording_info = self.get_recording_info(recording_mbid)
if recording_info: if recording_info:
# Update with the correct artist credit and title # Update with the correct artist credit and title
if self.use_database and recording_info.get('artist_credit'): if recording_info.get('artist_credit'):
song['artist'] = recording_info['artist_credit'] song['artist'] = recording_info['artist_credit']
elif not self.use_database and recording_info.get('artist-credit'): elif recording_info.get('artist-credit'):
artist_string = self._build_artist_string(recording_info['artist-credit']) artist_string = self._build_artist_string(recording_info['artist-credit'])
if artist_string: if artist_string:
song['artist'] = artist_string song['artist'] = artist_string
@ -200,12 +169,11 @@ class SongProcessor:
song['recording_mbid'] = recording_mbid song['recording_mbid'] = recording_mbid
# For collaborations, try to get the main artist's MBID # For collaborations, try to get the main artist's MBID
if self.use_database: main_artist, collaborators = parse_complex_collaboration(song.get('artist', ''))
main_artist, collaborators = parse_complex_collaboration(song.get('artist', '')) if main_artist:
if main_artist: artist_result = self.data_provider.fuzzy_search_artist(main_artist)
artist_result = self.db.fuzzy_search_artist(main_artist) if artist_result and isinstance(artist_result, tuple) and len(artist_result) >= 2:
if artist_result and isinstance(artist_result, tuple) and len(artist_result) >= 2: song['mbid'] = artist_result[1] # Set the main artist's MBID
song['mbid'] = artist_result[1] # Set the main artist's MBID
return song, True return song, True
return song, False return song, False
@ -229,12 +197,12 @@ class SongProcessor:
recording_info = self.get_recording_info(recording_mbid) recording_info = self.get_recording_info(recording_mbid)
if recording_info: if recording_info:
# Update artist string if there are multiple artists, but preserve the artist MBID # Update artist string if there are multiple artists, but preserve the artist MBID
if self.use_database and recording_info.get('artist_credit'): if recording_info.get('artist_credit'):
song['artist'] = recording_info['artist_credit'] song['artist'] = recording_info['artist_credit']
# Keep the original artist MBID even when updating artist name # Keep the original artist MBID even when updating artist name
if 'mbid' not in song: if 'mbid' not in song:
song['mbid'] = artist_mbid song['mbid'] = artist_mbid
elif not self.use_database and recording_info.get('artist-credit'): elif recording_info.get('artist-credit'):
artist_string = self._build_artist_string(recording_info['artist-credit']) artist_string = self._build_artist_string(recording_info['artist-credit'])
if artist_string: if artist_string:
song['artist'] = artist_string song['artist'] = artist_string