Signed-off-by: Matt Bruce <mbrucedogs@gmail.com>
This commit is contained in:
parent
0f33590eca
commit
b794d9dc1c
20
COMMANDS.md
20
COMMANDS.md
@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
The MusicBrainz Data Cleaner is a command-line interface (CLI) tool that processes JSON song data files and cleans/normalizes the metadata using the MusicBrainz database. The tool creates separate output files for successful and failed songs, along with detailed processing reports.
|
The MusicBrainz Data Cleaner is a command-line interface (CLI) tool that processes JSON song data files and cleans/normalizes the metadata using the MusicBrainz database. The tool uses an interface-based architecture with dependency injection for clean, maintainable code. It creates separate output files for successful and failed songs, along with detailed processing reports.
|
||||||
|
|
||||||
## Basic Command Structure
|
## Basic Command Structure
|
||||||
|
|
||||||
@ -277,6 +277,24 @@ Error: Source file should contain a JSON array of songs
|
|||||||
ModuleNotFoundError: No module named 'requests'
|
ModuleNotFoundError: No module named 'requests'
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Architecture Overview
|
||||||
|
|
||||||
|
### Interface-Based Design
|
||||||
|
|
||||||
|
The tool uses a clean interface-based architecture:
|
||||||
|
|
||||||
|
- **`MusicBrainzDataProvider` Interface**: Common protocol for data access
|
||||||
|
- **`DataProviderFactory`**: Creates appropriate provider (database or API)
|
||||||
|
- **`SongProcessor`**: Centralized processing logic using the interface
|
||||||
|
- **Dependency Injection**: CLI depends on interfaces, not concrete classes
|
||||||
|
|
||||||
|
### Data Flow
|
||||||
|
|
||||||
|
1. **CLI** uses `DataProviderFactory` to create data provider
|
||||||
|
2. **Factory** returns either database or API implementation
|
||||||
|
3. **SongProcessor** processes songs using the common interface
|
||||||
|
4. **Same logic** works regardless of provider type
|
||||||
|
|
||||||
## Environment Configuration
|
## Environment Configuration
|
||||||
|
|
||||||
### Docker Environment
|
### Docker Environment
|
||||||
|
|||||||
60
PRD.md
60
PRD.md
@ -162,6 +162,7 @@ Users have song data in JSON format with inconsistent artist names, song titles,
|
|||||||
- **Primary:** Direct PostgreSQL database access
|
- **Primary:** Direct PostgreSQL database access
|
||||||
- **Fallback:** MusicBrainz REST API (local server)
|
- **Fallback:** MusicBrainz REST API (local server)
|
||||||
- **Interface:** Command-line (CLI)
|
- **Interface:** Command-line (CLI)
|
||||||
|
- **Design Pattern:** Interface-based architecture with dependency injection
|
||||||
|
|
||||||
### Project Structure
|
### Project Structure
|
||||||
```
|
```
|
||||||
@ -169,20 +170,27 @@ src/
|
|||||||
├── __init__.py # Package initialization
|
├── __init__.py # Package initialization
|
||||||
├── api/ # API-related modules
|
├── api/ # API-related modules
|
||||||
│ ├── __init__.py
|
│ ├── __init__.py
|
||||||
│ ├── database.py # Direct PostgreSQL access with fuzzy search
|
│ ├── database.py # Direct PostgreSQL access (implements MusicBrainzDataProvider)
|
||||||
│ └── api_client.py # Legacy HTTP API client (fallback)
|
│ └── api_client.py # HTTP API client (implements MusicBrainzDataProvider)
|
||||||
├── cli/ # Command-line interface
|
├── cli/ # Command-line interface
|
||||||
│ ├── __init__.py
|
│ ├── __init__.py
|
||||||
│ └── main.py # Main CLI implementation
|
│ └── main.py # Main CLI implementation (uses factory pattern)
|
||||||
├── config/ # Configuration
|
├── config/ # Configuration
|
||||||
│ ├── __init__.py
|
│ ├── __init__.py
|
||||||
│ └── constants.py # Constants and settings
|
│ └── constants.py # Constants and settings
|
||||||
├── core/ # Core functionality
|
├── core/ # Core functionality
|
||||||
|
│ ├── __init__.py
|
||||||
|
│ ├── interfaces.py # Common interfaces and protocols
|
||||||
|
│ ├── factory.py # Data provider factory
|
||||||
|
│ └── song_processor.py # Centralized song processing logic
|
||||||
├── tests/ # Test files and scripts
|
├── tests/ # Test files and scripts
|
||||||
│ ├── __init__.py
|
│ ├── __init__.py
|
||||||
│ ├── test_*.py # Unit and integration tests
|
│ ├── test_*.py # Unit and integration tests
|
||||||
│ └── debug_*.py # Debug scripts
|
│ └── debug_*.py # Debug scripts
|
||||||
└── utils/ # Utility functions
|
└── utils/ # Utility functions
|
||||||
|
├── __init__.py
|
||||||
|
├── artist_title_processing.py # Shared artist/title processing
|
||||||
|
└── data_loader.py # Data loading utilities
|
||||||
```
|
```
|
||||||
|
|
||||||
### Architectural Principles
|
### Architectural Principles
|
||||||
@ -193,13 +201,18 @@ src/
|
|||||||
- **Error Handling**: Graceful error handling with meaningful messages
|
- **Error Handling**: Graceful error handling with meaningful messages
|
||||||
- **Performance First**: Direct database access for maximum speed
|
- **Performance First**: Direct database access for maximum speed
|
||||||
- **Fallback Strategy**: Automatic fallback to API when database unavailable
|
- **Fallback Strategy**: Automatic fallback to API when database unavailable
|
||||||
- **NEW**: **Database-First**: Always use live database data over static caches
|
- **Interface-Based Design**: Uses dependency injection with common interfaces
|
||||||
- **NEW**: **Intelligent Collaboration Detection**: Distinguish band names from collaborations
|
- **Factory Pattern**: Clean provider creation and configuration
|
||||||
- **NEW**: **Test Organization**: All test files must be placed in `src/tests/` directory, not in root
|
- **Single Responsibility**: Each class has one clear purpose
|
||||||
|
- **Database-First**: Always use live database data over static caches
|
||||||
|
- **Intelligent Collaboration Detection**: Distinguish band names from collaborations
|
||||||
|
- **Test Organization**: All test files must be placed in `src/tests/` directory, not in root
|
||||||
|
|
||||||
### Data Flow
|
### Data Flow
|
||||||
1. Read JSON input file
|
1. **CLI** uses `DataProviderFactory` to create appropriate data provider (database or API)
|
||||||
2. For each song:
|
2. **SongProcessor** receives the data provider and processes songs using the common interface
|
||||||
|
3. **Data Provider** (database or API) implements the same interface for consistent behavior
|
||||||
|
4. For each song:
|
||||||
- Clean artist name using name variations
|
- Clean artist name using name variations
|
||||||
- Detect collaboration patterns
|
- Detect collaboration patterns
|
||||||
- Use fuzzy search to find artist in database (including aliases, sort_names)
|
- Use fuzzy search to find artist in database (including aliases, sort_names)
|
||||||
@ -207,7 +220,7 @@ src/
|
|||||||
- For collaborations: find artist_credit and recording
|
- For collaborations: find artist_credit and recording
|
||||||
- For single artists: find recording by artist and title
|
- For single artists: find recording by artist and title
|
||||||
- Update song object with corrected data and MBIDs
|
- Update song object with corrected data and MBIDs
|
||||||
3. Write cleaned data to output file
|
5. Write cleaned data to output file
|
||||||
|
|
||||||
### Fuzzy Search Implementation
|
### Fuzzy Search Implementation
|
||||||
- **Algorithm**: Uses fuzzywuzzy library with multiple matching strategies
|
- **Algorithm**: Uses fuzzywuzzy library with multiple matching strategies
|
||||||
@ -316,18 +329,23 @@ docker-compose logs -f musicbrainz
|
|||||||
- [x] Fuzzy search for artists and recordings
|
- [x] Fuzzy search for artists and recordings
|
||||||
- [x] Automatic fallback to API mode
|
- [x] Automatic fallback to API mode
|
||||||
- [x] Performance optimizations
|
- [x] Performance optimizations
|
||||||
- [x] **NEW**: Advanced collaboration detection and handling
|
- [x] Advanced collaboration detection and handling
|
||||||
- [x] **NEW**: Artist alias and sort_name search
|
- [x] Artist alias and sort_name search
|
||||||
- [x] **NEW**: Dash variation handling
|
- [x] Dash variation handling
|
||||||
- [x] **NEW**: Numerical suffix handling
|
- [x] Numerical suffix handling
|
||||||
- [x] **NEW**: Band name vs collaboration distinction
|
- [x] Band name vs collaboration distinction
|
||||||
- [x] **NEW**: Complex collaboration parsing
|
- [x] Complex collaboration parsing
|
||||||
- [x] **NEW**: Removed problematic known_artists cache
|
- [x] Removed problematic known_artists cache
|
||||||
- [x] **NEW**: Simplified CLI with default full dataset processing
|
- [x] Simplified CLI with default full dataset processing
|
||||||
- [x] **NEW**: Separate output files for successful and failed songs (array format)
|
- [x] Separate output files for successful and failed songs (array format)
|
||||||
- [x] **NEW**: Human-readable text reports with statistics
|
- [x] Human-readable text reports with statistics
|
||||||
- [x] **NEW**: Smart defaults for all file paths and options
|
- [x] Smart defaults for all file paths and options
|
||||||
- [x] **NEW**: Configurable processing limits and output file paths
|
- [x] Configurable processing limits and output file paths
|
||||||
|
- [x] **NEW**: Interface-based architecture with dependency injection
|
||||||
|
- [x] **NEW**: Factory pattern for data provider creation
|
||||||
|
- [x] **NEW**: Centralized song processing logic
|
||||||
|
- [x] **NEW**: Common interfaces for database and API clients
|
||||||
|
- [x] **NEW**: Clean separation of concerns
|
||||||
|
|
||||||
### 🔄 Future Enhancements
|
### 🔄 Future Enhancements
|
||||||
- [ ] Web interface option
|
- [ ] Web interface option
|
||||||
|
|||||||
12
README.md
12
README.md
@ -1,6 +1,6 @@
|
|||||||
# 🎵 MusicBrainz Data Cleaner v3.0
|
# 🎵 MusicBrainz Data Cleaner v3.0
|
||||||
|
|
||||||
A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!**
|
A powerful command-line tool that cleans and normalizes your song data using the MusicBrainz database. **Now with interface-based architecture, advanced collaboration detection, artist alias handling, and intelligent fuzzy search for maximum accuracy!**
|
||||||
|
|
||||||
## 🚀 Quick Start for New Sessions
|
## 🚀 Quick Start for New Sessions
|
||||||
|
|
||||||
@ -47,6 +47,8 @@ docker-compose run --rm musicbrainz-cleaner python3 [script_name].py
|
|||||||
|
|
||||||
## ✨ What's New in v3.0
|
## ✨ What's New in v3.0
|
||||||
|
|
||||||
|
- **🏗️ Interface-Based Architecture**: Clean dependency injection with common interfaces
|
||||||
|
- **🏭 Factory Pattern**: Smart data provider creation and configuration
|
||||||
- **🚀 Direct Database Access**: Connect directly to PostgreSQL for 10x faster performance
|
- **🚀 Direct Database Access**: Connect directly to PostgreSQL for 10x faster performance
|
||||||
- **🎯 Advanced Fuzzy Search**: Intelligent matching for similar artist names and song titles
|
- **🎯 Advanced Fuzzy Search**: Intelligent matching for similar artist names and song titles
|
||||||
- **🔄 Automatic Fallback**: Falls back to API mode if database access fails
|
- **🔄 Automatic Fallback**: Falls back to API mode if database access fails
|
||||||
@ -299,11 +301,19 @@ python3 src/tests/run_tests.py --list
|
|||||||
musicbrainz-cleaner/
|
musicbrainz-cleaner/
|
||||||
├── src/
|
├── src/
|
||||||
│ ├── api/ # Database and API access
|
│ ├── api/ # Database and API access
|
||||||
|
│ │ ├── database.py # Direct PostgreSQL access (implements MusicBrainzDataProvider)
|
||||||
|
│ │ └── api_client.py # HTTP API client (implements MusicBrainzDataProvider)
|
||||||
│ ├── cli/ # Command-line interface
|
│ ├── cli/ # Command-line interface
|
||||||
|
│ │ └── main.py # Main CLI implementation (uses factory pattern)
|
||||||
│ ├── config/ # Configuration and constants
|
│ ├── config/ # Configuration and constants
|
||||||
│ ├── core/ # Core functionality
|
│ ├── core/ # Core functionality
|
||||||
|
│ │ ├── interfaces.py # Common interfaces and protocols
|
||||||
|
│ │ ├── factory.py # Data provider factory
|
||||||
|
│ │ └── song_processor.py # Centralized song processing logic
|
||||||
│ ├── tests/ # Test files (REQUIRED location)
|
│ ├── tests/ # Test files (REQUIRED location)
|
||||||
│ └── utils/ # Utility functions
|
│ └── utils/ # Utility functions
|
||||||
|
│ ├── artist_title_processing.py # Shared artist/title processing
|
||||||
|
│ └── data_loader.py # Data loading utilities
|
||||||
├── data/ # Data files and output
|
├── data/ # Data files and output
|
||||||
│ ├── known_artists.json # Name variations (ACDC → AC/DC)
|
│ ├── known_artists.json # Name variations (ACDC → AC/DC)
|
||||||
│ ├── known_recordings.json # Known recording MBIDs
|
│ ├── known_recordings.json # Known recording MBIDs
|
||||||
|
|||||||
@ -1,19 +1,20 @@
|
|||||||
"""
|
"""
|
||||||
Legacy HTTP API client for MusicBrainz Data Cleaner.
|
HTTP API client for MusicBrainz Data Cleaner.
|
||||||
Used as fallback when direct database access is not available.
|
Implements the MusicBrainzDataProvider interface for API-based access.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import requests
|
import requests
|
||||||
import time
|
import time
|
||||||
from typing import Dict, Optional, Any
|
from typing import Dict, Optional, Any, Tuple
|
||||||
from ..config.constants import (
|
from ..config.constants import (
|
||||||
DEFAULT_MUSICBRAINZ_URL, API_REQUEST_DELAY, REQUEST_TIMEOUT,
|
DEFAULT_MUSICBRAINZ_URL, API_REQUEST_DELAY, REQUEST_TIMEOUT,
|
||||||
SUCCESS_MESSAGES, ERROR_MESSAGES
|
SUCCESS_MESSAGES, ERROR_MESSAGES
|
||||||
)
|
)
|
||||||
|
from ..core.interfaces import MusicBrainzDataProvider
|
||||||
|
|
||||||
|
|
||||||
class MusicBrainzAPIClient:
|
class MusicBrainzAPIClient(MusicBrainzDataProvider):
|
||||||
"""Legacy HTTP API client for MusicBrainz (fallback option)."""
|
"""HTTP API client for MusicBrainz implementing the data provider interface."""
|
||||||
|
|
||||||
def __init__(self, base_url: str = DEFAULT_MUSICBRAINZ_URL):
|
def __init__(self, base_url: str = DEFAULT_MUSICBRAINZ_URL):
|
||||||
self.base_url = base_url
|
self.base_url = base_url
|
||||||
@ -30,8 +31,8 @@ class MusicBrainzAPIClient:
|
|||||||
print(f"API connection test failed: {e}")
|
print(f"API connection test failed: {e}")
|
||||||
return False
|
return False
|
||||||
|
|
||||||
def search_artist(self, artist_name: str) -> Optional[Dict[str, Any]]:
|
def fuzzy_search_artist(self, artist_name: str) -> Optional[Tuple[str, str, float]]:
|
||||||
"""Search for artist by name using API."""
|
"""Search for artist by name using API. Returns (artist_name, mbid, score)."""
|
||||||
try:
|
try:
|
||||||
url = f"{self.base_url}/ws/2/artist/?query=name:{artist_name}&fmt=json"
|
url = f"{self.base_url}/ws/2/artist/?query=name:{artist_name}&fmt=json"
|
||||||
response = self.session.get(url)
|
response = self.session.get(url)
|
||||||
@ -39,14 +40,15 @@ class MusicBrainzAPIClient:
|
|||||||
data = response.json()
|
data = response.json()
|
||||||
|
|
||||||
if data.get('artists') and len(data['artists']) > 0:
|
if data.get('artists') and len(data['artists']) > 0:
|
||||||
return data['artists'][0]
|
artist = data['artists'][0]
|
||||||
|
return (artist['name'], artist['id'], 1.0) # Perfect match for API
|
||||||
return None
|
return None
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"API search failed for artist '{artist_name}': {e}")
|
print(f"API search failed for artist '{artist_name}': {e}")
|
||||||
return None
|
return None
|
||||||
|
|
||||||
def search_recording(self, title: str, artist_mbid: Optional[str] = None) -> Optional[Dict[str, Any]]:
|
def fuzzy_search_recording(self, title: str, artist_mbid: Optional[str] = None) -> Optional[Tuple[str, str, float]]:
|
||||||
"""Search for recording by title and optionally artist using API."""
|
"""Search for recording by title and optionally artist using API. Returns (recording_name, mbid, score)."""
|
||||||
try:
|
try:
|
||||||
if artist_mbid:
|
if artist_mbid:
|
||||||
url = f"{self.base_url}/ws/2/recording/?query=arid:{artist_mbid}%20AND%20name:{title}&fmt=json"
|
url = f"{self.base_url}/ws/2/recording/?query=arid:{artist_mbid}%20AND%20name:{title}&fmt=json"
|
||||||
@ -58,7 +60,8 @@ class MusicBrainzAPIClient:
|
|||||||
data = response.json()
|
data = response.json()
|
||||||
|
|
||||||
if data.get('recordings') and len(data['recordings']) > 0:
|
if data.get('recordings') and len(data['recordings']) > 0:
|
||||||
return data['recordings'][0]
|
recording = data['recordings'][0]
|
||||||
|
return (recording['title'], recording['id'], 1.0) # Perfect match for API
|
||||||
return None
|
return None
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"API search failed for recording '{title}': {e}")
|
print(f"API search failed for recording '{title}': {e}")
|
||||||
|
|||||||
@ -13,10 +13,11 @@ from ..config.constants import (
|
|||||||
TITLE_SIMILARITY_THRESHOLD, ARTIST_SIMILARITY_THRESHOLD,
|
TITLE_SIMILARITY_THRESHOLD, ARTIST_SIMILARITY_THRESHOLD,
|
||||||
SUCCESS_MESSAGES, ERROR_MESSAGES
|
SUCCESS_MESSAGES, ERROR_MESSAGES
|
||||||
)
|
)
|
||||||
from src.utils.artist_title_processing import parse_complex_collaboration, parse_collaborators, generate_title_variations
|
from ..utils.artist_title_processing import parse_complex_collaboration, parse_collaborators, generate_title_variations
|
||||||
|
from ..core.interfaces import MusicBrainzDataProvider
|
||||||
|
|
||||||
|
|
||||||
class MusicBrainzDatabase:
|
class MusicBrainzDatabase(MusicBrainzDataProvider):
|
||||||
"""Direct PostgreSQL database access for MusicBrainz with fuzzy search."""
|
"""Direct PostgreSQL database access for MusicBrainz with fuzzy search."""
|
||||||
|
|
||||||
def __init__(self, host: str = DB_HOST, port: int = DB_PORT,
|
def __init__(self, host: str = DB_HOST, port: int = DB_PORT,
|
||||||
|
|||||||
@ -23,35 +23,23 @@ from ..config.constants import (
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
# Import database and API clients
|
# Import core components
|
||||||
from ..api.database import MusicBrainzDatabase
|
|
||||||
from ..api.api_client import MusicBrainzAPIClient
|
|
||||||
from ..core.song_processor import SongProcessor
|
from ..core.song_processor import SongProcessor
|
||||||
|
from ..core.factory import DataProviderFactory
|
||||||
|
|
||||||
|
|
||||||
class MusicBrainzCleaner:
|
class MusicBrainzCleaner:
|
||||||
"""Enhanced MusicBrainz Cleaner with direct database access and fuzzy search."""
|
"""Enhanced MusicBrainz Cleaner with interface-based data access."""
|
||||||
|
|
||||||
def __init__(self, use_database: bool = True, base_url: str = DEFAULT_MUSICBRAINZ_URL):
|
def __init__(self, use_database: bool = True, base_url: str = DEFAULT_MUSICBRAINZ_URL):
|
||||||
self.use_database = use_database
|
self.use_database = use_database
|
||||||
self.base_url = base_url
|
self.base_url = base_url
|
||||||
|
|
||||||
# Initialize database connection (primary method)
|
# Create data provider using factory
|
||||||
if use_database:
|
self.data_provider = DataProviderFactory.create_provider(use_database, base_url)
|
||||||
self.db = MusicBrainzDatabase()
|
|
||||||
if not self.db.connect():
|
|
||||||
print("⚠️ Database connection failed, falling back to API")
|
|
||||||
self.use_database = False
|
|
||||||
|
|
||||||
# Initialize API client (fallback method)
|
# Initialize centralized song processor with the data provider
|
||||||
if not self.use_database:
|
self.song_processor = SongProcessor(self.data_provider)
|
||||||
self.api = MusicBrainzAPIClient(base_url)
|
|
||||||
|
|
||||||
# Initialize centralized song processor
|
|
||||||
self.song_processor = SongProcessor(
|
|
||||||
database_client=self.db if use_database else None,
|
|
||||||
api_client=self.api if not use_database else None
|
|
||||||
)
|
|
||||||
|
|
||||||
def clean_song(self, song: Dict[str, Any]) -> Tuple[Dict[str, Any], bool]:
|
def clean_song(self, song: Dict[str, Any]) -> Tuple[Dict[str, Any], bool]:
|
||||||
"""
|
"""
|
||||||
@ -394,21 +382,14 @@ def main() -> int:
|
|||||||
|
|
||||||
# Handle test connection
|
# Handle test connection
|
||||||
if parsed['test_connection']:
|
if parsed['test_connection']:
|
||||||
if parsed['use_api']:
|
provider = DataProviderFactory.create_provider(not parsed['use_api'])
|
||||||
api = MusicBrainzAPIClient()
|
if provider.test_connection():
|
||||||
if api.test_connection():
|
provider_type = "API server" if parsed['use_api'] else "database"
|
||||||
print("✅ Connection to MusicBrainz API server successful")
|
print(f"✅ Connection to MusicBrainz {provider_type} successful")
|
||||||
return ExitCode.SUCCESS
|
return ExitCode.SUCCESS
|
||||||
else:
|
else:
|
||||||
print("❌ Connection to MusicBrainz API server failed")
|
provider_type = "API server" if parsed['use_api'] else "database"
|
||||||
return ExitCode.ERROR
|
print(f"❌ Connection to MusicBrainz {provider_type} failed")
|
||||||
else:
|
|
||||||
db = MusicBrainzDatabase()
|
|
||||||
if db.test_connection():
|
|
||||||
print("✅ Connection to MusicBrainz database successful")
|
|
||||||
return ExitCode.SUCCESS
|
|
||||||
else:
|
|
||||||
print("❌ Connection to MusicBrainz database failed")
|
|
||||||
return ExitCode.ERROR
|
return ExitCode.ERROR
|
||||||
|
|
||||||
# Process songs (main functionality)
|
# Process songs (main functionality)
|
||||||
|
|||||||
36
src/core/factory.py
Normal file
36
src/core/factory.py
Normal file
@ -0,0 +1,36 @@
|
|||||||
|
"""
|
||||||
|
Factory for creating MusicBrainz data providers.
|
||||||
|
Provides a clean way to instantiate the appropriate data provider based on configuration.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from typing import Optional
|
||||||
|
from ..api.database import MusicBrainzDatabase
|
||||||
|
from ..api.api_client import MusicBrainzAPIClient
|
||||||
|
from .interfaces import MusicBrainzDataProvider
|
||||||
|
from ..config.constants import DEFAULT_MUSICBRAINZ_URL
|
||||||
|
|
||||||
|
|
||||||
|
class DataProviderFactory:
|
||||||
|
"""Factory for creating MusicBrainz data providers."""
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def create_provider(use_database: bool = True, base_url: str = DEFAULT_MUSICBRAINZ_URL) -> MusicBrainzDataProvider:
|
||||||
|
"""
|
||||||
|
Create a data provider based on configuration.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
use_database: Whether to use database (True) or API (False)
|
||||||
|
base_url: Base URL for API client (only used if use_database=False)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
MusicBrainzDataProvider instance
|
||||||
|
"""
|
||||||
|
if use_database:
|
||||||
|
provider = MusicBrainzDatabase()
|
||||||
|
if provider.connect():
|
||||||
|
return provider
|
||||||
|
else:
|
||||||
|
print("⚠️ Database connection failed, falling back to API")
|
||||||
|
return MusicBrainzAPIClient(base_url)
|
||||||
|
else:
|
||||||
|
return MusicBrainzAPIClient(base_url)
|
||||||
60
src/core/interfaces.py
Normal file
60
src/core/interfaces.py
Normal file
@ -0,0 +1,60 @@
|
|||||||
|
"""
|
||||||
|
Core interfaces for MusicBrainz Data Cleaner.
|
||||||
|
Defines the common protocol that all data access implementations must follow.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from abc import ABC, abstractmethod
|
||||||
|
from typing import Dict, Optional, Any, Tuple, List
|
||||||
|
|
||||||
|
|
||||||
|
class MusicBrainzDataProvider(ABC):
|
||||||
|
"""
|
||||||
|
Abstract base class defining the interface for MusicBrainz data access.
|
||||||
|
Both database and API implementations must implement these methods.
|
||||||
|
"""
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def test_connection(self) -> bool:
|
||||||
|
"""Test connection to the data source."""
|
||||||
|
pass
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def fuzzy_search_artist(self, artist_name: str) -> Optional[Tuple[str, str, float]]:
|
||||||
|
"""
|
||||||
|
Fuzzy search for artist by name.
|
||||||
|
Returns (artist_name, mbid, similarity_score) or None.
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def fuzzy_search_recording(self, title: str, artist_mbid: Optional[str] = None) -> Optional[Tuple[str, str, float]]:
|
||||||
|
"""
|
||||||
|
Fuzzy search for recording by title and optionally artist.
|
||||||
|
Returns (recording_name, mbid, similarity_score) or None.
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def get_artist_info(self, mbid: str) -> Optional[Dict[str, Any]]:
|
||||||
|
"""Get detailed artist information by MBID."""
|
||||||
|
pass
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def get_recording_info(self, mbid: str) -> Optional[Dict[str, Any]]:
|
||||||
|
"""Get detailed recording information by MBID."""
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class SongProcessorInterface(ABC):
|
||||||
|
"""
|
||||||
|
Abstract base class defining the interface for song processing.
|
||||||
|
This ensures consistent behavior across different implementations.
|
||||||
|
"""
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def clean_song(self, song: Dict[str, Any]) -> Tuple[Dict[str, Any], bool]:
|
||||||
|
"""
|
||||||
|
Clean a single song.
|
||||||
|
Returns (cleaned_song, success_status)
|
||||||
|
"""
|
||||||
|
pass
|
||||||
@ -7,22 +7,20 @@ between CLI and database interactions to ensure consistency.
|
|||||||
from typing import Dict, Optional, Tuple, Any, List
|
from typing import Dict, Optional, Tuple, Any, List
|
||||||
from ..utils.artist_title_processing import parse_complex_collaboration, generate_title_variations
|
from ..utils.artist_title_processing import parse_complex_collaboration, generate_title_variations
|
||||||
from ..utils.data_loader import data_loader
|
from ..utils.data_loader import data_loader
|
||||||
|
from .interfaces import MusicBrainzDataProvider, SongProcessorInterface
|
||||||
|
|
||||||
|
|
||||||
class SongProcessor:
|
class SongProcessor(SongProcessorInterface):
|
||||||
"""Centralized song processing with consistent logic across all interfaces."""
|
"""Centralized song processing with consistent logic across all interfaces."""
|
||||||
|
|
||||||
def __init__(self, database_client=None, api_client=None):
|
def __init__(self, data_provider: MusicBrainzDataProvider):
|
||||||
"""
|
"""
|
||||||
Initialize with optional database and API clients.
|
Initialize with a data provider that implements MusicBrainzDataProvider.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
database_client: MusicBrainzDatabase instance for direct DB access
|
data_provider: Instance implementing MusicBrainzDataProvider interface
|
||||||
api_client: MusicBrainzAPIClient instance for API fallback
|
|
||||||
"""
|
"""
|
||||||
self.db = database_client
|
self.data_provider = data_provider
|
||||||
self.api = api_client
|
|
||||||
self.use_database = database_client is not None
|
|
||||||
|
|
||||||
def find_artist_mbid(self, artist_name: str) -> Optional[str]:
|
def find_artist_mbid(self, artist_name: str) -> Optional[str]:
|
||||||
"""
|
"""
|
||||||
@ -37,21 +35,10 @@ class SongProcessor:
|
|||||||
if not artist_name:
|
if not artist_name:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
# Try database first if available
|
result = self.data_provider.fuzzy_search_artist(artist_name)
|
||||||
if self.use_database:
|
|
||||||
result = self.db.fuzzy_search_artist(artist_name)
|
|
||||||
if result and isinstance(result, tuple) and len(result) >= 2:
|
if result and isinstance(result, tuple) and len(result) >= 2:
|
||||||
return result[1] # Return MBID from tuple (artist_name, mbid, score)
|
return result[1] # Return MBID from tuple (artist_name, mbid, score)
|
||||||
|
|
||||||
# Fallback to API
|
|
||||||
if self.api:
|
|
||||||
try:
|
|
||||||
result = self.api.search_artist(artist_name)
|
|
||||||
if result:
|
|
||||||
return result['id']
|
|
||||||
except:
|
|
||||||
pass
|
|
||||||
|
|
||||||
return None
|
return None
|
||||||
|
|
||||||
def find_recording_mbid(self, artist_mbid: str, title: str, original_artist: str = None) -> Optional[str]:
|
def find_recording_mbid(self, artist_mbid: str, title: str, original_artist: str = None) -> Optional[str]:
|
||||||
@ -83,12 +70,11 @@ class SongProcessor:
|
|||||||
return recording_mbid
|
return recording_mbid
|
||||||
|
|
||||||
# Handle collaborations using artist credit
|
# Handle collaborations using artist credit
|
||||||
if self.use_database:
|
|
||||||
# If no artist_mbid (collaboration case), try to find by title and verify artist credit
|
# If no artist_mbid (collaboration case), try to find by title and verify artist credit
|
||||||
if not artist_mbid and original_artist:
|
if not artist_mbid and original_artist:
|
||||||
# This is a collaboration case, try to find by title with all variations
|
# This is a collaboration case, try to find by title with all variations
|
||||||
for variation in title_variations:
|
for variation in title_variations:
|
||||||
result = self.db.fuzzy_search_recording(variation)
|
result = self.data_provider.fuzzy_search_recording(variation)
|
||||||
if result and isinstance(result, tuple) and len(result) >= 2:
|
if result and isinstance(result, tuple) and len(result) >= 2:
|
||||||
recording_mbid = result[1]
|
recording_mbid = result[1]
|
||||||
|
|
||||||
@ -113,40 +99,23 @@ class SongProcessor:
|
|||||||
if collaborator.lower() in recording_artists:
|
if collaborator.lower() in recording_artists:
|
||||||
return recording_mbid
|
return recording_mbid
|
||||||
|
|
||||||
return None
|
|
||||||
else:
|
|
||||||
# Regular case with artist_mbid - try all title variations
|
# Regular case with artist_mbid - try all title variations
|
||||||
for variation in title_variations:
|
for variation in title_variations:
|
||||||
result = self.db.fuzzy_search_recording(variation, artist_mbid)
|
result = self.data_provider.fuzzy_search_recording(variation, artist_mbid)
|
||||||
if result and isinstance(result, tuple) and len(result) >= 2:
|
if result and isinstance(result, tuple) and len(result) >= 2:
|
||||||
return result[1] # Return MBID from tuple (recording_name, mbid, score)
|
return result[1] # Return MBID from tuple (recording_name, mbid, score)
|
||||||
else:
|
|
||||||
# Fallback to API - try all title variations
|
return None
|
||||||
for variation in title_variations:
|
|
||||||
try:
|
|
||||||
result = self.api.search_recording(variation, artist_mbid)
|
|
||||||
if result:
|
|
||||||
return result['id']
|
|
||||||
except:
|
|
||||||
pass
|
|
||||||
|
|
||||||
return None
|
return None
|
||||||
|
|
||||||
def get_artist_info(self, mbid: str) -> Optional[Dict[str, Any]]:
|
def get_artist_info(self, mbid: str) -> Optional[Dict[str, Any]]:
|
||||||
"""Get artist info using consistent logic."""
|
"""Get artist info using consistent logic."""
|
||||||
if self.use_database:
|
return self.data_provider.get_artist_info(mbid)
|
||||||
return self.db.get_artist_info(mbid)
|
|
||||||
elif self.api:
|
|
||||||
return self.api.get_artist_info(mbid)
|
|
||||||
return None
|
|
||||||
|
|
||||||
def get_recording_info(self, mbid: str) -> Optional[Dict[str, Any]]:
|
def get_recording_info(self, mbid: str) -> Optional[Dict[str, Any]]:
|
||||||
"""Get recording info using consistent logic."""
|
"""Get recording info using consistent logic."""
|
||||||
if self.use_database:
|
return self.data_provider.get_recording_info(mbid)
|
||||||
return self.db.get_recording_info(mbid)
|
|
||||||
elif self.api:
|
|
||||||
return self.api.get_recording_info(mbid)
|
|
||||||
return None
|
|
||||||
|
|
||||||
def _build_artist_string(self, artist_credit: list) -> str:
|
def _build_artist_string(self, artist_credit: list) -> str:
|
||||||
"""Build artist string from artist credit list (API format)."""
|
"""Build artist string from artist credit list (API format)."""
|
||||||
@ -189,9 +158,9 @@ class SongProcessor:
|
|||||||
recording_info = self.get_recording_info(recording_mbid)
|
recording_info = self.get_recording_info(recording_mbid)
|
||||||
if recording_info:
|
if recording_info:
|
||||||
# Update with the correct artist credit and title
|
# Update with the correct artist credit and title
|
||||||
if self.use_database and recording_info.get('artist_credit'):
|
if recording_info.get('artist_credit'):
|
||||||
song['artist'] = recording_info['artist_credit']
|
song['artist'] = recording_info['artist_credit']
|
||||||
elif not self.use_database and recording_info.get('artist-credit'):
|
elif recording_info.get('artist-credit'):
|
||||||
artist_string = self._build_artist_string(recording_info['artist-credit'])
|
artist_string = self._build_artist_string(recording_info['artist-credit'])
|
||||||
if artist_string:
|
if artist_string:
|
||||||
song['artist'] = artist_string
|
song['artist'] = artist_string
|
||||||
@ -200,10 +169,9 @@ class SongProcessor:
|
|||||||
song['recording_mbid'] = recording_mbid
|
song['recording_mbid'] = recording_mbid
|
||||||
|
|
||||||
# For collaborations, try to get the main artist's MBID
|
# For collaborations, try to get the main artist's MBID
|
||||||
if self.use_database:
|
|
||||||
main_artist, collaborators = parse_complex_collaboration(song.get('artist', ''))
|
main_artist, collaborators = parse_complex_collaboration(song.get('artist', ''))
|
||||||
if main_artist:
|
if main_artist:
|
||||||
artist_result = self.db.fuzzy_search_artist(main_artist)
|
artist_result = self.data_provider.fuzzy_search_artist(main_artist)
|
||||||
if artist_result and isinstance(artist_result, tuple) and len(artist_result) >= 2:
|
if artist_result and isinstance(artist_result, tuple) and len(artist_result) >= 2:
|
||||||
song['mbid'] = artist_result[1] # Set the main artist's MBID
|
song['mbid'] = artist_result[1] # Set the main artist's MBID
|
||||||
|
|
||||||
@ -229,12 +197,12 @@ class SongProcessor:
|
|||||||
recording_info = self.get_recording_info(recording_mbid)
|
recording_info = self.get_recording_info(recording_mbid)
|
||||||
if recording_info:
|
if recording_info:
|
||||||
# Update artist string if there are multiple artists, but preserve the artist MBID
|
# Update artist string if there are multiple artists, but preserve the artist MBID
|
||||||
if self.use_database and recording_info.get('artist_credit'):
|
if recording_info.get('artist_credit'):
|
||||||
song['artist'] = recording_info['artist_credit']
|
song['artist'] = recording_info['artist_credit']
|
||||||
# Keep the original artist MBID even when updating artist name
|
# Keep the original artist MBID even when updating artist name
|
||||||
if 'mbid' not in song:
|
if 'mbid' not in song:
|
||||||
song['mbid'] = artist_mbid
|
song['mbid'] = artist_mbid
|
||||||
elif not self.use_database and recording_info.get('artist-credit'):
|
elif recording_info.get('artist-credit'):
|
||||||
artist_string = self._build_artist_string(recording_info['artist-credit'])
|
artist_string = self._build_artist_string(recording_info['artist-credit'])
|
||||||
if artist_string:
|
if artist_string:
|
||||||
song['artist'] = artist_string
|
song['artist'] = artist_string
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user