How the Chess Database Is Deduplicated
The quality of a chess database depends critically on avoiding duplicates. When the same game appears multiple times in a collection – possibly with slightly different header information – it distorts statistical analyses and wastes storage space. The deduplication script uses a multi-phase algorithm to reliably detect duplicates and intelligently merge them.
The Import Process
PGN Parsing and Chunk Processing
The import begins with reading the PGN files. Since chess databases often span several gigabytes, the files are not loaded completely into memory but split into manageable chunks. Each chunk contains a defined number of games (typically 50,000) and is processed independently.
The parser recognizes:
- Header Tags: All PGN tags such as Event, Site, Date, White, Black, Result, ECO, etc.
- Move Sequences: The complete notation including variations and comments
- Invalid Games: Faulty notation is written to a separate log file
Parallel Processing with Worker Pools
To leverage the full power of modern multi-core processors, multiple worker processes run in parallel. Each worker:
- Reads a chunk
- Parses the PGN notation
- Normalizes moves and calculates hash values
The intensive parsing and normalization involved saves a number of work steps and I/O-intensive database queries later on.
Staging Table and Batch Transfer
The import uses a two-stage architecture for maximum robustness:
- Staging Phase: All games are first written to a temporary UNLOGGED table – fast and without transaction overhead
- Finalize Phase: Data is transferred batch-wise (default: 1 million rows per batch) to the final games table
The batch processing uses a temporary lookup table for player IDs. IMPORTANT: the header quality score (`header_score`) is no longer computed and persisted during the initial import/finalize transfer; `header_score` is typically left NULL during staging and is calculated later in the dedicated Header‑Scores phase (run e.g. with --phases-to-run header-scores). This centralizes scoring and avoids repeated rewrites.
All deduplication phases are fully parallelized. This includes:
- Exact Phase: Parallel processing of duplicate groups
- Subsumed Phase: Parallel processing of subsumed groups
- Join Phase: Parallel fuzzy matching of candidate pairs
- Header Merge: Parallel merging of metadata
Move Sequence Normalization
A critical step before deduplication: moves from different sources are often notated differently. The script normalizes all moves to a unified UCI-format:
- Piece symbols are standardized (K, Q, R, B, N)
- Superfluous characters (!, ?, +, #) are removed for hash comparison
- Move numbering is standardized
From the normalized move sequence, a hash value (xxHash) is calculated – a unique fingerprint of the game for fast comparisons.
Comprehensive Data Validation
All fields are strictly validated during import:
- FIDE IDs: Stored as BIGINT (supports all current and future IDs)
- Year values: Validated in range 0-2100
- ELO ratings: Validated in range 0-4000
- Date fields: Must conform to YYYY-MM-DD format
- Invalid values: Markers like “????” are automatically filtered
Preparation for Deduplication
Player Name Normalization
One of the biggest challenges: the same player appears under different names. “Carlsen, Magnus”, “Carlsen,Magnus”, “Carlsen, M.”, and “Magnus Carlsen” are all the same person. The script performs several normalization steps:
- Whitespace Normalization: Spaces and commas are unified
- Title Extraction: “GM”, “IM”, “FM” etc. are stored separately
- Unicode Normalization: Special characters and accents are standardized
FIDE Player Lookup
A powerful feature: the script matches player names against the official FIDE database, which is downloaded at each execution of the script. When a match is found:
- The official FIDE ID is assigned
- The correct spelling of the name is adopted
- Nationality and title are verified
The lookup uses trigram-based similarity search (pg_trgm) to find the correct player even with typos or alternative spellings.
Reference ID Assignment
Similar player names are grouped into reference groups. All variants of a name receive the same reference_id, which enormously speeds up later queries. Instead of searching for “Carlsen, Magnus OR Carlsen,Magnus OR …”, a single ID suffices.
Materialized View for Duplicate Candidates
Before the actual deduplication, a Materialized View is created in PostgreSQL. This pre-computation identifies all game pairs with identical hashes and stores them for fast access. The deduplication can thus work directly on relevant candidates instead of having to search through all millions of games.
The Deduplication Phases
Phase 1: Exact Duplicate Detection
The first step identifies games with exactly identical move sequences. For each game, a unique hash value (fingerprint) is calculated from the normalized move sequence. Games with identical hashes are grouped as potential duplicates.
Phase 2: Subsumption Detection
Not all duplicates have exactly the same move sequence. Often a game was interrupted or the notation ends earlier than in another source. Subsumption detection finds such cases:
- Game A contains moves 1-40
- Game B contains moves 1-35 (identical to A)
- → Game B is a subsumption of Game A
The shorter fragment is marked as a duplicate of the more complete game, while the complete version is preserved as the master.
Phase 3: Join-Lines Detection
Sometimes two games complement each other: one source has the opening in detail, another has the endgame. The Join-Lines phase detects such cases and can intelligently merge the move sequences to reconstruct the most complete version.
Example: If the main source records 10. Nf3, but another source has 10. Ng5, this appears in the result as:
10. Nf3 (10. Ng5) 10... Be7
This way, no information is lost, and researchers can trace the different sources.
Phase 4: Intelligent Header Merging
Different sources provide different quality metadata. One source has the correct FIDE IDs of the players, another has the exact tournament date, yet another has the ELO ratings at the time of the game.
The script uses a score-based merge system:
- Each header value (player name, date, ELO, etc.) receives a quality score
- FIDE-verified data receives higher scores
- For each field, the value with the highest score is adopted
- The result is a “best-of-all” version of the game
Result
After processing and exporting, the database contains:
- Master Games: The highest quality version of each unique game
- Duplicate Markers: References to the respective master for statistical purposes
- Complete Provenance: All sources are traceably documented
The result is a cleaned, high-quality chess database without redundant entries – the foundation for reliable statistics and analyses.
Views: 833