Game Deduplication

How the Chess Database Is Deduplicated

The quality of a chess database depends critically on avoiding duplicates. When the same game appears multiple times in a collection – possibly with slightly different header information – it distorts statistical analyses and wastes storage space. The deduplication script uses a multi-phase algorithm to reliably detect duplicates and intelligently merge them.

The Import Process

PGN Parsing and Chunk Processing

The import begins with reading the PGN files. Since chess databases often span several gigabytes, the files are not loaded completely into memory but split into manageable chunks. Each chunk contains a defined number of games (typically 50,000) and is processed independently.

The parser recognizes:

  • Header Tags: All PGN tags such as Event, Site, Date, White, Black, Result, ECO, etc.
  • Move Sequences: The complete notation including variations and comments
  • Invalid Games: Faulty notation is written to a separate log file

Parallel Processing with Worker Pools

To leverage the full power of modern multi-core processors, multiple worker processes run in parallel. Each worker:

  • Reads a chunk
  • Parses the PGN notation
  • Normalizes moves and calculates hash values

The intensive parsing and normalization involved saves a number of work steps and I/O-intensive database queries later on.

Staging Table and Batch Transfer

The import uses a two-stage architecture for maximum robustness:

  • Staging Phase: All games are first written to a temporary UNLOGGED table – fast and without transaction overhead
  • Finalize Phase: Data is transferred batch-wise (default: 1 million rows per batch) to the final games table

The batch processing uses a temporary lookup table for player IDs. IMPORTANT: the header quality score (`header_score`) is no longer computed and persisted during the initial import/finalize transfer; `header_score` is typically left NULL during staging and is calculated later in the dedicated Header‑Scores phase (run e.g. with --phases-to-run header-scores). This centralizes scoring and avoids repeated rewrites.

All deduplication phases are fully parallelized. This includes:

  • Exact Phase: Parallel processing of duplicate groups
  • Subsumed Phase: Parallel processing of subsumed groups
  • Join Phase: Parallel fuzzy matching of candidate pairs
  • Header Merge: Parallel merging of metadata

Move Sequence Normalization

A critical step before deduplication: moves from different sources are often notated differently. The script normalizes all moves to a unified UCI-format:

  • Piece symbols are standardized (K, Q, R, B, N)
  • Superfluous characters (!, ?, +, #) are removed for hash comparison
  • Move numbering is standardized

From the normalized move sequence, a hash value (xxHash) is calculated – a unique fingerprint of the game for fast comparisons.

Comprehensive Data Validation

All fields are strictly validated during import:

  • FIDE IDs: Stored as BIGINT (supports all current and future IDs)
  • Year values: Validated in range 0-2100
  • ELO ratings: Validated in range 0-4000
  • Date fields: Must conform to YYYY-MM-DD format
  • Invalid values: Markers like “????” are automatically filtered

Preparation for Deduplication

Player Name Normalization

One of the biggest challenges: the same player appears under different names. “Carlsen, Magnus”“Carlsen,Magnus”“Carlsen, M.”, and “Magnus Carlsen” are all the same person. The script performs several normalization steps:

  • Whitespace Normalization: Spaces and commas are unified
  • Title Extraction: “GM”, “IM”, “FM” etc. are stored separately
  • Unicode Normalization: Special characters and accents are standardized

FIDE Player Lookup

A powerful feature: the script matches player names against the official FIDE database, which is downloaded at each execution of the script. When a match is found:

  • The official FIDE ID is assigned
  • The correct spelling of the name is adopted
  • Nationality and title are verified

The lookup uses trigram-based similarity search (pg_trgm) to find the correct player even with typos or alternative spellings.

Reference ID Assignment

Similar player names are grouped into reference groups. All variants of a name receive the same reference_id, which enormously speeds up later queries. Instead of searching for “Carlsen, Magnus OR Carlsen,Magnus OR …”, a single ID suffices.

Materialized View for Duplicate Candidates

Before the actual deduplication, a Materialized View is created in PostgreSQL. This pre-computation identifies all game pairs with identical hashes and stores them for fast access. The deduplication can thus work directly on relevant candidates instead of having to search through all millions of games.

The Deduplication Phases

Phase 1: Exact Duplicate Detection

The first step identifies games with exactly identical move sequences. For each game, a unique hash value (fingerprint) is calculated from the normalized move sequence. Games with identical hashes are grouped as potential duplicates.

Phase 2: Subsumption Detection

Not all duplicates have exactly the same move sequence. Often a game was interrupted or the notation ends earlier than in another source. Subsumption detection finds such cases:

  • Game A contains moves 1-40
  • Game B contains moves 1-35 (identical to A)
  • → Game B is a subsumption of Game A

The shorter fragment is marked as a duplicate of the more complete game, while the complete version is preserved as the master.

Phase 3: Join-Lines Detection

Sometimes two games complement each other: one source has the opening in detail, another has the endgame. The Join-Lines phase detects such cases and can intelligently merge the move sequences to reconstruct the most complete version.

Example: If the main source records 10. Nf3, but another source has 10. Ng5, this appears in the result as:

10. Nf3 (10. Ng5) 10... Be7

This way, no information is lost, and researchers can trace the different sources.

Phase 4: Intelligent Header Merging

Different sources provide different quality metadata. One source has the correct FIDE IDs of the players, another has the exact tournament date, yet another has the ELO ratings at the time of the game.

The script uses a score-based merge system:

  • Each header value (player name, date, ELO, etc.) receives a quality score
  • FIDE-verified data receives higher scores
  • For each field, the value with the highest score is adopted
  • The result is a “best-of-all” version of the game

Result

After processing and exporting, the database contains:

  • Master Games: The highest quality version of each unique game
  • Duplicate Markers: References to the respective master for statistical purposes
  • Complete Provenance: All sources are traceably documented

The result is a cleaned, high-quality chess database without redundant entries – the foundation for reliable statistics and analyses.

Views: 833

Scroll to Top