How Duplicate Games Are Found
Complete overview of all deduplication phases, algorithms and configuration options
Contents
1. Deduplication Phases Overview
The system operates in 9 sequential phases:
2. The Three Duplicate Types
2.1 Exact Duplicates — SHA-256 Move-Hash
Two games are exact duplicates when they have identical move sequences:
- Converts all moves to UCI notation (e.g.
e2e4instead ofe4) - Computes a SHA-256 hash of the entire move sequence
- Games with identical hashes are exact duplicates
Game A: 1.e4 e5 2.Nf3 Nc6 3.Bb5 a6 ← English notation Game B: 1.e4 e5 2.Sf3 Sc6 3.Lb5 a6 ← German notation UCI: e2e4 e7e5 g1f3 b8c6 f1b5 a7a6 ← identical!
2.2 Subsumption Duplicates (Prefix Games)
A game is a subsumption when its moves appear exactly at the beginning of a longer game.
Partial Subsumption v2.3.1
When a player is unknown ([White "?"]), the system checks the original header instead of resolved values:
Game A: [White "Kitces, Edward"] [Black "?"] [BlackFideId "2004194"]
1.e4 c6 2.d4 d5 ... (63 half-moves)
Game B: [White "Kitces, Edward"] [Black "?"]
1.e4 c6 2.d4 d5 ... (75 half-moves)
→ Grouped by white_norm="kitces, edward" → Game A ⊂ Game B
2.3 Join-Lines (Fuzzy Matching)
For games with minor differences. A match requires all three algorithms to meet their thresholds:
3. Player Grouping
A critical aspect: Only games with the same player combination are compared.
The core_game_hash
core_game_hash = SHA256(white_norm + ":" + black_norm)
Players with resolved FIDE names are correctly grouped.
Handling Invalid Player Names v2.2.1
- Invalid names (?, NN, Unknown) → unified placeholder
"unknown" - FIDE IDs for invalid names are ignored
FIDE ID Conflicts
Source A: Filipenko, Alexander V — BlackFideId 34117881 Source B: Filipenko, Alexander V — BlackFideId 4104471 → Fuzzy player matching resolves conflicts
4. Phonetic Matching — Triple-Phonetic Blocking v2.5
Since v2.5, the system uses three independent phonetic algorithms for player matching:
Canonical FIDE Names
| Before (PGN) | After (Backfill) |
|---|---|
| Chilov, A… | Chilov, Alexandros |
| Lemos, N. | Lemos, Nikolaos |
| Carlsen, M | Carlsen, Magnus |
Automatic Initial Aliases v2.4.2
| Canonical Name | Generated Aliases |
|---|---|
| Carlsen, Magnus | Carlsen, M. / Carlsen, M |
| Van der Berg, Jan Peter | Van der Berg, J. / Van der Berg, JP |
5. International Name Matching v2.5
6. Comparison Parameters
Position and Length
| Parameter | Description | Default |
|---|---|---|
--join-ply-start | From which half-move to compare | 15 |
--join-min-compare | Minimum overlap after ply-start | 10 |
--join-diff-ratio | Allowed relative length deviation | 0.15 |
--join-min-ply | Minimum game length for candidates | 10 |
Accuracy Presets
| Preset | Jaro-Winkler | Token-LCS | Max Indel | Use Case |
|---|---|---|---|---|
| strict | ≥0.98 | ≥0.98 | ≤2 | High-quality tournament data |
| normal | ≥0.95 | ≥0.95 | ≤4 | General use |
| tolerant | ≥0.92 | ≥0.90 | ≤8 | Noisy data sources |
Duplicate Classification
7. Header Merging & Export
Detected duplicates are intelligently merged:
- Higher quality preferred: Complete data over placeholders
- FIDE data prioritized: ELO, titles, FIDE IDs from official database
- All sources documented:
Sourceheader shows origin - Merged tag: Merged games with
[Merged "true"]
Single Source of Truth v2.4.0
The merge phase writes the best values directly into dedicated Parquet columns:
| Column | Description |
|---|---|
event | Best event name from duplicate group |
site | Best venue |
game_date | Best date |
white_elo / black_elo | Best ELO values |
white_title / black_title | Best titles |
eco | Best ECO classification |
time_control | Best time control information |
Consistent Player Names v2.4.0
Three-tier priority:
- Canonical name (
canonical_namefrom SSP/FIDE) - Original name (
white_player/black_player) - Normalized name (
white_norm/black_norm) — lowercase fallback
Merged Tag for Variants v2.4.0
[Event "Olympiad"]
[White "Carlsen, Magnus"]
[Black "Anand, Viswanathan"]
[Merged "true"]
1.e4 e5 2.Nf3 Nc6 (2...Nf6 {Variant from duplicate}) 3.Bb5 *
8. Performance Optimizations
Example Workflow
# Full deduplication with FIDE data pgn_deduplicator games.pgn --fide-xml players.xml --load-fide-data -o unique.pgn # Strict deduplication for tournament data pgn_deduplicator tournament.pgn --join-accuracy strict -o clean.pgn # Tolerant deduplication for online games pgn_deduplicator online.pgn --join-accuracy tolerant -o unique.pgn
System Architecture
Layer model, 9-phase pipeline, memory budget and hash algorithms of PGN Deduplicator v2.5
Contents
1. Layer Model
2. Data Flow of the 9-Phase Pipeline
3. Memory Budget (v2.4.2 → v2.5.0)
4. Hash Algorithms
5. Duplicate Classification
| Status | Method | Detection | Export? |
|---|---|---|---|
| NULL | — | Unique or master of a group | ✓ Yes |
| exact | SHA-256 move_hash | Identical move sequence | ✗ No |
| subsumption | Prefix comparison | Shorter version of a longer game | ✗ No |
| join | JW + LCS (Banded) + Indel | Similar game with notation differences | ✗ No |
6. New Dependencies v2.5
| Crate | Version | Features | Purpose |
|---|---|---|---|
| rphonetic | 3.0 | embedded_bm, embedded_dm | Beider-Morse + Daitch-Mokotoff Phonetics |
| unicode-normalization | 0.1 | — | NFKD accent removal |
| redb | 2.4 | — | Embedded KV store (state management) |
| once_cell | 1.21 | — | Lazy-static for thread-safe encoders |
Player Matching
6-stage pipeline with triple-phonetic blocking, transliteration and fuzzy matching for international player names
Contents
1. Multi-Stage Matching Pipeline
Player Consolidation (Phase 3) identifies identical players across 6 stages. Each stage has ascending cost — early stages filter cheaply, later stages deliver precise results.
2. Transliteration Rules
3. Phonetic Algorithms Compared
| Algorithm | Type | Strength | Example | Since |
|---|---|---|---|---|
| Double Metaphone | Phonetic | English, Spanish, French | Smith / Smyth → SM0 / XMT | v2.4 |
| Beider-Morse | Phonetic (Multi-Origin) | Eastern European, Yiddish, Slavic | Schwarzenegger → multiple codes per origin | v2.5 |
| Daitch-Mokotoff | Phonetic-Numeric | Germanic, Slavic, Yiddish | Schwarzenegger → 4-6 digit numeric codes | v2.5 |
| Jaro-Winkler | Similarity (0.0–1.0) | General, especially prefix matches | Carlsen / Karlsen → 0.93 | v2.4 |
| Damerau-Levenshtein | Edit Distance | Typos, transpositions | Fischer / Ficsher → 1 | v2.4 |
| Bigram-Dice | N-Gram Similarity | Script-independent, fallback | Kasparov / Kasparow → 0.86 | v2.5 |
4. Triple-Phonetic Index — Blocking Strategy
5. Matching Examples
| Name A | Name B | Stage | Method |
|---|---|---|---|
| Carlsen, Magnus | Carlsen, Magnus | 1 (Exact) | String equality |
| Magnus Carlsen | Carlsen, Magnus | 2 (Scid) | Inversion detection |
| Shirov, Alexei | Schirow, Alexej | 3 (Triple-Phonetic) | BM multi-origin match |
| Kramnik, V. | Kramnik, Vladimir | 4 (Variant) | Initials expansion |
| Müller, Thomas | Mueller, Thomas | 5 (Combined) | JW=0.95 → Match |
| Kasparov, Garry | Kasparow, Garri | 6 (Bigram-Dice) | Dice=0.86 → Match |
Duplicate Detection
SHA-256 Exact Dedup, Subsumption and Join Lines — three methods for detecting identical and similar chess games
Contents
1. Overview: Three Duplicate Types
2. Exact Dedup — Detail Flow
3. Subsumption — Prefix Detection
4. Join Lines — Similarity-Based Detection
5. Header Score — Master Selection
event, site, game_date, etc.) — no re-parsing of the original PGN.
6. Delta Pipeline (NEW v2.5)
7. Schema v2 — Column Overview
| Column Group | Columns | Type |
|---|---|---|
| Header | event, site, game_date, event_round, white_player, black_player, result | Utf8 |
| Player Meta | white_elo, black_elo, white_fide_id, black_fide_id, white_title, black_title | Int32 / Utf8 |
| Hashes | move_hash, core_game_hash | Int64 (v2.5, was Utf8) |
| Deduplication | duplicate_status, dupe_group, header_score | Utf8 / Int32 |
| IDs | white_player_id, black_player_id, game_id | UInt32 / UInt64 |
| Moves | moves, move_count | Utf8 / Int32 |
| Meta | eco, time_control, source, import_date, schema_version | Utf8 / Int32 |
Views: 2041