Game Deduplication

How Duplicate Games Are Found

Complete overview of all deduplication phases, algorithms and configuration options

1. Deduplication Phases Overview

The system operates in 9 sequential phases:

① Ingest PGN → Parquet (Streaming, Rayon) ② FIDE Import XML → players + player_lookup ③ Player Consolidation Triple-Phonetic: DM + BM + Daitch-Mokotoff · Transliteration · Asian Names ④ Backfill FIDE IDs, Elo, Names, Hashes DEDUPLICATION ⑤ Exact ⑥ Subsumpt. ⑦ Join-Lines DeltaStateStore (state.redb) — only re-evaluate affected groups Bloom Filter · Pre-Aggregated Hashes · Banded LCS ⑧ Merge Merge headers · Single Source of Truth ⑨ Export Streaming → export.pgn 📄 games.pgn 📄 players.xml 📄 unique.pgn Parquet Files 📊 games.parquet (46 cols) 📊 players.parquet (14 cols) 📊 player_lookup.parquet 🗄️ state.redb (6 tables) 📁 chunks/ (temporary) Each phase is idempotent · Checkpoint-based · Delta-incremental (v2.5)

2. The Three Duplicate Types

Exact Duplicate duplicate_status = “exact” 1.e4 e5 2.Nf3 Nc6 3.Bb5 a6 1.e4 e5 2.Sf3 Sc6 3.Lb5 a6 UCI notation is language-independent: e2e4 e7e5 g1f3 b8c6 f1b5 a7a6 ↓ SHA-256 → move_hash (Int64) Identical hashes → Duplicate 100% identical Subsumption duplicate_status = “subsumption” 1.d4 d5 2.c4 e6 3.Nc3 Nf6 4.Bg5 Be7… 1.d4 d5 2.c4 e6 3.Nc3 Nf6 missing Shorter game’s moves are a prefix of the longer game Requires: same core_game_hash Shorter = Duplicate of Longer Prefix duplicate Join Lines duplicate_status = “join” 1.e4 e5 2.Nf3 Nc6 3.Bb5 a6 4.Ba4 1.e4 e5 2.Nf3 Nc6 3.Bb5 a6 4.Bxc6 All three algorithms must pass: ① Jaro-Winkler ≥ 0.95 ② Token-LCS (Banded) ≥ 0.95 ③ Indel Distance ≤ 4 ~98% similar

2.1 Exact Duplicates — SHA-256 Move-Hash

Two games are exact duplicates when they have identical move sequences:

  1. Converts all moves to UCI notation (e.g. e2e4 instead of e4)
  2. Computes a SHA-256 hash of the entire move sequence
  3. Games with identical hashes are exact duplicates
Game A: 1.e4 e5 2.Nf3 Nc6 3.Bb5 a6    ← English notation
Game B: 1.e4 e5 2.Sf3 Sc6 3.Lb5 a6    ← German notation
UCI:    e2e4 e7e5 g1f3 b8c6 f1b5 a7a6   ← identical!

2.2 Subsumption Duplicates (Prefix Games)

A game is a subsumption when its moves appear exactly at the beginning of a longer game.

Subsumption: Shorter ⊂ Longer Master (40 moves): e2e4 e7e5 g1f3 b8c6 f1b5 a7a6 b5a4 g8f6 e1g1 f8e7 f1e1 b7b5 a4b3 d7d6 c2c3 e8g8 … Subsumption (20 moves): e2e4 e7e5 g1f3 b8c6 f1b5 a7a6 b5a4 g8f6 e1g1 f8e7 ⬚ missing moves (abort / transmission error) Conditions: ① same core_game_hash ② moves = prefix ③ ≥10 moves shorter ④ Partial: LCS ≥ 0.95

Partial Subsumption v2.3.1

When a player is unknown ([White "?"]), the system checks the original header instead of resolved values:

Game A: [White "Kitces, Edward"] [Black "?"] [BlackFideId "2004194"]
       1.e4 c6 2.d4 d5 ... (63 half-moves)

Game B: [White "Kitces, Edward"] [Black "?"]
       1.e4 c6 2.d4 d5 ... (75 half-moves)

→ Grouped by white_norm="kitces, edward" → Game A ⊂ Game B

2.3 Join-Lines (Fuzzy Matching)

For games with minor differences. A match requires all three algorithms to meet their thresholds:

Candidate Pairs (core_game_hash Match) ① Jaro-Winkler Character-based similarity Prefix bonus · Transposition tolerance ② Token-LCS (Banded) Longest common subsequence Band-restricted · ~40× speed (v2.5) ③ Indel Distance Insertions + Deletions missing=1 · replaced=2 · transpos.=4 AND Join-Lines Match → “join” Preset Jaro-Winkler Token-LCS Max Indel strict ≥0.98 ≥0.98 ≤2 normal ≥0.95 ≥0.95 ≤4 tolerant ≥0.92 ≥0.90 ≤8

3. Player Grouping

A critical aspect: Only games with the same player combination are compared.

The core_game_hash

core_game_hash = SHA256(white_norm + ":" + black_norm)

Players with resolved FIDE names are correctly grouped.

Handling Invalid Player Names v2.2.1

  • Invalid names (?, NN, Unknown) → unified placeholder "unknown"
  • FIDE IDs for invalid names are ignored

FIDE ID Conflicts

Source A: Filipenko, Alexander V — BlackFideId 34117881
Source B: Filipenko, Alexander V — BlackFideId 4104471
→ Fuzzy player matching resolves conflicts

4. Phonetic Matching — Triple-Phonetic Blocking v2.5

Since v2.5, the system uses three independent phonetic algorithms for player matching:

Player Name (normalized) Double Metaphone English · Spanish · French Korchnoi → KRXN / KRKN Karpov → KRPF Kasparov → KSPRF Beider-Morse Multi-Origin: Slavic · Germanic · Hebrew · Romance Tchaikovsky → bm:tSakofski, bm:tSaikov Smirnov → bm:smirnof, bm:zmirn Goldberg → bm:goltbɛrk, bm:goltpirk Daitch-Mokotoff Eastern European · Yiddish · 6-digit Schwarzenegger → dm:479465 Kovalchuk → dm:585400 Branching support Union of all candidates (deduplicated) PlayerMatchIndex (in-memory) by_metaphone: HashMap<String, Vec<PlayerId>> by_beider_morse: HashMap<String, Vec<PlayerId>> by_daitch_mokotoff: HashMap<String, Vec<PlayerId>> Blocking Efficiency O(c × k) instead of O(n²) n=30M · c~100 · k~5 Triple approach: maximum recall with high precision BM/DM codes in-memory only (PlayerMatchIndex), not stored in Parquet · ~60 MB instead of ~1 GB

Canonical FIDE Names

Before (PGN)After (Backfill)
Chilov, A…Chilov, Alexandros
Lemos, N.Lemos, Nikolaos
Carlsen, MCarlsen, Magnus

Automatic Initial Aliases v2.4.2

Canonical NameGenerated Aliases
Carlsen, MagnusCarlsen, M. / Carlsen, M
Van der Berg, Jan PeterVan der Berg, J. / Van der Berg, JP
Important: Only unambiguous aliases are added. With ambiguity, no alias is created.

5. International Name Matching v2.5

Transliteration Normalization -ow → -ov Kasparow → Kasparov -off → -ov Spassoff → Spassov -ski → -sky Polanski → Polansky tsch → ch Tschaikowski → Chaikowski -ij → -y Nabokovij → Nabokovy -ej → -ey Sergej → Sergey ew → ev Karpew → Karpev Korean Surnames 이 / Yi, Ri → Lee 김 / Gim → Kim 박 / Bak, Pak → Park 최 → Choi 정 / Jeong → Chung 권 / Gwon → Kwon Chinese Name Spacing Ding, Li Ren → Ding, Liren Wang, Yi Fan → Wang, Yifan NFKD Accent Removal Müller → Muller Éric → Eric Jiráček → Jiracek Bigram Dice Coefficient “Kasparov” → {Ka,as,sp,pa,ar,ro,ov} “Kasparow” → {Ka,as,sp,pa,ar,ro,ow} Dice=0.86 ✓ Dice(A,B) = 2 × |Bigrams(A) ∩ Bigrams(B)| / (|A| + |B|) All normalizations run BEFORE phonetic blocking Transliteration → NFKD → Asian Names → generate_name_variants() → Triple-Phonetic → Similarity-Score → Match

6. Comparison Parameters

Position and Length

ParameterDescriptionDefault
--join-ply-startFrom which half-move to compare15
--join-min-compareMinimum overlap after ply-start10
--join-diff-ratioAllowed relative length deviation0.15
--join-min-plyMinimum game length for candidates10
Why from half-move 15? — The first moves (opening) are often identical between many different games. Half-move 15 (~move 8) focuses on the middlegame and reduces false positives with popular openings.

Accuracy Presets

PresetJaro-WinklerToken-LCSMax IndelUse Case
strict≥0.98≥0.98≤2High-quality tournament data
normal≥0.95≥0.95≤4General use
tolerant≥0.92≥0.90≤8Noisy data sources

Duplicate Classification

NULL Unique / Master → Export ✓ “exact” Exact copy → No export ✗ “subsumption” Prefix of a longer game → No export ✗ “join” Fuzzy match → No export ✗ Master selection: Highest header_score wins header_score = FIDE-IDs(+30) + Elo(+20) + Event/Site(+15) + Date(+10) + ECO(+5) + Move-Length(+1–5)

8. Performance Optimizations

Streaming Architecture Batch processing for arbitrarily large DBs Chunk-based · Temporary files Parallel Processing Rayon · All CPU cores Arrow/Parquet Columnar Storage Bloom Filter Probabilistic player dedup ~9 MB instead of ~3 GB Banded LCS (v2.5) Band-restricted DP computation ~40× speedup for move comparison Pre-Aggregated Hashes (v2.5) HashMap dedup before DB writes 200K transaction chunks Precomputed FIDE Data (v2.5) Metaphones, tokens, variants Reuse instead of recomputation Memory Budget (30M+ records) v2.4.2: ~8–10 GB v2.5.0: ~2.1 GB ↓ ~75% reduction 8 GB RAM limit Limitations: ① Different player IDs → no comparison ② Strong notation differences (>8 moves) ③ Different results in short games ④ Incorrect FIDE IDs in source data

Example Workflow

# Full deduplication with FIDE data
pgn_deduplicator games.pgn --fide-xml players.xml --load-fide-data -o unique.pgn

# Strict deduplication for tournament data
pgn_deduplicator tournament.pgn --join-accuracy strict -o clean.pgn

# Tolerant deduplication for online games
pgn_deduplicator online.pgn --join-accuracy tolerant -o unique.pgn

System Architecture

Layer model, 9-phase pipeline, memory budget and hash algorithms of PGN Deduplicator v2.5

1. Layer Model

CLI (main.rs) Argument Parsing → ArrowPipelineConfig → run_arrow_pipeline Orchestration (arrow_pipeline.rs) Phase Sequencing · Checkpoint Management · DeltaStateStore Phases ingest · fide · backfill player_lookup · exact_dedup subsumption · join_lines merge · export Storage parquet_* · schema v2 delta_state (redb) bloom · compaction fide/ · join_lines/ Core config · logging memory · progress phase · stats Helper & Players similarity (JW, DL, LCS, Bigram-Dice) · regex · hashes · matching (Triple-Phonetic) · BM/DM/Transliteration 📦 games.parquet · players.parquet · player_lookup.parquet · state.redb · export.pgn
CLI
Orchestration
Phase Engines
Storage (Parquet + redb)
Core Infrastructure
Helper & Players

2. Data Flow of the 9-Phase Pipeline

Ingest PGN → Parquet (Streaming, Rayon-parallel) PGN FIDE Import XML → players.parquet + player_lookup.parquet Player Consolidation Triple-Phonetic: Double Metaphone + Beider-Morse + DM Transliteration · Asian Names · NFKD · Bigram-Dice Backfill Single-Pass: FIDE-ID, Title, Elo, Canonical Names, Hashes DEDUPLICATION (Delta-based: affected groups only) Exact Dedup SHA-256 move_hash Subsumption Prefix + Partial Join Lines JW + LCS + Indel state.redb move_hash · core_hash · dupe_groups Merge Header Merge · Single Source of Truth · PGN Variations Compaction (optional) Export Streaming → export.pgn (Headers from dedicated columns) PGN Parquet Files 📄 games.parquet (46 columns) 📄 games_N.parquet (fragments) 📄 players.parquet (14 columns) 📄 player_lookup.parquet (6 col.) 🗄️ state.redb (6 tables) 📋 checkpoints/ 📁 chunks/ (temporary) 🌸 bloom (~9 MB in-memory)

3. Memory Budget (v2.4.2 → v2.5.0)

Component v2.4.2 v2.5.0 Savings move_hash GroupBy ~1.5 GB ~750 MB 750 MB PlayerMatchIndex ~1 GB ~60 MB ~700 MB Join-Lines HashSets ~840 MB ~120 MB ~720 MB FIDE Lookup ~2.5 GB ~180 MB ~2.3 GB State Store (redb) ~200 MB (OS) New Total Peak ~8–10 GB ~2.1 GB ~6–8 GB 8 GB RAM Limit

4. Hash Algorithms

UCI Moves SHA-256 i64 (Int64) move_hash Exact Dedup: GroupBy Birthday: √2⁶⁴ ≈ 4.3×10⁹ player_ids + result SHA-256 i64 core_game_hash Subsumption + Join Lines Schema v2 (NEW v2.5): move_hash as Int64 instead of Utf8 · Backward-compatible reading · Automatic migration on compaction

5. Duplicate Classification

StatusMethodDetectionExport?
NULLUnique or master of a group✓ Yes
exactSHA-256 move_hashIdentical move sequence✗ No
subsumptionPrefix comparisonShorter version of a longer game✗ No
joinJW + LCS (Banded) + IndelSimilar game with notation differences✗ No

6. New Dependencies v2.5

CrateVersionFeaturesPurpose
rphonetic3.0embedded_bm, embedded_dmBeider-Morse + Daitch-Mokotoff Phonetics
unicode-normalization0.1NFKD accent removal
redb2.4Embedded KV store (state management)
once_cell1.21Lazy-static for thread-safe encoders

Player Matching

6-stage pipeline with triple-phonetic blocking, transliteration and fuzzy matching for international player names

1. Multi-Stage Matching Pipeline

Player Consolidation (Phase 3) identifies identical players across 6 stages. Each stage has ascending cost — early stages filter cheaply, later stages deliver precise results.

Input: Canonical Player Names Stage 0 — Name Normalization (NEW v2.5) Transliteration (Sch→Sh, tsch→ch, …) · NFKD Accent Stripping · Asian Name Patterns Korean (이→Lee, 김→Kim) · Chinese Spacing (XiJunfeng → Xi Junfeng) Stage 1 — Exact Match Normalized Name == Normalized Name → Hit MATCH Stage 2 — Scid Inversion “Carlsen, Magnus” ↔ “Magnus Carlsen” MATCH Stage 3 — Triple-Phonetic Blocking (NEW v2.5) Double Metaphone by_metaphone Index KRLSN → KRLSN Beider-Morse by_beider_morse Index Multi-Origin Codes Daitch-Mokotoff by_daitch_mokotoff Index Eastern European Names Candidate pool (union of all 3 indices) Stage 4 — Variant Matching generate_name_variants() (initials, umlaut, transliteration) MATCH Stage 5 — Combined Score JW(surname) × 0.5 + JW(first name) × 0.3 + DL(surname) × 0.2 Threshold: ≥ 0.82 → Match MATCH Stage 6 — Bigram-Dice Coefficient (NEW v2.5) Dice = 2 × |Bigrams(A) ∩ Bigrams(B)| / (|Bigrams(A)| + |Bigrams(B)|) Threshold: ≥ 0.80 → Fallback for phonetically difficult names MATCH No Match → New Player Blocking Efficiency Triple-Phonetic: O(c × k) instead of O(n²) n = 30M players c = cluster size ~100 k = candidates ~1–5 Matcher Output → Canonical name → player_id (ID-based) → white/black_fide_id PlayerMatchIndex Structure (in-memory) by_exact: HashMap<String, Vec<PlayerId>> by_metaphone: HashMap<String, Vec<PlayerId>> — Double Metaphone Codes by_beider_morse: HashMap<String, Vec<PlayerId>> — Beider-Morse Multi-Origin by_daitch_mokotoff: HashMap<String, Vec<PlayerId>> — DM Soundex Codes

2. Transliteration Rules

normalize_transliteration() Slavic / Germanic sch → sh Schirow → Shirow tsch → ch Tschechow → Chechow ew → ev Karpow → Karpov ff → f Hoffmann → Hofmann ou → u Spassouk → Spassuk ow → ov Kramnikow → Kramnikov Korean Surnames 이 → Lee 김 → Kim 박 → Park 최 → Choi 정 → Jung 이세돌 → Lee Sedol 김유진 → Kim Yujin 박지은 → Park Jieun Chinese Names XiJunfeng → Xi Junfeng Auto-Spacing NFKD Accent Removal Müller → Muller · Pérez → Perez · Čech → Cech Bigram-Dice Fallback “Kasparov” → {Ka, as, sp, pa, ar, ro, ov} — “Kasparow” → {Ka, as, sp, pa, ar, ro, ow} Dice = 2×6/14 = 0.857 ≥ 0.80 ✓

3. Phonetic Algorithms Compared

Algorithm Type Strength Example Since
Double Metaphone Phonetic English, Spanish, French Smith / Smyth → SM0 / XMT v2.4
Beider-Morse Phonetic (Multi-Origin) Eastern European, Yiddish, Slavic Schwarzenegger → multiple codes per origin v2.5
Daitch-Mokotoff Phonetic-Numeric Germanic, Slavic, Yiddish Schwarzenegger → 4-6 digit numeric codes v2.5
Jaro-Winkler Similarity (0.0–1.0) General, especially prefix matches Carlsen / Karlsen → 0.93 v2.4
Damerau-Levenshtein Edit Distance Typos, transpositions Fischer / Ficsher → 1 v2.4
Bigram-Dice N-Gram Similarity Script-independent, fallback Kasparov / Kasparow → 0.86 v2.5

4. Triple-Phonetic Index — Blocking Strategy

“Schirow, Alexei” Normalization → “Shirov, Alexei” Double Metaphone → SRF / XRF by_metaphone[“SRF”] Beider-Morse → Sirov|Shirov|Schirov (multi-origin) Daitch-Mokotoff → 490000, 494000 Candidate Pool (union, deduplicated) Each algorithm has its own HashMap in the PlayerMatchIndex. The union of results significantly reduces false negatives for culturally diverse names.

5. Matching Examples

Name AName BStageMethod
Carlsen, MagnusCarlsen, Magnus1 (Exact)String equality
Magnus CarlsenCarlsen, Magnus2 (Scid)Inversion detection
Shirov, AlexeiSchirow, Alexej3 (Triple-Phonetic)BM multi-origin match
Kramnik, V.Kramnik, Vladimir4 (Variant)Initials expansion
Müller, ThomasMueller, Thomas5 (Combined)JW=0.95 → Match
Kasparov, GarryKasparow, Garri6 (Bigram-Dice)Dice=0.86 → Match

Duplicate Detection

SHA-256 Exact Dedup, Subsumption and Join Lines — three methods for detecting identical and similar chess games

1. Overview: Three Duplicate Types

Exact Duplicate duplicate_status = “exact” 1.e4 e5 2.Nf3 Nc6 3.Bb5 a6 1.e4 e5 2.Nf3 Nc6 3.Bb5 a6 SHA-256(move_hash) Identical hashes → Duplicate 100% identical Subsumption duplicate_status = “subsumption” 1.e4 e5 2.Nf3 Nc6 3.Bb5 a6 1.e4 e5 2.Nf3 Nc6 missing core_game_hash match + Moves are prefix of the longer game Prefix duplicate Join Lines duplicate_status = “join” 1.e4 e5 2.Nf3 Nc6 3.Bb5 a6 1.e4 e5 2.Nf3 Nc6 3.Bb5 a5 JW ≥ 0.82 + LCS (Banded) + Indel-Score ≤ Threshold ~98% similar → Master Selection: Highest header_score wins

2. Exact Dedup — Detail Flow

UCI Moves e2e4 e7e5 g1f3 … SHA-256 move_hash Int64 (Schema v2) GroupBy move_hash Pre-Aggregated (v2.5) Chunk-Level → Global Merge Groups ≥ 2 Master = max(header_score) Remainder → “exact” Bloom Filter (v2.5) FPR ≈ 0.1% Quick test: definitely not a duplicate? state.redb (Delta) Only re-evaluate affected groups

3. Subsumption — Prefix Detection

Subsumption: Shorter Game ⊂ Longer Game Master (40 moves): e2e4 e7e5 g1f3 b8c6 f1b5 a7a6 b5a4 g8f6 e1g1 f8e7 f1e1 b7b5 a4b3 d7d6 c2c3 e8g8 … Subsumption (20 moves): e2e4 e7e5 g1f3 b8c6 f1b5 a7a6 b5a4 g8f6 e1g1 f8e7 … ⬚ missing moves (abort / transmission error) Prefix match ✓ Conditions for Subsumption: ① core_game_hash equal (same players + result) · ② Shorter moves = prefix of the longer ③ At least 10 moves shorter · ④ Partial Subsumption: LCS-Ratio ≥ 0.95 (Banded, v2.5)

4. Join Lines — Similarity-Based Detection

Candidate Pairs (core_game_hash match) ① Chunked Processing Move length similarity ≥ 80% → comparison group ② Jaro-Winkler Similarity Move-to-move JW score · Threshold ≥ 0.82 · Prefix bonus for identical openings < 0.82 → NOT similar ③ Banded LCS (NEW v2.5) Longest Common Subsequence with band width · O(n × band) instead of O(n²) ④ Indel Score Insertions + Deletions ≤ max_indel_threshold Join-Lines Match → duplicate_status = “join” Performance v2.5 Banded LCS: ~4× faster than standard LCS for n>50

5. Header Score — Master Selection

header_score — Which version becomes master? FIDE IDs present +30 points Elo rating present +20 points Event + Site details +15 points Complete date +10 points ECO code +5 points More moves (longer game) +1–5 points Time control info +3 points max() → Master
Single Source of Truth: During export, header data is read exclusively from dedicated Parquet columns (event, site, game_date, etc.) — no re-parsing of the original PGN.

6. Delta Pipeline (NEW v2.5)

Incremental Processing with redb State Store New PGN Data Ingest + Hash Pre-Aggregated GroupBy DeltaStateStore state.redb (ACID, MVCC, Zero-Copy) 6 tables · ~200 MB mmap’d Affected Groups Re-Evaluate (new groups) Skip (unchanged) redb tables: move_hashes · core_hashes · game_metadata · dupe_groups · processing_state · schema_version Idempotency Guarantee Unchanged groups are NEVER re-processed

7. Schema v2 — Column Overview

Column GroupColumnsType
Header event, site, game_date, event_round, white_player, black_player, result Utf8
Player Meta white_elo, black_elo, white_fide_id, black_fide_id, white_title, black_title Int32 / Utf8
Hashes move_hash, core_game_hash Int64 (v2.5, was Utf8)
Deduplication duplicate_status, dupe_group, header_score Utf8 / Int32
IDs white_player_id, black_player_id, game_id UInt32 / UInt64
Moves moves, move_count Utf8 / Int32
Meta eco, time_control, source, import_date, schema_version Utf8 / Int32

Views: 2041

Scroll to Top