Game Deduplication

How Duplicate Games Are Found

Complete overview of all deduplication phases, algorithms and configuration options

Deduplication Phases Overview
The Three Duplicate Types
Player Grouping
Phonetic Matching (Triple-Phonetic)
International Name Matching
Comparison Parameters & Accuracy Presets
Header Merging & Export
Performance Optimizations

1. Deduplication Phases Overview

The system operates in 9 sequential phases:

2. The Three Duplicate Types

2.1 Exact Duplicates — SHA-256 Move-Hash

Two games are exact duplicates when they have identical move sequences:

Converts all moves to UCI notation (e.g. e2e4 instead of e4)
Computes a SHA-256 hash of the entire move sequence
Games with identical hashes are exact duplicates

Game A: 1.e4 e5 2.Nf3 Nc6 3.Bb5 a6    ← English notation
Game B: 1.e4 e5 2.Sf3 Sc6 3.Lb5 a6    ← German notation
UCI:    e2e4 e7e5 g1f3 b8c6 f1b5 a7a6   ← identical!

2.2 Subsumption Duplicates (Prefix Games)

A game is a subsumption when its moves appear exactly at the beginning of a longer game.

Partial Subsumption v2.3.1

When a player is unknown ([White "?"]), the system checks the original header instead of resolved values:

Game A: [White "Kitces, Edward"] [Black "?"] [BlackFideId "2004194"]
       1.e4 c6 2.d4 d5 ... (63 half-moves)

Game B: [White "Kitces, Edward"] [Black "?"]
       1.e4 c6 2.d4 d5 ... (75 half-moves)

→ Grouped by white_norm="kitces, edward" → Game A ⊂ Game B

2.3 Join-Lines (Fuzzy Matching)

For games with minor differences. A match requires all three algorithms to meet their thresholds:

3. Player Grouping

A critical aspect: Only games with the same player combination are compared.

The core_game_hash

core_game_hash = SHA256(white_norm + ":" + black_norm)

Players with resolved FIDE names are correctly grouped.

Handling Invalid Player Names v2.2.1

Invalid names (?, NN, Unknown) → unified placeholder "unknown"
FIDE IDs for invalid names are ignored

FIDE ID Conflicts

Source A: Filipenko, Alexander V — BlackFideId 34117881
Source B: Filipenko, Alexander V — BlackFideId 4104471
→ Fuzzy player matching resolves conflicts

4. Phonetic Matching — Triple-Phonetic Blocking v2.5

Since v2.5, the system uses three independent phonetic algorithms for player matching:

Canonical FIDE Names

Before (PGN)	After (Backfill)
Chilov, A…	Chilov, Alexandros
Lemos, N.	Lemos, Nikolaos
Carlsen, M	Carlsen, Magnus

Automatic Initial Aliases v2.4.2

Canonical Name	Generated Aliases
Carlsen, Magnus	Carlsen, M. / Carlsen, M
Van der Berg, Jan Peter	Van der Berg, J. / Van der Berg, JP

Important: Only unambiguous aliases are added. With ambiguity, no alias is created.

5. International Name Matching v2.5

6. Comparison Parameters

Position and Length

Parameter	Description	Default
`--join-ply-start`	From which half-move to compare	15
`--join-min-compare`	Minimum overlap after ply-start	10
`--join-diff-ratio`	Allowed relative length deviation	0.15
`--join-min-ply`	Minimum game length for candidates	10

Why from half-move 15? — The first moves (opening) are often identical between many different games. Half-move 15 (~move 8) focuses on the middlegame and reduces false positives with popular openings.

Accuracy Presets

Preset	Jaro-Winkler	Token-LCS	Max Indel	Use Case
strict	≥0.98	≥0.98	≤2	High-quality tournament data
normal	≥0.95	≥0.95	≤4	General use
tolerant	≥0.92	≥0.90	≤8	Noisy data sources

Duplicate Classification

7. Header Merging & Export

Detected duplicates are intelligently merged:

Higher quality preferred: Complete data over placeholders
FIDE data prioritized: ELO, titles, FIDE IDs from official database
All sources documented: Source header shows origin
Merged tag: Merged games with [Merged "true"]

Single Source of Truth v2.4.0

The merge phase writes the best values directly into dedicated Parquet columns:

Column	Description
`event`	Best event name from duplicate group
`site`	Best venue
`game_date`	Best date
`white_elo` / `black_elo`	Best ELO values
`white_title` / `black_title`	Best titles
`eco`	Best ECO classification
`time_control`	Best time control information

Benefit: Export reads only from dedicated columns — no JSON parsing, no double processing.

Consistent Player Names v2.4.0

Three-tier priority:

Canonical name (canonical_name from SSP/FIDE)
Original name (white_player/black_player)
Normalized name (white_norm/black_norm) — lowercase fallback

Merged Tag for Variants v2.4.0

[Event "Olympiad"]
[White "Carlsen, Magnus"]
[Black "Anand, Viswanathan"]
[Merged "true"]

1.e4 e5 2.Nf3 Nc6 (2...Nf6 {Variant from duplicate}) 3.Bb5 *

8. Performance Optimizations

Example Workflow

# Full deduplication with FIDE data
pgn_deduplicator games.pgn --fide-xml players.xml --load-fide-data -o unique.pgn

# Strict deduplication for tournament data
pgn_deduplicator tournament.pgn --join-accuracy strict -o clean.pgn

# Tolerant deduplication for online games
pgn_deduplicator online.pgn --join-accuracy tolerant -o unique.pgn

System Architecture

Layer model, 9-phase pipeline, memory budget and hash algorithms of PGN Deduplicator v2.5

Layer Model
Data Flow of the 9-Phase Pipeline
Memory Budget (v2.4.2 → v2.5.0)
Hash Algorithms
Duplicate Classification
New Dependencies v2.5

1. Layer Model

CLI

Orchestration

Phase Engines

Storage (Parquet + redb)

Core Infrastructure

Helper & Players

2. Data Flow of the 9-Phase Pipeline

3. Memory Budget (v2.4.2 → v2.5.0)

4. Hash Algorithms

5. Duplicate Classification

Status	Method	Detection	Export?
NULL	—	Unique or master of a group	✓ Yes
exact	SHA-256 move_hash	Identical move sequence	✗ No
subsumption	Prefix comparison	Shorter version of a longer game	✗ No
join	JW + LCS (Banded) + Indel	Similar game with notation differences	✗ No

6. New Dependencies v2.5

Crate	Version	Features	Purpose
rphonetic	3.0	embedded_bm, embedded_dm	Beider-Morse + Daitch-Mokotoff Phonetics
unicode-normalization	0.1	—	NFKD accent removal
redb	2.4	—	Embedded KV store (state management)
once_cell	1.21	—	Lazy-static for thread-safe encoders

Player Matching

6-stage pipeline with triple-phonetic blocking, transliteration and fuzzy matching for international player names

Multi-Stage Matching Pipeline
Transliteration Rules
Phonetic Algorithms Compared
Triple-Phonetic Index — Blocking Strategy
Matching Examples

1. Multi-Stage Matching Pipeline

Player Consolidation (Phase 3) identifies identical players across 6 stages. Each stage has ascending cost — early stages filter cheaply, later stages deliver precise results.

2. Transliteration Rules

3. Phonetic Algorithms Compared

Algorithm	Type	Strength	Example	Since
Double Metaphone	Phonetic	English, Spanish, French	Smith / Smyth → SM0 / XMT	v2.4
Beider-Morse	Phonetic (Multi-Origin)	Eastern European, Yiddish, Slavic	Schwarzenegger → multiple codes per origin	v2.5
Daitch-Mokotoff	Phonetic-Numeric	Germanic, Slavic, Yiddish	Schwarzenegger → 4-6 digit numeric codes	v2.5
Jaro-Winkler	Similarity (0.0–1.0)	General, especially prefix matches	Carlsen / Karlsen → 0.93	v2.4
Damerau-Levenshtein	Edit Distance	Typos, transpositions	Fischer / Ficsher → 1	v2.4
Bigram-Dice	N-Gram Similarity	Script-independent, fallback	Kasparov / Kasparow → 0.86	v2.5

4. Triple-Phonetic Index — Blocking Strategy

5. Matching Examples

Name A	Name B	Stage	Method
Carlsen, Magnus	Carlsen, Magnus	1 (Exact)	String equality
Magnus Carlsen	Carlsen, Magnus	2 (Scid)	Inversion detection
Shirov, Alexei	Schirow, Alexej	3 (Triple-Phonetic)	BM multi-origin match
Kramnik, V.	Kramnik, Vladimir	4 (Variant)	Initials expansion
Müller, Thomas	Mueller, Thomas	5 (Combined)	JW=0.95 → Match
Kasparov, Garry	Kasparow, Garri	6 (Bigram-Dice)	Dice=0.86 → Match

Duplicate Detection

SHA-256 Exact Dedup, Subsumption and Join Lines — three methods for detecting identical and similar chess games

Overview: Three Duplicate Types
Exact Dedup — Detail Flow
Subsumption — Prefix Detection
Join Lines — Similarity-Based Detection
Header Score — Master Selection
Delta Pipeline (NEW v2.5)
Schema v2 — Column Overview

1. Overview: Three Duplicate Types

2. Exact Dedup — Detail Flow

3. Subsumption — Prefix Detection

4. Join Lines — Similarity-Based Detection

5. Header Score — Master Selection

Single Source of Truth: During export, header data is read exclusively from dedicated Parquet columns (event, site, game_date, etc.) — no re-parsing of the original PGN.

6. Delta Pipeline (NEW v2.5)

7. Schema v2 — Column Overview

Column Group	Columns	Type
Header	event, site, game_date, event_round, white_player, black_player, result	Utf8
Player Meta	white_elo, black_elo, white_fide_id, black_fide_id, white_title, black_title	Int32 / Utf8
Hashes	move_hash, core_game_hash	Int64 (v2.5, was Utf8)
Deduplication	duplicate_status, dupe_group, header_score	Utf8 / Int32
IDs	white_player_id, black_player_id, game_id	UInt32 / UInt64
Moves	moves, move_count	Utf8 / Int32
Meta	eco, time_control, source, import_date, schema_version	Utf8 / Int32

How Duplicate Games Are Found

Contents

1. Deduplication Phases Overview

2. The Three Duplicate Types

2.1 Exact Duplicates — SHA-256 Move-Hash

2.2 Subsumption Duplicates (Prefix Games)

Partial Subsumption v2.3.1

2.3 Join-Lines (Fuzzy Matching)

3. Player Grouping

The core_game_hash

Handling Invalid Player Names v2.2.1

FIDE ID Conflicts

4. Phonetic Matching — Triple-Phonetic Blocking v2.5

Canonical FIDE Names

Automatic Initial Aliases v2.4.2

5. International Name Matching v2.5

6. Comparison Parameters

Position and Length

Accuracy Presets

Duplicate Classification

7. Header Merging & Export

Single Source of Truth v2.4.0

Consistent Player Names v2.4.0

Merged Tag for Variants v2.4.0

8. Performance Optimizations

Example Workflow

System Architecture

Contents

1. Layer Model

2. Data Flow of the 9-Phase Pipeline

3. Memory Budget (v2.4.2 → v2.5.0)

4. Hash Algorithms

5. Duplicate Classification

6. New Dependencies v2.5

Player Matching

Contents

1. Multi-Stage Matching Pipeline

2. Transliteration Rules

3. Phonetic Algorithms Compared

4. Triple-Phonetic Index — Blocking Strategy

5. Matching Examples

Duplicate Detection

Contents

1. Overview: Three Duplicate Types

2. Exact Dedup — Detail Flow

3. Subsumption — Prefix Detection

4. Join Lines — Similarity-Based Detection

5. Header Score — Master Selection

6. Delta Pipeline (NEW v2.5)

7. Schema v2 — Column Overview