The PGN Deduplicator: Systematic Cleanup of Chess Games
I was recently informed by a user that Chessbase, the database program they use, still found several hundred thousand duplicates despite prior cleaning via Scid. For this reason, I developed a script that systematically deduplicates large PGN files in multiple phases. On this page, I will describe how my script functions. The script (a Python script) leverages the powerful PostgreSQL database for efficient storage and processing of the games.
The key phases of the process:
Import (Phase 1)
This phase initiates the processing with the source PGN file. I first check whether smaller “chunks” (parts) of the PGN file already exist in a dedicated TEMP
directory. If present, I prioritize their reuse for accelerated processing. If no chunks exist and a large PGN file is provided, I use pgn-extract
to split the input file into manageable segments. From each game in these chunks, I extract essential information such as players, event, site, date, and the complete move sequence. During this process, I automatically remove Chessbase analysis-EVAL comments ([%evp...]
) and normalize blank lines to ensure a clean and consistent PGN structure for the database. Subsequently, for each game and its core details (e.g., player names), I calculate unique “hashes” (like digital fingerprints) that enable fast comparisons in later deduplication phases. The structured game data is then efficiently loaded into the PostgreSQL database using the COPY
command. Before the actual deduplication begins, I load FIDE player data (FIDE ID, names) from a separate database table into memory. This information is crucial for evaluating header quality and for more precise matching of player names during header merging.
Exact Deduplication (Phase 2)
In this phase, I identify and group games that are exact duplicates based on their unique game hash. Within each group of exact duplicates, I intelligently select the “best” version of the game (typically the one with the most complete headers or the longest PGN text) to serve as the “master” record. I then optimize the headers of this master game by merging relevant information from all exact duplicates to ensure the most comprehensive metadata for the master game. All identified exact duplicates are marked accordingly in the database to prevent them from being processed again as unique games.
Subsumption Deduplication (Phase 3)
This phase targets games where one game is a complete subset of another (e.g., a shorter game contained within a longer version of the same game). This is a common scenario for partial games or analysis snippets. For this, I use a “Core Game Hash” (based on player names) to group potentially subsumed duplicates. Within these groups, I check if the move sequence of one game perfectly starts another, and additionally examine the similarity of player names using algorithms like the Jaro-Winkler distance (which measures string similarity). If a game is found to be subsumed, I mark it in the database as a “subsumed duplicate,” and its header information can contribute to enriching the master game.
FEN Recalculation (Optional Pre-Phase for Textual Fuzzy Deduplication)
If requested, in this phase, I re-analyze all games in the database to ensure that their “End-FEN” (the unique description of the board’s final position) is correctly calculated and stored. This recalculation is useful for consistency and can be employed as an important pre-filter for subsequent textual fuzzy deduplication by reducing the number of games with identical end positions that need to be compared.
Textual Fuzzy Deduplication (Phase 4)
This is an optional, more advanced phase for detecting duplicates that are not exact matches but are very similar in their move sequences due to minor variations (e.g., typos, different notation for the same move) or similar end positions. For this, I utilize string similarity algorithms such as the Levenshtein distance (counts the minimum edit operations required to transform one string into another) and the Jaro-Winkler distance (a weighted measure of similarity, particularly effective for short strings or those with common prefixes). I group games by their Core Game Hash (players) and can optionally pre-filter them by their End-Board State Hash (FEN – Forsyth-Edwards Notation) to accelerate checks. If games meet my defined similarity thresholds for Levenshtein and Jaro-Winkler, I mark them as “textual fuzzy duplicates.” The headers of the master game in these fuzzy groups are also optimized by intelligently merging relevant information from all found fuzzy duplicates, based on the quality score of the header values and leveraging FIDE player data for improved player name resolution.
Export (Phase 5)
Finally, I can export the processed games back into PGN files. I can export only the unique master games (with their merged and optimized headers) into a dedicated file. Here, games can optionally be filtered based on a minimum header quality score to ensure that only games with sufficiently complete metadata are exported. Optionally, I can also export all games that are part of a duplicate group with more than one game (both the master games and their duplicates) into a separate file. For duplicates, I adjust their header information (e.g., the “Site” and “Event” fields) to indicate their specific duplicate status and the group they belong to. For the master games within such groups, I set the “Event” tag to “MASTER_OF_DUPLICATE_GROUP,” which is crucial for verifying potential errors in duplicate detection.
Throughout all these phases, I employ a multiprocessing approach that runs multiple “workers” in parallel to significantly accelerate the processing of large game collections. Robust logging is also integrated to track progress and report any issues. Nevertheless, my script still requires approximately 10 hours to complete the process for about 10 million games.
Estimation of Comparison Operations in the PGN Deduplicator
I calculated an approximate number of the required comparisons.
The performance of the PGN Deduplicator script is significantly influenced by the number of internal comparison operations performed. The following analysis provides an approximate estimate of these comparisons for a dataset of 10 million games, under specific assumptions regarding duplicate distribution and without active END-FEN pre-filtering.
For this calculation, I make the following assumptions:
- Total number of games: 10,000,000
- Total number of duplicates: 500,000, split into:
- Exact duplicates: 480,000
- Subsumption duplicates: 2,000
- Textual fuzzy duplicates: 18,000
- Exact duplicates: 480,000
- Average group size: Duplicates typically occur in groups of 2 to 5 games. For the majority of groups with unique hashes (e.g.,
game_hash
,core_game_hash
), it is assumed that they contain only a single game. - FIDE data: The use of FIDE player data can reduce the need for computationally intensive Jaro-Winkler/Levenshtein comparisons for player names if a unique FIDE ID assignment is found. In the estimation, it is considered that a portion of the player names still requires fuzzy checking (assumed 10%).
- Iterations: I assume that the deduplication process converges in two iterations, with the first iteration finding the majority of duplicates.
- Disabled END-FEN pre-filtering: The optional FEN recalculation and the associated pre-filtering in the textual fuzzy deduplication phase are not used. This means that the
core_game_hash
groups in Phase 4 can potentially be larger, as no additional pre-filtering based on the end position is performed.
The estimation focuses on string distance calculations (Levenshtein and Jaro-Winkler distances), as these algorithms represent the most computationally intensive operations compared to simple hash or length comparisons.
Analysis of the Phases:
Import (Initialization) This phase primarily deals with parsing PGN files, calculating hashes (such as game_hash
, core_game_hash
, end_fen_hash
, partial_move_hashes
), and loading the data into the PostgreSQL database. In this phase, no direct game-to-game comparisons for duplicate detection are performed. The operations are primarily database and hash-based.
Exact Deduplication (approx. 480,000 duplicates)
- Goal: Identification of games with exactly identical
game_hash
. - Group formation: With 480,000 exact duplicates, it is assumed that these result in approximately 480,000/2=240,000 duplicate groups, each consisting of one master game and at least one duplicate.
- Comparisons: Within each of these groups, header optimization is performed, which involves comparisons and score calculations for various header tags.
- Player name comparisons: Although FIDE data is used to improve player name resolution, it is assumed that in approximately 10% of cases where fuzzy checking is required, Jaro-Winkler and Levenshtein distances for player names (worst-case 4 calculations per pair) are performed.
- Estimation: 240,000 groups ×0.1 (fuzzy proportion) ×1 pair ×4 (distance calculations per pair) ≈96,000 string distance calculations for player names.
- Other header tags: For the remaining ∼18 header tags, less computationally intensive internal score comparisons are performed.
Subsumption Deduplication (approx. 2,000 duplicates)
- Goal: Finding games that represent complete sub-segments of a longer game.
- Group formation: With 2,000 subsumption duplicates, it is assumed that these result in 2,000 groups, each consisting of a master game and a subsumed game.
- Comparisons: For each of these 2,000 groups:
moves_san.startswith()
comparison: One string start comparison per group.- Player name comparisons: If the
startswith
condition is met, player names are checked for similarity. - Estimation: 2,000 groups ×2 (players) ×1 (Jaro-Winkler call) ≈4,000 Jaro-Winkler distances for player names.
Textual Fuzzy Deduplication (approx. 18,000 duplicates)
- Goal: Finding very similar games based on the textual similarity of move sequences (
moves_san
). END-FEN pre-filtering is disabled. - Group formation: With 18,000 textual fuzzy duplicates, it is assumed that this implies 9,000 duplicate pairs (master + fuzzy duplicate).
- Comparisons: This phase is the most computationally intensive, as it performs pairwise string distances on
moves_san
. Since END-FEN pre-filtering is disabled, thecore_game_hash
groups, which include all games with the same player names (not yet marked as duplicates), can be very large.- For the actually identified duplicates: 9,000 pairs \times (1 \text{ Levenshtein distance for moves_san} + 1 \text{ Jaro-Winkler distance for moves_san} + 2 \text{ Jaro-Winkler distances for player names}) = 9,000 \times 4 \approx 36,000 string distance calculations.
- For non-duplicate pairs within groups: This is the largest factor. After Phases 2 and 3, approximately 9,518,000 “unique” games remain. A significant portion of these games (assumed 5%, i.e., ∼475,900 games) belong to
core_game_hash
groups with more than one game but are not duplicates. Assuming these games are distributed among groups with an average of 3 games. This would correspond to approximately 475,900/3≈158,633 groups. Each group with 3 games leads to 3 pair comparisons. 158,633 groups ×3 pairs/group ×4 (distance calculations per pair) ≈1,903,600 string distance calculations.
Estimated total number of string distance calculations per iteration: The sum of computationally intensive string distance calculations (Levenshtein and Jaro-Winkler) is as follows:
- Exact Deduplication (player names): approx. 96,000
- Subsumption Deduplication (player names): approx. 4,000
- Textual Fuzzy Deduplication (games & player names): approx. 36,000+1,903,600=approx. 1,939,600
- Total per iteration: approx. 2,039,600 string distance calculations.
Estimation for two iterations: Since I assume that the process converges in two iterations, the number of comparisons would roughly double. While the second iteration might find fewer new duplicates, the need to re-check existing groups persists.
Estimated total string distance calculations over 2 iterations: approx. 2,039,600×2=approx. 4,079,200
Additional considerations: In addition to these computationally intensive string distance calculations, millions of simpler comparisons are performed (e.g., length comparisons, score comparisons, hash equality checks). While these are numerically higher, they are computationally much less demanding and typically do not represent the bottleneck of the process.
In summary, when analyzing 10 million games with the assumed duplicate distribution (without END-FEN pre-filtering) and two iterations of the deduplication process, approximately 4 to 5 million computationally intensive string distance calculations need to be performed. These operations are largely responsible for the script’s overall runtime.
Views: 49
Leave a Reply