How duplicate games are found

Since all versions of Scid seem to miss a lot of duplicate lots, I have written a script that deduplicates the lots of a larger PGN file in several phases. You can read more about how the script works in the following article.

This script, named “PGN Deduplicator,” is designed to manage and clean large collections of chess games stored in PGN (Portable Game Notation) format. It uses a PostgreSQL database to efficiently store and process these games, identifying and eliminating duplicates.

The script operates in several key phases:

Import (Phase 1):
- It takes a PGN file as input. If the file is very large, it can split it into smaller “chunks” to make processing more manageable.
- For each game in these chunks, it extracts important information like players, event, site, date, and the sequence of moves.
- It then calculates unique “hashes” (like a digital fingerprint) for each game and its core details (e.g., player names) to enable quick comparisons later.
- This structured game data is then efficiently loaded into a PostgreSQL database, ready for deduplication.
Exact Deduplication (Phase 2):
- In this phase, the script identifies and groups games that are exact duplicates based on their unique game hash.
- Within each group of exact duplicates, it intelligently selects the “best” version of the game (e.g., the one with the most complete headers or longest PGN text) to serve as the “master” record.
- The headers of this master game are then optimized by merging relevant information from all its exact duplicates.
- All identified exact duplicates are marked as such in the database, preventing them from being processed again as unique games.
Subsumption Deduplication (Phase 3):
- This phase targets games where one game is a complete subset of another (e.g., a shorter game is contained within a longer version of the same game). This is common for partial games or analysis snippets.
- It uses a “core game hash” (based on players) to group potential subsumed duplicates.
- Within these groups, it checks if one game’s move sequence perfectly starts another’s, and also verifies player name similarity using algorithms like Jaro-Winkler distance (which measures string similarity).
- If a game is found to be subsumed, it’s marked as a “subsumed duplicate” in the database, and its header information can contribute to the master game’s enrichment.
FEN Recalculation (Optional Phase, as Pre-Phase of Textual Fuzzy Deduplication):
- If requested, this phase re-analyzes all games in the database to ensure their “end FEN” (the unique description of the final board position) is correctly calculated and stored. This is useful for consistency and can be used for pre-filtering in fuzzy deduplication.
Textual Fuzzy Deduplication (Phase 4):
- This is an optional, more advanced phase for finding duplicates that aren’t exact matches but are very similar in their move sequences due to minor variations (e.g., typos, different notation for the same move).
- It employs string similarity algorithms: Levenshtein distance (counts minimum edits needed to transform one string into another) and Jaro-Winkler distance (a weighted measure of similarity, especially good for short strings or those with common prefixes).
- Games are grouped by their core game hash (players), and optionally pre-filtered by their final board state hash (FEN – Forsyth-Edwards Notation) to speed up checks.
- If games meet the defined similarity thresholds for both Levenshtein and Jaro-Winkler, they are marked as “textual fuzzy duplicates.”
Export (Phase 5):
- Finally, the script can export the processed games back into PGN files.
- It can export only the unique, master games (with their merged and optimized headers).
- Optionally, it can also export the identified duplicate games into a separate file. For duplicates, it modifies their header information to indicate their duplicate status and the group they belong to, which can be useful for review.

Throughout these phases, the script uses a multiprocessing approach, running several “workers” in parallel to speed up the processing of large game collections. It also includes robust logging to track its progress and report any issues encountered. Nevertheless, the script still needs around 10 hours to complete the process for around 10 million games.