How Duplicates Are Removed from PGN Files
A Quick Overview
This system for deduplicating chess games processes PGN files in several phases to identify duplicates and optimize data quality. First, it reads PGN files, extracts and cleans essential data, calculates hashes, and recognizes metadata. Then, it consolidates player-pair groups using fuzzy name comparisons. This is followed by exact deduplication based on move sequence hashes, where the header of the best game is chosen as the master. Games with subsumed move sequences are also flagged. Another phase uses fuzzy matching for textual similarities of move sequences. Finally, the system exports the unique games and, optionally, the flagged duplicates, optimizing header quality through the integration of FIDE data and a detailed evaluation to ensure the master game contains the best available information.
Feel free, to leave comment below this article.
1. Phase: PGN File Import
This phase is responsible for reading and preprocessing PGN chess games before they are imported into the database.
- Date Processing: The script reads PGN files from the source. It can split large files into smaller, more manageable “chunks” using `pgn-extract`. Existing chunks in the temporary directory are prioritized for reuse unless overwriting is forced.
- Data Extraction: For each game, important header information (such as players, date, event, site) and the move sequence are extracted.
- Cleaning and Normalization:
- Chessbase EVAL comments (`[%evp…]`) are automatically removed.
- Blank lines in the PGN text are normalized.
- Player names are comprehensively normalized and cleaned, including the removal of titles (GM, IM, etc.) and country codes.
- FIDE player data (ID, names, titles) from the `fide_players` table is integrated to enrich headers and enable more precise player assignment.
- Hash Calculation: Unique hashes are calculated for various aspects of the game:
- A hash for the entire game (`game_hash`).
- A hash for player pairs (`player_pair_hash`).
- A hash for the full move sequence (`moves_hash_full`).
- Partial move hashes (every 10 half-moves) and end-FENs at specific half-move counts (`moves_san_segment_hashes` and `end_fens_at_ply`).
- Metadata Detection: It is detected and stored whether the PGN text of the game contains comments or variations.
- Date Handling: The script attempts to parse flexible date formats and calculate a quality score for the date. It can also prefer the `EventDate` if it has a higher quality.
- FEN Generation: The end-FEN of the game is calculated and a hash of it is stored.
2. Phase: Consolidated Player Pair Grouping
This phase is a crucial preprocessing step that groups games based on the similarity of player names.
- Fuzzy Name Comparisons: Games played by the logically same player pairs are grouped. This is done through fuzzy name comparisons using algorithms such as Jaro-Winkler and Monge-Elkan.
- Canonical ID Assignment: A `canonical_player_pair_id` is assigned to indicate that these games originate from the same player pairs, even if the names vary slightly. If the names differ too much, no assignment can be made here, and the game will likely be ignored during duplicate detection.
- Parallel Processing: The initial assignment of temporary IDs occurs in parallel in workers. The final consolidation and assignment of global IDs takes place in the main process to ensure quality. Workers identify equivalence classes of player names and send these relationships to the main process, which consolidates them globally.
3. Phase: Exact Move Sequence Deduplication
This phase identifies and marks games that have exactly identical move sequences.
- Hash Comparison: Games are compared based on their `moves_hash_full`. If this hash is identical, the games are considered exact duplicates.
- Master Selection: For a group of exact duplicates, the “best” header is selected. The header quality score is calculated based on the completeness and quality of the header information (player names, date, Elo, event, site). See also PGN Header Optimization Process
- Marking: The master game receives a `duplicate_status` of `NULL`, while the other duplicates are marked as `exact_dupe`. The optimized headers of the master are stored in the database.
4. Phase: Subsumption Deduplication
This phase finds games whose move sequences are an exact subset of the move sequences of another game.
- Subset Detection: The script searches for games where the move sequence of a shorter game exactly matches the beginning of the move sequence of a longer game.
- Marking: Such games are marked as `subsumed_dupe`. Here too, the “best” header for the master games in the duplicate group is selected and applied.
5. Phase: Textual Fuzzy Deduplication
This phase uses advanced fuzzy matching algorithms to identify similar but not identical games.
- Algorithms: Jaro-Winkler and Monge-Elkan are applied to the normalized move sequences.
- Dynamic Thresholds: The similarity thresholds can be dynamically adjusted based on the move count to control the sensitivity of the detection.
- Pre-filters: There are filters based on the minimum number of moves and the maximum percentage move difference to reduce the number of comparisons.
- Marking: Games that meet these fuzzy criteria and are not yet marked as duplicates (i.e., `duplicate_status IS NULL`) are marked as `textual_fuzzy_dupe`. The master header is also optimized.
6. Phase: Export
The final phase is responsible for exporting the deduplicated and normalized games into PGN files.
- Unique Games: Unique games (the master games of the duplicate groups, i.e., `duplicate_status IS NULL`) can be exported to a separate PGN file. A minimum header quality is considered here.
- Duplicate Games: Optionally, games marked as duplicates (`exact_dupe`, `subsumed_dupe`, `textual_fuzzy_dupe`) can be exported to a separate PGN file. The `Site` and `Event` tags are modified to reflect the duplicate status and duplicate group.
- Header Optimization: During export, header optimization is applied to ensure that the best available header information is included in the exported games.
PGN Header Optimization Process
The header optimization process aims to ensure the highest possible quality of header data for each game, especially for the “master” game within a duplicate group. This involves several steps of cleaning, scoring, and merging, heavily leveraging FIDE player data.
1. Initialization and FIDE Data Integration (during Import)
During the initial import of a PGN game, player names are transformed and enriched with FIDE data:
- Player names (`White` and `Black`) are cleaned by removing common titles (GM, IM, etc.) and country codes.
- A “normalized” string is created for hashing and general comparisons, typically by removing all non-alphanumeric characters and converting to lowercase.
- The script attempts to link players to FIDE data:
- If a normalized player name uniquely matches the `main_name` of a FIDE player in the loaded FIDE data, the game’s header is updated with the FIDE player’s `main_name`, `FideId`, and `Title` (if available). This only occurs if the FIDE name has sufficient length. This enriches the game’s header early on with authoritative FIDE information.
2. Loading FIDE Player Data
Before deduplication or header optimization phases begin, the script loads FIDE player data from the `fide_players` table. This data is crucial for robust player name matching and header enhancement:
- The data is stored in two main structures:
- `fide_data_by_id`: A dictionary mapping FIDE IDs to the full player data (main name, alternative names, country, titles, etc.).
- `fide_data_by_normalized_name`: A dictionary mapping various normalized name variants (main names and alternative names) to a list of corresponding FIDE IDs. This allows for flexible lookups based on different spellings or formats of a player’s name.
- Both the `main_name` and any `alternative_names` from the FIDE data are normalized and added to the `fide_data_by_normalized_name` map. This ensures that different ways a player’s name might appear in a PGN can still be linked to their official FIDE record.
- Names in the PGN are only replaced if a unique assignment to a single person in the `fide_players` table occurs.
3. Header Quality Scoring
To determine the “best” header among duplicates, each game’s header is assigned a quality score. This score is a composite of various header fields:
- Player Names: Scored based on length and the presence of problematic characters. A significant bonus is added if a player name can be uniquely linked to a FIDE ID.
- Event/Site: Scored based on length and the presence of meaningful data.
- Date: Scored based on the precision of the date (full date > month-year > year-only). A bonus is given if the date did not need to be changed.
- Elo Ratings: Scored based on their numerical value.
- Result: A result like “1-0”, “0-1”, or “1/2-1/2” receives a high score. Missing results (“*” or “?”) receive a negative score.
- Other Headers: General header fields receive a score based on their length.
4. Selecting the Best Player Name from a Pool
Within a duplicate group that contains multiple games with potentially different player name spellings, the script selects the “best” player name. This process is crucial for data consistency:
- All player names (White and Black) from all games in the duplicate group are collected.
- Each of these candidate names is individually scored, with a significant bonus for linking to a FIDE ID.
- Candidates are sorted by their score, length (longer, more complete names are preferred), and `game_id` (as a tie-breaker).
- The name with the highest score is selected as the “best” name. If this best name is linked to a unique FIDE ID and the FIDE main name meets the minimum length requirement, the FIDE main name is used as the final best player name, along with the FIDE ID and title. This ensures that the most official and highest quality name form is preferred.
5. Merging Header Data
After the master of a duplicate set has been identified, the header information of all games in this group is merged to create the optimal header for the master. This is done field by field:
- Player Names (`White`, `Black`): The previously determined “best” player name from the pool of all candidates in the group is adopted, including the associated FIDE ID and title, if applicable.
- Date (`Date`): The script selects the date with the highest quality score from all games in the group. If scores are equal, the later date is preferred.
- Other Tags (e.g., `Event`, `Site`, `Result`, `WhiteElo`, `BlackElo`, `PlyCount`): For each of these fields, all values from all games in the group are collected. Each value receives its own score (e.g., `score_generic_header`, `score_elo`). The value with the highest score is selected for the master header. In case of a tie, the longer value is preferred.
- Irrelevant or empty tags are removed from the optimized header.
6. Updating the Master Game
Finally, the optimized headers are updated in the database for the master game. The master game is also marked as “header_optimized” and its `duplicate_status` is set to `NULL` to indicate that it is the canonical entry for this duplicate group.
Last cleanup with Scid
Finally, a cleanup is carried out with Scid. Scid still finds some duplicates here, which is mainly due to two things:
- Formation of the player pair groupings: If the players’ names are spelled so differently that they are not included in the grouping, they cannot be recognized as duplicates.
- The maximum difference in match length of 30%: If the difference in the number of ply exceeds the value of 30%, the games are also not recognized via deduplication.
This cleanup will catch approximatly 1500 to 2000 additional duplicate games.
Script version 3.2
Views: 163
Leave a Reply