Check for exact overlap between train and test data so evaluation scores are not quietly inflated by repeated records.
Understand what leakage means in practical terms
Leakage happens when information from the training side reaches the evaluation side in a way that makes the test easier than it should be. One of the most obvious forms is exact overlap between train and test records.
That overlap can come from repeated exports, careless merges, deduplication done too late or splits created before cleanup was finished.
- Exact repeated records are one of the clearest leakage patterns.
- Small datasets are especially sensitive to repeated examples across splits.
- Leakage checks are most useful when they are routine rather than reactive.
Decide whether to compare full rows or one key field
Full-row comparison is strict and useful when the complete record defines the example. Field-based comparison is helpful when metadata changes but the core text or identifier repeats across splits.
For prompt datasets, comparing by `prompt`, `instruction` or another main text field can reveal overlaps that full-object matching would miss.
- Use full rows when metadata is part of the example meaning.
- Use one field when the main text is what drives leakage risk.
- Pick the comparison mode that best matches how the model sees the data.
Normalize the comparison so trivial formatting differences do not hide overlaps
Whitespace differences and minor casing changes can hide exact repeats that are practically the same example. Normalizing those differences before comparison makes exact-overlap checks more realistic.
This is especially useful when train and test files come from different exports or cleanup passes.
- Normalize whitespace before comparing text-heavy datasets.
- Use case-insensitive checks when casing is not meaningful for the task.
- Keep the normalization rules simple enough that the results stay explainable.
Use sample matches to decide whether the split needs to be rebuilt
The overlap count tells you the scale of the problem, but the sample matches tell you how serious it is. Seeing which test rows match which train rows makes it easier to decide whether the issue is harmless duplication or something that invalidates the split.
When the examples are clearly the same task content, rebuilding the split after deduplication is often the safer move.
- Inspect matched row pairs instead of relying only on the count.
- Rebuild the split if repeated examples clearly weaken the evaluation boundary.
- Keep the cleaned and re-split version as the new working dataset.
Place leakage checks in the normal preparation workflow
A practical order is clean the dataset, deduplicate it, create the split and then run a leakage check before you trust the metrics. This order makes it easier to catch exact overlap while the split is still easy to rebuild.
That also makes the evaluation process more defensible because the integrity checks happened before the results were shared.
- Deduplicate before splitting when possible.
- Run a leakage check right after the split is created.
- Treat overlap review as part of evaluation setup, not as a late correction.