What this tool does
Train/Test Leakage Checker compares two datasets to find exact overlaps before you measure model performance. It can compare full normalized rows or a selected field when you want to focus on the main text or identifier rather than every metadata column.
This is valuable because a small amount of leakage can make evaluation look better than it really is, especially on narrow tasks or smaller datasets.
- Compare full rows or one selected field across train and test sets.
- Normalize whitespace before matching near-identical formatting variants.
- Review example overlaps with train and test row numbers.
When to use it
Use leakage checks after splitting or when datasets were assembled from several sources and you want confidence that evaluation data was not already seen in training. It is a natural companion to deduplication and splitting workflows.
For exact-match leakage, this browser-first approach is fast and practical. More advanced semantic overlap checks would require heavier tooling and are outside the scope of this page.
- Run after dataset splitting and before reporting model metrics.
- Compare by text field when metadata differs but the core example may repeat.
- Use the sample overlaps to decide whether the split needs to be rebuilt.