AI Data Preparation AI Data Tools

Train/Test Leakage Checker

Check whether train and test datasets contain the same records before you trust model metrics. This browser-based leakage checker compares full normalized rows or a selected field and reports overlap counts, rates and example matches.

AI Prep

No file selected
Read locally in your browser only

This tool does not upload files to a server.

No file selected
Read locally in your browser only

This tool does not upload files to a server.

Comparison options

No fields detected yet.

Train rows

0

Test rows

0

Exact overlaps

0

Overlap rate

0%

Compare train and test datasets for exact overlap before running evaluations.

Summary JSON

Sample overlaps

Review matching row numbers to confirm whether the overlap is expected or problematic.

No overlap report yet.

What this tool does

Train/Test Leakage Checker compares two datasets to find exact overlaps before you measure model performance. It can compare full normalized rows or a selected field when you want to focus on the main text or identifier rather than every metadata column.

This is valuable because a small amount of leakage can make evaluation look better than it really is, especially on narrow tasks or smaller datasets.

  • Compare full rows or one selected field across train and test sets.
  • Normalize whitespace before matching near-identical formatting variants.
  • Review example overlaps with train and test row numbers.

When to use it

Use leakage checks after splitting or when datasets were assembled from several sources and you want confidence that evaluation data was not already seen in training. It is a natural companion to deduplication and splitting workflows.

For exact-match leakage, this browser-first approach is fast and practical. More advanced semantic overlap checks would require heavier tooling and are outside the scope of this page.

  • Run after dataset splitting and before reporting model metrics.
  • Compare by text field when metadata differs but the core example may repeat.
  • Use the sample overlaps to decide whether the split needs to be rebuilt.

How to use

  • Paste train and test datasets in CSV, JSON or JSONL format, or import local files.
  • Choose whether to compare full rows or a single field such as `prompt`, `id` or `text`.
  • Run the check to review overlap counts, sample matches and a downloadable JSON report.

Example

Input

Train: [{"prompt":"Summarize","answer":"Short"}] | Test: [{"prompt":"Summarize","answer":"Short"}]

Output

Exact overlaps: 1 | Overlap rate: 100%

Privacy note

Train/test comparison happens locally in your browser. Both imported datasets stay on your device and are not uploaded anywhere.

Recommended Guides

Start with these higher-value walkthroughs to understand the workflow around this tool, not just the button clicks.

FAQ

Can I compare only one field such as `prompt`?

Yes. Leave the full-row mode behind and enter one field name when the main text is what matters most for overlap detection.

Does this detect semantic leakage or paraphrases?

No. It checks exact overlap after optional normalization, not meaning-level similarity.

Should I run this before or after train/test splitting?

Use it after a split exists, or whenever you receive separate train and test files and want to verify the boundary is clean.

Related Tools

AI Data Preparation AI Data Tools

Dataset Splitter

Split CSV or JSON datasets into train, validation and test sets in your browser.

AI Prep

Open tool