AI Data Preparation AI Data Tools

Train/Test Leakage Checker

Check whether train and test datasets contain the same records before you trust model metrics. This browser-based leakage checker compares full normalized rows or a selected field and reports overlap counts, rates and example matches.

AI Prep

Train dataset

Import train file

No file selected

Read locally in your browser only

This tool does not upload files to a server.

Test dataset

Import test file

No file selected

Read locally in your browser only

This tool does not upload files to a server.

Comparison options

Input format

Match field (optional)

No fields detected yet.

Normalize whitespace before comparing Ignore uppercase and lowercase differences

Train rows

Test rows

Exact overlaps

Overlap rate

Compare train and test datasets for exact overlap before running evaluations.

Summary JSON

Sample overlaps

Review matching row numbers to confirm whether the overlap is expected or problematic.

No overlap report yet.

What this tool does

Train/Test Leakage Checker compares two datasets to find exact overlaps before you measure model performance. It can compare full normalized rows or a selected field when you want to focus on the main text or identifier rather than every metadata column.

This is valuable because a small amount of leakage can make evaluation look better than it really is, especially on narrow tasks or smaller datasets.

Compare full rows or one selected field across train and test sets.
Normalize whitespace before matching near-identical formatting variants.
Review example overlaps with train and test row numbers.

When to use it

Use leakage checks after splitting or when datasets were assembled from several sources and you want confidence that evaluation data was not already seen in training. It is a natural companion to deduplication and splitting workflows.

For exact-match leakage, this browser-first approach is fast and practical. More advanced semantic overlap checks would require heavier tooling and are outside the scope of this page.

Run after dataset splitting and before reporting model metrics.
Compare by text field when metadata differs but the core example may repeat.
Use the sample overlaps to decide whether the split needs to be rebuilt.

How to use

Paste train and test datasets in CSV, JSON or JSONL format, or import local files.
Choose whether to compare full rows or a single field such as `prompt`, `id` or `text`.
Run the check to review overlap counts, sample matches and a downloadable JSON report.

Example

Input

Train: [{"prompt":"Summarize","answer":"Short"}] | Test: [{"prompt":"Summarize","answer":"Short"}]

Output

Exact overlaps: 1 | Overlap rate: 100%

Privacy note

Train/test comparison happens locally in your browser. Both imported datasets stay on your device and are not uploaded anywhere.

Recommended Guides

Start with these higher-value walkthroughs to understand the workflow around this tool, not just the button clicks.

How to Check Train/Test Leakage Before Trusting Model Metrics

Check for exact overlap between train and test data so evaluation scores are not quietly inflated by repeated records.

Read guide

How to Deduplicate JSONL Training Data Before Splitting or Validation

Remove repeated JSONL records before validation, splitting and model training so your dataset counts and evaluation stay more trustworthy.

Read guide

FAQ

Can I compare only one field such as `prompt`?

Yes. Leave the full-row mode behind and enter one field name when the main text is what matters most for overlap detection.

Does this detect semantic leakage or paraphrases?

No. It checks exact overlap after optional normalization, not meaning-level similarity.

Should I run this before or after train/test splitting?

Use it after a split exists, or whenever you receive separate train and test files and want to verify the boundary is clean.

Related Tools

AI Data Preparation AI Data Tools

Dataset Splitter

Split CSV or JSON datasets into train, validation and test sets in your browser.

AI Prep

Open tool

AI Data Preparation AI Data Tools

JSONL Deduplicator

Remove repeated JSONL records by full object or a selected key field.

AI Prep

Open tool

AI Data Preparation AI Data Tools

Prompt Dataset Converter

Convert CSV or JSON rows into instruction or chat-style prompt datasets.

AI Prep

Open tool