A practical guide to checking headers, fixing missing values, removing duplicates and preparing cleaner CSV datasets for ML projects.
Start with column structure
The first thing to check is whether the file structure is stable. Column names should be clear, consistent and easy to map to features. Mixed naming styles such as `user_id`, `User ID` and `userid` often lead to confusion when you preprocess data later.
Look for shifted columns, accidental separators inside values and rows with different field counts. Even small structure problems can break imports or silently move values into the wrong columns.
- Rename headers into one consistent naming style.
- Confirm that every row has the same number of fields.
- Check that delimiters and quoted values are parsed correctly.
Handle missing and invalid values carefully
Missing values are not always just blank cells. Sometimes they appear as `N/A`, `null`, `unknown`, `-`, `?` or whitespace. Normalize these patterns first so you can decide whether to drop, fill or flag them.
You should also check for impossible values. Negative ages, future dates, broken categories and empty labels are signs that a dataset needs rules, not just formatting.
- Replace inconsistent missing-value markers with one standard representation.
- Drop rows only when the missing information makes the record unusable.
- Use domain knowledge before filling values with averages or defaults.
Advertisement
AdSense slot placeholder
Remove duplicates and normalize text
Duplicate records can overweight some patterns and distort training results. This is especially important in customer lists, support tickets and manually exported spreadsheet data.
Text values should also be normalized where useful. Extra spaces, inconsistent case, different abbreviations and trailing punctuation can create artificial categories that mean the same thing.
- Remove exact duplicate rows before splitting the dataset.
- Trim whitespace from all text fields.
- Standardize repeated labels such as `yes/Yes/YES` into one version.
Check target labels and leakage risks
If you are training a supervised model, the label column needs special attention. Inconsistent class names, missing labels or mixed label definitions can destroy evaluation quality.
You should also look for leakage. A feature that directly reveals the answer may boost metrics during testing but fail in real-world use. Data cleaning is not only about formatting; it is also about protecting the integrity of the experiment.
- Verify that labels use one clean set of class names.
- Remove columns that reveal future information or direct answers.
- Review duplicates across train and test data to avoid leakage.
Export a cleaner dataset for the next step
Once the data is stable, export a clean working version and keep the raw file unchanged as a reference. This makes your workflow easier to reproduce and audit later.
At this stage, simple browser-based tools can help with conversions, deduplication and validation before you move into notebooks or pipelines.
- Keep one raw CSV and one cleaned CSV version.
- Document the transformations you applied.
- Run your train/test split only after the cleanup is complete.