CoARA

Back to Managing Your Research Data

Back to RDM Home Page

Best Practices for Data Cleaning

Here are some key guidelines to follow when cleaning your data:

Keep a detailed log of all modifications made to the dataset.
Store this documentation alongside the data for transparency and reproducibility.
Documenting your cleaning steps is not optional — it’s what separates scientific work from data dredging. Without clear records, your analysis becomes difficult to verify, reproduce, or trust.

Identify how missing values are encoded (e.g., NA, -999, "I don't know", "Prefer not to answer").
Distinguish between different types of missingness (e.g., refusal to answer vs. lack of knowledge).
Decide on appropriate strategies: imputation, removal, or flagging.

Ensure that each row represents an observation and each column a variable.
Column names should be on a single line and clearly labeled.
Remove empty columns and lines unless justified.

Each column should contain a single, consistent data type (e.g., numeric, categorical, date).
Convert or correct types as needed (e.g., strings to dates, floats to integers).

After importing/exporting data, check that the number of rows and columns remains unchanged.
Compare summary statistics before and after transformations.

Use descriptive statistics (min, max, mean, quantiles) and visualizations (boxplots, histograms).
Identify outliers or impossible values (e.g., negative heart rate).

Cross-check related variables for contradictions (e.g., Marital Status = Single but Spouse Name is filled).
Use validation rules to flag inconsistencies.

Back to Managing Your Research Data

Back to RDM Home Page

Le 18 septembre 2025

Data Collection

Research data management
Le 18 septembre 2025

Publish and share your data

Research data management
Le 17 septembre 2025

Qui sont vos personnes ressources pour la gestion des données de recherche ? DPOs

Research data management
Le 15 septembre 2025

Managing Your Research Data

Research data management
Le 16 mai 2025

Qui sont vos personnes ressources pour la gestion des données de recherche ?

Research data management