Le 21 janvier 2026
Back to Managing Your Research Data Back to RDM Home Page
Best Practices for Data Cleaning
Here are some key guidelines to follow when cleaning your data:
1. Document All Changes
- Keep a detailed log of all modifications made to the dataset.
- Store this documentation alongside the data for transparency and reproducibility.
- Documenting your cleaning steps is not optional — it’s what separates scientific work from data dredging. Without clear records, your analysis becomes difficult to verify, reproduce, or trust.
2. Understand the Data
- Review the data description and metadata to ensure consistency.
- Verify that the data complies with relevant regulations (e.g., GDPR).
3. Handle Missing Values Carefully
- Identify how missing values are encoded (e.g.,
NA,-999,"I don't know","Prefer not to answer"). - Distinguish between different types of missingness (e.g., refusal to answer vs. lack of knowledge).
- Decide on appropriate strategies: imputation, removal, or flagging.
4. Check Data Structure
- Ensure that each row represents an observation and each column a variable.
- Column names should be on a single line and clearly labeled.
- Remove empty columns and lines unless justified.
5. Verify Data Types
- Each column should contain a single, consistent data type (e.g., numeric, categorical, date).
- Convert or correct types as needed (e.g., strings to dates, floats to integers).
6. Validate Data Integrity
- After importing/exporting data, check that the number of rows and columns remains unchanged.
- Compare summary statistics before and after transformations.
7. Explore Numerical Variables
- Use descriptive statistics (min, max, mean, quantiles) and visualizations (boxplots, histograms).
- Identify outliers or impossible values (e.g., negative heart rate).
8. Review Categorical Variables
- Check for inconsistent labels (e.g.,
"Belgium"vs."belgium"). - Standardize categories and remove duplicates or typos.
9. Detect Duplicates
- Look for and remove duplicate rows or records unless justified.
10. Ensure Logical Consistency
- Cross-check related variables for contradictions (e.g.,
Marital Status = SinglebutSpouse Nameis filled). - Use validation rules to flag inconsistencies.
Back to Managing Your Research Data Back to RDM Home Page