Data Cleaning

Data cleaning (also known as data cleansing or data scrubbing) is the process of identifying and correcting (or removing) errors and inconsistencies in data to improve its quality and reliability. Clean data is essential for accurate analysis, modeling, and decision-making. This process ensures that your dataset is complete, consistent, and ready for use.

Back to Managing Your Research Data
Back to RDM Home Page

Best Practices for Data Cleaning

Here are some key guidelines to follow when cleaning your data:

1. Document All Changes

  • Keep a detailed log of all modifications made to the dataset.
  • Store this documentation alongside the data for transparency and reproducibility.
  • Documenting your cleaning steps is not optional — it’s what separates scientific work from data dredging. Without clear records, your analysis becomes difficult to verify, reproduce, or trust.

2. Understand the Data

  • Review the data description and metadata to ensure consistency.
  • Verify that the data complies with relevant regulations (e.g., GDPR).

3. Handle Missing Values Carefully

  • Identify how missing values are encoded (e.g., NA, -999, "I don't know", "Prefer not to answer").
  • Distinguish between different types of missingness (e.g., refusal to answer vs. lack of knowledge).
  • Decide on appropriate strategies: imputation, removal, or flagging.

4. Check Data Structure

  • Ensure that each row represents an observation and each column a variable.
  • Column names should be on a single line and clearly labeled.
  • Remove empty columns and lines unless justified.

5. Verify Data Types

  • Each column should contain a single, consistent data type (e.g., numeric, categorical, date).
  • Convert or correct types as needed (e.g., strings to dates, floats to integers).

6. Validate Data Integrity

  • After importing/exporting data, check that the number of rows and columns remains unchanged.
  • Compare summary statistics before and after transformations.

7. Explore Numerical Variables

  • Use descriptive statistics (min, max, mean, quantiles) and visualizations (boxplots, histograms).
  • Identify outliers or impossible values (e.g., negative heart rate).

8. Review Categorical Variables

  • Check for inconsistent labels (e.g., "Belgium" vs. "belgium").
  • Standardize categories and remove duplicates or typos.

9. Detect Duplicates

  • Look for and remove duplicate rows or records unless justified.

10. Ensure Logical Consistency

  • Cross-check related variables for contradictions (e.g., Marital Status = Single but Spouse Name is filled).
  • Use validation rules to flag inconsistencies.

 

Back to Managing Your Research Data
Back to RDM Home Page

Plus d’articles sur cette thématique

  • Illustration de l’article Data Collection

    Data Collection

    Research data management
  • Illustration de l’article Publish and share your data

    Publish and share your data

    Research data management
  • Illustration de l’article Qui sont vos personnes ressources pour la gestion des données de recherche ? DPOs

    Qui sont vos personnes ressources pour la gestion des données de recherche ? DPOs

    Research data management
  • Illustration de l’article Managing Your Research Data

    Managing Your Research Data

    Research data management
  • Illustration de l’article Qui sont vos personnes ressources pour la gestion des données de recherche ?

    Qui sont vos personnes ressources pour la gestion des données de recherche ?

    Research data management