Type, format and volume of data

Depending on your discipline and research design, the data you collect can vary widely in type, format, and volume.

Back to Managing Your Research Data
Back to RDM Home page

 

This page outlines key considerations to help you manage these differences effectively throughout your research lifecycle.

1. Physical format and structure

Research data often comes as digital files containing numbers or text, but it may also include non-digital or non-standard data formats—such as sound recordings, high-resolution images, video, biological samples, or archaeological artefacts.

Digital files may be:

  • Structured: Data organized in a tabular or relational model (e.g., spreadsheets, databases)
  • Unstructured: Content without a predefined schema (e.g., text corpora, multimedia, web content)

Regardless of data format:

  • Assign unique identifiers to each physical or digital item
  • Create a digital inventory with detailed descriptions and metadata to support traceability and reuse

2. Volume of data

The volume of data refers to the number of data points, items, or observations you collect—not just their total size in megabytes or gigabytes. Volume influences:

  • Data cleaning and preprocessing time
  • Complexity of data modeling or statistical analysis
  • Required tools and computing capacity

Examples:

  • High-volume, small-size data: survey responses from thousands of users (lightweight text files)
  • Low-volume, large-size data: a few high-resolution MRI scans or satellite images (heavy files)

Anticipating volume helps:

  • Structure your data collection protocol
  • Select appropriate storage solutions and database models
  • Determine when automation or advanced data management tools are necessary

3. File size and storage needs

Data size refers to the amount of digital storage space your data occupies. This has direct implications for:

  • Storage infrastructure: local drives, institutional servers, or cloud services
  • Data backup strategies
  • Accessibility and transfer times

Estimate storage requirements in MB, GB, or TB at different project stages. Plan ahead for growth, especially in data-intensive disciplines like genomics, digital imaging, or remote sensing.

Choose efficient file formats for large datasets. Compressed or binary formats can optimize performance without loss of fidelity.

4. Digital file formats

Choosing the right file format ensures long-term usability, interoperability, and preservation. Favor:

  • Open, non-proprietary formats (e.g., .csv, .xml, .json, .txt)
  • Lossless compression where data integrity is critical

When proprietary software is required (e.g., .sav, .psd, .mat), also produce:

  • Portable backups in widely supported formats to safeguard accessibility

Selection criteria include:

  • Team expertise
  • Accepted standards in your research community
  • Compatibility with repository or funder requirements

5. How to select a data format (Adapted from ANDS)

Follow these best practices:

  • Decide early: Agree on formats before data collection begins
  • Compare proprietary and open formats for accessibility, functionality, and sustainability
  • Anticipate obsolescence: software and formats may not be supported forever
  • Dual-format storage: consider saving in both proprietary and open formats to reduce risk
  • High-resolution data may require format conversion for online display or transmission
  • Ask colleagues or your data steward about preferred formats in your field

Recommended universal backup formats: .csv, .tab, .txt, .rtf

> Need help choosing a format? Consult DMP – data formats for preservation

6. Variable types in structured data

Correctly identifying variable types improves how your data is interpreted and analyzed by software tools.

Quantitative variables

  • Discrete: Whole numbers (e.g., number of publications)
  • Continuous: Real numbers on a scale (e.g., time, distance)

Qualitative (categorical) variables

  • Nominal: Unordered categories (e.g., language, country)
  • Ordinal: Ordered categories (e.g., Likert scale, academic level)

Many tools also support:

  • String or character variables: Free text entries, notes, open-ended responses

Clearly documenting variable types ensures accurate processing, facilitates interoperability, and supports statistical integrity.

7. Discipline-specific formats and integrated metadata

Certain research disciplines use specialized file formats that already integrate structured metadata directly within the file. These formats:

  • Facilitate automated metadata extraction
  • Enhance interoperability with community-specific tools and platforms
  • Support standardized documentation, boosting reuse and reproducibility

Common examples by discipline:

Discipline Format Metadata Features
Social Sciences DDI (.xml) Documents study-level metadata, variable-level details, methodology
Genomics / Bioinformatics FASTQ, BAM, VCF Includes sequencing information, read quality, genome annotations
Geospatial Sciences GeoTIFF, NetCDF, Shapefile Captures geolocation, spatial resolution, time stamps
Digital Humanities TEI (.xml) Encodes text structure, annotations, provenance
Engineering / CAD STEP, IGES, DXF Stores design metadata, units, geometry standards
Astronomy FITS Integrates metadata headers with observational data
Imaging (Medical) DICOM Embeds patient, modality, and capture metadata
Environmental Science HDF5, NetCDF Handles multidimensional sensor datasets with metadata

 

Back to Managing Your Research Data
Back to RDM Home page

 

Plus d’articles sur cette thématique

  • Illustration de l’article Going further

    Going further

    Research data management
  • Illustration de l’article Data Quality

    Data Quality

    Research data management
  • Illustration de l’article File Organization and Naming Conventions

    File Organization and Naming Conventions

    Research data management
  • Illustration de l’article Metadata

    Metadata

    Research data management
  • Illustration de l’article Codebook

    Codebook

    Research data management
  • Illustration de l’article Document your data

    Document your data

    Research data management
  • Illustration de l’article Search for existing datasets

    Search for existing datasets

    Research data management
  • Illustration de l’article Sampling strategies

    Sampling strategies

    Research data management
  • Illustration de l’article Questionnaire design

    Questionnaire design

    Research data management
  • Illustration de l’article Compass to Research Data Management

    Compass to Research Data Management

    Research data management
  • Illustration de l’article Experimental planning

    Experimental planning

    Research data management
  • Illustration de l’article Write your DMP on DMPonline.be

    Write your DMP on DMPonline.be

    Research data management
  • Illustration de l’article Plan data management cost

    Plan data management cost

    Research data management
  • Illustration de l’article Data Management Plan (DMP)

    Data Management Plan (DMP)

    Research data management
  • Illustration de l’article Research Data Management

    Research Data Management

    Research data management
  • Illustration de l’article FAIR data principles

    FAIR data principles

    Research data management
  • Illustration de l’article Data Cleaning

    Data Cleaning

    Research data management
  • Illustration de l’article Data Collection

    Data Collection

    Research data management
  • Illustration de l’article Publish and share your data

    Publish and share your data

    Research data management
  • Illustration de l’article Qui sont vos personnes ressources pour la gestion des données de recherche ? DPOs

    Qui sont vos personnes ressources pour la gestion des données de recherche ? DPOs

    Research data management
  • Illustration de l’article Managing Your Research Data

    Managing Your Research Data

    Research data management
  • Illustration de l’article Qui sont vos personnes ressources pour la gestion des données de recherche ?

    Qui sont vos personnes ressources pour la gestion des données de recherche ?

    Open Data