Le 21 janvier 2026
Back to Managing Your Research Data
Back to RDM Home page
This page outlines key considerations to help you manage these differences effectively throughout your research lifecycle.
1. Physical format and structure
Research data often comes as digital files containing numbers or text, but it may also include non-digital or non-standard data formats—such as sound recordings, high-resolution images, video, biological samples, or archaeological artefacts.
Digital files may be:
- Structured: Data organized in a tabular or relational model (e.g., spreadsheets, databases)
- Unstructured: Content without a predefined schema (e.g., text corpora, multimedia, web content)
Regardless of data format:
- Assign unique identifiers to each physical or digital item
- Create a digital inventory with detailed descriptions and metadata to support traceability and reuse
2. Volume of data
The volume of data refers to the number of data points, items, or observations you collect—not just their total size in megabytes or gigabytes. Volume influences:
- Data cleaning and preprocessing time
- Complexity of data modeling or statistical analysis
- Required tools and computing capacity
Examples:
- High-volume, small-size data: survey responses from thousands of users (lightweight text files)
- Low-volume, large-size data: a few high-resolution MRI scans or satellite images (heavy files)
Anticipating volume helps:
- Structure your data collection protocol
- Select appropriate storage solutions and database models
- Determine when automation or advanced data management tools are necessary
3. File size and storage needs
Data size refers to the amount of digital storage space your data occupies. This has direct implications for:
- Storage infrastructure: local drives, institutional servers, or cloud services
- Data backup strategies
- Accessibility and transfer times
Estimate storage requirements in MB, GB, or TB at different project stages. Plan ahead for growth, especially in data-intensive disciplines like genomics, digital imaging, or remote sensing.
Choose efficient file formats for large datasets. Compressed or binary formats can optimize performance without loss of fidelity.
4. Digital file formats
Choosing the right file format ensures long-term usability, interoperability, and preservation. Favor:
- Open, non-proprietary formats (e.g., .csv, .xml, .json, .txt)
- Lossless compression where data integrity is critical
When proprietary software is required (e.g., .sav, .psd, .mat), also produce:
- Portable backups in widely supported formats to safeguard accessibility
Selection criteria include:
- Team expertise
- Accepted standards in your research community
- Compatibility with repository or funder requirements
5. How to select a data format (Adapted from ANDS)
Follow these best practices:
- Decide early: Agree on formats before data collection begins
- Compare proprietary and open formats for accessibility, functionality, and sustainability
- Anticipate obsolescence: software and formats may not be supported forever
- Dual-format storage: consider saving in both proprietary and open formats to reduce risk
- High-resolution data may require format conversion for online display or transmission
- Ask colleagues or your data steward about preferred formats in your field
Recommended universal backup formats: .csv, .tab, .txt, .rtf
> Need help choosing a format? Consult DMP – data formats for preservation
6. Variable types in structured data
Correctly identifying variable types improves how your data is interpreted and analyzed by software tools.
Quantitative variables
- Discrete: Whole numbers (e.g., number of publications)
- Continuous: Real numbers on a scale (e.g., time, distance)
Qualitative (categorical) variables
- Nominal: Unordered categories (e.g., language, country)
- Ordinal: Ordered categories (e.g., Likert scale, academic level)
Many tools also support:
- String or character variables: Free text entries, notes, open-ended responses
Clearly documenting variable types ensures accurate processing, facilitates interoperability, and supports statistical integrity.
7. Discipline-specific formats and integrated metadata
Certain research disciplines use specialized file formats that already integrate structured metadata directly within the file. These formats:
- Facilitate automated metadata extraction
- Enhance interoperability with community-specific tools and platforms
- Support standardized documentation, boosting reuse and reproducibility
Common examples by discipline:
| Discipline | Format | Metadata Features |
| Social Sciences | DDI (.xml) | Documents study-level metadata, variable-level details, methodology |
| Genomics / Bioinformatics | FASTQ, BAM, VCF | Includes sequencing information, read quality, genome annotations |
| Geospatial Sciences | GeoTIFF, NetCDF, Shapefile | Captures geolocation, spatial resolution, time stamps |
| Digital Humanities | TEI (.xml) | Encodes text structure, annotations, provenance |
| Engineering / CAD | STEP, IGES, DXF | Stores design metadata, units, geometry standards |
| Astronomy | FITS | Integrates metadata headers with observational data |
| Imaging (Medical) | DICOM | Embeds patient, modality, and capture metadata |
| Environmental Science | HDF5, NetCDF | Handles multidimensional sensor datasets with metadata |
Back to Managing Your Research Data Back to RDM Home page