Artificial Intelligence and Digital Data Practices in Research

Artificial intelligence and large-scale digital data practices are transforming research across all disciplines. Whether you are training a machine learning model, using a generative AI tool to assist with writing or analysis, scraping web data for a corpus, or analysing datasets derived from digital platforms, these activities raise a distinct and growing set of ethical and legal responsibilities. 

Artificial Intelligence and Research

Why does it matter?

Generative AI has great potential for accelerating scientific discovery, leading to new research breakthroughs and significant productivity gains, and improving the effectiveness and pace of research and verification processes. At the same time, the technology entails real risks of abuse. Some risks stem from the tool’s technical limitations, and others arise from the intentional or unintentional use of the tool in ways that erode sound research practices. In many respects, these tools can harm research integrity and raise questions about the ability of current models to combat deceptive scientific practices and misinformation. 

To address this, the European Research Area Forum developed guidelines on the use of generative AI in research for funding bodies, research organisations, and researchers, both in the public and private research ecosystems. 

The result is the Living Guidelines on the Responsible Use of Generative AI in Research (European Commission, 2026), which provide a shared framework applicable across Europe. While non-binding, they should be considered as a supporting tool for researchers, research organisations and research funding bodies, including those applying to the European Framework Programme for Research and Innovation.

Main issues

AI systems can exhibit a range of limitations that directly affect research quality and integrity:

  • Training data bias: biases present in training data can produce skewed or inaccurate outputs, reflecting and amplifying systemic inequalities in the source material.
  • Sycophantic behaviour: models may align their responses with the perceived beliefs or preferences of the user, producing misleadingly agreeable outputs rather than accurate ones.
  • Invented citations: generative AI models may produce plausible-sounding but entirely fictitious references, which can seriously mislead anyone relying on those sources.
  • Opacity: AI models operate as « black boxes », making it difficult to understand how specific responses are generated. This makes independent verification essential, particularly in automated data analysis.
  • Hallucinations: models regularly produce confident but factually incorrect statements, requiring systematic critical review of all outputs before use.

Beyond these technical limitations, there are also risks linked to the proprietary nature of many tools, including lack of openness, fees, and the potential use of input data by the platform provider.

AI and research integrity

The use of AI in research is not a neutral act: it engages directly with the core principles of research integrity. The key principles framing the Living Guidelines are grounded in the European Code of Conduct for Research Integrity and the guidelines on Trustworthy AI. They cover: 

Reliability in ensuring the quality of research, including verifying and reproducing AI-generated information;
Honesty in transparently disclosing the use of generative AI;
Respect for colleagues, participants, society, and the environment, including proper management of privacy, confidentiality, and intellectual property;
Accountability for all outputs produced, underpinned by human agency and oversight. 

Concealing the use of AI in the creation of content or in drafting publications is considered an unacceptable practice under the ALLEA European Code of Conduct for Research Integrity. AI systems are not authors or co-authors: authorship implies agency and responsibility, which rest solely with human researchers.

What is recommended

Transparency and disclosure

Any substantial use of AI must be clearly described in your research outputs. This applies across the full research process: if AI has been used to carry out a literature review, analyse data, generate or refine text, develop hypotheses, or produce images, this must be disclosed in the methods section (or equivalent). You should specify the tool name, version, and date of use, and explain how it shaped your results.

Human agency and oversight

AI tools must support rather than replace human judgement. Researchers should maintain a critical relationship with AI outputs, verifying claims, checking citations, and evaluating results independently. Acknowledging the stochastic nature of generative AI ( the tendency to produce different outputs from the same input ) researchers should strive for reproducibility and robustness, and openly discuss any limitations arising from the tools used. Future Needs

When AI systems interact with study participants or the public, those individuals must be clearly informed that they are engaging with an AI and must receive comprehensible information about its capabilities and limitations.

Privacy and intellectual property

Special caution is required when inputting data into external AI platforms. The output produced by generative AI can contain personal data. If this becomes apparent, researchers are responsible for handling any personal data output responsibly and appropriately, and EU data protection rules are to be followed. 

Researchers should not upload personal data to external AI systems without explicit consent and a documented lawful basis under the GDPR, and should check the data governance policies of any tool before use.

AI-generated text, code, or images may incorporate or closely resemble existing protected works. All AI-generated content must be critically reviewed before publication for factual errors, invented citations, bias, and inadvertent reproduction of third-party material.

Environmental responsibility

Large generative models carry significant computational costs. Researchers should evaluate whether the tool chosen is appropriately matched to the task, and consider the environmental impact of AI use as part of responsible research practice.

 

Digital Data Collection and Research

Key ethical and legal considerations

Digital data collection (including web scraping, API access, use of social media datasets, or mining of existing databases) engages multiple overlapping legal and ethical obligations. The regulatory frameworks differ substantially depending on the type of data involved.

Personal data (names, email addresses, location data, IP addresses, or any information that could directly or indirectly identify an individual) falls under the General Data Protection Regulation (GDPR). If your dataset includes personal data, you must establish a lawful basis for processing, apply the principles of data minimisation and purpose limitation, implement appropriate security measures, and consider conducting a Data Protection Impact Assessment (DPIA) when the processing is likely to result in high risks to individuals.

Non-personal data may still be protected by other legal instruments. The Database Directive (96/9/EC) protects databases that reflect either originality or substantial investment. Using a substantial portion of such a database without authorisation may infringe on these rights.

Web scraping and terms of service

Before scraping any website, review its terms of service carefully: many platforms explicitly restrict automated access or commercial reuse of their content. Breaching these terms may expose researchers to civil liability. EU copyright law does include a specific exception for text and data mining (TDM) for scientific research purposes under Directive 2019/790, but this exception has conditions and does not override the GDPR or database rights.

Intellectual property and third-party data

Before incorporating datasets, images, or text produced by others into your research, including as inputs to AI tools, verify whether the material is protected by copyright or database rights. If you are publishing datasets derived from third-party sources, ensure that your reuse is covered by an appropriate licence or falls within a recognised legal exception.

The EU AI Act

The EU Artificial Intelligence Act entered into force on 1 August 2024, with provisions being phased in over two to three years. It applies a risk-based framework to AI systems across four tiers: prohibited applications, high-risk systems (subject to strict obligations including conformity assessments and human oversight), limited-risk systems (subject to transparency obligations), and minimal-risk systems. Researchers developing or deploying AI should assess where their systems fall within this framework, maintain detailed technical documentation throughout the project, and apply the principles of transparency, human oversight, and non-discrimination from the outset.

A useful self-assessment tool is the Assessment List for Trustworthy Artificial Intelligence (ALTAI), developed by the EU High-Level Expert Group on AI.

 

What you should do

Before and during your project:

  • Conduct an ethics self-assessment addressing AI and digital data dimensions if applying for EU-funded research. Consult the EU Grants: How to complete your ethics self-assessment (pages 39–45 for AI).
  • Contact your local ethics committee if your research involves the development of AI systems, large-scale digital data collection, or the use of AI in interaction with human participants.
  • Apply the principles of data minimisation, purpose limitation, and security to all personal data, and consult your institution’s Data Protection Officer where needed.
  • Review website terms of service before any scraping activity and confirm that your collection falls within applicable legal exceptions.
  • Use the ALTAI checklist to evaluate the trustworthiness of any AI system you develop or deploy.
  • Document your use of AI tools throughout the project, including which tools were used, how, and at which stages.

When publishing:

  • Disclose all substantial uses of AI in methods sections, following the norms of your discipline and any requirements set by your target journal or funder.
  • Critically review all AI-generated content before publication: check for factual errors, invented citations, bias, and any inadvertent reproduction of third-party material.
  • Do not list AI systems as authors or co-authors.

 

Key References and future readings: 

Living Guidelines on the Responsible Use of Generative AI in Research 

ALLEA European Code of Conduct for Research Integrity

Guidelines for AI use in EU project deliverables (based on EC recommendations)

EU Grants: How to complete your ethics self-assessment 

Assessment List for Trustworthy Artificial Intelligence (ALTAI) 

EU Artificial Intelligence Act 

Directive 2019/790,

General Data Protection Regulation (GDPR)

Plus d’articles sur cette thématique

  • Illustration de l’article Research involving non-EU countries

    Research involving non-EU countries

    Research data management
  • Illustration de l’article Where can I publish my research data ?

    Where can I publish my research data ?

    Research data management
  • Illustration de l’article Introduction to RDM and FAIR data

    Introduction to RDM and FAIR data

    Research data management
  • Illustration de l’article Can I / Should I share my data openly?

    Can I / Should I share my data openly?

    Research data management
  • Illustration de l’article License your data

    License your data

    Research data management
  • Illustration de l’article Data Sharing Agreement

    Data Sharing Agreement

    Research data management
  • Illustration de l’article Add an embargo period

    Add an embargo period

    Research data management
  • Illustration de l’article Select data for publication

    Select data for publication

    Research data management
  • Illustration de l’article Publish a data paper

    Publish a data paper

    Research data management
  • Illustration de l’article Add metadata

    Add metadata

    Research data management
  • Illustration de l’article Choose a data repository

    Choose a data repository

    Research data management
  • Illustration de l’article Research Involving Human Cells or Tissues

    Research Involving Human Cells or Tissues

    Research data management
  • Illustration de l’article Research involving Animals

    Research involving Animals

    Research data management
  • Illustration de l’article Research on human participants

    Research on human participants

    Research data management
  • Illustration de l’article Ethics

    Ethics

    Research data management
  • Illustration de l’article Going further

    Going further

    Research data management
  • Illustration de l’article Type, format and volume of data

    Type, format and volume of data

    Research data management
  • Illustration de l’article Data Quality

    Data Quality

    Research data management
  • Illustration de l’article File Organization and Naming Conventions

    File Organization and Naming Conventions

    Research data management
  • Illustration de l’article Metadata

    Metadata

    Research data management
  • Illustration de l’article Codebook

    Codebook

    Research data management
  • Illustration de l’article Document your data

    Document your data

    Research data management
  • Illustration de l’article Search for existing datasets

    Search for existing datasets

    Research data management
  • Illustration de l’article Sampling strategies

    Sampling strategies

    Research data management
  • Illustration de l’article Questionnaire design

    Questionnaire design

    Research data management
  • Illustration de l’article Compass to Research Data Management

    Compass to Research Data Management

    Research data management
  • Illustration de l’article Experimental planning

    Experimental planning

    Research data management
  • Illustration de l’article Write your DMP on DMPonline.be

    Write your DMP on DMPonline.be

    Research data management
  • Illustration de l’article Plan data management cost

    Plan data management cost

    Research data management
  • Illustration de l’article Data Management Plan (DMP)

    Data Management Plan (DMP)

    Research data management
  • Illustration de l’article Research Data Management

    Research Data Management

    Research data management
  • Illustration de l’article FAIR data principles

    FAIR data principles

    Research data management
  • Illustration de l’article Data Cleaning

    Data Cleaning

    Research data management
  • Illustration de l’article Data Collection

    Data Collection

    Research data management
  • Illustration de l’article Publish and share your data

    Publish and share your data

    Research data management
  • Illustration de l’article Qui sont vos personnes ressources pour la gestion des données de recherche ? DPOs

    Qui sont vos personnes ressources pour la gestion des données de recherche ? DPOs

    Research data management
  • Illustration de l’article Managing Your Research Data

    Managing Your Research Data

    Research data management
  • Illustration de l’article (Re)visionnez les webinaires du réseau de Data Ambassadors !

    (Re)visionnez les webinaires du réseau de Data Ambassadors !

    Research data management
  • Illustration de l’article Qui sont vos personnes ressources pour la gestion des données de recherche ?

    Qui sont vos personnes ressources pour la gestion des données de recherche ?

    Research data management
  • Illustration de l’article Do I have to write a DMP for my funder?

    Do I have to write a DMP for my funder?

    Research data management