SynDAiTE: Synthetic Data for AI Trustworthiness and Evolution

SynDAiTE: Synthetic Data for AI Trustworthiness and Evolution

Workshop at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2025), September 15, 2025 - Porto, Portugal


Organisers

Dr. Marco Piangerelli
University of Camerino | Vici & C.
Ylenia Rotalinti
Brunel University London
Prof. Heitor Murilo Gomes
Victoria University of Wellington
Prof. Maroua Bahari
Sorbonne Université
Prof. Yi He
William & Mary
Dr. Bardh Prenkaj
Technical University of Munich
André Carreiro
Fraunhofer AICOS
Prof. Ana Carolina Lorena
Instituto Tecnológico de Aeronáutica
Prof. Kate Smith-Miles
University of Melbourne
Zafeiris Kokkinogenis
University of Porto
Prof. Albert Bifet
University of Waikato
Prof. Carlos Soares
University of Porto | Fraunhofer AICOS

🚨📢 Where to submit? 📢🚨 All submissions must be done via Microsoft CMT here.

Table of contents

Aims and Scope

The rapid advancement of artificial intelligence (AI) relies heavily on access to large, diverse, and high-quality datasets for training and evaluation. However, the increasing scarcity of data, strict privacy regulations, and the high costs associated with collection and annotation are creating significant barriers to progress. Projections suggest that by 2050, we may face a shortage of fresh text data, and by 2060, image data may become similarly limited. These challenges make it imperative to explore alternatives that can sustain AI’s growth and effectiveness. Synthetic data presents itself as a compelling solution to these issues, offering the advantages of scalability, customisation, and inherent anonymisation. It allows for the generation of large volumes of tailored datasets without the same privacy and cost concerns of real data.

Important Dates

All deadlines are 11:59 pm, AoE.

Topics

SynDAiTE welcomes contributions on the use of synthetic data on all topics below, independent of the application domain (e.g., health, finance, business, basic sciences, construction computational advertising, IoT, etc.) and of data types (e.g., networks, graphs, logs, spatiotemporal, multimedia, time series, genomic sequences, and streaming data.):

Invited Talks:

Simulation as a Data Engine for Physical AI - Hugo Penedones - 14:10

Hugo Penedones
Abstract:

AI is having a remarkable impact on the physical sciences and engineering. From AI-driven material discovery and drug design to robotics and weather forecasting, the progress of Physical AI depends on high-quality data. This talk will show how we can leverage the mature field of numerical simulation to generate synthetic data for training machine learning models. Examples will be demonstrated using the Inductiva cloud HPC platform and its minimalist Python SDK, which is designed to be intuitive for the machine learning community.

Speaker's Bio:

Hugo Penedones is a Machine Learning researcher and engineer, co-founder of Inductiva Research Labs, on a mission to blend Scientific Computing and Machine Learning. Most recently, he worked at Google DeepMind, in London and Zurich. Prior to that, he worked in the Query Formulation team at Microsoft Bing and did research in Machine Learning and Computer Vision at Idiap Research Institute and École Polytechnique Fédérale de Lausane, both in Switzerland. He did my undergraduate studies in Informatics and Computing Engineering at FEUP, in Portugal.

Lessons from Synthetic Health Data Generation: Fidelity, Privacy, Augmentation & Time - Allan Tucker - 17:00

Professor Allan Tucker
Abstract:

Primary healthcare care data offers huge value in modelling disease and illness. However, this data holds extremely private information about individuals and privacy concerns continue to limit the wide-spread use of such data, both by public research institutions and by the private health-tech sector. One possible solution is the use of synthetic data which mimics the underlying correlational structure and distributions of real data but avoids many of the privacy concerns. Brunel University London has been working in a long-term collaboration with the Medicine and Health Regulatory Authority in the UK to construct a high-fidelity synthetic data generator using probabilistic models with complex underlying latent variable structures. This work has led to multiple releases of synthetic data on a number of diseases including covid and cardiovascular disease, which are available for state-of-the-art AI research. Two major issues that have arisen from our synthetic data work are issues with bias, even when working with comprehensive national data, and with concept drift where subsequent batches of data move away from current models and what impact this may have on regulation. In this talk I will discuss some of the key results of the collaboration: on our experiences of synthetic data generation, on the detection of bias and how to better represent the true underlying UK population, and how to handle concept drift when building models of healthcare data that evolves over time.

Speaker's Bio:

Allan Tucker is Professor of Artificial Intelligence in the Department of Computer Science at Brunel University London, where he heads the Intelligent Data Analysis (IDA) Group. His research spans biomedical informatics, eco-informatics, machine learning, and Bayesian networks, with current projects involving Google, the Royal Free Hospital, UCL, the Zoological Society of London, and the Royal Botanical Gardens at Kew. He is also involved in significant grants, including a Natural Environment Research Council (NERC) project on improved estimation of global-scale groundwater changes (2024–2027) and a BEIS Innovate UK Regulatory Pioneer Fund project on using high-fidelity synthetic data in clinical trials (2023–2025).

Program at a Glance (Level 2, INFANTE Room, Alfândega do Porto):

Registration and Presentation Policy

Each accepted paper must have at least one author registered for the full conference by the early registration deadline and must be presented at the workshop even if they opt-out of the post-proceedings. We expect the authors, the program committee, and the organizing committee to adhere to the ECML-PKDD Code of Conduct.

Acknowledgement

Vici & C. SpA is an Italian company that has been operating since 1977 in the industrial automation sector and plans, produces and distributes electrical circuit boards and machines

“The Synthetic Data for AI Trustworthiness and Evolution (SynDAiTE 2025)” workshop has been supported by VICI & C and the @HOME Project: Lazio Region, FESR Lazio 2021–2027 (# F89J23001050007, CUP B83C23006240002).

Contacts

For general inquiries about the workshop, please email syndaite@gmail.com