🚨📢 Where to submit? 📢🚨 All submissions must be done via Microsoft CMT here.
The rapid advancement of artificial intelligence (AI) relies heavily on access to large, diverse, and high-quality datasets for training and evaluation. However, the increasing scarcity of data, strict privacy regulations, and the high costs associated with collection and annotation are creating significant barriers to progress. Projections suggest that by 2050, we may face a shortage of fresh text data, and by 2060, image data may become similarly limited. These challenges make it imperative to explore alternatives that can sustain AI’s growth and effectiveness. Synthetic data presents itself as a compelling solution to these issues, offering the advantages of scalability, customisation, and inherent anonymisation. It allows for the generation of large volumes of tailored datasets without the same privacy and cost concerns of real data.
All deadlines are 11:59 pm, AoE.
SynDAiTE welcomes contributions on the use of synthetic data on all topics below, independent of the application domain (e.g., health, finance, business, basic sciences, construction computational advertising, IoT, etc.) and of data types (e.g., networks, graphs, logs, spatiotemporal, multimedia, time series, genomic sequences, and streaming data.):
![]() |
Abstract:AI is having a remarkable impact on the physical sciences and engineering. From AI-driven material discovery and drug design to robotics and weather forecasting, the progress of Physical AI depends on high-quality data. This talk will show how we can leverage the mature field of numerical simulation to generate synthetic data for training machine learning models. Examples will be demonstrated using the Inductiva cloud HPC platform and its minimalist Python SDK, which is designed to be intuitive for the machine learning community. Speaker's Bio:Hugo Penedones is a Machine Learning researcher and engineer, co-founder of Inductiva Research Labs, on a mission to blend Scientific Computing and Machine Learning. Most recently, he worked at Google DeepMind, in London and Zurich. Prior to that, he worked in the Query Formulation team at Microsoft Bing and did research in Machine Learning and Computer Vision at Idiap Research Institute and École Polytechnique Fédérale de Lausane, both in Switzerland. He did my undergraduate studies in Informatics and Computing Engineering at FEUP, in Portugal. |
![]() |
Abstract:Primary healthcare care data offers huge value in modelling disease and illness. However, this data holds extremely private information about individuals and privacy concerns continue to limit the wide-spread use of such data, both by public research institutions and by the private health-tech sector. One possible solution is the use of synthetic data which mimics the underlying correlational structure and distributions of real data but avoids many of the privacy concerns. Brunel University London has been working in a long-term collaboration with the Medicine and Health Regulatory Authority in the UK to construct a high-fidelity synthetic data generator using probabilistic models with complex underlying latent variable structures. This work has led to multiple releases of synthetic data on a number of diseases including covid and cardiovascular disease, which are available for state-of-the-art AI research. Two major issues that have arisen from our synthetic data work are issues with bias, even when working with comprehensive national data, and with concept drift where subsequent batches of data move away from current models and what impact this may have on regulation. In this talk I will discuss some of the key results of the collaboration: on our experiences of synthetic data generation, on the detection of bias and how to better represent the true underlying UK population, and how to handle concept drift when building models of healthcare data that evolves over time. Speaker's Bio:Allan Tucker is Professor of Artificial Intelligence in the Department of Computer Science at Brunel University London, where he heads the Intelligent Data Analysis (IDA) Group. His research spans biomedical informatics, eco-informatics, machine learning, and Bayesian networks, with current projects involving Google, the Royal Free Hospital, UCL, the Zoological Society of London, and the Royal Botanical Gardens at Kew. He is also involved in significant grants, including a Natural Environment Research Council (NERC) project on improved estimation of global-scale groundwater changes (2024–2027) and a BEIS Innovate UK Regulatory Pioneer Fund project on using high-fidelity synthetic data in clinical trials (2023–2025). |
Each accepted paper must have at least one author registered for the full conference by the early registration deadline and must be presented at the workshop even if they opt-out of the post-proceedings. We expect the authors, the program committee, and the organizing committee to adhere to the ECML-PKDD Code of Conduct.
“The Synthetic Data for AI Trustworthiness and Evolution (SynDAiTE 2025)” workshop has been supported by VICI & C and the @HOME Project: Lazio Region, FESR Lazio 2021–2027 (# F89J23001050007, CUP B83C23006240002).
For general inquiries about the workshop, please email syndaite@gmail.com