Machine learning: New paper published with Prof. GrimmMinimizing the risk of data leaks

Source: https://bit.cs.tum.de/en/news/article/minimizing-the-risk-of-data-leaks Parent: https://bit.cs.tum.de/en/news?tx_news_pi1%5Bcontroller%5D=News&tx_news_pi1%5BcurrentPage%5D=2&cHash=276544fedb2a51251c9022c9bca5d607

2024-09-10 TUMCS, BIT

Machine learning: New paper published with Prof. Grimm Minimizing the risk of data leaks

The use of artificial intelligence has opened up countless possibilities and opportunities in biological research and has become indispensable for understanding complex biological systems. By applying machine learning (ML) methods to biomolecular data, researchers can identify patterns and relationships in DNA, RNA, and protein sequences, for example. This has led to significant advances in many areas of biological research, such as the prediction of 3D protein structures.

In practical application, however, researchers repeatedly encounter the problem that the reported results of ML-based predictors are often too optimistic and cannot be reproduced with independent data. One of the main reasons for this is so-called “data leakage” – i.e., the unauthorized transfer of information between training and test data. This leads to overly optimistic performance estimates that cannot be validated in practice.

A team of researchers from the Technical University of Munich (TUM), Friedrich-Alexander University Erlangen-Nuremberg (FAU), Weihenstephan-Triesdorf University of Applied Sciences (HSWT), the Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), and Saarland University (UdS) has therefore addressed the question of how these pitfalls can be avoided when applying ML-based approaches, which can quickly lead to data leaks and thus to overly optimistic results, especially in biological applications.

“Data leaks that lead to unrealistic assessments of the performance of ML approaches are particularly dangerous in biological and medical applications,” says Prof. Olga Kalinina from HIPS/UdS, “as they can potentially even jeopardize patient safety.”

Against this background, the researchers present seven questions that should help to avoid data leaks when constructing machine learning models in biology. By applying these questions to concrete examples, the researchers demonstrate their usefulness and offer a guide for robust and reproducible research in the field of machine learning in biology. “Our goal is to raise awareness of potential problems caused by data leaks and to contribute to the development of reliable machine learning models. We hope that our questions will help researchers identify complex and hidden dependencies in biological data and thus avoid data leaks,” says , head of the professorship Bioinformatics at TUM Campus Straubing and HSWT.

“Nowadays, popular software and programming frameworks have made it easier to ensure a valid ML workflow. In practice, however, their user-friendliness increases the risk of scientifically incorrect applications and false results,” notes Prof. David Blumenthal from the Department of Artificial Intelligence in Biomedical Engineering at FAU.

“Conversely, the complexity of biological data can lead to data leaks if it is overlooked by data scientists without sufficient qualifications in the respective application domain. For these reasons, we strongly recommend interdisciplinary collaboration between experts from both fields,” says Prof. Markus List, Professor of Data Science in Systems Biology at TUM in Freising.

In summary, Prof. Haselbeck, Professor of Smart Farming at HSWT, explains: “I would particularly like to highlight the excellent inter-institutional collaboration. We hope that our work will improve the quality and reliability of future machine learning models for biological applications.”

Bernett, J., Blumenthal, D.B., Grimm, D.G. et al. Guiding questions to avoid data leakage in biological machine learning applications. Nat Methods 21, 1444–1453 (2024). doi.org/10.1038/s41592-024-02362-y