Author(s)

  • Armin Soleymaniniya | Computational Mass Spectrometry Department, School of Life Sciences, Technical University of Munich | Maximus-von-Imhof-Forum 3, 85354, Freising, Germany
  • Harikrishnan Ramadasan (Presenting Author) | CompOmics - VIB Ghent University Center for Medical Biotechnology | FSVM II - Technologiepark Zwijnaarde 75, 9052, Gent, Belgium
  • Julian Uszkoreit | Ruhr University Bochum, Medical Faculty, Medical Bioinformatics | Universitätsstr. 150, D-44801, Bochum, Germany
  • Katerina Nastou | Statens Serum Institut | Artillerivej 5, 2300, Copenhagen, Denmark
  • Lev Levitsky | Protein Research Group, Department of Biochemistry and Molecular Biology, University of Southern Denmark | Campusvej , 5230, Odense, Denmark
  • Magnus Palmblad | Center for Proteomics and Metabolomics | Postbus 9600, 2300 RC, Leiden, Netherlands
  • Natalia Tischenko | CompOmics - VIB - Ghent University Center for Medical Biotechnology | FSVM II - Technologiepark Zwijnaarde 75, 9052, Ghent, Belgium
  • Samuel de la Cámara Fuentes | Proteomics Unit, Faculty of Pharmacy, Complutense University of Madrid | Plaza Ramón y Cajal, 28040, Madrid, Spain
  • Tim Van Den Bossche | CompOmics - VIB Ghent University Center for Medical Biotechnology | FSVM II - Technologiepark Zwijnaarde 75,, 9052, Ghent, Belgium
  • Veit Schwämmle | University of Southern Denmark | Campusvej 55, 5230, Odense, Denmark
  • Tine Claeys | CompOmics - VIB Ghent University Center for Medical Biotechnology | FSVM II - Technologiepark Zwijnaarde 75, 9052, Ghent, Belgium

Abstract

Originating from the 2025 EuBIC-MS Developers Meeting, our initiative introduces an automated metadata annotation pipeline integrating natural language processing (NLP) and large language models (LLMs) with robust manual curation. Implemented in lesSDRF 2.0, this pipeline extracts biological and technical metadata from open-access publications, supplementary files, and mzML files in accordance with SDRF guidelines. To ensure accuracy, we employed a hybrid approach: 42 open-access papers were manually annotated following SDRF guidelines to fine-tune a Generative Pretrained Transformer (GPT) model. To mitigate hallucinations, the model is reinforced with a Named Entity Recognition (NER) pipeline incorporating 18 SDRF-compatible ontologies. Once refined, the GPT model will be used to annotate a vast corpus of open-access papers, which, after expert validation, will serve as training data for a BERT-based LLM, selected for its state-of-the-art performance in NER tasks. In parallel, we are developing workflows to extract metadata from supplementary files and MS data formats, ensuring coverage of both biological context and technical parameters. This dual strategy consolidates experimental and technical details into a streamlined and FAIR-compliant framework. By combining careful manual curation with advanced NLP models, this initiative bridges human expertise and computational efficiency, marking a significant step forward in automated proteomics metadata annotation.