Abstract
Determining the amino acid sequence of proteins directly from experimental data remains a fundamental challenge in proteomics, especially in contexts where genome-based reference databases are incomplete or unavailable. De novo sequencing with tandem mass spectrometry (MS/MS) offers a route to identify proteins without relying on prior sequence knowledge. However, accurate interpretation of MS/MS spectra is complex, requiring models that can robustly infer peptide sequences from sparse and noisy signals.In this talk, I will briefly overview the current state of de novo sequencing using mass spectrometry and how deep learning has transformed the field. I will discuss strategies for encoding mass spectra into formats suitable for neural networks and show how architectural choices impact model performance. In particular, I will present our recent work leveraging pairwise attention mechanisms, a variant of transformer-style attention, to better capture the relationships between fragment ions. This approach significantly improves sequence reconstruction accuracy over previous models.
Additionally, I will explore the emerging potential of cryo-electron microscopy (cryo-EM) data in de novo sequencing. While typically used for structural analysis, cryo-EM offers complementary constraints that, when integrated with MS/MS and deep learning, may enable full sequence determination of novel proteins. By combining these modalities, we can move closer to the goal of reference-free, high-throughput proteomics.
This presentation will be of interest to researchers in computational proteomics, structural biology, and machine learning, and will highlight opportunities for further integration between modalities and disciplines.

