Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Widespread false negatives in DNA-encoded library data : how linker effects impair machine learning-based lead prediction

Montoya, Alba L. ; Hogendorf, Adam S. ; Tingey, Steven ; Kuberan, Aadarsh ; Yuen, Lik Hang ; Schüler, Herwig LU orcid and Franzini, Raphael M. (2025) In Chemical Science 16(24). p.10918-10927
Abstract

DNA-encoded chemical libraries (DECLs) have become integral to early-stage drug discovery, yielding active compounds and extensive labeled datasets for machine learning (ML)-based prediction of bioactive molecules. However, the information content of DECL selection data remains scarcely explored. This study systematically investigates for the first time the prevalence of false negatives and the influence of the linker in DECL data. Using a focused DECL targeting the poly-(ADP-ribose) polymerases PARP1/2 and TNKS1/2 as a model system, we found that our DECL selections frequently miss active compounds, with numerous false negatives for each identified hit. The presence of the DNA-conjugation linker emerged as a factor contributing to the... (More)

DNA-encoded chemical libraries (DECLs) have become integral to early-stage drug discovery, yielding active compounds and extensive labeled datasets for machine learning (ML)-based prediction of bioactive molecules. However, the information content of DECL selection data remains scarcely explored. This study systematically investigates for the first time the prevalence of false negatives and the influence of the linker in DECL data. Using a focused DECL targeting the poly-(ADP-ribose) polymerases PARP1/2 and TNKS1/2 as a model system, we found that our DECL selections frequently miss active compounds, with numerous false negatives for each identified hit. The presence of the DNA-conjugation linker emerged as a factor contributing to the underdetection of active molecules. This bias toward false negatives compromises the predictive power of DECL data for prioritizing hits, anticipating target selectivity, and training ML models, as determined by analyzing the effects of undersampling and oversampling techniques in learning the PARP2 data. Conversely, the linker's presence in DECLs offers advantages, such as enabling the identification of target-selective protein engagers, even when the underlying molecules themselves may not be selective. These findings highlight the challenges and opportunities of DECL data, emphasizing the need for best practices in data handling and ML model development in drug discovery.

(Less)
Please use this url to cite or link to this publication:
author
; ; ; ; ; and
organization
publishing date
type
Contribution to journal
publication status
published
subject
in
Chemical Science
volume
16
issue
24
pages
10 pages
publisher
Royal Society of Chemistry
external identifiers
  • scopus:105005756756
  • pmid:40395382
ISSN
2041-6520
DOI
10.1039/d5sc00844a
language
English
LU publication?
yes
id
8075beb0-7ac1-430d-b447-efaab59b1940
date added to LUP
2025-09-24 16:01:42
date last changed
2025-09-25 08:29:39
@article{8075beb0-7ac1-430d-b447-efaab59b1940,
  abstract     = {{<p>DNA-encoded chemical libraries (DECLs) have become integral to early-stage drug discovery, yielding active compounds and extensive labeled datasets for machine learning (ML)-based prediction of bioactive molecules. However, the information content of DECL selection data remains scarcely explored. This study systematically investigates for the first time the prevalence of false negatives and the influence of the linker in DECL data. Using a focused DECL targeting the poly-(ADP-ribose) polymerases PARP1/2 and TNKS1/2 as a model system, we found that our DECL selections frequently miss active compounds, with numerous false negatives for each identified hit. The presence of the DNA-conjugation linker emerged as a factor contributing to the underdetection of active molecules. This bias toward false negatives compromises the predictive power of DECL data for prioritizing hits, anticipating target selectivity, and training ML models, as determined by analyzing the effects of undersampling and oversampling techniques in learning the PARP2 data. Conversely, the linker's presence in DECLs offers advantages, such as enabling the identification of target-selective protein engagers, even when the underlying molecules themselves may not be selective. These findings highlight the challenges and opportunities of DECL data, emphasizing the need for best practices in data handling and ML model development in drug discovery.</p>}},
  author       = {{Montoya, Alba L. and Hogendorf, Adam S. and Tingey, Steven and Kuberan, Aadarsh and Yuen, Lik Hang and Schüler, Herwig and Franzini, Raphael M.}},
  issn         = {{2041-6520}},
  language     = {{eng}},
  number       = {{24}},
  pages        = {{10918--10927}},
  publisher    = {{Royal Society of Chemistry}},
  series       = {{Chemical Science}},
  title        = {{Widespread false negatives in DNA-encoded library data : how linker effects impair machine learning-based lead prediction}},
  url          = {{http://dx.doi.org/10.1039/d5sc00844a}},
  doi          = {{10.1039/d5sc00844a}},
  volume       = {{16}},
  year         = {{2025}},
}