2021
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome
Abstract: Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder represen-tation, named DNABER…
View preprint versions
Search citation statements
Paper Sections
Select...
900
267
108
81
Citation Types
4
1,080
0
0
Year Published
2015
2026
Publication Types
Select...
619
351
187
52
Relationship
2
1,207
Authors
Journals
Cited by 1,079 publications
(1,110 citation statements)
References 59 publications
4
1,080
0
0
“…Among these, six algorithms-DeepSEA, Basset, DanQ, ExplaiNN, SATORI, and Scover-were specifically designed for diverse predictive tasks. Three models were DNA foundation models, namely DNABERT2 30,31 , Nucleotide Transformer (NT) 32 , and HyenaDNA 33 Consistent with previous studies, we observed a decline in the performance of deep learning models in cell type-specific regions (FigS2.d) 34 , with models performing better in regions associated with active histone modifications, such as H3K4me3 and H3K27ac, compared to repressive modifications like H3K9me3 and H3K27me3 (FigS2.e).…”
Section: Benchmark Pipelinesupporting
confidence: 84%
“…Among these, six algorithms-DeepSEA, Basset, DanQ, ExplaiNN, SATORI, and Scover-were specifically designed for diverse predictive tasks. Three models were DNA foundation models, namely DNABERT2 30,31 , Nucleotide Transformer (NT) 32 , and HyenaDNA 33 Consistent with previous studies, we observed a decline in the performance of deep learning models in cell type-specific regions (FigS2.d) 34 , with models performing better in regions associated with active histone modifications, such as H3K4me3 and H3K27ac, compared to repressive modifications like H3K9me3 and H3K27me3 (FigS2.e).…”
Section: Benchmark Pipelinesupporting
confidence: 84%
“…Experimental validation on real-world swine genomic datasets (PIC-GD and HZA-PMB) demonstrates that our model substantially outperforms baselines, including GBLUP and a Transformer trained from scratch [ 9 , 23 ]. This confirms our central scientific hypothesis that pre-training on the genomic data itself enables the model to learn intrinsic genomic structures, thereby boosting performance in the downstream task of phenotype prediction by capturing non-linear genetic signals [ 15 , 17 , 18 ].…”
Section: Discussionsupporting
confidence: 73%
“…These results robustly demonstrate that (1) the Transformer architecture is inherently more capable of capturing complex genetic effects than linear models and other tested architectures [ 13 , 14 ], and (2) self-supervised pre-training is the critical step that unlocks this potential by providing a powerful initialization based on general genomic knowledge [ 15 , 16 , 18 ]. Figure 6 visually confirms the strong agreement between predicted and true phenotypic values (R 2 = 0.552) for PIC-GD T5.…”
Section: Resultsmentioning
confidence: 74%
“…We pre-trained EBERTs with k ∈ {5, 6, 7} and DBERT with k ∈ {6, 7}. Our general findings on tokenization schemes agree with other DNA sequence embedding models, DNABERT [Ji et al, 2021] and EP2vec [Zeng et al, 2018], that found slight increases in downstream performance with increasing k up to 6, with diminishing returns as k increases past 6 up to 10. Here we show results from our best performing EBERT: k=7 with stride of 7, which produces a L input of 150 tokens.…”
Section: Genomic and Epigenetic Datasupporting
confidence: 76%
