DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Yanrong Ji¹, Zhihan Zhou², Han Liu², Ramana V Davuluri³

Affiliations

¹ Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA.
² Department of Computer Science, Northwestern University, Evanston, IL 60208, USA.
³ Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794, USA.

PMID: 33538820
PMCID: PMC11025658
DOI: 10.1093/bioinformatics/btab083

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Yanrong Ji et al. Bioinformatics. 2021.

. 2021 Aug 9;37(15):2112-2120.

doi: 10.1093/bioinformatics/btab083.

Authors

Yanrong Ji¹, Zhihan Zhou², Han Liu², Ramana V Davuluri³

Affiliations

¹ Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA.
² Department of Computer Science, Northwestern University, Evanston, IL 60208, USA.
³ Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794, USA.

PMID: 33538820
PMCID: PMC11025658
DOI: 10.1093/bioinformatics/btab083

Abstract

Motivation: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.

Results: To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.

Availability and implementation: The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT).

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Details of architecture and characteristics of DNABERT model.(a) Differences between RNN, CNN and Transformer in understanding contexts. T1 to 5 denotes embedded tokens which were input into models to develop hidden states (white boxes, orange box is the current token of interest). RNN propagates information through all hidden states, and CNN takes local information in developing each representation. In contrast, Transformers develop global contextual embedding via self-attention. (b) DNABERT uses tokenized k-mer sequences as input, which also contains a CLS token (a tag representing meaning of entire sentence), a SEP token (sentence separator) and MASK tokens (to represent masked k-mers in pre-training). The input passes an embedding layer and is fed to 12 Transformer blocks. The first output among last hidden states will be used for sentence-level classification while outputs for individual masked token used for token-level classification. Et, It and Ot denote the positional, input embedding and last hidden state at token t, respectively. (c) DNABERT adopts general-purpose pre-training which can then be fine-tuned for multiple purposes using various task-specific data. (d) Example overview of global attention patterns across 12 attention heads showing DNABERT correctly focusing on two important regions corresponding to known binding sites within sequence (boxed regions, where self-attention converged)

**Fig. 2.**
DNABERT significantly outperforms other models in identifying promoter regions. (a) (Left to right) accuracy, F1 and MCC of prom-300 prediction in TATA, no-TATA and combined datasets. (b) Stacked barplot showing F1 (left) and MCC (right) of Prom-scan predictions in different settings. (**c–f**) ROC (c, TATA; d, noTATA) and Preci- sion-recall (PR) curves (e, TATA; f, noTATA) with adjusted P-values from Delong test. (g) (Left to right) accuracy, F1 and MCC of core promoters prediction in TATA, no-TATA and combined datasets

**Fig. 3.**
DNABERT accurately identifies TFBSs.Violin plots showing accuracy (top left), precision (top right), recall (middle left), F1 (middle right), MCC (bottom left)and AUC (bottom right) of TFBS prediction with ENCODE 690 ChIP-Seq datasets. Pairwise comparison using Wilcoxon one-sided signed-rank test (n = 690) and adjusted P-values using Benjamini-Hochberg procedure were shown. Global hypothesis testing across all models done by Kruskal-Wallis test (n = 690)

**Fig. 4.**
Visualizations of attention and context by DNABERT-viz.(a) Attention maps of two example ChIP-Seq-validated TAp73-beta binding sites (top, middle) and one non-binding site (bottom). Numbers below represent binding scores previously predicted by P53Scan. (b) Attention landscapes of TATA (top) and noTATA (bottom) promoters in Prom-300 test set. (**c,d**) Example attention landscapes for individual ENCODE 690 dataset. CTCF (left) is of good quality while SMARCA4 (right) has concerned quality. (e) Attention-head (context) plots of a p53 binding site. (left) sentence-level self-attention across all heads; (middle left, middle right, right) attention of the ‘CTT’ token within one of the important regions, with only attention ≥ 0.2, 0.4 and 0.6 shown respectively. Heatmap on the left shows the corresponding attention head

**Fig. 5.**
DNABERT significantly outperforms other models in finding splice sites.(a) (Left to right) multiclass accuracy, F1 and MCC of splice donor and acceptor prediction. GBM: gradient boosting; LR: logistic regression; DBN: deep belief network; RF: random forest; tree: decision tree; SVM_RBF: support vector machine with radial basis function kernel. (b, c) ROC (top) and PR curves (bottom) on splice donor (b) and acceptor (c) datasets with adjusted P-values from Delong test

**Fig. 6.**
DNABERT identifies functional genetic variants, and pretraining is essential and can be generalized.(**a–c**) Mutation maps of difference scores (top 3) and log-odds ratio scores (logOR, bottom 3). Each mutation map contains the attention score indicating importance of the region (top), scores for wild-type (WT, middle) and scores for mutant (mut, bottom). (Left to right) a rare deletion within a CTCF binding site inside MYO7A gene in ECC-1 cell line completely disrupts the binding site; a rare single-nucleotide variant (SNV) at initiator codon of SUMF1 gene also disrupts YY1 binding site (5‘-CCGCCATNTT-3’); a common intronic SNP within XPC gene weakens CTCF binding site and is associated with pancreatic cancer. (d) Fine-tuning loss of pre-trained (pre) versus random initialized (init) DNABERT on Prom-300 (left) and Prom-core (right). (e) p53 attention map for random initialized (top), pre-trained (middle) and fine-tuned (bottom) DNABERT model. (f) Mean Accuracy (top left), F1 (top right), MCC (bottom left) and AUC (bottom right) across 78 mouse ENCODE datasets

See this image and copyright information in PMC

References

1. Alipanahi B. et al. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol.,33, 831–838. - PubMed
1. Andersson R., Sandelin A. (2020) Determinants of enhancer and promoter activities of regulatory elements. Nat. Rev. Genet.,21, 71–87. - PubMed
1. Bartlett A. et al. (2017) Mapping genome-wide transcription-factor binding sites using DAP-seq. Nat. Protoc.,12, 1659–1672. - PMC - PubMed
1. Bengio Y. et al. (2013) Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal.,35, 1798–1828. - PubMed
1. Brendel V., Busse H.G. (1984) Genome structure described by formal languages. Nucleic Acids Res.,12, 2561–2568. - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Affiliations

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources