Ramana V Davuluri | Stony Brook Dept of Biomedical Informatics

Ph.D., Distinguished Professor

Department of Biomedical Informatics

Stony Brook Cancer Center

Stony Brook, NY 11794

Phone: (631) 638-2590
Email: Ramana.Davuluri@stonybrookmedicine.edu

INTERESTS
Machine Learning; Molecular Data Science; Foundation Models for Genomics; DNABERT; Bioinformatics; Cancer Genomics

BIOGRAPHY

Ramana Davuluri is a world leader in Molecular Data Science, with research focus on computational analysis of non-coding genomic regions and isoform-level gene regulation. Davuluri is well-known for his pioneering efforts to use machine learning in biomedical applications – for example, development of gene promoter prediction algorithms, molecular subtyping assays for glioblastoma and ovarian cancers, deep learning algorithm for integrative earning from multi-omics datasets, and informatics pipelines for the analysis of cancer drug target interactions affected by alternative splicing. Recently, Davuluri has developed novel deep learning approaches for understanding the DNA language, and how genetic and epigenetic changes in the non-coding genome alter the DNA linguistics. Working at the interface of artificial intelligence and genomics, Davuluri is one of the first groups to develop genomic large language model, called “DNABERT”. Released in 2021, DNABERT has been widely used in understanding and decoding genomic and epigenomic languages. The DNABERT model can predict allele-specific activity based only on local nucleotide sequence context, and prioritize candidate transcription-factor-binding sites, core-promoters and splice sites that are sensitive to variants at genome-scale. Building on DNABERT’s success, his group is developing informatics methods to calculate genome-wide mutational scores, based on whole genome sequence data, and integrate other biomedical data, such as histology and RNA expression to improve prognosis and cancer outcome prediction.

RESEARCH

Hypothesis: The central hypothesis of his laboratory research is that the isoform-level gene products “transcript variants” and “protein isoforms” are the basic functional units in the mammalian cell, and accordingly, the informatics platforms – ranging from basic molecular biology data management systems to the biomarker and therapeutic drug target discovery for precision medicine – should adapt “gene isoform centric” rather than “gene centric” approaches.

The complexity of mammalian gene structure and its regulation (see our review paper in Pharmacology and Therapeutics journal for further discussion).

Summary of Ongoing Research Projects:

Developing novel deep-learning based methods for deciphering non-coding gene regulatory code (1R01LM013722): To develop novel pre-trained DNA Bidirectional Encoder Representations from Transformers, called DNABERT, and associated deep-learning tools to decipher the language of non-coding DNA and facilitate integration of gene regulatory information from rapidly accumulating sequence data with NLM’s genetic databases (for example, dbSNP, dbGaP and ClinVar), which serve both scientists and the public health by helping identify the genetic components of disease. The goal of our proposed research is to develop DNABERT for a variety of sequence prediction tasks, and benchmark with existing state-of-the-art deep-learning based methods. Specific aims are (1) develop novel deep-learning methods by adapting BERT; (2) apply the proposed deep-learning methods to specifically target non-coding DNA sequence analyses and predictions; and (3) predict and validate functional non-coding genetic variants by applying DNABERT prediction models. A major contribution of the proposed research is development of pre-trained DNABERT model and prediction algorithms, which present new powerful methods for analyses and predictions of DNA sequences.

Molecular signaling in mechanobiology regulation by single-cell analyses using bioinformatics approach (NSF – PI: Yi-Xian Qin, co-PI: Davuluri) : The project will develop a single-cell multiplex in situ tagging (scMIST) system combined with advanced machine leaning algorithms through successive rounds of labeling and imaging to effectively achieve a multiplexity of thousands of data points using a common fluorescence microscope and a simple procedure in a typical biological laboratory setting, which has the potential to revolutionize the field of mechanical biology. Davuluri group provides Bioinformatics and Machine-Learning expertise to the research project.

The spinal cell atlas of opioid-targeted inflammasomes in the HIV pain model: mechanism and pathogenic roles (NIDA/NIH; 1R01DA062257; PI: Tang, Shao Jun; Co-I: Davuluri) : Davuluri, as co-Investigator, provides bioinformatics support for analysis of NGS data that will be generated by this project.

DNABERT for interpretation of germline and somatic non-coding mutations in cancer : This project aims to develop (1) novel deep-learning methods, by adapting DNABERT, to identify combinations of regulatory SNPs that are associated with cancer risk and survival, and (2) software for region selection and prediction models that demonstrate the utility of genome-wide DNABERT-mv based mutational profiles for cancer prediction tasks. Variant data from large AllOfUs cohort will be used for training the models, while TCGA, ICGC and other cancer data portals will be used for testing the models.

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome: Understanding the hidden instructions within genome on gene regulation is crucial for biological research. However, complex language patterns widely exist in DNA, such as polysemy and distant semantic relationship, which previous methods often fail to capture especially in data-scarce scenarios. For the first time, Davuluri group (in collaboration with Dr Han Liu, Department of Computer Science, Northwestern University) is developing DNABERT to form global understanding of genomic sequences based on up and downstream sequence contexts. Using an innovative global contextual embedding of input sequences, DNABERT attempts to tackle the problem of sequence specificity prediction with a “top-down” approach by developing general understanding of DNA language via self-supervised pre-training and applying it to specific tasks (for example, prediction of promoters, transcription factor binding sites and splice sites), in contrast to the traditional “bottom-up” approach using task-specific data. Various modules of DNABERT are currently under development. It is anticipated that the pre-trained DNABERT on the human genome can also be readily applied to data from other organisms with exceptional performance.

Algorithms and bioinformatics software for promoter prediction: As a postdoctoral fellow in Michael Zhang group, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, Davuluri developed novel algorithms and computer programs for predicting gene-promoters and non-coding first exons, a highly complex problem that remained as a critical gap in gene-prediction for several years. These creative and groundbreaking approaches facilitated the prediction and annotation of Pol-II promoters in human and mouse genomes.

Data-mining methods for integrative analysis of transcriptome and epigenome data: In collaboration with Dr. Tim Huang group at Ohio State University Comprehensive Cancer Center, Davuluri performed integrative microarray technology and statistical modeling approaches to predict which proteins work with estrogen to contribute to breast cancer development. The computational predictions in this study indicated that the interaction of estrogen with one of seven different partner proteins determines whether the gene is activated or suppressed in breast cancer cells. This was a noteworthy data-mining methodology breakthrough because it allowed integrative analysis of big datasets, often consisting of expression, chromatin landscaping data and TF-binding information for thousands of genomic loci, for predicting small networks of transcription factors, which then could be validated by more traditional experimental biology techniques. This is one of the early integrative chromatin immunoprecipitation (ChIP)-chip/ChIP-seq and gene expression profiling studies, which laid foundations for efficient methods for integrative analysis of multi-omics data on a genomic scale.

Isoform-level gene expression and regulation in mammalian development and cancer: Recent genome-wide studies have discovered that majority of human genes produce multiple transcript-variants/protein-isoforms, which could be involved in different functional pathways. Moreover, altered expression of specific isoforms for numerous genes is linked with cancer and its prognosis, as cancer cells manipulate regulatory mechanisms to express specific isoforms that confer drug resistance and survival advantages. For example, cancer-associated alterations in alternative exons and splicing machinery have been identified in cancer samples, suggesting that specific transcript-variants could be more effective as diagnostic and prognostic markers than corresponding genes. In a recent study, Davuluri group discovered that majority of genes associated with neurological diseases expressed multiple transcripts through alternative promoters by using integrative NextGen sequencing based experimental approaches and bioinformatics analysis. The study also observed aberrant use of alternative promoters and splice variants in different cancers. Subsequently, his group demonstrated that cancer cell-lines regardless of their tissue of origin can be effectively discriminated from non-cancer cell-lines at isoform-level, but not at gene-level. The novel informatics methods have been successfully applied by his collaborators in different cancer studies.

Platform-independent isoform-level gene signatures for stratification of cancer patients into molecular subgroups: Based on recent studies from Davuluri group and others, significant expression differences were observed between different sample groups (e.g., developmental stages, cancer subtypes, normal vs cancer) for numerous genes at the isoform-level but not at the overall gene-level. Davuluri group investigated whether the isoform-level transcriptome changes could provide better patient stratification in terms of overall prognosis and classification accuracy. His group developed novel methods, by integrating data discretization, feature selection, and meta-classification algorithms, for derivation of platform-independent gene signature for multi-label molecular stratification of cancer patients, from exon-array and RNA-seq data. The application of these algorithms has led to the development of new methods for diagnosis of glioblastoma and other cancers and investigation of alternative splicing on drug-target gene interactions.

Algorithms and bioinformatics software for analyses of NextGen sequence data: Mapping genome-wide data to human subtelomeres has been problematic due to the incomplete assembly and challenges of lowcopy repetitive DNA elements. Davuluri group developed novel bioinformatics pipelines for incorporating multi-read mapping for annotation of the updated assemblies using short-read data sets from ChIP-seq data, and RNA-seq data. As part of other collaborative efforts, his group developed bioinformatics methods for identification of single-nucleotide polymorphisms (SNPs) that alter miRNA gene regulation and influence tumor susceptibility. Similarly, his group played a pivotal role in the development of informatics methods required for analysis of small-RNA sequence data, with Nishikura group at Wistar Institute, Philadelphia, PA.

PUBLICATIONS

Google Scholar | PubMed

Selected publications:

Surana P, Dutta P, Davuluri RV (2024). TransTEx: Novel tissue-specificity scoring method for grouping human transcriptome into different expression groups. Bioinformatics, 40 (8).
Zhou Z, Wu W, Ho H, Wang J, Shi L, RV Davuluri RV, Wang Z, Liu H (2024) Dnabert-s: Learning species-aware dna embedding with genome foundation models, ArXiv.
Zhou Z, Ji Y, Li W, Dutta P, Davuluri RV, Liu H. (2024) Dnabert-2: Efficient foundation model and benchmark for multi-species genome. Proceedings of ICLR 2024; May 7–11, 2024; Vienna Austria2024 (in the proceedings of the twelfth International Conference on Learning Representations)..
Ji Y, Dutta P, Davuluri RV. (2023). Deep Multi-Omics Integration by Learning Correlation-Maximizing Representation Identifies Prognostically Better Stratified Cancer Subtypes. Bioinformatics Advances, 3(1):vbad075.
Keathley R, Kocherginsky M, Davuluri R, Matei D. (2023). Integrated Multi-Omic Analysis Reveals Immunosuppressive Phenotype Associated with Poor Outcomes in High-Grade Serous Ovarian Cancer. Caners (Basel), 15(14):3649.
Yang L, Dutta P, Davuluri R, Wang J. (2023) Rapid, High-Throughput Single-Cell Multiplex In Situ Tagging (MIST) Analysis of Immunological Disease with Machine Learning. Anal Chem. 95(19):7779-7787.
Y Ji, X Tong, DD Xu, J Liao, Davuluri RV, Yang GY, Mishra RK. (2023). A new robust classifier to detect hot-spots and null-spots in protein–protein interface: validation of binding pocket and identification of inhibitors in in vitro and in vivo models. Big Data Analytics in Chemoinformatics and Bioinformatics, 247-263.
Ji Y, Zhou Z, Liu H, Davuluri RV. (2021) DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. Epub 2021/02/05. doi: 10.1093/bioinformatics/btab083. PubMed PMID: 33538820.
Chakraborty A, Ay F, Davuluri RV. (2021) ExTraMapper: Exon- and Transcript-level mappings for orthologous gene pairs. Bioinformatics. Epub 2021/05/21. doi: 10.1093/bioinformatics/btab393. PubMed PMID: 34014317.
Ji Y, Mishra R, Davuluri RV (2020) In silico analysis of alternative splicing on drug target genes. Sci Rep. 10(1):134.
Shilpi A, Kandpal M, Ji Y, Seagle BL, Shahabi S, Davuluri RV (2019) Platform-independent classification system for predicting high-grade serous ovarian carcinoma molecular subtypes. JCO Clin Cancer Inform, 3, 1-9.
Dapas M, Kandpal M, Bi Y, Davuluri RV (2017). Comparative evaluation of isoform-level gene expression estimation algorithms for RNA-seq and exon-array platforms. Briefings in Bioinformatics Mar 1;18(2):260-269. doi: 10.1093/bib/bbw016.
Jin H-J, Jung S, DebRoy A, Davuluri RV (2016). Identification and validation of regulatory SNPs that modulate transcription factor chromatin binding and gene expression in prostate cancer Oncotarget 7:54616-54626, doi: 10.18632/oncotarget.10520.
Jung S, Bi Y, Davuluri RV (2015). Evaluation of data discretization methods to derive platform independent isoform expression signatures for multi-class tumor subtyping. BMC Genomics. Nov 10;16 Suppl 11:S3. doi: 10.1186/1471-2164-16-S11-S3. PMCID: PMC4652565.
Pal S, Bi Y, Macyszyn L, Showe LC, O’Rourke DM and Davuluri RV. (2014) Isoform-level gene signature improves prognostic stratification and accurately classifies glioblastoma subtypes. Nucleic Acids Res. Epub 2014/02/08. doi: 10.1093/nar/gku121. PubMed PMID: 24503249.
Bi Y and Davuluri RV (2013) NPEBseq: Nonparametric Empirical Bayesian based Procedure for Differential Expression Analysis from RNA-seq Data. BMC Bioinformatics, 14: 262. Highly accessed.
Ota, H.*, Sakurai, M.*, Gupta, R.*, Valente, L., Wulff, B.-E., Ariyoshi, K., Iizasa, H., Davuluri, R.V. and Nishikura, K. (2013) ADAR1 complexes with Dicer and plays a role in microRNA processing and RNA-induced gene silencing mechanisms. Cell, 153(3): 575-589. (*Equal contribution)
Zhang, Z., Pal, S., Tchou, J. and Davuluri, R.V. (2013) Isoform-level expression profiles provide better cancer signatures than gene-level expression profiles, Genome Medicine, Apr 17;5(4):33.
Pal, S., Gupta, R., Kim, H., Wickramasinghe, P., Baubet, V., Showe, L.C., Dahmane, N. and Davuluri, R.V. (2011) Alternative transcription exceeds alternative splicing in generating the transcriptome diversity of cerebellar development, Genome Research, 21(8): 1260-1272.

Davuluri R.V., Grosse, I. and Zhang, M.Q. (2001). Computational identification of first exons and promoters in the human genome. Nature Genetics, 29: 412-417. (Note: This work was featured in Nature Reviews Genetics, 3: 3-9; in Bioinformatics section of Highlights as “Filling the gap in gene prediction” January, 2002).

Present Lab Members & Trainees

Name	Duration	Qualification	Number of publications	Present position of mentee
Pallavi Surana	June. 2021-Present	M.S. In Bioinformatics, Pursuing Ph.D. in Biomedical Informatics	1	PhD Student
Rekha Sathian	Sept. 2021-Present	M.S. In Bioinformatics, Pursuing Ph.D. in Biomedical Informatics	-	PhD Student
Nimisha Papineni	Sept 2022 - Present	M.S. In Statistics, Pursuing Ph.D. in Biomedical Informatics	-	PhD Student
Matthew Obusan	Sept 2023-Present	MD PhD (MSTP Student) Pursuing PhD in Biomedical Informatics		MSTP Student
Max Chao	Sept 2023-Present	MS in Computational Biology, Pursuing PhD in Biomedical Informatics		PhD Student
Chelsea Kirkland	April. 2024 - Present	Pursuing PhD in Pharmacology		PhD Student

Past Trainees in my Research Laboratory

Name	Duration	Qualification	Number of publications	Present position of mentee
Pratik Dutta	May. 2021- 2022	PhD in Computer Science	2	Sr Research Scientist, Stony Brook University
Russell Keathley	Aug. 2019 – 2023	Ph.D. in Biomedical Informatics	1	Graduate Research Assistant
Yanrong Ji	Oct. 2017 – May, 2021	Ph.D. in Biomedical Informatics	4	Senior Manager, Statistics at AbbVie
Manoj Kandpal	Mar. 2014 – June 2020	Ph.D. (Chemical Engineering, NUS, Singapore)	5	Director of Research Bioinformatics, Center for Clinical and Translational Science, Rockefeller University, New York, NY.
Yue Gao	Mar. 2019- Dec. 2020	M.S. Biotechnology Student (Bioinformatics Track)	-	Bioinformatics Intern, Division of Genetics and Genomics at Boston Children's Hospital and Harvard Medical School, Boston, MA.
Rethavathi Janarthanam	May. 2019- Dec. 2020	M.S. HBMI Student (Bioinformatics Track)	-	Research Scientist, Robert Lurie Children’s Research Institute, Northwestern University, Chicago, IL.
Mathew Dapas	Mar. 2015 – Mar. 2019	Ph.D. Student, Bioinformatics Track	1	Research Assistant Professor at Northwestern University, Chicago, IL.
Sudesh Pundir	January, 2018 – Janualy 2019	PhD (Statistics)	1	Asst Professor (From Pondicherry University, India)
Arunima Shilpi	Dec. 2016 – Dec. 2018	PhD, Bioinformatics	1	Postdoctoral Fellow
Yingtao Bi	Oct. 2009-March 2017	Ph.D. (Applied Statistics University of California)	6	Director of Bioinformatics, AbbVie, Boston, MA
HongJian Jin	August 2014 – 2017	Ph.D. in Cell Biology. Zhejiang University, Hangzhou, China. Postdoctoral Fellow (Division of Hematology/Oncology, Northwestern University Feinberg School of Medicine)	1	Research Assistant Professor, St Judes Children’s Research Hospital, Memphis
Segun Jung	Nov. 2013 – Dec. 2015	Ph.D. (Computational Biology, NYU)	2	Associate Director, R&D and Clinical Bioinformatics, NeoGenomics Laboratories, Los Angeles, California
Auditi Debroy	October 2014 – September 2015	Ph.D. in Pharmacology, University of Illinois at Chicago (UIC)	-	Associate Director at Bristol-Myers Squibb (Immuno-Oncology) at Bristol-Myers Squibb/
Ferhat Ay	March, 2015 – Dec. 2015	Ph.D. in Computer Science University of Florida (UF)	1	Institute Leadership Associate Professor of Computational Biology, La Jolla Institute for Allergy and Immunology, San Diego, CA.
Arunima Shilpi	July 2014 – November 2014	Ph.D. Student, National Institute of Technology, Rourkela, Odisha, India	1	Bioinformatics Scientist, NIDDK, NIH, Bethesda, MD.
Sharmistha Pal	Sep. 2008-Feb. 2013	Ph.D. (Molecular, Cellular and Developmental Biology, OSU)	7	Senior Scientist, Haas-Kogan Lab, Dana Farber Cancer Institute, Harvard University, Boston, MA.
Ravi Gupta	Jul. 2008-Feb. 2012	Ph.D. (Computer Science & Engineering – Indian Institute of Technology, India)	12	VP of Bioinformatics, MedGenome Labs, India
Madhu Bhattacharjee	Nov. 2009-Apr 2010	Ph.D. (Statistics); Lecturer-B, University of St Andrews, Scotland	1	Associate Professor, University of Hyderabad, Hyderabad, India.
Murali Bashyam	Jan. 2013 – Jun, 2013	Ph.D. (Biochemistry); Group Head (PI) of Laboratory of Molecular Oncology, Centre for DNA Fingerprinting and Diagnostics (CDFD), Hyderabad, India	-	Group Head (PI) of Laboratory of Molecular Oncology, Centre for DNA Fingerprinting and Diagnostics (CDFD), Hyderabad, India
Hyunsoo Kim	May 2009-Aug. 2011	Ph.D. (Computer science, University of Minnesota)	4	Bioinformatics Scientist, Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
Zhongfa (Jacob) Zhang	Oct. 2010-Mar. 2012	Ph.D. (Biostatistics, Case Western Reserve University, Cleveland, OH)	1	Bioinformatics Scientist, Millennium Pharmaceuticals, Cambridge, MA
Priyankara Wikramasinghe	May. 2006-Apr. 2008	Ph.D. (Physics, University of Cincinnati, OH)	1	Associate Managing Director of Bioinformatics and Associate Wistar Scientist, The Wistar Institute, Philadelphia, PA.
Anirban Bhattacharya	Nov. 2007-May, 2010	Ph.D. (Physics, OSU, Columbus, OH)	1	SVP, Data Science at Chubb, Boston, MA.
Hao Sun	April 2002-Dec. 2003	Ph.D. (Environmental Chemistry, Nanjing University, China)	11	Full Professor, Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Hong Kong
Gregory Singer	May 2004-Apr. 2008	Ph.D. (Bioinformatics & Genomics, University of British Columbia, Vancouver, Canada)	3	Director of Research Grants at Ryerson University
Huaxia Qin	Sep. 2005-2007	Ph.D. (Plant Genetics & Molecular Biology – University of Tennessee, Knoxville); M.A. (Biostatistics, University of California, Berkeley)	1	Statistician, JP Morgan Chase Bank & Co., Columbus, OH
Victor Jin	April 2003-Nov. 2005	Ph.D. (Supramolecular Chemistry, Queen’s University, Ontario, Canada)	8	Linda T. and John A. Mellowes Endowed Chair of Bioinformatics and Data Analytics; Professor, IHE/Biostatistics, Medical College of Wisconsin, Milwaukee, WI.

Rotation Students and CSIRE Program Trainees

S. No.	Name	Duration	Qualification	Present position of mentee
1	Margalit Mitzner	Summer, 2024	MSTP student	PhD Student of Dr. Chao Chen & Dr. Pratik Prasanna
2	Naheel Khatri	Summer, 2024	MSTP student	PhD Student of Dr. Mehdi Damaghi & Dr. Chao Chen
3	Pranav Mukhi	Summer, 2024	High School Student (CSIRE program)	-
4	Mannat Vikramaditya Jain	Summer, 2024	High School Student (CSIRE program)	-

AWARDS AND ACTIVITIES

Dr. Davuluri’s awards include Young Scientist Award – Merit Certificate (Statistics) from the Indian Science Congress Association, 84th annual session (1996-97); V Scholar Award, The V foundation for Cancer Research. He held Philadelphia Healthcare Trust Endowed Chair and Tobin Kestenbaum Family Endowed Professor, while on faculty at The Wistar Institute, Philadelphia. He is currently serving as a regular member of Biomedical Informatics, Library and Data Science (BILDS) Review Committee, National Library of Medicine, NIH.

TEACHING AND TRAINING SUMMARY

Davuluri has trained 20 graduate students and postdocs, and 4 junior investigators so far. These include that have gone on to faculty positions in academia (Sharmistha Pal, Scientist, Dana Farber Cancer Institute, Harvard University, Boston, MA; Hao Sun, Associate Professor, The Chinese University of Hong Kong, Hong Kong; Victor Jin, Professor, Dept of Molecular Medicine, UTHSCSA, San Antonio, TX) or leadership positions in industry or academia (Yingtao Bi, Bioinformatics Director, AbbVie Inc, Boston).

Course Director and Teaching:

Course Director, BMI 511 Translational Bioinformatics