Bioinformatics, Machine Learning (ML), and Artificial Intelligence (AI)

Keywords: Machine Learning, Natural Language Processing, Electronic Health Records, Infectious Disease, Genomics, noncoding RNA, Gene Regulation


Our expertise is in implementing Bioinformatics methods and machine learning models to solve biological research questions. The two research focus areas for the group are:


1) Health Informatics

We utilise cutting edge AI and genomics technologies with significant outcomes for the academic and clinical communities to discover new treatments and improve healthcare. Our approach employs machine learning methods to automatically learn complex features from individual data types, and harmonise heterogeneous multimodal information.  

Solving complex diseases require the integration of multi-modal big data, each contributing prominent features of biological significance. Existing integrative approaches merge multi-modal data during post-processing, which risks losing quantitative information of individual modalities, leading to erroneous analysis. We address this problem by analysing the large volumes of sequencing, image and digital health data in its raw form using machine learning. A critical advantage of the design is that it limits significant assumptions, as the user inputs data in its primary form. This reduces information loss, increasing sensitivity to weak signals in the data, robustness and reproducibility. 

This not only opens up new avenues of genome-editing-based therapy of diseases but also aims to create a tracking and response system to lead to earlier detection of superbugs, personalised treatment for patients and prevention of outbreaks.



 COVID-19 drug target identification pipeline 

A Machine Learning-based multivariate approach to harmonise multi-omics data from SARS-CoV-2



2) Integrative Genomics

Our research interest is to combine data from multiple genomics layers to generate gene regulatory signatures. We have developed computational methods to integrate epigenomics and transcriptomics data. We study non-coding parts of the genome comprising DNA regulatory elements such as promoters and enhancers and genomic regions encoding for small and long non-coding RNAs (ncRNA).

There are more than 30 trillion cells in the average human body but all the cells essentially carry the same DNA. How do different cells express information on the DNA? This is achieved by turning different set genes 'on' and 'off' by different cells under different conditions or states. The epigenome can change the way cells use instructions coded into RNAs without changing the DNA itself. These epigenomic components consist of chemical compounds, proteins, or non-coding RNAs.  Non-coding RNAs tell other epigenetic players what to do -they direct the show and orchestrate changes. Our group is looking at teasing apart the biogenesis and function of short and long non-coding RNAs and how they play a role in epigenomic gene regulation.


                                                

                "The answers to gene regulation of development and disease lie encrypted in the epigenome" 


Computational Methods and data analysis pipelines developed by us:


  1. CID-miRNA implements a probabilistic model to predict pre-miRNA structures from the genomic data (Tyagi et al 2008; Dubrovski & Tyagi 2015; Stark, Tyagi et al 2010). 
  2. Evolutionary Motif Scan (EMSCANimplements evolutionary models into scanning of known protein-binding motifs (Tyagi et alUnpublished).
  3. Co-regulatory motif cluster (CRC) finder and CRC database (Bonu & Tyagi 2020, unpublished).
  4. Linc2Function: a deep learning model for functional annotation of long non-coding RNA (lncRNA). (Manuscripts: 1. Ramakrishnaiah, Kuhlmann & Tyagi 2020; 2. Ramakrishnaiah & Tyagi 2021 ).
    1. DATA: LncRNA annotation summary from available public resources
    2. linc2function pipeline source on Gitlab (MIT license)
  5. multiomics:  A user-friendly multi-omics data harmonisation R pipeline ( Manuscript: Chen et al 2021 under review F1000 ).
  6. Multi-omics data harmonization by applying machine learning frameworks (Manuscript: Chen et al BIB 2021; Review: Chen & Tyagi 2020; Poster: Chen et al 2019).
  7. Noncoding variant detection and prioritisation in diseases (Chahal et al 2019).
  8. De novo assembly of genome/transcriptome/meta-genome (Sarristo et al 2021).
  9. Annotation of somatic and germline variants in WES or WGS data (Stark et al 2012 Nature genetics 44 (2), 165-169).


Current Projects   


AI to Translate Electronic Medical Records & Genomic Data into Clinical Assets
Multi-omics Data Harmonisation
Application of AI in Genomics

Antimicrobial Resistance (AMR) is the biggest cause of hospital infections that significantly impact patient survival, length of stay and health care costs estimated in the billions.

Combined with electronic health data, genomic data can help researchers and clinicians discover early signs of antibiotics resistance or determine an individual’s risk of developing AMR. Genomics can point to the underlying causes of clinical changes, leading to more personalized, effective treatments.


In this project, we are working on harmonising data from multiple modalities to build regulatory and functional signatures of a biological process.

We are working on a universal harmoniser that has major medical implications, such as 1). Identifying dysregulated biological pathways responsible for a disease as a powerful diagnostic tool 2) Investigating these pathways further allows the biological community to better understand a disease’s mechanisms. 3) Precision medicine also benefits from developments in this area, particularly in the context of the growing eld of selective epigenome editing, which can suppress or induce a desired phenotype.


In this industry-funded collaboration with Reliance India Ltd. and Monash-IITB academy, we are working on building Asia-Pacific resources for studying disease-causing genomic variants. We are focusing on developing population-specific diverse datasets and computational pipelines to study disease genomes.



Multimodal and Machine Learning Approach to Study Antibiotic-Resistant Pathogens
Noncoding RNA Structure and Functions
Regulatory Protein Binding Site Detection


As part of the national data framework project led by Bioplatforms Australia Ltd., genomics, transcriptomics, proteomics, and metabolomics data was generated from bacteria that are known to cause sepsis condition in the hospitals. In many cases, these microbes are resistant to an antibiotic. Developing biomarkers and characterizing functional molecules involved in gene regulation of these microbes is of importance in designing targeted drugs for sepsis. Our group is studying transcriptomics of bacteria and contributing to the development of a multi-omics data integration framework.


Noncoding RNA (ncRNA) will not usually directly code for protein and used to be thought of as “junk” regions in the DNA with little to no functional significance. Recently, they are starting to garner attention because of the realization that they play a vital role in genome regulation. Our group has been developing pipelines to predict primary and secondary structures of both short and long non-coding RNA using probabilistic and machine learning frameworks.

DNA or RNA motifs are short (5-20 bp) recurring patterns that are presumed to have a biological function by binding to proteins. Searching for these small patterns in large genomic data (up to billions bp) is very challenging. To address this, we have built an analysis pipeline with a hierarchy of increasingly sophisticated motif scanning algorithms that are considered, testing their ability to identify known binding sites in a genomic sequence.  Further, we are implementing a deep learning framework to predict the most probable combination of two or more such motifs from various permutations possible in a given sequence.




Machine Learning based Data Harmonisation for SARS-CoV-2 Drug Target Identification
Unravelling Regulatory Mechanisms of Preterm Labour


We demonstrate that a holistic cross-omics approach is needed for studying complex phenotypes, such as COVID-19, to develop biomarkers from a system-level perspective

Secondly, we present an open-access, flexible, machine learning multivariate approach to extract relevant information from noisy heterogeneous multi-omics data

We define data harmonisation as simultaneous/parallel multi-omics integration to highlight the inter-relationships of disease-driving biomolecules. This is in contrast to comparing processed information from each omics level separately.



Preterm birth is a prominent cause of infant death and permanent disability. In Australia ~9% of births are preterm (<37 weeks). Currently, the mechanisms regulating activation of the muscle of the uterus are unknown making treatment difficult. We have sequenced RNA from myometrium from women in term and preterm labour to assess the transcriptome and develop a novel computational model of the process of labour. We are working on a new hypothesis that “junk” DNA regulates the transformation of the uterus at labour. This new knowledge may lead to improvements in the diagnosis and treatment of preterm labour.