Computational Biology, Machine Learning (ML), and Artificial Intelligence (AI)
Keywords: Machine Learning, Natural Language Processing, Electronic Health Records, Complex Diseases, Genomics, noncoding RNA, Gene Regulation
Computational pipelines and research software developed by us:
|
Our expertise is in implementing Bioinformatics methods and machine learning models to solve biological research and clinical outcome questions. The two research focus areas for the group are:
1) Multimodal data integration for personalised medicine
We utilise cutting edge AI and genomics technologies with significant outcomes for the academic and clinical communities to discover new treatments and improve healthcare. Our approach employs machine learning methods to automatically learn complex features from individual data types, and harmonise heterogeneous multimodal information.
Solving complex diseases require the integration of multi-modal Big data made of genomic and healthcare information. Healthcare data contains an individual's medical history in the form of digital heath records, whereas multi-omics genomic assays generate sequence and numerical data at large-scale. Existing integrative approaches merge multi-modal data during post-processing, which risks losing quantitative information of individual modalities, leading to erroneous analysis. We address this problem by analysing the large volumes of sequencing, image and digital health data in its raw form using machine learning. A critical advantage of the design is that it limits significant assumptions, as the user inputs data in its primary form. This reduces information loss, increasing sensitivity to weak signals in the data, robustness and reproducibility.
2) Integrative Genomics
Our research interest is to combine data from multiple genomics layers to generate gene regulatory signatures. We have developed computational methods to integrate epigenomics and transcriptomics data. We study non-coding parts of the genome comprising DNA regulatory elements such as promoters and enhancers and genomic regions encoding for small and long non-coding RNAs (ncRNA).
There are more than 30 trillion cells in the average human body but all the cells essentially carry the same DNA. How do different cells express information on the DNA? This is achieved by turning different set genes 'on' and 'off' by different cells under different conditions or states. The epigenome can change the way cells use instructions coded into RNAs without changing the DNA itself. These epigenomic components consist of chemical compounds, proteins, or non-coding RNAs. Non-coding RNAs tell other epigenetic players what to do -they direct the show and orchestrate changes. Our group is looking at teasing apart the biogenesis and function of short and long non-coding RNAs and how they play a role in epigenomic gene regulation.
"The answers to gene regulation of development and disease lie encrypted in the epigenome"
Current Projects
AI to Translate Electronic Medical Records & Genomic Data into Clinical Assets | Decoding genomic grammar using DNA language models | Multimodal Learning for Personalised Medicine |
Antimicrobial Resistance (AMR) is the biggest cause of hospital infections that significantly impact patient survival, length of stay and health care costs estimated in the billions. Combined with electronic health data, genomic data can help researchers and clinicians discover early signs of antibiotics resistance or determine an individual’s risk of developing AMR. Genomics can point to the underlying causes of clinical changes, leading to more personalized, effective treatments. | Genome Language Modeling We are applying NLP methods to process genomic data and their downstream ML application models. We have developed genomicBERT models to integrate genomic data for ML modelling tasks We extend these approaches to RNA and amino acid sequences. | In this industry-funded collaboration with Monash-IITB academy, we are working on building Asia-Pacific resources for studying disease-causing genomic variants. We are focusing on developing population-specific diverse datasets and computational pipelines to study disease genomes. We are developing deep learning models to fuse data from multiple modalities. |
Multiomics data integration to Study Antibiotic-Resistant Pathogens | Noncoding RNA Structure and Functions | Regulatory Protein Binding Site Detection on DNA and RNA |
As part of the national data framework project led by Bioplatforms Australia Ltd., genomics, transcriptomics, proteomics, and metabolomics data was generated from bacteria that are known to cause sepsis condition in the hospitals. In many cases, these microbes are resistant to an antibiotic. Developing biomarkers and characterizing functional molecules involved in gene regulation of these microbes is of importance in designing targeted drugs for sepsis. Our group is working on harmonising data from multiple modalities to build regulatory and functional signatures of a biological process. | Noncoding RNA (ncRNA) will not usually directly code for protein and used to be thought of as “junk” regions in the DNA with little to no functional significance. Recently, they are starting to garner attention because of the realization that they play a vital role in genome regulation. RNA typically function by interacting with other biomolecules such as DNA, RNA or Protein via site specific interactons. Our group has been developing pipelines to predict secondary structures and their functional motifs for both short and long non-coding RNA using probabilistic and machine learning frameworks. | DNA or RNA motifs are short (5-20 bp) recurring patterns that are presumed to have a biological function by binding to proteins. Searching for these small patterns in large genomic data (up to billions bp) is very challenging. To address this, we have built an analysis pipeline with a hierarchy of increasingly sophisticated motif scanning algorithms that are considered, testing their ability to identify known binding sites in a genomic sequence. Further, we are implementing a deep learning framework to predict the most probable combination of two or more such motifs from various permutations possible in a given sequence. |
Machine Learning based Data Harmonisation for SARS-CoV-2 Drug Target Identification | AI-driven diagnosis of Preterm birth (PTB) and Unravelling Regulatory Mechanisms of Preterm Labour | Healthcare data processing and Clinical Data standards |
We demonstrate that a holistic cross-omics approach is needed for studying complex phenotypes, such as COVID-19, to develop biomarkers from a system-level perspective Secondly, we present an open-access, flexible, machine learning multivariate approach to extract relevant information from noisy heterogeneous multi-omics data We define data harmonisation as simultaneous/parallel multi-omics integration to highlight the inter-relationships of disease-driving biomolecules. This is in contrast to comparing processed information from each omics level separately. [COMPLETED and CLOSED] | Preterm birth is a prominent cause of infant death and permanent disability. In Australia ~9% of births are preterm (<37 weeks). Currently, the mechanisms regulating activation of the muscle of the uterus are unknown making treatment difficult. We have sequenced RNA from myometrium from women in term and preterm labour to assess the transcriptome and develop a novel computational model of the process of labour. We are working on a new hypothesis that “junk” DNA regulates the transformation of the uterus at labour. This new knowledge may lead to improvements in the diagnosis and treatment of preterm labour. | Electronic Health Records (EHR) is the most common type of data that is accessible from hospitals. This data has static, temporal and narrative text format. The data is full of errors, inconsistencies and anomalies that need to be processed before the information can be used for various data science tasks. We are developing pipelines to process EHR data in an automated fashion to develop standardised data encodings and study specific data metrics for their downstream data science and data modelling applications. |