AI for Bioinformatics

Back to Verticals page

In this vertical we explore the applications of artificial intelligence in bioinformatics. The power of AI can be used to unravel biological complexities and enhance precision medicine using bioinformatics.

a. Ensembled Lower-dimensional Projections of Cellular Expression Improve the Cell Type Classification from Single-cell Sequencing

Single-cell RNA sequencing (scRNA-seq) enables the study of cellular diversity at single cell level. It provides a global view of cell-type specification during the onset of biological mechanisms such as developmental processes and human organogenesis. Various statistical, machine and deep learning-based methods have been proposed for cell-type classification. Most of the methods utilizes unsupervised lower dimensional projections obtained from for a large reference data. In this work, we proposed a reference-based method for cell type classification, called EnProCell. The EnProCell, first, computes lower dimensional projections that capture both the high variance and class separability through an ensemble of principle component analysis and multiple discriminant analysis. In the second phase, EnProCell trains a deep neural network on the lower dimensional representation of data to classify cell types. The proposed method outperformed the existing state-of-the-art methods when tested on four different data sets produced from different single-cell sequencing technologies. The EnProCell showed higher accuracy (98.91) and F1 score (98.64) than other methods for predicting reference from reference datasets. Similarly, EnProCell also showed better performance than existing methods in predicting cell types for data with unknown cell types (query) from reference datasets (accuracy:99.52; F1 score: 99.07). In addition to improved performance, the proposed methodology is simple and does not require more computational resources and time. the EnProCell is available at https://github.com/umar1196/EnProCell.

b Deep-Ace: LSTM-based Prokaryotic Lysine Acetylation Site Predictor

Acetylation of lysine residues (K-Ace) is a crucial post-translational modification present in both prokaryotes and eukaryotes. It plays a significant role in disease pathology and cell biology, necessitating the identification of K-Ace sites. Previous approaches using hand-crafted features and encodings in machine learning models have been employed to analyze the characteristics of K-Ace sites. However, these methods overlook long-term relationships within sequences, leading to a degradation in performance. In this study, we propose Deep-Ace, a deep learning-based framework utilizing a Long-Short-Term-Memory (LSTM) network that effectively captures and encodes long-term relationships in sequences. Such relationships are essential for learning discriminative and impactful sequence representations. We employ LSTM for both deep feature extraction and the prediction of K-Ace sites, leveraging fully connected layers for eight prokaryotic species models, including B. subtilis, C. glutamicum, E. coli, G. kaustophilus, S. eriocheiris, B. velezensis, S. typhimurium, and M. tuberculosis. All codes will soon be made publicly available at https://github.com/Maham-Ilyas/Deep-Ace.

c. Data Science Meets High-throughput Single-cell Sequencing

Spatial Transcriptomics (ST) measures the cellular gene expression profiles while preserving the spatial context of genes. ST helps to understand the architecture of heterogeneous tissues and cell-to-cell communication. ST technologies provide the spatial distribution of RNA abundance in spots. Spots are bigger in size and may capture a heterogeneous population of cells resulting in a mixed cell expression from each spot. Identification of cellular composition in each spot has raised computational challenges commonly referred to as ST spot deconvolution. Addressing the computational challenges raised by emerging ST technologies necessitates collaborative efforts from people with different skills. Statistical and machine or deep learning methods have been proposed to deconvolute ST spots. Such methods have been provided in R or python packages that can be utilized by the bioinformatics community to analyze their data. Despite the availability of easy-to- use computer programming packages, a deeper knowledge of machine or deep learning methods is required for their successful applications in ST spot deconvolution. ST is a recently emerging technology, most often bioinformaticians need a deeper understanding of machine or deep learning methods in the context of spot deconvolution. Additionally, ST technology is evolving rapidly, thus posing new computational challenges of developing more robust and state-of-the-art methods to take full advantage of its data. Data science discipline has gained immense popularity due to its applications in almost every field. Data scientists have a restricted role to address ST challenges because of their limited understanding of data. In this review, to help the bioinformatics community, we have provided a detailed description of the machine and deep learning methods. An overview of evaluation methods for the proposed learning architecture is also presented. An introduction to cloud-based tools that might have applications in ST is also discussed. In order to provide a bigger but better picture of ST datasets to data scientists we have also discussed the ST technologies and the data produced by them. The purpose of this review is twofold: to provide an overview of the machine and deep learning methods along with ST datasets. Secondly, to reduce the gap between the bioinformatics and data science community which may stimulate the development of more robust and accurate methods for spatial transcriptomics.