In this vertical we explore the applications of artificial intelligence in bioinformatics. The power of AI can be used to unravel biological complexities and enhance precision medicine using bioinformatics.
a. Ensembled Lower-dimensional Projections of Cellular Expression Improve the Cell Type Classification from Single-cell Sequencing
Single-cell RNA sequencing (scRNA-seq) enables the study of cellular diversity at
single cell level. It provides a global view of cell-type specification during the onset
of biological mechanisms such as developmental processes and human
organogenesis. Various statistical, machine and deep learning-based methods have
been proposed for cell-type classification. Most of the methods utilizes unsupervised
lower dimensional projections obtained from for a large reference data. In this work,
we proposed a reference-based method for cell type classification, called EnProCell.
The EnProCell, first, computes lower dimensional projections that capture both the
high variance and class separability through an ensemble of principle component
analysis and multiple discriminant analysis. In the second phase, EnProCell trains a
deep neural network on the lower dimensional representation of data to classify cell
types. The proposed method outperformed the existing state-of-the-art methods when
tested on four different data sets produced from different single-cell sequencing
technologies. The EnProCell showed higher accuracy (98.91) and F1 score (98.64)
than other methods for predicting reference from reference datasets. Similarly,
EnProCell also showed better performance than existing methods in predicting cell
types for data with unknown cell types (query) from reference datasets
(accuracy:99.52; F1 score: 99.07). In addition to improved performance, the proposed
methodology is simple and does not require more computational resources and time.
the EnProCell is available at https://github.com/umar1196/EnProCell.
b Deep-Ace: LSTM-based Prokaryotic Lysine Acetylation Site Predictor
Acetylation of lysine residues (K-Ace) is a crucial post-translational modification
present in both prokaryotes and eukaryotes. It plays a significant role in disease
pathology and cell biology, necessitating the identification of K-Ace sites. Previous
approaches using hand-crafted features and encodings in machine learning models
have been employed to analyze the characteristics of K-Ace
sites. However, these methods overlook long-term relationships
within sequences, leading to a degradation in performance. In
this study, we propose Deep-Ace, a deep learning-based
framework utilizing a Long-Short-Term-Memory (LSTM)
network that effectively captures and encodes long-term
relationships in sequences. Such relationships are essential for
learning discriminative and impactful sequence representations.
We employ LSTM for both deep feature extraction and the
prediction of K-Ace sites, leveraging fully connected layers for
eight prokaryotic species models, including B. subtilis, C.
glutamicum, E. coli, G. kaustophilus, S. eriocheiris, B. velezensis, S. typhimurium,
and M. tuberculosis. All codes will soon be made publicly available at
https://github.com/Maham-Ilyas/Deep-Ace.
c. Data Science Meets High-throughput Single-cell Sequencing
Spatial Transcriptomics (ST) measures the cellular gene expression profiles while
preserving the spatial context of genes. ST helps to understand the architecture of
heterogeneous tissues and cell-to-cell communication. ST technologies provide the
spatial distribution of RNA abundance in spots. Spots are bigger in size and may capture
a heterogeneous population of cells resulting in a mixed cell expression from each spot.
Identification of cellular composition in each spot has raised computational challenges
commonly referred to as ST spot deconvolution. Addressing the computational
challenges raised by emerging ST technologies necessitates collaborative efforts from
people with different skills.
Statistical and machine or deep learning methods have been proposed to deconvolute ST
spots. Such methods have been provided in R or python packages that can be utilized by
the bioinformatics community to analyze their data. Despite the availability of easy-to-
use computer programming packages, a deeper knowledge of machine or deep learning
methods is required for their successful applications in ST spot deconvolution. ST is a
recently emerging technology, most often bioinformaticians need a deeper understanding
of machine or deep learning methods in the context of spot deconvolution.
Additionally, ST technology is evolving rapidly, thus posing new computational
challenges of developing more robust and state-of-the-art methods to take full advantage
of its data. Data science discipline has gained immense popularity due to its applications
in almost every field. Data scientists have a restricted role to address ST challenges
because of their limited understanding of data.
In this review, to help the bioinformatics community, we have provided a detailed
description of the machine and deep learning methods. An overview of evaluation
methods for the proposed learning architecture is also presented. An introduction to
cloud-based tools that might have applications in ST is also discussed. In order to provide
a bigger but better picture of ST datasets to data scientists we have also discussed the ST
technologies and the data produced by them. The purpose of this review is twofold: to
provide an overview of the machine and deep learning methods along with ST datasets.
Secondly, to reduce the gap between the bioinformatics and data science community
which may stimulate the development of more robust and accurate methods for spatial
transcriptomics.
