Studying the COVID-19 genome sequences

Featured image

This notebook is an attempt to answering some of the questions of bio-informatics in regards to the recent panademic outbreak of Corona virus, from a non-medical background person’s perspective. Every analysis presented here is based on the stuff I learnt and understood (or felt that I actually understood) about the genomes from the internet.

Some Context

[Ques.] What is Genome?
Genomes are considered to be genetic material of any organism. It consists of DNA(or RNA) which in turn can be considered as the blueprints of any organisms’ origin.

[Ques.] How to study Genomes?
A genome consists of a sequence of nucleotides that together make up all the chromosomes of any organism. These nucleotides are the basic building blocks of DNA and RNA. A DNA genome has 4 types of base nucleotides -
Adenine also represented as A,
Thymine represented as T,
Guanine represented as G and
Cytosine represented as C.
In RNA genome, the Thymine nucleotide is replaced with what is known as Uracil represented as U.
In short, any genome sequence is basically a combination of all these nucleotide in smoe order.

[Ques.] Can one find a cure for COVID-19 by just studying the genomes?
Not sure, as the molecular biology has not really been my field of study (never liked Biology tbh) but also because its a very vast subject where to confirm on a single hypothesis also takes months/years of evaluation and testing. Having said that, the whole purpose of this analysis is to help the ones who actually understand molecular biology get some data-driven insights to be able to preapare a RNA vaccine at the earliest.

lets have a look at a snippet of one of the genomes,

having a huge sequence and with everyting in single color could be difficult to read. So let’s color-code these nucleotides in the sequences and have them represented as they are normally written.

Comparison of composition of genomes

Comparing the genomic composition, helps Vaccinologist do reverse vaccinology to discover candidate antigents for vaccine development by analyzing the genome of a pathogen or a family of pathogens.

For the comparison of composition of these genomes, I’ll be looking at the

1. Oligonucleotide composition:

Oligonucleotides are DNA or RNA molecules made up of a sequence of nucleotides. The length of the oligonucleotide is represented as “k-mer” ($k$ being the number of nucleotides in one oligonulceotide). In this analysis, I’ll be looking at the 3-mers (or trinucleotides) and 4-mer (or tetranucleotides) compositions against their normalized frquency of occurrences.

Now lets have a look at the tetramer composition of these genomes

Well that’s alot to put in one plot, but the good thing with plotly is that you can always turn off a few legends to compare the values just among the selected few.

For a moment, lets just look at the tetranucleotide composition of all the “corona-virus infected species”.

2. GC Content:

GC content (or Guanine-Cytosine content) is the percentage of nucleotides bases in a DNA/RNA molecule that are either guanine (G) or cytosine (C). GC content is always expressed as a percentage value and is supposed to remaine the same among genomes of same specie.

\(\frac{G + C}{A + C + G + T} \times 100\ \%\)

Proteins & Amino Acids

The most commonly occuring amino acids are abbreviated by using 20 letters from the english alphabet (all letters except for B,J,O,U,X and Z). The protein strings are constructed from these 20 alphabets as symbols. Henceforth, a Protein Sequence can also be incorporated using the nucleotide sequences.

The DNA codon table on wiki dictates the details regarding the encoding of specific codons into amino acids alphabets.

lets have a look at the protein sequences of a few of these genome before we go for a comparative analysis

1. Protein Strands by genome

2. Amino Acids Distribution

3. Finding the ORFs (Open Reading Frames)

The stretch of codons from start-codon (ATG) to stop-codon (TAA/TAG/TGA) is called Open Reading Frame (ORF)

the highlighted sequences below represent the ORFs

4. Comparative Summary of Protein Sequences

Sequence Alignement

The sequence alignment is a way of arranging the sequences of DNA/RNA or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. These in turn help in identifying the sequence similarities and developing homology models of protein structures. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor.

as the genomes sequences are too big in length, lets align and have a look at just the first few nucleodtides in the sequences

let us also have a look at the first few aligned protein sequences of these genomes

Pairwise Sequence Alignment

earlier we had aligned every sequence based on the Clustal algorithm, now let’s align every other genome with the covid19 genomes.

Genome Sequence Similarity

using Clustal algorithm

for establishing the similarity among genomes, we’ll be changing the scoring parameter
for every match there will be a +1 point
for every mismatch there will be a -1 point
for every open gap there will be a -0.5 points
for every extended gap there will be a -0.1 point

Similarity scores between

COVID-19 & SARS genome sequences:	 20885.00000000038 (69.84%)
COVID-19 & MERS genome sequences:	 15122.599999998136 (50.57%)
COVID-19 & Civet_SL_CoV genome sequences:	 20616.90000000066 (68.95%)
COVID-19 & Bat_SL_CoV genome sequences:	 20706.000000000255 (69.24%)
COVID-19 & Ebola genome sequences:	 10233.39999999796 (34.22%)
COVID-19 & Camel_CoV genome sequences:	 15134.599999998123 (50.61%)
COVID-19 & Malaria genome sequences:	 8.800000000105774 (0.03%)
COVID-19 & HIV genome sequences:	 5962.5999999990445 (19.94%)
COVID-19 & Hedgehog_CoV genome sequences:	 15227.29999999811 (50.92%)

Generating a Phylogenetic Tree

A Phylogenetic Tree is the tree of evolutionary relationship among various species based on their genetic characteristics.

UPGMA (Unweighted Pair Group Method with Arithmetic mean) algorithm

This is where you can learn more about it wiki
[NOTE] unlike the above sequence similarity, here only the mismatch cases are scored using the aligner and no scoring is done for gaps and extended gaps.

Clustering all COVID-19 patients’ genomes

using the UPGMA algorithm

Conclusions


If you liked this project, please consider buying me a coffee

Buy me a coffeeBuy me a coffee