Tags:
I expect the audience of this post to be both programmers and biologists so I'll bring you up to speed on a topic before going into it. Feel free to skip a sentence, paragraph or even section if you're familiar with a topic.
A genome is the entire genetic code of an organism. While computational data is represented in binary form, ones, and zeros, biochemical data is represented by nitrogenous bases that seem to stick out of a DNA or RNA molecule/strand abbreviated A, T, C, and G for DNA and A, G, C and U for RNA. We care about RNA because some viruses have RNA and not DNA.
If this is confusing, you can think of a strand of DNA or RNA as a thread with knots where each knot is a base.
A sequence is an ordering of something. A programming analog is a sequence vs a collection. Sequences are ordered, for example lists, and therefore have the potential for a next and a previous element while collections are just data thrown together, for example, a dictionary or a set.
Genome sequencing (or sequencing a genome) therefore, is determining the order of bases in all of the DNA or RNA in an organism. What makes this easy is that all the cells in an individual organism have the same DNA so we can get all the DNA in an organism from a single cell. In practice, however, we can't work with a single cell due to its size. Ignore chromosomes, haplotypes and other things you may know about DNA for now.
To determine the sequence of bases in an entire genome of an organism we focus on only one of the alleles (a haplotype) and only one strand of the double helix. Since 2005 we've used methods broadly categorized under Next Generation Sequencing (NGS) to perform genome sequencing. There are two main ways of performing NGS:
As you would expect, each method has its drawbacks and advantages. What we get out of the machine that does the actual sequencing of DNA is called a read and reads have to be aligned and assembled2. Alignment involves stacking reads on top of each other and assembling is the greater process that involves alignment, algorithmically choosing the best alignment and determining what the original sequence was.
There are two broad categories of assembly4:
A reference genome is a consensus sequence that accepted as the genome of a species2. It’s stored as one long sequence of characters/bases. You may wonder how we can have a known genome of an entire species when every individual has a unique genetic code or how humans are 99% chimp. Well, the answer is that genetic code of most organisms is similar and this similarity increases as we narrow down taxonomically. When we say that a human is closer to a chimp than a monkey what we mean is that we can observe greater variation between the genomes of the two, man+chimp vs monkey, than man vs chimp alone.
This isn't actual math but may help clear things up.
variation(combine_genomes(man, chimp), monkey) > variation(man, chimp)
However, there are still genomic differences and they should not be ignored. The ignoring of differences is implicit in a linear reference. A better way to describe them is to say that the differences are segregating within the population. We may also want to carry out a comparison between species or between related species which is done in pangenomics.
DNA has sections which are identical between individuals (conserved regions), and the number of these sections grow as we narrow down taxonomically and there are sections which vary between individuals, for example, the short sequence repeats that are compared in paternity testing.
Graph theory is an area in math that can help us understand variable regions within genomes. The idea of representing genomes as graphs isn’t new, however, the low number of tools like vg which apply graph theory to genomics and the little that we know about genomes has been a drawback.
A graph is a series of vertices (also known as nodes) and edges.
For genome graphs, we focus on directed acyclic graphs. A walk in a directed graph is traversal from one node to another through an edge, for example, a to b to d or a to c to d.
Once the reference genome of an organism has been determined, it is stored in fasta format which contains the sequence and metadata. Moving forward, anyone sequencing the same species aligns against this reference. Differences that occur in less than 1% of the reads are usually thrown out; the ones that aren’t thrown out don’t help to update the reference but are stored in Variant Call Format (VCF) which contains the variation data and their positions plus metadata. These VCF files are spread out amongst researchers and aid in the particular thing being researched but generally don’t contribute in and of themselves to the general genomic body of knowledge. However, every once in awhile the reference is updated but not on a fixed schedule2. It’s for this reason that the variation graph would be a good way of representing the reference. There is research that confirmed that short reads align better to the variation graph than to a linear reference3.
Graphs that are applied to genomes are generally called genome graphs. However, there are two more specific categories which are sequence graphs and variation graphs.
As an example assume that we zoom on a hypothetical reference: "ACTGAATTTGTA"
Variation | Position | Alternative |
---|---|---|
Variation1 | 2 | GGGA |
Variation2 | 4 | C |
We could recursively insert Variation1 at position 2 and Variation2 at position 4 to generate the graph below:
(generated using graphite and my current fork of graph)
In this case, a single walk would represent a possible genome. Compared to the reference, this variation information is maintained and the graph still holds the data that was in reference.
These are graphs with sequence labels on the nodes or edges.
Sequence graphs or equivalent structures have been used previously to represent multiple sequences that contain shared differences or ambiguities in a single structure. Related structures used in genome assembly which collapse long repeated sequences, so the same nodes are used for different regions of the genome include the De Bruijn graph5. Graphs to represent genetic variation have previously been used for microbial genomes & localized regions of the human genome such as the major histocompatibility complex.
A variation graph is a sequence graph together with a set of paths representing possible sequences from a population. However, what makes it so unique is it's tight mapping between the graph and the reference.
Human orthopneumovirus, formerly known as Respiratory Syncytial Virus (RSV), is a single-stranded RNA virus and a good candidate for exploration using the variation graph because viruses don’t have proofreading in their genetic code. Proofreading is a process in which the cell ensures that it has copied the genetic code correctly in preparation for cell division. Without proofreading, errors will be commonplace leading to high mutation rates. Another advantage is the size of its genome; the reference stands at 15,206 bases which translate to 15206 bytes or 14.8 KB of memory.
As of writing this, graphite can’t generate a graph out of reads alone (perform an alignment). It supports a reference in fasta and a single VCF file.
I'll detail the algorithm in a later post but the gist of it is this:
'((a b) (a c)) to become a node with edges from a to b and c to be and a->b and a->b
A variation is a struct of position
and sequence
.
I’m using the racket graph library graph to generate a graph out of the nested lists and treating the graph as a “dynamic tree”.
We then rely on graph to generate an unweighted directed graph through unweighted-graph/directed. We export the graph in dot format and visualize via graphviz. Serialization isn’t implemented yet.