Tags:
The Google Summer of Code for 2019 is coming to a close. I've been working on the Graphite project aimed at creating a variation graph tool under the Open Bioinformatics Foundation.
I shall give an overview of genome graphs; for a more thorough introduction, I advise reading On a reference pan-genome model or my very own An Introduction to Variation Graphs.
Variation graphs are a class of genome graphs that, among other things, maintain:
All this with the main aim of representing genomic variation.
Currently genomes are represented as a consensus; for example, here's part of chromosome 20 of the human genome:
TGGGAGAGAACTGGAACAAGAACCCAGTGCTCTTTCTGCTCTACCCACTGACCCATCCTCTCACGCATCATACACCCATA
CTCCCATCCACCCACCTTCCCATTCATGCATTCACCCATTCACCCACCTTCCATCCATCTACCATCCACCACGTACCTAC
ACTCCCATCTACCATCCAACCACATTTCCATTCACCCATCCTCCCATCCATCAACCCTCCAATCCACCACCCACAGACCT
TCCCATCCATTCATTTACCCATCCACATATTCACCCACCCTCCCATCCATCCATCTACTGTCTATCACCTACTCATTTTC
...
However, variation does exist on this chromosome between individuals. To demonstrate, when a single file of variation data is "applied" to it we end up with a graph that can be represented as in the below zoomed in section of chromosome 20.
Not all graphs are linear like the one above; in fact, the kind of graphs we're hoping to generate soon will have many breaks and will look pretty jumbled up. The amount of variation within and between genomes depends on a lot of factors including but not limited to the organism(s) whose genomes are studying and how evolutionarily distant the genomes we are comparing are (pan-genomes).
The reference being a consensus introduces a problem known as reference bias; which is analogous to a false negative during read mapping, that is, claiming that a variation does not exist where it actually exists. Research has shown an improvement in read mapping of short read data when short reads are mapped to a reference graph instead of a reference consensus.
Currently, graphite shows that we can generate a graph from a reference and variation data and progressively update it; not far from what Heng Li theorizes in: On a reference pan-genome model. We plan on supporting the generation of graphs from short and long-read data on its own (de novo), you can track its progress under the alignment project board.
You may wonder why Graphite is written in Racket yet the project name is Add Variant Graph (VG) support to BioD. Genome graphs have been theorized and written about for a while now but there are only a few tools that implement them and there has been even much less use by bioinformaticians.
We decided to build something people can use over just adding variation graph support to BioD believing that it would've taken me longer to write something people could use in D.
I have experience in functional programming and particularly professional experience with LISPs, I could, therefore, move much faster in it. Racket is a good LISP to use when working in an unfamiliar domain because it has a straightforward build system compared to the overhead of the JVM and leiningen in the most popular LISP, Clojure does and has an easy to use C/C++ FFI, languages used in a lot of Bioinformatics applications and libraries. Being untyped and programming in LISP being REPL driven also helps to venture into unfamiliar territory. Racket has immutability in a lot, if not all, of its data structures which also helps with all the data manipulation we would have to do. Finally, Racket also implements core programming concepts elegantly compared to many other languages for example with its hygienic macros.
This shouldn't make the D community feel betrayed because given what we know now, we can always go back and reimplement either all or part of Graphite in D, especially for the possible performance improvements.
For the fine detail install and set up Graphite as instructed in the README
and run graphite --help
.
We implemented the graph as an association hash table. I went into more detail on how it's built and the rationale behind certain choices in Creating the Initial Variation Graph and Justifying SHA256 in Graphite.
Graphite allows you to build an initial graph out of a reference in
FASTA format and a VCF file.
In the example below I output a serialized graph but you can output .dot
or
.gfa
.
./bin/graphite construct \
-o rsv1.gra \
-f gra \
data/RSV/refererence_and_vcf_file/9465113.fa data/RSV/refererence_and_vcf_file/H_3801_22_04.freebayes.vcf
Formally, progressive construction. Graphite lets the user update a serialized
graph generated via construct
.
In the update, it takes serialized graph .gra
and variation data in VCF.
Example:
./bin/graphite update \
-o rsv2.dot \
-f dot \
rsv1.gra data/RSV/refererence_and_vcf_file/fake_H_3801_22_04.freebayes.vcf
Graphite allows you to generate graphs in
Using update
and view
, one can visualize changes introduced by different VCF
files as of writing this.
The view command takes a serialized graph .gra
, an output format, and an output
file as arguments.
Example:
./bin/graphite view \
-o rsv1.dot \
-f dot \
rsv1.gra
Look at the Graphite project boards for further detail.
This would allow for aligning reads to a graph or against each other bypassing the consensus reference.
For now, graphite can only align against strings (however this functionality isn't exposed because it's not ready yet) mainly because it only stores forward edges which makes it hard to implement Partial Order Alignment (POA) which strictly depends on backward edges. The short term options are adding a pre-processing step to generate backward edges or using Racket's FFI to call spoa or gssw. In the medium to long term Graphite's nodes should support backward edges which would, on top of assisting with POA, allow for encoding more complex mutations such as inversions.
This involves having a kmer, the substring of a genome, and searching for its most likely position in the graph. I have multiple problems with this as of now such as completely not finding kmers at all or having them point to the wrong location.
Graphite doesn't support complementarity (only supports the positive strands) therefore searching for a kmer in the negative strand wouldn't even work. I am evaluating different ways of implementing complementarity.
Search doesn't work for some strings and even worse the method I am using to build the index is not ideal. Currently, I am building the FM index via a Burrows-Wheeler Transform (BWT) which is in turn built from rotating the given string, this is far from ideal. A better strategy I am considering is:
I could then complement the BWT and then generate the FM index and therefore get fast queries onto the graph. I believe this to be the basic idea behind A Graph Extension of the Positional Burrows-Wheeler Transform and its Applications.
We could add a metadata field to the nodes which will allow for something like inbuilt annotation support.
Another feature not related to Bioinformatics but is surprisingly lacking in Racket is to extend cmdline to have command-line options as is described in Multi-command-line in Racket.
To reiterate, you can look at these 3 other posts regarding Graphite:
I'd like to acknowledge the contribution of the following organizations and people.
Sponsors:
my mentors:
and those who weren't my mentors but helped along the way by offering technical advice, testing graphite, reading the code or reviewing the blog posts: