Core to variation graphs is the maintenance of a tight mapping between the reference and the graph. To maintain this mapping we establish a coordinate system — a way to reliably associate a node in the graph, with a position in the reference and vice versa.
We use the concepts offset and ref to maintain a coordinate system.
An offset is the number of bases from the first node where the variation occurs; offsets are one-indexed in the reference but zero-indexed in the graph. Offsets are suited to translating linear reference to graphs because it's how variations are viewed within the reference anyway.
For example, we could represent a variation "A" occuring at position 3 in the reference "ATCGAT" as:
Notice how we start counting from 0 in the graph? We call that being zero-indexed.
A ref is a unique identifier which we get from the reference
description line. A graph created from just one reference will have all nodes
contain the same value in the
As you may have suspected, some problems arise from this coordinate system. They are a matter of progressive update and read alignment but not a matter of initial graph construction and are therefore beyond the scope of this post. They include:
Properties of our graph:
A node is built out of a racket
struct in many
languages, with the following fields:
|segment||a string of alphabet A, T, C, and G|
|offset||offset from zero on the reference|
|id||sha256 hash of the concatenation of segment, "+" and offset|
|ref||reference from which the segment is derived|
|links||a list of the IDs of the next nodes|
The use of
links to mean
edges are inspired by
A proposal of the Graphical Fragment Assembly format.
We generate a sha256 hash out of the segment, a plus symbol and the offset to
generate a value for
For example, given a segment "ATCGATG" at offset 34 we can generate an ID like so:
generate-id(<string> segment, <natural-number> offset) // take note of the + sign in the concatenation string-and-offset <- concatenate("ATCGATG", "+","34") hash-as-bytestring <- sha256hash(string-and-offset) id <- bytestring-to-hex-string(hash-as-bytestring) return id
I chose hashes over UUIDs because they are reproducible and will have constant time lookups in the occasion that we want to retrieve a node from the graph given its sequence and offset. This should come in handy in visualization especially on the web.
I also considered the likelihood of collisions in the hashes. I expect it to be low when dealing with 15,000 base pair size viruses. I shall expound on this in a later post. One thing to note is that vg uses UUIDs and they work for human genome so I believe graphite, the tool that I'm writing to implement this, can get away with sha256 hashes for more complex genomes.
A variation is a
structure containing the following fields:
|segment||a string of single of alphabet A, T, C, and G|
|offset||offset from zero on the reference|
|ref||an identifier of the reference it's derived from|
It is extracted from a Variant Call Format file, the main file format for genomic variation data.
I had to implement a graph in graphite due to the lack of serialization (a required feature for progressive updates) in the racket graph library; I would have preferred to add serialization support to graph but couldn't do that and still stay on track with graphite.
The graph is built out of an adjacency map of
id, key, to
hash table and not a
list has the following pros:
The general idea is:
offsetand a linear reference (
In the case of graphite, we recursively split the reference into a list of
pairs that imply directionality.
For example, the pair
(a b) would translate to an edge from node a to node b.
We then have a function
gen-directed-graph that takes this
and generates a directed graph from it using
foldl. Graphite creates the graph
in the 3 steps detailed below.
O(n); n being the size of the variation list
gen-node-list(reference, variations, prev-position = f, prev-nodes = <empty-list>) if empty-list? variations // the base case of gen node list cap(reference, previous-position, previous-nodes) else if (is-number previous-position) and (previous-position = current-offset) // we have more than one variation in this position handle-duplicate(reference, variations, previous-position, previous-nodes) else // we have just one variation in this position handle-unique(reference, variations, previous-position, previous-nodes)
A mutually recursive function takes from the
tail of variation list,
variations, and returns a list of pair of nodes
(a, b) where the direction
of the nodes is
a -> b for example a list like
[(a b), (b c), (c d)] should
later translate to
a -> b -> c -> d.
Creates the initial variation i.e "caps" the directed graph. It creates a first node that points to the first variations.
cap(reference, previous-position, previous-nodes) map( lambda node: (substring(reference, 0, previous-position), node) previous-nodes )
Inserts a variation where there isn't an alternative.
In a case where there's only 1 alternative path so we break the current sequence
and insert our alternative path, for example,
a -> b and
a -> c.
handle-unique(reference, variations, previous-position, previous-nodes) ...
Inserts extra alternative variations where they already exist.
a -> b,
a -> c and
a -> d.
handle-duplicate(reference, variations, previous-position, previous-nodes) ...
O(n); with n being the size of the list of pairs
gen-directed-graph(g, list-of-pairs) foldl( // make sure that you're not overwriting the list of edges of a node as you // update it. This check makes `gen-directed-graph` slow approx 4n. lambda pair: add-adjacent-node(g, first(pair), second(pair)) g list-of-pairs)
The reason for the bad performance of
gen-directed-graph is that it checks to
avoid overwriting any existing nodes.
This is to mean that if there's a relationship like:
a -> b and
a -> c
we have to make sure not to lose the edge
a -> b when creating
a -> c.
It, however, does suffice for virus data.
A composition of
gen-vg(reference, variations) node-list <- gen-node-list(reference, variation) graph <- gen-directed-graph(node-list) return graph
Representing the alphabet in 4 bits, as is done in BioD, because:
The alphabet would be:
However, most of the optimization would come from graph creation, graph update and search which is what I'm focused on for now.