Tags:
Variation graphs represent the reference genome as a graph. For an introduction, read my previous post An Introduction to Variation Graphs or Untangling graphical pangenomics by Erik Garrison.
Core to variation graphs is the maintenance of a tight mapping between the reference and the graph. To maintain this mapping we establish a coordinate system — a way to reliably associate a node in the graph, with a position in the reference and vice versa.
We use the concepts offset and ref to maintain a coordinate system.
An offset is the number of bases from the first node where the variation occurs; offsets are one-indexed in the reference but zero-indexed in the graph. Offsets are suited to translating linear reference to graphs because it's how variations are viewed within the reference anyway.
For example, we could represent a variation "A" occuring at position 3 in the reference "ATCGAT" as:
Notice how we start counting from 0 in the graph? We call that being zero-indexed.
A ref is a unique identifier which we get from the reference
description line. A graph created from just one reference will have all nodes
contain the same value in the ref
field.
As you may have suspected, some problems arise from this coordinate system. They are a matter of progressive update and read alignment but not a matter of initial graph construction and are therefore beyond the scope of this post. They include:
Properties of our graph:
A node is built out of a racket structure
, a struct
in many
languages, with the following fields:
Name | Description |
---|---|
segment | a string of alphabet A, T, C, and G |
offset | offset from zero on the reference |
id | sha256 hash of the concatenation of segment, "+" and offset |
ref | reference from which the segment is derived |
links | a list of the IDs of the next nodes |
The use of segment
and links
to mean vertices
and edges
are inspired by
A proposal of the Graphical Fragment Assembly format.
We generate a sha256 hash out of the segment, a plus symbol and the offset to
generate a value for id
.
For example, given a segment "ATCGATG" at offset 34 we can generate an ID like so:
generate-id(<string> segment, <natural-number> offset)
// take note of the + sign in the concatenation
string-and-offset <- concatenate("ATCGATG", "+","34")
hash-as-bytestring <- sha256hash(string-and-offset)
id <- bytestring-to-hex-string(hash-as-bytestring)
return id
I chose hashes over UUIDs because they are reproducible and will have constant time lookups in the occasion that we want to retrieve a node from the graph given its sequence and offset. This should come in handy in visualization especially on the web.
I also considered the likelihood of collisions in the hashes. I expect it to be low when dealing with 15,000 base pair size viruses. I shall expound on this in a later post. One thing to note is that vg uses UUIDs and they work for human genome so I believe graphite, the tool that I'm writing to implement this, can get away with sha256 hashes for more complex genomes.
A variation is a structure
containing the following fields:
Name | Description |
---|---|
segment | a string of single of alphabet A, T, C, and G |
offset | offset from zero on the reference |
ref | an identifier of the reference it's derived from |
It is extracted from a Variant Call Format file, the main file format for genomic variation data.
I had to implement a graph in graphite due to the lack of serialization (a required feature for progressive updates) in the racket graph library; I would have preferred to add serialization support to graph but couldn't do that and still stay on track with graphite.
The graph is built out of an adjacency map of id
, key, to node
, value.
Using a hash table
and not a list
has the following pros:
segment
and its offset
and cons:
The general idea is:
list
of variation structures
sorted by offset
and a
linear reference (string
).In the case of graphite, we recursively split the reference into a list of
pairs that imply directionality.
For example, the pair (a b)
would translate to an edge from node a to node b.
We then have a function gen-directed-graph
that takes this list
of pairs
and generates a directed graph from it using foldl
. Graphite creates the graph
in the 3 steps detailed below.
O(n); n being the size of the variation list
gen-node-list(reference, variations, prev-position = f, prev-nodes = <empty-list>)
if empty-list? variations
// the base case of gen node list
cap(reference, previous-position, previous-nodes)
else if (is-number previous-position) and (previous-position = current-offset)
// we have more than one variation in this position
handle-duplicate(reference, variations, previous-position, previous-nodes)
else
// we have just one variation in this position
handle-unique(reference, variations, previous-position, previous-nodes)
A mutually recursive function takes from the tail
of variation list,
variations
, and returns a list of pair of nodes (a, b)
where the direction
of the nodes is a -> b
for example a list like [(a b), (b c), (c d)]
should
later translate to a -> b -> c -> d
.
Creates the initial variation i.e "caps" the directed graph. It creates a first node that points to the first variations.
cap(reference, previous-position, previous-nodes)
map(
lambda node: (substring(reference, 0, previous-position), node)
previous-nodes
)
Inserts a variation where there isn't an alternative.
In a case where there's only 1 alternative path so we break the current sequence
and insert our alternative path, for example, a -> b
and a -> c
.
handle-unique(reference, variations, previous-position, previous-nodes)
...
Inserts extra alternative variations where they already exist.
for example a -> b
, a -> c
and a -> d
.
handle-duplicate(reference, variations, previous-position, previous-nodes)
...
O(n); with n being the size of the list of pairs
gen-directed-graph(g, list-of-pairs)
foldl(
// make sure that you're not overwriting the list of edges of a node as you
// update it. This check makes `gen-directed-graph` slow approx 4n.
lambda pair: add-adjacent-node(g, first(pair), second(pair))
g
list-of-pairs)
The reason for the bad performance of gen-directed-graph
is that it checks to
avoid overwriting any existing nodes.
This is to mean that if there's a relationship like:
a -> b
and a -> c
we have to make sure not to lose the edge a -> b
when creating a -> c
.
It, however, does suffice for virus data.
A composition of gen-node-list
and gen-directed-graph
gen-vg(reference, variations)
node-list <- gen-node-list(reference, variation)
graph <- gen-directed-graph(node-list)
return graph
Graphite supports the generation of graphs in: GFA, for interoperability with tools such as vg and bandage; DOT, for visualization; and a serialized form, .gra.
Representing the alphabet in 4 bits, as is done in BioD, because:
The alphabet would be:
However, most of the optimization would come from graph creation, graph update and search which is what I'm focused on for now.