Documentation ¶
Overview ¶
Package seqhash contains the seqhash algorithm.
This package contains the reference seqhash algorithm.
There is a big problem with current sequence databases - they all use different identifiers and accession numbers. This means cross-referencing databases is a complicated exercise, especially as the quantity of databases increases, or if you need to compare "wild" DNA sequences.
Seqhash is a simple algorithm to produce consistent identifiers for any genetic sequence. The basic premise of the Seqhash algorithm is to hash sequences with the hash being a robust cross-database identifier. Sequences themselves shouldn't be used as a database index (often, they're too big), so a hash based off of a sequence is the next best thing.
Usability wise, you should be able to Seqhash any rotation of a sequence in any direction and get a consistent hash.
The Seqhash algorithm makes several opinionated design choices, primarily to make working with Seqhashes more consistent and nice. The Seqhash algorithm only uses a single hash function, Blake3, and only operates on DNA, RNA, and Protein sequences. These identifiers will be seen by human beings, so versioning and metadata is attached to the front of the hashes so that a human operator can quickly identify problems with hashing.
If the sequence is DNA or RNA, the Seqhash algorithm needs to know whether or not the nucleic acid is circular and/or double stranded. If circular, the sequence is rotated to a deterministic point. If double stranded, the sequence is compared to its reverse complement, and the lexicographically minimal sequence is taken (whether or not the min or max is used doesn't matter, just needs to be consistent).
If the sequence is RNA, the sequence will be converted to DNA before hashing. While the full Seqhash will still be different between RNA and DNA (due to the metadata string), the hash afterwards will be the same. This makes it easy to cross reference DNA and RNA sequences. This fact is important for parts of DnaDesign store that relate to storing and searching large quantities of sequences - deduplication can easily be used on those Seqhashes to save a lot of space.
For DNA or RNA sequences, only ATUGCYRSWKMBDHVNZ characters are allowed. For Proteins, only ACDEFGHIKLMNPQRSTVWYUO*BXZ characters are allowed in sequences. Selenocysteine (Sec; U) and pyrrolysine (Pyl; O) are included in the protein character set - usually U and O don't occur within protein sequences, but for certain organisms they do, and it is certainly a relevant amino acid for those particular proteins.
Seqhash version 2 ¶
Version 1 seqhashes are depreciated.
Version 1 seqhashes were rather long, and version 2 seqhashes are built to be much shorter. The intended use case are for handling sequences with LLM systems since these system's context window is a value resource, and smaller references allows the system to be more focused. Seqhash version 2 are approximately 3x smaller than version 1 seqhashes. Officially, they are [16]byte arrays, but can be also encoded with base58 to get a hash that can be used as a string across different systems. Here is a length comparison:
version 1: v1_DLD_f4028f93e08c5c23cbb8daa189b0a9802b378f1a1c919dcbcf1508a615f46350 version 2: C_5X6Hudy3K8ht7r4mvu9Gco
The metadata is now encoded in a 1 byte flag rather than a metadata string, instead of 7 rune like in version 1. Rather than use 256 bits for encoding the hash, we use 120 bits. Since seqhashes are not meant for security, this is good enough (50% collision with 1.3x10^18 hashes), while making them conveniently only 16 btyes long. Additionally, encoded prefixes are added to the front of the base58 encoded hash as a heuristic device for LLMs while processing batches of seqhashes.
In addition, seqhashes can now encode fragments. Fragments are double stranded DNA that are the result of restriction digestion, with single stranded overhangs flanking both sides. These fragments can encode genetic parts - and an important part of any vector containing these parts would be the part seqhash, rather than the vector seqhash. This enhancement allows you to identify genetic parts irregardless of their context.
Base58 is used rather than base64 so that seqhashes can easily be added into urls without a "/" in the identifier. Ironically, it also makes smaller hashes than base64 due to base64 chunking 3 bytes at a time - at 16 bytes, 2 blank bytes are added to make the seqhash divisible by 3. Base58 chunks differently, and so doesn't encounter this problem.
Example (Basic) ¶
This example shows how to seqhash a sequence.
package main import ( "fmt" "github.com/koeng101/dnadesign/lib/seqhash" ) func main() { sequence := "ATGC" sequenceType := seqhash.DNA circular := false doubleStranded := true sequenceSeqhash, _ := seqhash.EncodeHash2(seqhash.Hash2(sequence, sequenceType, circular, doubleStranded)) fmt.Println(sequenceSeqhash) }
Output: C_5X6Hudy3K8ht7r4mvu9Gco
Index ¶
- Variables
- func DecodeHash2(encodedString string) ([16]byte, error)
- func EncodeFlag(version int, sequenceType SequenceType, circularity bool, doubleStranded bool) byte
- func EncodeHash2(hash [16]byte, err error) (string, error)
- func Hash2(sequence string, sequenceType SequenceType, circular bool, doubleStranded bool) ([16]byte, error)
- func Hash2Fragment(sequence string, fwdOverhangLength int8, revOverhangLength int8) ([16]byte, error)
- func RotateSequence(sequence string) string
- type Hash2MetadataKey
- type SequenceType
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var Hash2Metadata = map[Hash2MetadataKey]rune{ {DNA, true, true}: 'A', {DNA, true, false}: 'B', {DNA, false, true}: 'C', {DNA, false, false}: 'D', {RNA, true, true}: 'E', {RNA, true, false}: 'F', {RNA, false, true}: 'G', {RNA, false, false}: 'H', {PROTEIN, false, false}: 'I', {PROTEIN, true, false}: 'J', {FRAGMENT, false, false}: 'K', {FRAGMENT, true, false}: 'L', {FRAGMENT, false, true}: 'M', {FRAGMENT, true, true}: 'N', }
Hash2Metadata contains the seqhash v2 single letter metadata tags.
Functions ¶
func DecodeHash2 ¶
DecodeHash2 decodes a seqhash into a [16]byte, including the metadata tag.
func EncodeFlag ¶
func EncodeFlag(version int, sequenceType SequenceType, circularity bool, doubleStranded bool) byte
EncodeFlag encodes the version, circularity, double-strandedness, and type into a single byte flag. Used for seqhash v2
func EncodeHash2 ¶
EncodeHash2 encodes Hash2 as a base58 string. It also adds a single letter metadata tag that can be used as an easy heuristic for an LLM to identify misbehaving code.
func Hash2 ¶
func Hash2(sequence string, sequenceType SequenceType, circular bool, doubleStranded bool) ([16]byte, error)
Hash2 creates a version 2 seqhash.
Example ¶
package main import ( "fmt" "github.com/koeng101/dnadesign/lib/seqhash" ) func main() { sequence := "ATGC" sequenceType := seqhash.DNA circular := false doubleStranded := true sequenceSeqhash, _ := seqhash.Hash2(sequence, sequenceType, circular, doubleStranded) fmt.Println(sequenceSeqhash) }
Output: [36 152 32 245 168 76 196 4 51 14 109 151 189 225 59 88]
func Hash2Fragment ¶
func Hash2Fragment(sequence string, fwdOverhangLength int8, revOverhangLength int8) ([16]byte, error)
Hash2Fragment creates a version 2 fragment seqhash. Fragment seqhashes are a special kind of seqhash that are used to identify fragments, usually released by restriction enzyme digestion, rather than complete DNA sequences. This is very useful for tracking genetic parts in a database: as abstractions away from their container vectors, so that many fragments in different vectors can be identified consistently.
fwdOverhangLength and revOverhangLength are the lengths of both overhangs. Hashed sequences are hashed with their overhangs attached. Most of the time, both of these will equal 4, as they are released by TypeIIS restriction enzymes.
In order to make sure fwdOverhangLength and revOverhangLength fit in the hash, the hash is truncated at 13 bytes rather than 16, and both int8 are inserted. So the bytes would be:
flag + fwdOverhangLength + revOverhangLength + [13]byte(hash)
fwdOverhangLength and revOverhangLength are both int8, and their negatives are considered if the the overhang is on the 3prime strand, rather than the 5prime strand.
13 bytes is considered enough, because the number of fragments is limited by our ability to physically produce them, while other other sequence types can be found in nature.
The fwdOverhang and revOverhang are the lengths of the overhangs of the input sequence. The hash, however, contains the forward and reverse overhang lengths of the deterministic sequence - ie, the alphabetically less-than strand, when comparing the uppercase forward and reverse complement strand. This means if the input sequence is not less than its reverse complement (for example, GTT is greater than AAC), then the output hash will have the forward and reverse overhang lengths of the reverse complement, not the input strand.
func RotateSequence ¶
RotateSequence rotates circular sequences to deterministic point.
Example ¶
package main import ( "fmt" "os" "github.com/koeng101/dnadesign/lib/bio" "github.com/koeng101/dnadesign/lib/seqhash" ) func main() { file, _ := os.Open("../data/puc19.gbk") defer file.Close() parser := bio.NewGenbankParser(file) sequence, _ := parser.Next() sequenceLength := len(sequence.Sequence) testSequence := sequence.Sequence[sequenceLength/2:] + sequence.Sequence[0:sequenceLength/2] fmt.Println(seqhash.RotateSequence(sequence.Sequence) == seqhash.RotateSequence(testSequence)) }
Output: true
Types ¶
type Hash2MetadataKey ¶
type Hash2MetadataKey struct { SequenceType SequenceType Circular bool DoubleStranded bool }
Hash2MetadataKey is a key for a seqhash v2 single letter metadata tag.
type SequenceType ¶
type SequenceType string
Seqhash is a struct that contains the Seqhash algorithm sequence types.
const ( DNA SequenceType = "DNA" RNA SequenceType = "RNA" PROTEIN SequenceType = "PROTEIN" FRAGMENT SequenceType = "FRAGMENT" )
func DecodeFlag ¶
func DecodeFlag(flag byte) (int, SequenceType, bool, bool)
DecodeFlag decodes the single byte flag into its constituent parts. Outputs: version, circularity, doubleStranded, dnaRnaProtein. Used for seqhash v2