seqhash

package

v0.0.0-...-f005bc5 Latest Latest Go to latest Published: Dec 14, 2024 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/koeng101/dnadesign

Documentation ¶

Overview ¶

Package seqhash contains the seqhash algorithm.

This package contains the reference seqhash algorithm.

There is a big problem with current sequence databases - they all use different identifiers and accession numbers. This means cross-referencing databases is a complicated exercise, especially as the quantity of databases increases, or if you need to compare "wild" DNA sequences.

Seqhash is a simple algorithm to produce consistent identifiers for any genetic sequence. The basic premise of the Seqhash algorithm is to hash sequences with the hash being a robust cross-database identifier. Sequences themselves shouldn't be used as a database index (often, they're too big), so a hash based off of a sequence is the next best thing.

Usability wise, you should be able to Seqhash any rotation of a sequence in any direction and get a consistent hash.

The Seqhash algorithm makes several opinionated design choices, primarily to make working with Seqhashes more consistent and nice. The Seqhash algorithm only uses a single hash function, Blake3, and only operates on DNA, RNA, and Protein sequences. These identifiers will be seen by human beings, so versioning and metadata is attached to the front of the hashes so that a human operator can quickly identify problems with hashing.

If the sequence is DNA or RNA, the Seqhash algorithm needs to know whether or not the nucleic acid is circular and/or double stranded. If circular, the sequence is rotated to a deterministic point. If double stranded, the sequence is compared to its reverse complement, and the lexicographically minimal sequence is taken (whether or not the min or max is used doesn't matter, just needs to be consistent).

If the sequence is RNA, the sequence will be converted to DNA before hashing. While the full Seqhash will still be different between RNA and DNA (due to the metadata string), the hash afterwards will be the same. This makes it easy to cross reference DNA and RNA sequences. This fact is important for parts of DnaDesign store that relate to storing and searching large quantities of sequences - deduplication can easily be used on those Seqhashes to save a lot of space.

For DNA or RNA sequences, only ATUGCYRSWKMBDHVNZ characters are allowed. For Proteins, only ACDEFGHIKLMNPQRSTVWYUO*BXZ characters are allowed in sequences. Selenocysteine (Sec; U) and pyrrolysine (Pyl; O) are included in the protein character set - usually U and O don't occur within protein sequences, but for certain organisms they do, and it is certainly a relevant amino acid for those particular proteins.

Seqhash version 2 ¶

Version 1 seqhashes are depreciated.

Version 1 seqhashes were rather long, and version 2 seqhashes are built to be much shorter. The intended use case are for handling sequences with LLM systems since these system's context window is a value resource, and smaller references allows the system to be more focused. Seqhash version 2 are approximately 3x smaller than version 1 seqhashes. Officially, they are [16]byte arrays, but can be also encoded with base58 to get a hash that can be used as a string across different systems. Here is a length comparison:

version 1: v1_DLD_f4028f93e08c5c23cbb8daa189b0a9802b378f1a1c919dcbcf1508a615f46350
version 2: C_5X6Hudy3K8ht7r4mvu9Gco

The metadata is now encoded in a 1 byte flag rather than a metadata string, instead of 7 rune like in version 1. Rather than use 256 bits for encoding the hash, we use 120 bits. Since seqhashes are not meant for security, this is good enough (50% collision with 1.3x10^18 hashes), while making them conveniently only 16 btyes long. Additionally, encoded prefixes are added to the front of the base58 encoded hash as a heuristic device for LLMs while processing batches of seqhashes.

In addition, seqhashes can now encode fragments. Fragments are double stranded DNA that are the result of restriction digestion, with single stranded overhangs flanking both sides. These fragments can encode genetic parts - and an important part of any vector containing these parts would be the part seqhash, rather than the vector seqhash. This enhancement allows you to identify genetic parts irregardless of their context.

Base58 is used rather than base64 so that seqhashes can easily be added into urls without a "/" in the identifier. Ironically, it also makes smaller hashes than base64 due to base64 chunking 3 bytes at a time - at 16 bytes, 2 blank bytes are added to make the seqhash divisible by 3. Base58 chunks differently, and so doesn't encounter this problem.

Example (Basic) ¶

This example shows how to seqhash a sequence.

package main

import (
	"fmt"

	"github.com/koeng101/dnadesign/lib/seqhash"
)

func main() {
	sequence := "ATGC"
	sequenceType := seqhash.DNA
	circular := false
	doubleStranded := true

	sequenceSeqhash, _ := seqhash.EncodeHash2(seqhash.Hash2(sequence, sequenceType, circular, doubleStranded))
	fmt.Println(sequenceSeqhash)
}

Output:

C_5X6Hudy3K8ht7r4mvu9Gco

Index ¶

Variables
func DecodeHash2(encodedString string) ([16]byte, error)
func EncodeFlag(version int, sequenceType SequenceType, circularity bool, doubleStranded bool) byte
func EncodeHash2(hash [16]byte, err error) (string, error)
func Hash2(sequence string, sequenceType SequenceType, circular bool, doubleStranded bool) ([16]byte, error)
func Hash2Fragment(sequence string, fwdOverhangLength int8, revOverhangLength int8) ([16]byte, error)
func RotateSequence(sequence string) string
type Hash2MetadataKey
type SequenceType
- func DecodeFlag(flag byte) (int, SequenceType, bool, bool)

Constants ¶

This section is empty.

Variables ¶

View Source

var Hash2Metadata = map[Hash2MetadataKey]rune{
	{DNA, true, true}:        'A',
	{DNA, true, false}:       'B',
	{DNA, false, true}:       'C',
	{DNA, false, false}:      'D',
	{RNA, true, true}:        'E',
	{RNA, true, false}:       'F',
	{RNA, false, true}:       'G',
	{RNA, false, false}:      'H',
	{PROTEIN, false, false}:  'I',
	{PROTEIN, true, false}:   'J',
	{FRAGMENT, false, false}: 'K',
	{FRAGMENT, true, false}:  'L',
	{FRAGMENT, false, true}:  'M',
	{FRAGMENT, true, true}:   'N',
}

Hash2Metadata contains the seqhash v2 single letter metadata tags.

Functions ¶

func DecodeHash2 ¶

func DecodeHash2(encodedString string) ([16]byte, error)

DecodeHash2 decodes a seqhash into a [16]byte, including the metadata tag.

func EncodeFlag ¶

func EncodeFlag(version int, sequenceType SequenceType, circularity bool, doubleStranded bool) byte

EncodeFlag encodes the version, circularity, double-strandedness, and type into a single byte flag. Used for seqhash v2

func EncodeHash2 ¶

func EncodeHash2(hash [16]byte, err error) (string, error)

EncodeHash2 encodes Hash2 as a base58 string. It also adds a single letter metadata tag that can be used as an easy heuristic for an LLM to identify misbehaving code.

func Hash2 ¶

func Hash2(sequence string, sequenceType SequenceType, circular bool, doubleStranded bool) ([16]byte, error)

Hash2 creates a version 2 seqhash.

Example ¶

package main

import (
	"fmt"

	"github.com/koeng101/dnadesign/lib/seqhash"
)

func main() {
	sequence := "ATGC"
	sequenceType := seqhash.DNA
	circular := false
	doubleStranded := true

	sequenceSeqhash, _ := seqhash.Hash2(sequence, sequenceType, circular, doubleStranded)
	fmt.Println(sequenceSeqhash)
}

Output:

[36 152 32 245 168 76 196 4 51 14 109 151 189 225 59 88]

func Hash2Fragment ¶

func Hash2Fragment(sequence string, fwdOverhangLength int8, revOverhangLength int8) ([16]byte, error)

Hash2Fragment creates a version 2 fragment seqhash. Fragment seqhashes are a special kind of seqhash that are used to identify fragments, usually released by restriction enzyme digestion, rather than complete DNA sequences. This is very useful for tracking genetic parts in a database: as abstractions away from their container vectors, so that many fragments in different vectors can be identified consistently.

fwdOverhangLength and revOverhangLength are the lengths of both overhangs. Hashed sequences are hashed with their overhangs attached. Most of the time, both of these will equal 4, as they are released by TypeIIS restriction enzymes.

In order to make sure fwdOverhangLength and revOverhangLength fit in the hash, the hash is truncated at 13 bytes rather than 16, and both int8 are inserted. So the bytes would be:

flag + fwdOverhangLength + revOverhangLength + [13]byte(hash)

fwdOverhangLength and revOverhangLength are both int8, and their negatives are considered if the the overhang is on the 3prime strand, rather than the 5prime strand.

13 bytes is considered enough, because the number of fragments is limited by our ability to physically produce them, while other other sequence types can be found in nature.

The fwdOverhang and revOverhang are the lengths of the overhangs of the input sequence. The hash, however, contains the forward and reverse overhang lengths of the deterministic sequence - ie, the alphabetically less-than strand, when comparing the uppercase forward and reverse complement strand. This means if the input sequence is not less than its reverse complement (for example, GTT is greater than AAC), then the output hash will have the forward and reverse overhang lengths of the reverse complement, not the input strand.

func RotateSequence ¶

func RotateSequence(sequence string) string

RotateSequence rotates circular sequences to deterministic point.

Example ¶

package main

import (
	"fmt"
	"os"

	"github.com/koeng101/dnadesign/lib/bio"
	"github.com/koeng101/dnadesign/lib/seqhash"
)

func main() {
	file, _ := os.Open("../data/puc19.gbk")
	defer file.Close()
	parser := bio.NewGenbankParser(file)
	sequence, _ := parser.Next()

	sequenceLength := len(sequence.Sequence)
	testSequence := sequence.Sequence[sequenceLength/2:] + sequence.Sequence[0:sequenceLength/2]

	fmt.Println(seqhash.RotateSequence(sequence.Sequence) == seqhash.RotateSequence(testSequence))
}

Output:

true

Types ¶

type Hash2MetadataKey ¶

type Hash2MetadataKey struct {
	SequenceType   SequenceType
	Circular       bool
	DoubleStranded bool
}

Hash2MetadataKey is a key for a seqhash v2 single letter metadata tag.

type SequenceType ¶

type SequenceType string

Seqhash is a struct that contains the Seqhash algorithm sequence types.

const (
	DNA      SequenceType = "DNA"
	RNA      SequenceType = "RNA"
	PROTEIN  SequenceType = "PROTEIN"
	FRAGMENT SequenceType = "FRAGMENT"
)

func DecodeFlag ¶

func DecodeFlag(flag byte) (int, SequenceType, bool, bool)

DecodeFlag decodes the single byte flag into its constituent parts. Outputs: version, circularity, doubleStranded, dnaRnaProtein. Used for seqhash v2

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL