mash

package

v0.31.1 Latest Latest Go to latest Published: Jan 31, 2024 License: MIT Imports: 2 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/bebop/poly

Links

Open Source Insights

Documentation ¶

Overview ¶

Package mash is for sketching sequence data to make it easier to compare to other sequence.

The package is named mash after the mash sketching algorithm, which is based on the MinHash algorithm.

Mash: fast genome and metagenome distance estimation using MinHash. Ondov, B.D., Treangen, T.J., Melsted, P. et al. Genome Biol 17, 132 (2016). https://doi.org/10.1186/s13059-016-0997-x

Mash Screen: high-throughput sequence containment estimation for genome discovery. Ondov, B., Starrett, G., Sappington, A. et al. Genome Biol 20, 232 (2019). https://doi.org/10.1186/s13059-019-1841-x

tl;dr for the papers above:

Comparing biological sequences is really hard because similar sequences can have frameshifts that make it impossible to measure similarity using more common metric distances like hamming distance or levenshtein distance.

Bioinformatics and nlp researchers have come up with tons of string comparison algorithms that are better suited for comparing biological sequences. For example poly already implements a few of them like the Needleman-Wunsch and Smith-Waterman algorithms in our "align" package.

Mash is a different approach to comparing biological sequences. It uses a technique called sketching to reduce the complexity of the sequence to a vector of hashes. The hashes are generated by sliding a window of size k along the sequence and hashing each kmer. The hash is then stored in a vector of size s. The vector is sorted and the smallest hash is kept. The process is repeated until the vector is full. The vector of hashes is the sketch.

The sketch is then compared to other sketches by counting the number of hashes that are the same between the two sketches. The number of hashes that are the same is divided by the size of the sketch to get a distance between 0 and 1.

Hash vectors can only be compared to other hash vectors that use the same sliding window size. Sketch size limits how many hashes can be stored in the vector and the return vector will always be of length of the sketch size and filled the smallest hashes that were generated and sorted from smallest to largest.

The larger the sketch size the more accurate the distance calculation will be but the longer it will take to calculate.

TTFN, Tim

Index ¶

type Mash
- func New(kmerSize int, sketchSize int) *Mash

Examples ¶

Mash

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Mash ¶

type Mash struct {
	KmerSize   int      // The kmer size is the size of the sliding window that is used to generate the hashes.
	SketchSize int      // The sketch size is the number of hashes to store.
	Sketches   []uint32 // The sketches are the hashes of the kmers that we can compare to other sketches.
}

Mash is a collection of hashes of kmers from a given sequence.

Example ¶

package main

import (
	"fmt"

	"github.com/bebop/poly/search/mash"
)

func main() {
	fingerprint1 := mash.New(17, 10)
	fingerprint1.Sketch("ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA")

	fingerprint2 := mash.New(17, 9)
	fingerprint2.Sketch("ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA")

	distance := fingerprint1.Distance(fingerprint2)

	fmt.Println(distance)

}

Output:

0

func New ¶

func New(kmerSize int, sketchSize int) *Mash

New initializes a new mash sketch.

func (*Mash) Distance ¶

func (mash *Mash) Distance(other *Mash) float64

Distance returns the Jaccard distance between two sketches (1 - similarity)

func (*Mash) Similarity ¶

func (mash *Mash) Similarity(other *Mash) float64

Similarity returns the Jaccard similarity between two sketches (number of matching hashes / sketch size)

func (*Mash) Sketch ¶

func (mash *Mash) Sketch(sequence string)

Sketch generates a mash sketch of the sequence.

Source Files ¶

View all Source files

mash.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL