mash

package
v0.31.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 31, 2024 License: MIT Imports: 2 Imported by: 0

Documentation

Overview

Package mash is for sketching sequence data to make it easier to compare to other sequence.

The package is named mash after the mash sketching algorithm, which is based on the MinHash algorithm.

Mash: fast genome and metagenome distance estimation using MinHash. Ondov, B.D., Treangen, T.J., Melsted, P. et al. Genome Biol 17, 132 (2016). https://doi.org/10.1186/s13059-016-0997-x

Mash Screen: high-throughput sequence containment estimation for genome discovery. Ondov, B., Starrett, G., Sappington, A. et al. Genome Biol 20, 232 (2019). https://doi.org/10.1186/s13059-019-1841-x

tl;dr for the papers above:

Comparing biological sequences is really hard because similar sequences can have frameshifts that make it impossible to measure similarity using more common metric distances like hamming distance or levenshtein distance.

Bioinformatics and nlp researchers have come up with tons of string comparison algorithms that are better suited for comparing biological sequences. For example poly already implements a few of them like the Needleman-Wunsch and Smith-Waterman algorithms in our "align" package.

Mash is a different approach to comparing biological sequences. It uses a technique called sketching to reduce the complexity of the sequence to a vector of hashes. The hashes are generated by sliding a window of size k along the sequence and hashing each kmer. The hash is then stored in a vector of size s. The vector is sorted and the smallest hash is kept. The process is repeated until the vector is full. The vector of hashes is the sketch.

The sketch is then compared to other sketches by counting the number of hashes that are the same between the two sketches. The number of hashes that are the same is divided by the size of the sketch to get a distance between 0 and 1.

Hash vectors can only be compared to other hash vectors that use the same sliding window size. Sketch size limits how many hashes can be stored in the vector and the return vector will always be of length of the sketch size and filled the smallest hashes that were generated and sorted from smallest to largest.

The larger the sketch size the more accurate the distance calculation will be but the longer it will take to calculate.

TTFN, Tim

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Mash

type Mash struct {
	KmerSize   int      // The kmer size is the size of the sliding window that is used to generate the hashes.
	SketchSize int      // The sketch size is the number of hashes to store.
	Sketches   []uint32 // The sketches are the hashes of the kmers that we can compare to other sketches.
}

Mash is a collection of hashes of kmers from a given sequence.

Example
package main

import (
	"fmt"

	"github.com/bebop/poly/search/mash"
)

func main() {
	fingerprint1 := mash.New(17, 10)
	fingerprint1.Sketch("ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA")

	fingerprint2 := mash.New(17, 9)
	fingerprint2.Sketch("ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA")

	distance := fingerprint1.Distance(fingerprint2)

	fmt.Println(distance)

}
Output:

0

func New

func New(kmerSize int, sketchSize int) *Mash

New initializes a new mash sketch.

func (*Mash) Distance

func (mash *Mash) Distance(other *Mash) float64

Distance returns the Jaccard distance between two sketches (1 - similarity)

func (*Mash) Similarity

func (mash *Mash) Similarity(other *Mash) float64

Similarity returns the Jaccard similarity between two sketches (number of matching hashes / sketch size)

func (*Mash) Sketch

func (mash *Mash) Sketch(sequence string)

Sketch generates a mash sketch of the sequence.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL