align

package
v0.31.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 31, 2024 License: MIT Imports: 1 Imported by: 0

Documentation

Overview

Package align is a package for aligning (comparing) DNA, RNA, and protein sequences.

Biology is fickle and full of quirks that make it hard to do even the most basic of tasks which we would normally take for granted when working with other kinds of data.

Comparing two biological sequences to see if they're roughly equivalent is one of those tasks.

Essentially two almost identical sequences with almost identical functionality can contain small insertions or deletions that shift the entire string such that a meaningful comparison via hamming distance or levenshtein distance becomes impossible.

For example:

Timothy Stiles ||||||| |||||| Timothy Stiles

is an easy match with hamming or levenshtein distance.

However, say we introduce a new character to the beginning of the sequence.

Timothy Stiles xxxxxxxxxxxxxxxx A Timothy Stiles

Now our edit distance via levenshtein is maximized at 16 and we wouldn't be able to tell that semantically these strings are almost identical.

This frame shifting seen above is incredibly common within biological sequences and alignment algorithms are designed in part to deal with these shifts so that when we compare two sequences like the two below we can get a more useful edit distance.

GAAAAAAT GAA----T

As of writing this package includes the two most basic algorithms for alignment, Needleman-Wunsch and Smith-Waterman. Needleman-Wunsch is used when you are looking for global alignment between two full-length sequences, while Smith-Waterman is better at smaller sequences with local similarities and handling sequences with long non-homologous regions. BLAST, on the other hand, takes advantage of more heuristic techniques to speed up alignment, and is better at finding similar sequences in large database, sacrificing precision for faster results.

Both are "dynamic programming algorithms" which is a fancy 1980's term for they use matrices. If you're familiar with kernel operations, linear filters, or whatever term ML researchers are using nowadays for, "slide a window over a matrix and determine that entry's values using its neighbor's values", then this should be pretty easy to grok.

If not these algorithms essentially compare every character in one sequence with another sequence and create an edit distance along with human readable string to show gaps like the previous example.

I'm not really an expert on alignment so if you want to learn more about this class of algorithms wikipedia has a decent overview.

https://en.wikipedia.org/wiki/Sequence_alignment

Even if I may not know the answer to your alignment questions please ask and I'll do my best to help!

TTFN, Tim

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func NeedlemanWunsch

func NeedlemanWunsch(stringA string, stringB string, scoring Scoring) (int, string, string, error)

NeedlemanWunsch performs global alignment between two strings using the Needleman-Wunsch algorithm. It returns the final score and the optimal alignments of the two strings in O(nm) time and O(nm) space. https://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm

Example
package main

import (
	"fmt"

	"github.com/bebop/poly/alphabet"
	"github.com/bebop/poly/search/align"
	"github.com/bebop/poly/search/align/matrix"
)

func main() {
	a := "GATTACA"
	b := "GCATGCU"

	m := [][]int{
		/*       A C G T U */
		/* A */ {1, -1, -1, -1, -1},
		/* C */ {-1, 1, -1, -1, -1},
		/* G */ {-1, -1, 1, -1, -1},
		/* T */ {-1, -1, -1, 1, -1},
		/* U */ {-1, -1, -1, -1, 1},
	}

	alphabet := alphabet.NewAlphabet([]string{"A", "C", "G", "T", "U"})
	subMatrix, err := matrix.NewSubstitutionMatrix(alphabet, alphabet, m)
	if err != nil {
		fmt.Println(err)
		return
	}

	scoring, err := align.NewScoring(subMatrix, -1)
	if err != nil {
		fmt.Println(err)
		return
	}
	score, alignA, alignB, err := align.NeedlemanWunsch(a, b, scoring)

	if err != nil {
		fmt.Println(err)
		return
	}

	fmt.Printf("score: %d, A: %s, B: %s", score, alignA, alignB)

}
Output:

score: 0, A: G-ATTACA, B: GCA-TGCU

func SmithWaterman

func SmithWaterman(stringA string, stringB string, scoring Scoring) (int, string, string, error)

SmithWaterman performs local alignment between two strings using the Smith-Waterman algorithm. It returns the max score and optimal local alignments between two strings alignments of the two strings in O(nm) time and O(nm) space. https://en.wikipedia.org/wiki/Smith-Waterman_algorithm

Example
package main

import (
	"fmt"

	"github.com/bebop/poly/alphabet"
	"github.com/bebop/poly/search/align"
	"github.com/bebop/poly/search/align/matrix"
)

func main() {
	a := "GATTACA"
	b := "GCATGCU"

	m := [][]int{
		/*       A C G T U */
		/* A */ {1, -1, -1, -1, -1},
		/* C */ {-1, 1, -1, -1, -1},
		/* G */ {-1, -1, 1, -1, -1},
		/* T */ {-1, -1, -1, 1, -1},
		/* U */ {-1, -1, -1, -1, 1},
	}

	alphabet := alphabet.NewAlphabet([]string{"A", "C", "G", "T", "U"})
	subMatrix, err := matrix.NewSubstitutionMatrix(alphabet, alphabet, m)
	if err != nil {
		fmt.Println(err)
		return
	}
	scoring, err := align.NewScoring(subMatrix, -1)
	if err != nil {
		fmt.Println(err)
		return
	}
	score, alignA, alignB, err := align.SmithWaterman(a, b, scoring)

	if err != nil {
		fmt.Println(err)
		return
	}

	fmt.Printf("score: %d, A: %s, B: %s", score, alignA, alignB)

}
Output:

score: 2, A: AT, B: AT
Example (Matrix_nuc_4)
package main

import (
	"fmt"

	"github.com/bebop/poly/alphabet"
	"github.com/bebop/poly/search/align"
	"github.com/bebop/poly/search/align/matrix"
)

func main() {
	a := "GATTACA"
	b := "GCATGCT"

	alphabet := alphabet.NewAlphabet([]string{"A", "C", "G", "T", "-"})
	subMatrix, err := matrix.NewSubstitutionMatrix(alphabet, alphabet, matrix.NUC_4)
	if err != nil {
		fmt.Println(err)
		return
	}
	scoring, err := align.NewScoring(subMatrix, -1)

	if err != nil {
		fmt.Println(err)
		return
	}
	score, alignA, alignB, err := align.SmithWaterman(a, b, scoring)

	if err != nil {
		fmt.Println(err)
		return
	}

	fmt.Printf("score: %d, A: %s, B: %s", score, alignA, alignB)

}
Output:

score: 15, A: GATTAC, B: GCATGC

Types

type Scoring

type Scoring struct {
	SubstitutionMatrix *matrix.SubstitutionMatrix
	GapPenalty         int
}

Scoring is a struct that holds the scoring matrix for match, mismatch, and gap penalties.

func NewScoring

func NewScoring(substitutionMatrix *matrix.SubstitutionMatrix, gapPenalty int) (Scoring, error)

NewScoring returns a new Scoring struct with default values for DNA.

func (Scoring) Score

func (s Scoring) Score(a, b byte) (int, error)

Directories

Path Synopsis
Package matrix provides a variety of alignment scoring matrices for sequence alignment.
Package matrix provides a variety of alignment scoring matrices for sequence alignment.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL