seq

package module

v0.0.0-...-72e89e6 Latest Latest Go to latest Published: Mar 3, 2014 License: MIT Imports: 7 Imported by: 53

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/TuftsBCB/seq

Links

Open Source Insights

README ¶

This package provides common operations for dealing with biological sequence data. The package is not exhaustive and sometimes only provides representations for essential data without operations. (For example, a representation of a multiple-sequence alignment without a multiple-sequence alignment algorithm.)

The sequence type defined here is used throughout other Go packages in the TuftsBCB group.

Documentation for this package can be found here:

http://godoc.org/github.com/TuftsBCB/seq

Installation

Once you have Go installed, just run

go get github.com/TuftsBCB/seq

to install the packages.

Documentation ¶

Overview ¶

Package seq provides common types and operations for dealing with biological sequence data, with a bias toward amino acid sequences. Types includes sequences, profiles, multiple sequence alignments and HMMs. Operations include sequence alignment (currently only Needleman-Wunsch global alignment), building frequency profiles with background probabilities and an implementation of the Viterbi algorithm to find the probability of the most likely alignment of a sequence to an HMM.

This package is currently a "kitchen sink" of operations on biological sequences. It isn't yet clear (to me) whether it should remain a kitchen sink.

Index ¶

Variables
type Alignment
- func NeedlemanWunsch(A, B []Residue, subst SubstMatrix) Alignment
type Alphabet
- func NewAlphabet(residues ...Residue) Alphabet
- func (a1 Alphabet) Equals(a2 Alphabet) bool
- func (a Alphabet) Index() [256]int
- func (a Alphabet) Len() int
- func (a *Alphabet) MarshalJSON() ([]byte, error)
- func (a Alphabet) String() string
- func (a *Alphabet) UnmarshalJSON(bs []byte) error
type DynamicTable
- func AllocTable(numNodes int, seqLen int) *DynamicTable
type EProbs
- func NewEProbs(alphabet Alphabet) EProbs
- func (ep EProbs) Lookup(r Residue) Prob
- func (ep *EProbs) Set(r Residue, p Prob)
type FrequencyProfile
- func NewFrequencyProfile(columns int) *FrequencyProfile
- func NewFrequencyProfileAlphabet(columns int, alphabet Alphabet) *FrequencyProfile
- func NewNullProfile() *FrequencyProfile
- func (fp *FrequencyProfile) Add(s Sequence)
- func (fp1 *FrequencyProfile) Combine(fp2 *FrequencyProfile)
- func (fp *FrequencyProfile) Len() int
- func (fp *FrequencyProfile) Profile(null *FrequencyProfile) *Profile
- func (fp *FrequencyProfile) String() string
type HMM
- func HMMCat(h1, h2 *HMM) *HMM
- func NewHMM(nodes []HMMNode, alphabet []Residue, null EProbs) *HMM
- func (hmm *HMM) Slice(start, end int) *HMM
- func (hmm *HMM) ViterbiScore(seq Sequence) Prob
- func (hmm *HMM) ViterbiScoreMem(seq Sequence, table *DynamicTable) Prob
type HMMNode
type HMMState
type MSA
- func NewMSA() MSA
- func (m *MSA) Add(adds Sequence)
- func (m *MSA) AddFasta(adds Sequence)
- func (m *MSA) AddFastaSlice(seqs []Sequence)
- func (m *MSA) AddSlice(seqs []Sequence)
- func (m MSA) Get(row int) Sequence
- func (m MSA) GetA2M(row int) Sequence
- func (m MSA) GetA3M(row int) Sequence
- func (m MSA) GetFasta(row int) Sequence
- func (m MSA) Len() int
- func (m *MSA) SetLen(n int)
- func (m MSA) Slice(start, end int) MSA
- func (m MSA) String() string
type Prob
- func NewProb(fstr string) (Prob, error)
- func (p1 Prob) Distance(p2 Prob) float64
- func (p Prob) IsMin() bool
- func (p1 Prob) Less(p2 Prob) bool
- func (p Prob) MarshalJSON() ([]byte, error)
- func (p Prob) Ratio() float64
- func (p Prob) String() string
- func (p *Prob) UnmarshalJSON(bs []byte) error
type Profile
- func NewProfile(columns int) *Profile
- func NewProfileAlphabet(columns int, alphabet Alphabet) *Profile
- func (p *Profile) Len() int
- func (p *Profile) String() string
type Residue
- func (r Residue) HMMState() HMMState
type Sequence
- func NewSequenceString(name, srs string) Sequence
- func (s Sequence) Bytes() []byte
- func (s Sequence) Copy() Sequence
- func (s Sequence) IsNull() bool
- func (s Sequence) Len() int
- func (s Sequence) Slice(start, end int) Sequence
type SubstMatrix
type TProbs

Examples ¶

NeedlemanWunsch

Constants ¶

This section is empty.

Variables ¶

View Source

var (

	// A standard BLOSUM62 substitution matrix for amino acid sequences.
	SubstBlosum62 = SubstMatrix{AlphaBlosum62, blosum62}

	// An identity substitution matrix for DNA sequences.
	SubstDNA = SubstMatrix{AlphaDNA, identity}

	// An identity substitution matrix for DNA sequences.
	SubstRNA = SubstMatrix{AlphaRNA, identity}
)

View Source

var AlphaBlosum62 = NewAlphabet(
	'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M',
	'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z', '-',
)

The default alphabet that corresponds to the BLOSUM62 matrix included in this package.

View Source

var AlphaDNA = NewAlphabet(
	'A', 'C', 'G', 'T', 'N', '-',
)

The default alphabet for DNA sequences.

View Source

var AlphaRNA = NewAlphabet(
	'A', 'C', 'G', 'U', 'N', '-',
)

The default alphabet for RNA sequences.

View Source

var MinProb = Prob(math.MaxFloat64)

The value representing a minimum emission/transition probability. Remember, max in negative log space is minimum probability.

Functions ¶

This section is empty.

Types ¶

type Alignment ¶

type Alignment struct {
	A []Residue // Reference
	B []Residue // Query
}

Alignment represents the result of aligning two sequences.

func NeedlemanWunsch ¶

func NeedlemanWunsch(A, B []Residue, subst SubstMatrix) Alignment

Performs the Needleman-Wunsch sequence alignment algorithm on a pair of sequences. A function `subst` should return alignment scores for pairs of residues. This package provides some functions suitable for this purpose, e.g., MatBlosum62, MatDNA, MatRNA, etc.

Example ¶

s1 := NewSequenceString("seq1", "GHIKLMNPQR")
s2 := NewSequenceString("seq1", "GAAAHIKLMN")
aligned := NeedlemanWunsch(s1.Residues, s2.Residues, SubstBlosum62)

fmt.Printf("%s\n", aligned.A)
fmt.Printf("%s\n", aligned.B)

Output:

G---HIKLMNPQR
GAAAHIKLMN---

type Alphabet ¶

type Alphabet []Residue

Alphabet corresponds to a set of residues, in a particular order, that capture all possible residues of a particular set of sequences. For example, this is used in frequency profiles and HMMs to specify which amino acids are represented in the probabilistic model.

In most cases, the ordering of the alphabet is significant. For example, the indices of an alphabet may be in correspondence with the indices of a column in a frequency profile.

func NewAlphabet ¶

func NewAlphabet(residues ...Residue) Alphabet

NewAlphabet creates an alphabet from the residues given.

func (Alphabet) Equals ¶

func (a1 Alphabet) Equals(a2 Alphabet) bool

Equals returns true if and only if a1 == a2.

func (Alphabet) Index ¶

func (a Alphabet) Index() [256]int

Index returns a constant-time mapping from ASCII to residue index in the alphabet. This depends on all residues in the alphabet being ASCII characters.

func (Alphabet) Len ¶

func (a Alphabet) Len() int

func (*Alphabet) MarshalJSON ¶

func (a *Alphabet) MarshalJSON() ([]byte, error)

func (Alphabet) String ¶

func (a Alphabet) String() string

func (*Alphabet) UnmarshalJSON ¶

func (a *Alphabet) UnmarshalJSON(bs []byte) error

type DynamicTable ¶

type DynamicTable struct {
	// contains filtered or unexported fields
}

DynamicTable represents a dynamic programming table used in sequence alignment algorithms like Viterbi.

func AllocTable ¶

func AllocTable(numNodes int, seqLen int) *DynamicTable

AllocTable returns a freshly allocated dynamic programming table for use in HMM alogirthms like Viterbi. It is indexed by HMM state, node index and sequence length, in that order. The total size of the table is equal to (#states * (#nodes + 1) * (seqLen + 1)).

The only states used are Match, Deletion and Insertion.

Each value is initialized to a minimum probability.

type EProbs ¶

type EProbs struct {
	Offset Residue
	Probs  []Prob
}

EProbs represents emission probabilities, as log-odds scores. The representation of EProbs should not be used by clients; it is exported only for convenience with marshalling APIs.

func NewEProbs ¶

func NewEProbs(alphabet Alphabet) EProbs

NewEProbs creates a new EProbs map from the given alphabet. Keys of the map are residues defined in the alphabet, and values are defaulted to the minimum probability.

func (EProbs) Lookup ¶

func (ep EProbs) Lookup(r Residue) Prob

Returns the emission probability for a particular residue.

func (*EProbs) Set ¶

func (ep *EProbs) Set(r Residue, p Prob)

Set sets the emission probability of the given residue.

type FrequencyProfile ¶

type FrequencyProfile struct {
	// The columns of a frequency profile.
	Freqs []map[Residue]int

	// The alphabet of the profile. The length of the alphabet should be
	// equal to the number of rows in the frequency profile.
	// There are no restrictions on the alphabet. (i.e., Gap characters are
	// allowed but they are not treated specially.)
	Alphabet Alphabet
}

FrequencyProfile represents a sequence profile in terms of raw frequencies. A FrequencyProfile is useful as an intermediate representation. It can be used to incrementally build a Profile.

func NewFrequencyProfile ¶

func NewFrequencyProfile(columns int) *FrequencyProfile

NewFrequencyProfile initializes a frequency profile with a default alphabet that is compatible with this package's BLOSUM62 matrix.

func NewFrequencyProfileAlphabet ¶

func NewFrequencyProfileAlphabet(
	columns int,
	alphabet Alphabet,
) *FrequencyProfile

NewFrequencyProfileAlphabet initializes a frequency profile with the given alphabet.

func NewNullProfile ¶

func NewNullProfile() *FrequencyProfile

NewNullProfile initializes a frequency profile that can be used to tabulate a null model. This is equivalent to calling NewFrequencyProfile with the number of columns set to 1.

func (*FrequencyProfile) Add ¶

func (fp *FrequencyProfile) Add(s Sequence)

Add adds the sequence to the given profile. The sequence must have length equivalent to the number of columns in the profile. The sequence must also only contain residues that are in the alphabet for the profile.

As a special case, if the alphabet contains the 'X' residue, then any unrecognized residues in the sequence with respect to the profile's alphabet will be considered as an 'X' residue.

func (*FrequencyProfile) Combine ¶

func (fp1 *FrequencyProfile) Combine(fp2 *FrequencyProfile)

Combine adds the given frequency profile to the current one. Both profiles must have the same number of columns.

func (*FrequencyProfile) Len ¶

func (fp *FrequencyProfile) Len() int

Len returns the number of columns in the frequency profile.

func (*FrequencyProfile) Profile ¶

func (fp *FrequencyProfile) Profile(null *FrequencyProfile) *Profile

Profile converts a raw frequency profile to a profile that uses a log-odds representation. The log-odds scores are computed with the given null model, which is itself just a raw frequency profile with a single column.

func (*FrequencyProfile) String ¶

func (fp *FrequencyProfile) String() string

type HMM ¶

type HMM struct {
	// An ordered list of HMM nodes.
	Nodes []HMMNode

	// The alphabet as defined by an ordering of residues.
	// Indices in this slice correspond to indices in match/insertion emissions.
	Alphabet Alphabet

	// NULL model. (Amino acid background frequencies.)
	// HMMER hmm files don't have this, but HHsuite hhm files do.
	// In the case of HHsuite, the NULL model is used for insertion emissions
	// in every node.
	Null EProbs
}

func HMMCat ¶

func HMMCat(h1, h2 *HMM) *HMM

HMMCat joins two HMMs together. The HMMs given are not modified. Both HMMs must have the same alphabet. The null emissions for the first HMM are used.

func NewHMM ¶

func NewHMM(nodes []HMMNode, alphabet []Residue, null EProbs) *HMM

NewHMM creates a new HMM from a list of nodes, an ordered alphabet and a set of null probabilities (which may be nil).

func (*HMM) Slice ¶

func (hmm *HMM) Slice(start, end int) *HMM

Slice returns a slice of the HMM given. A slice of an HMM returns only the HMM nodes (i.e., columns or match/delete states) in the region specified by the slice. Also, the transition probabilities of the last state are specially set: M->M = 0, M->I = *, M->D = *, I->M = 0, I->I = *, D->M = 0, and D->D = *. No other modifications are made.

func (*HMM) ViterbiScore ¶

func (hmm *HMM) ViterbiScore(seq Sequence) Prob

ViterbiScore returns the probability of the likeliest path through the HMM for the given sequence.

If you're running Viterbi in a performance critical section, ViterbiScoreMem may be appropriate.

Note that the state path is not computed.

func (*HMM) ViterbiScoreMem ¶

func (hmm *HMM) ViterbiScoreMem(seq Sequence, table *DynamicTable) Prob

ViterbiScoreMem is the same as ViterbiScore, except it does not allocate, which makes it faster in performance critical sections of code. This is done by passing a pre-allocated dynamic programming table created by AllocTable function.

Note that the caller must ensure that only one goroutine is calling ViterbiScoreMem with the same dynamic programming table.

type HMMNode ¶

type HMMNode struct {
	Residue             Residue
	NodeNum             int
	InsEmit             EProbs
	MatEmit             EProbs
	Transitions         TProbs
	NeffM, NeffI, NeffD Prob
}

HMMNode represents a single node in an HMM, including the reference residue, the node index, insertion emissions, match emissions, transition probabilities.

The NeffM, NeffI and NeffD aren't used, but are included since they exist in common HMM file formats.

type HMMState ¶

type HMMState int

const (
	Match HMMState = iota
	Deletion
	Insertion
	Begin
	End
)

HMM states in the Plan7 architecture.

type MSA ¶

type MSA struct {
	Entries []Sequence
	// contains filtered or unexported fields
}

func NewMSA ¶

func NewMSA() MSA

func (*MSA) Add ¶

func (m *MSA) Add(adds Sequence)

Add adds a new entry to the multiple sequence alignment. Sequences must be in A2M or A3M format. If your sequence is from a FASTA aligned format, use AddFasta.

Empty sequences are ignored.

func (*MSA) AddFasta ¶

func (m *MSA) AddFasta(adds Sequence)

AddFasta adds a FASTA formatted sequence to the MSA. If your sequences are in A2M or A3M format, use Add.

Empty sequences are ignored.

func (*MSA) AddFastaSlice ¶

func (m *MSA) AddFastaSlice(seqs []Sequence)

AddFastaSlice calls "AddFasta" for each sequence in the slice.

func (*MSA) AddSlice ¶

func (m *MSA) AddSlice(seqs []Sequence)

AddSlice calls "Add" for each sequence in the slice.

func (MSA) Get ¶

func (m MSA) Get(row int) Sequence

Get gets a copy of the sequence at the provided row in the MSA. The sequence is in the default format of the MSA representation. Currently, this is A2M. ('-' for deletes, '.' and a-z for inserts, and A-Z for matches.)

func (MSA) GetA2M ¶

func (m MSA) GetA2M(row int) Sequence

GetA2M gets a copy of the sequence at the provided row in the MSA in A2M format.

func (MSA) GetA3M ¶

func (m MSA) GetA3M(row int) Sequence

GetA3M gets a copy of the sequence at the provided row in the MSA in A3M format. This is the same as A2M format, except all '.' (period) are omitted.

func (MSA) GetFasta ¶

func (m MSA) GetFasta(row int) Sequence

GetFasta gets a copy of the sequence at the provided row in the MSA in aligned FASTA format. This is the same as A2M format, except all '.' (period) are changed to '-' residues.

func (MSA) Len ¶

func (m MSA) Len() int

Len returns the length of the alignment. (All entries in an MSA are guaranteed to have the same length.)

func (*MSA) SetLen ¶

func (m *MSA) SetLen(n int)

SetLen allows you to override the length of the MSA. Only use this if you know what you're doing. (The Add and AddFasta methods will update this for you automatically.)

func (MSA) Slice ¶

func (m MSA) Slice(start, end int) MSA

Slice returns a slice of the given MSA by slicing each sequence in the MSA. Slicing an empty MSA will panic.

Note that slicing works in terms of *match states* of the first sequence in the MSA.

func (MSA) String ¶

func (m MSA) String() string

type Prob ¶

type Prob float64

Prob represents a transition or emission probability.

func NewProb ¶

func NewProb(fstr string) (Prob, error)

NewProb creates a new probability value from a string (usually read from an hmm or hhm file). If the string is equivalent to the special value "*", then the probability returned is guaranteed to be minimal. Otherwise, the string is parsed as a float, and an error returned if parsing fails.

func (Prob) Distance ¶

func (p1 Prob) Distance(p2 Prob) float64

Distance returns the distance between `p1` and `p2`.

func (Prob) IsMin ¶

func (p Prob) IsMin() bool

IsMin returns true if the probability is minimal.

func (Prob) Less ¶

func (p1 Prob) Less(p2 Prob) bool

Less returns true if `p1` represents a smaller probability than `p2`.

func (Prob) MarshalJSON ¶

func (p Prob) MarshalJSON() ([]byte, error)

func (Prob) Ratio ¶

func (p Prob) Ratio() float64

Ratio returns the log probability as a ratio in the range [0, 1]. (This assumes that `p` is a log-odds score.)

func (Prob) String ¶

func (p Prob) String() string

String returns a string representation of the probability. When `p` is the minimum probability, then "*" is used. Otherwise, the full number is written.

func (*Prob) UnmarshalJSON ¶

func (p *Prob) UnmarshalJSON(bs []byte) error

type Profile ¶

type Profile struct {
	// The columns of a profile.
	Emissions []EProbs

	// The alphabet of the profile. The length of the alphabet should be
	// equal to the number of rows in the profile.
	// There are no restrictions on the alphabet. (i.e., Gap characters are
	// allowed but they are not treated specially.)
	Alphabet Alphabet
}

Profile represents a sequence profile in terms of log-odds scores.

func NewProfile ¶

func NewProfile(columns int) *Profile

NewProfile initializes a profile with a default alphabet that is compatible with this package's BLOSUM62 matrix. Emission probabilities are set to the minimum log-odds probability.

func NewProfileAlphabet ¶

func NewProfileAlphabet(columns int, alphabet Alphabet) *Profile

NewProfileAlphabet initializes a profile with the given alphabet. Emission probabilities are set to the minimum log-odds probability.

func (*Profile) Len ¶

func (p *Profile) Len() int

func (*Profile) String ¶

func (p *Profile) String() string

type Residue ¶

type Residue byte

A Residue corresponds to a single entry in a sequence.

func (Residue) HMMState ¶

func (r Residue) HMMState() HMMState

HMMState returns the HMMState of a particular residue in a sequence. Residues in [A-Z] are match states. Residues matching '-' are deletion states. Residues equal to '.' or in [a-z] are insertion states.

A residue corresponding to any other value will panic.

The pre-condition here is that 'r' is a residue from a sequence from an A2M format. (N.B. MSA's formed from A3M and FASTA formatted files are repsented as A2M format, so MSA's read from A3M/FASTA files are OK.)

type Sequence ¶

type Sequence struct {
	Name     string
	Residues []Residue
}

A Sequence corresponds to any kind of biological sequence: DNA, RNA, amino acid, secondary structure, etc.

func NewSequenceString ¶

func NewSequenceString(name, srs string) Sequence

NewSequenceString is a convenience function for constructing a sequence from a string. It is otherwise appropriate to create new Sequence values directly.

func (Sequence) Bytes ¶

func (s Sequence) Bytes() []byte

Bytes returns the sequence of residues as a byte slice.

func (Sequence) Copy ¶

func (s Sequence) Copy() Sequence

Copy returns a deep copy of the sequence.

func (Sequence) IsNull ¶

func (s Sequence) IsNull() bool

IsNull returns true if the name has zero length and the residues are nil.

func (Sequence) Len ¶

func (s Sequence) Len() int

Len returns the number of residues in the sequence.

func (Sequence) Slice ¶

func (s Sequence) Slice(start, end int) Sequence

Slice returns a slice of the sequence. The name stays the same, and the sequence of residues corresponds to a Go slice of the original. (This does not copy data, so that if the original or sliced sequence is changed, the other one will too. Use Sequence.Copy first if you need copy semantics.)

type SubstMatrix ¶

type SubstMatrix struct {
	Alphabet Alphabet
	Scores   [][]int
}

SubstMatrix corresponds to a substitution matrix and an alphabet that is used in sequence alignment algorithms. The matrix given should be square, with rows and columns equivalent to the length of the alphabet. (This means that if your substitution matrix includes gap penalties, then the alphabet should also have a gap character.)

type TProbs ¶

type TProbs struct {
	MM, MI, MD, IM, II, DM, DD Prob
}

TProbs represents transition probabilities, as log-odds scores. Note that ID and DI are omitted (Plan7).

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL