Documentation
¶
Overview ¶
Package seq provides common types and operations for dealing with biological sequence data, with a bias toward amino acid sequences. Types includes sequences, profiles, multiple sequence alignments and HMMs. Operations include sequence alignment (currently only Needleman-Wunsch global alignment), building frequency profiles with background probabilities and an implementation of the Viterbi algorithm to find the probability of the most likely alignment of a sequence to an HMM.
This package is currently a "kitchen sink" of operations on biological sequences. It isn't yet clear (to me) whether it should remain a kitchen sink.
Index ¶
- Variables
- type Alignment
- type Alphabet
- type DynamicTable
- type EProbs
- type FrequencyProfile
- type HMM
- type HMMNode
- type HMMState
- type MSA
- func (m *MSA) Add(adds Sequence)
- func (m *MSA) AddFasta(adds Sequence)
- func (m *MSA) AddFastaSlice(seqs []Sequence)
- func (m *MSA) AddSlice(seqs []Sequence)
- func (m MSA) Get(row int) Sequence
- func (m MSA) GetA2M(row int) Sequence
- func (m MSA) GetA3M(row int) Sequence
- func (m MSA) GetFasta(row int) Sequence
- func (m MSA) Len() int
- func (m *MSA) SetLen(n int)
- func (m MSA) Slice(start, end int) MSA
- func (m MSA) String() string
- type Prob
- type Profile
- type Residue
- type Sequence
- type SubstMatrix
- type TProbs
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var ( // A standard BLOSUM62 substitution matrix for amino acid sequences. SubstBlosum62 = SubstMatrix{AlphaBlosum62, blosum62} // An identity substitution matrix for DNA sequences. SubstDNA = SubstMatrix{AlphaDNA, identity} // An identity substitution matrix for DNA sequences. SubstRNA = SubstMatrix{AlphaRNA, identity} )
var AlphaBlosum62 = NewAlphabet(
'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M',
'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z', '-',
)
The default alphabet that corresponds to the BLOSUM62 matrix included in this package.
var AlphaDNA = NewAlphabet(
'A', 'C', 'G', 'T', 'N', '-',
)
The default alphabet for DNA sequences.
var AlphaRNA = NewAlphabet(
'A', 'C', 'G', 'U', 'N', '-',
)
The default alphabet for RNA sequences.
var MinProb = Prob(math.MaxFloat64)
The value representing a minimum emission/transition probability. Remember, max in negative log space is minimum probability.
Functions ¶
This section is empty.
Types ¶
type Alignment ¶
Alignment represents the result of aligning two sequences.
func NeedlemanWunsch ¶
func NeedlemanWunsch(A, B []Residue, subst SubstMatrix) Alignment
Performs the Needleman-Wunsch sequence alignment algorithm on a pair of sequences. A function `subst` should return alignment scores for pairs of residues. This package provides some functions suitable for this purpose, e.g., MatBlosum62, MatDNA, MatRNA, etc.
Example ¶
s1 := NewSequenceString("seq1", "GHIKLMNPQR") s2 := NewSequenceString("seq1", "GAAAHIKLMN") aligned := NeedlemanWunsch(s1.Residues, s2.Residues, SubstBlosum62) fmt.Printf("%s\n", aligned.A) fmt.Printf("%s\n", aligned.B)
Output: G---HIKLMNPQR GAAAHIKLMN---
type Alphabet ¶
type Alphabet []Residue
Alphabet corresponds to a set of residues, in a particular order, that capture all possible residues of a particular set of sequences. For example, this is used in frequency profiles and HMMs to specify which amino acids are represented in the probabilistic model.
In most cases, the ordering of the alphabet is significant. For example, the indices of an alphabet may be in correspondence with the indices of a column in a frequency profile.
func NewAlphabet ¶
NewAlphabet creates an alphabet from the residues given.
func (Alphabet) Index ¶
Index returns a constant-time mapping from ASCII to residue index in the alphabet. This depends on all residues in the alphabet being ASCII characters.
func (*Alphabet) MarshalJSON ¶
func (*Alphabet) UnmarshalJSON ¶
type DynamicTable ¶
type DynamicTable struct {
// contains filtered or unexported fields
}
DynamicTable represents a dynamic programming table used in sequence alignment algorithms like Viterbi.
func AllocTable ¶
func AllocTable(numNodes int, seqLen int) *DynamicTable
AllocTable returns a freshly allocated dynamic programming table for use in HMM alogirthms like Viterbi. It is indexed by HMM state, node index and sequence length, in that order. The total size of the table is equal to (#states * (#nodes + 1) * (seqLen + 1)).
The only states used are Match, Deletion and Insertion.
Each value is initialized to a minimum probability.
type EProbs ¶
EProbs represents emission probabilities, as log-odds scores. The representation of EProbs should not be used by clients; it is exported only for convenience with marshalling APIs.
func NewEProbs ¶
NewEProbs creates a new EProbs map from the given alphabet. Keys of the map are residues defined in the alphabet, and values are defaulted to the minimum probability.
type FrequencyProfile ¶
type FrequencyProfile struct { // The columns of a frequency profile. Freqs []map[Residue]int // The alphabet of the profile. The length of the alphabet should be // equal to the number of rows in the frequency profile. // There are no restrictions on the alphabet. (i.e., Gap characters are // allowed but they are not treated specially.) Alphabet Alphabet }
FrequencyProfile represents a sequence profile in terms of raw frequencies. A FrequencyProfile is useful as an intermediate representation. It can be used to incrementally build a Profile.
func NewFrequencyProfile ¶
func NewFrequencyProfile(columns int) *FrequencyProfile
NewFrequencyProfile initializes a frequency profile with a default alphabet that is compatible with this package's BLOSUM62 matrix.
func NewFrequencyProfileAlphabet ¶
func NewFrequencyProfileAlphabet( columns int, alphabet Alphabet, ) *FrequencyProfile
NewFrequencyProfileAlphabet initializes a frequency profile with the given alphabet.
func NewNullProfile ¶
func NewNullProfile() *FrequencyProfile
NewNullProfile initializes a frequency profile that can be used to tabulate a null model. This is equivalent to calling NewFrequencyProfile with the number of columns set to 1.
func (*FrequencyProfile) Add ¶
func (fp *FrequencyProfile) Add(s Sequence)
Add adds the sequence to the given profile. The sequence must have length equivalent to the number of columns in the profile. The sequence must also only contain residues that are in the alphabet for the profile.
As a special case, if the alphabet contains the 'X' residue, then any unrecognized residues in the sequence with respect to the profile's alphabet will be considered as an 'X' residue.
func (*FrequencyProfile) Combine ¶
func (fp1 *FrequencyProfile) Combine(fp2 *FrequencyProfile)
Combine adds the given frequency profile to the current one. Both profiles must have the same number of columns.
func (*FrequencyProfile) Len ¶
func (fp *FrequencyProfile) Len() int
Len returns the number of columns in the frequency profile.
func (*FrequencyProfile) Profile ¶
func (fp *FrequencyProfile) Profile(null *FrequencyProfile) *Profile
Profile converts a raw frequency profile to a profile that uses a log-odds representation. The log-odds scores are computed with the given null model, which is itself just a raw frequency profile with a single column.
func (*FrequencyProfile) String ¶
func (fp *FrequencyProfile) String() string
type HMM ¶
type HMM struct { // An ordered list of HMM nodes. Nodes []HMMNode // The alphabet as defined by an ordering of residues. // Indices in this slice correspond to indices in match/insertion emissions. Alphabet Alphabet // NULL model. (Amino acid background frequencies.) // HMMER hmm files don't have this, but HHsuite hhm files do. // In the case of HHsuite, the NULL model is used for insertion emissions // in every node. Null EProbs }
func HMMCat ¶
HMMCat joins two HMMs together. The HMMs given are not modified. Both HMMs must have the same alphabet. The null emissions for the first HMM are used.
func NewHMM ¶
NewHMM creates a new HMM from a list of nodes, an ordered alphabet and a set of null probabilities (which may be nil).
func (*HMM) Slice ¶
Slice returns a slice of the HMM given. A slice of an HMM returns only the HMM nodes (i.e., columns or match/delete states) in the region specified by the slice. Also, the transition probabilities of the last state are specially set: M->M = 0, M->I = *, M->D = *, I->M = 0, I->I = *, D->M = 0, and D->D = *. No other modifications are made.
func (*HMM) ViterbiScore ¶
ViterbiScore returns the probability of the likeliest path through the HMM for the given sequence.
If you're running Viterbi in a performance critical section, ViterbiScoreMem may be appropriate.
Note that the state path is not computed.
func (*HMM) ViterbiScoreMem ¶
func (hmm *HMM) ViterbiScoreMem(seq Sequence, table *DynamicTable) Prob
ViterbiScoreMem is the same as ViterbiScore, except it does not allocate, which makes it faster in performance critical sections of code. This is done by passing a pre-allocated dynamic programming table created by AllocTable function.
Note that the caller must ensure that only one goroutine is calling ViterbiScoreMem with the same dynamic programming table.
type HMMNode ¶
type HMMNode struct { Residue Residue NodeNum int InsEmit EProbs MatEmit EProbs Transitions TProbs NeffM, NeffI, NeffD Prob }
HMMNode represents a single node in an HMM, including the reference residue, the node index, insertion emissions, match emissions, transition probabilities.
The NeffM, NeffI and NeffD aren't used, but are included since they exist in common HMM file formats.
type MSA ¶
type MSA struct { Entries []Sequence // contains filtered or unexported fields }
func (*MSA) Add ¶
Add adds a new entry to the multiple sequence alignment. Sequences must be in A2M or A3M format. If your sequence is from a FASTA aligned format, use AddFasta.
Empty sequences are ignored.
func (*MSA) AddFasta ¶
AddFasta adds a FASTA formatted sequence to the MSA. If your sequences are in A2M or A3M format, use Add.
Empty sequences are ignored.
func (*MSA) AddFastaSlice ¶
AddFastaSlice calls "AddFasta" for each sequence in the slice.
func (MSA) Get ¶
Get gets a copy of the sequence at the provided row in the MSA. The sequence is in the default format of the MSA representation. Currently, this is A2M. ('-' for deletes, '.' and a-z for inserts, and A-Z for matches.)
func (MSA) GetA2M ¶
GetA2M gets a copy of the sequence at the provided row in the MSA in A2M format.
func (MSA) GetA3M ¶
GetA3M gets a copy of the sequence at the provided row in the MSA in A3M format. This is the same as A2M format, except all '.' (period) are omitted.
func (MSA) GetFasta ¶
GetFasta gets a copy of the sequence at the provided row in the MSA in aligned FASTA format. This is the same as A2M format, except all '.' (period) are changed to '-' residues.
func (MSA) Len ¶
Len returns the length of the alignment. (All entries in an MSA are guaranteed to have the same length.)
func (*MSA) SetLen ¶
SetLen allows you to override the length of the MSA. Only use this if you know what you're doing. (The Add and AddFasta methods will update this for you automatically.)
type Prob ¶
type Prob float64
Prob represents a transition or emission probability.
func NewProb ¶
NewProb creates a new probability value from a string (usually read from an hmm or hhm file). If the string is equivalent to the special value "*", then the probability returned is guaranteed to be minimal. Otherwise, the string is parsed as a float, and an error returned if parsing fails.
func (Prob) MarshalJSON ¶
func (Prob) Ratio ¶
Ratio returns the log probability as a ratio in the range [0, 1]. (This assumes that `p` is a log-odds score.)
func (Prob) String ¶
String returns a string representation of the probability. When `p` is the minimum probability, then "*" is used. Otherwise, the full number is written.
func (*Prob) UnmarshalJSON ¶
type Profile ¶
type Profile struct { // The columns of a profile. Emissions []EProbs // The alphabet of the profile. The length of the alphabet should be // equal to the number of rows in the profile. // There are no restrictions on the alphabet. (i.e., Gap characters are // allowed but they are not treated specially.) Alphabet Alphabet }
Profile represents a sequence profile in terms of log-odds scores.
func NewProfile ¶
NewProfile initializes a profile with a default alphabet that is compatible with this package's BLOSUM62 matrix. Emission probabilities are set to the minimum log-odds probability.
func NewProfileAlphabet ¶
NewProfileAlphabet initializes a profile with the given alphabet. Emission probabilities are set to the minimum log-odds probability.
type Residue ¶
type Residue byte
A Residue corresponds to a single entry in a sequence.
func (Residue) HMMState ¶
HMMState returns the HMMState of a particular residue in a sequence. Residues in [A-Z] are match states. Residues matching '-' are deletion states. Residues equal to '.' or in [a-z] are insertion states.
A residue corresponding to any other value will panic.
The pre-condition here is that 'r' is a residue from a sequence from an A2M format. (N.B. MSA's formed from A3M and FASTA formatted files are repsented as A2M format, so MSA's read from A3M/FASTA files are OK.)
type Sequence ¶
A Sequence corresponds to any kind of biological sequence: DNA, RNA, amino acid, secondary structure, etc.
func NewSequenceString ¶
NewSequenceString is a convenience function for constructing a sequence from a string. It is otherwise appropriate to create new Sequence values directly.
func (Sequence) Slice ¶
Slice returns a slice of the sequence. The name stays the same, and the sequence of residues corresponds to a Go slice of the original. (This does not copy data, so that if the original or sliced sequence is changed, the other one will too. Use Sequence.Copy first if you need copy semantics.)
type SubstMatrix ¶
SubstMatrix corresponds to a substitution matrix and an alphabet that is used in sequence alignment algorithms. The matrix given should be square, with rows and columns equivalent to the length of the alphabet. (This means that if your substitution matrix includes gap penalties, then the alphabet should also have a gap character.)