Documentation ¶
Overview ¶
Package seq defines a *Seq* type, and provides some basic operations of sequence, like validation of DNA/RNA/Protein sequence and getting reverse complement sequence.
This package was inspired by [biogo](https://code.google.com/p/biogo/source/browse/#git%2Falphabet).
IUPAC nucleotide code: ACGTURYSWKMBDHVN
http://droog.gs.washington.edu/parc/images/iupac.html
code base Complement A A T C C G G G C T/U T A M A/C K R A/G Y W A/T W S C/G S Y C/T R K G/T M V A/C/G B H A/C/T D D A/G/T H B C/G/T V X/N A/C/G/T X . not A/C/G/T or- gap
IUPAC amino acid code
A Ala Alanine B Asx Aspartic acid or Asparagine [2] C Cys Cysteine D Asp Aspartic Acid E Glu Glutamic Acid F Phe Phenylalanine G Gly Glycine H His Histidine I Ile Isoleucine J Isoleucine or Leucine [4] K Lys Lysine L Leu Leucine M Met Methionine N Asn Asparagine O pyrrolysine [6] P Pro Proline Q Gln Glutamine R Arg Arginine S Ser Serine T Thr Threonine U Sec selenocysteine [5,6] V Val Valine W Trp Tryptophan Y Tyr Tyrosine Z Glx Glutamine or Glutamic acid [2] X unknown amino acid . gaps * End
Reference:
- http://www.bioinformatics.org/sms/iupac.html
- http://www.dnabaser.com/articles/IUPAC%20ambiguity%20codes.html
- http://www.bioinformatics.org/sms2/iupac.html
- http://www.matrixscience.com/blog/non-standard-amino-acid-residues.html
- http://www.sbcs.qmul.ac.uk/iupac/AminoAcid/A2021.html#AA21
- https://en.wikipedia.org/wiki/Amino_acid
https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes
Index ¶
- Constants
- Variables
- func AmbBase2Bases0(b byte) ([]byte, error)
- func Bases2AmbBase(bs []byte) (byte, error)
- func Codes2AmbCode(codes []int) (int, error)
- func Degenerate2Seqs(s []byte) (dseqs [][]byte, err error)
- func Phred2Solexa(q float64) (float64, error)
- func QualityConvert(from, to QualityEncoding, quality []byte, force bool) ([]byte, error)
- func QualityValue(encoding QualityEncoding, quality []byte) ([]int, error)
- func Solexa2Phred(q float64) (float64, error)
- func SubLocation(length, start, end int) (int, int, bool)
- type Alphabet
- func (a *Alphabet) AllLetters() []byte
- func (a *Alphabet) AmbiguousLetters() []byte
- func (a *Alphabet) Clone() *Alphabet
- func (a *Alphabet) Gaps() []byte
- func (a *Alphabet) IsValid(s []byte) error
- func (a *Alphabet) IsValidLetter(b byte) bool
- func (a *Alphabet) Letters() []byte
- func (a *Alphabet) PairLetter(b byte) (byte, error)
- func (a *Alphabet) String() string
- func (a *Alphabet) Type() string
- type CodonTable
- func (t *CodonTable) Clone() CodonTable
- func (t *CodonTable) Get(codon []byte, allowUnknownCodon bool) (byte, error)
- func (t *CodonTable) Get2(codon string, allowUnknownCodon bool) (byte, error)
- func (t *CodonTable) Set(codon []byte, aminoAcid byte) error
- func (t *CodonTable) Set2(codon string, aminoAcid byte) error
- func (t CodonTable) String() string
- func (t CodonTable) StringWithAmbiguousCodons() string
- func (t *CodonTable) Translate(sequence []byte, frame int, trim bool, clean bool, allowUnknownCodon bool, ...) ([]byte, error)
- type QualityEncoding
- type Seq
- func (seq *Seq) AvgQual(asciiBase int) float64
- func (seq *Seq) BaseContent(list string) float64
- func (seq *Seq) BaseContentCaseSensitive(list string) float64
- func (seq *Seq) BaseCount(list string) int
- func (seq *Seq) BaseCountCaseSensitive(list string) int
- func (seq *Seq) Bases(gapLetters string) int
- func (seq *Seq) Clone() *Seq
- func (seq *Seq) Clone2() *Seq
- func (seq *Seq) Complement() *Seq
- func (seq *Seq) ComplementInplace() *Seq
- func (seq *Seq) Degenerate2Regexp() string
- func (seq *Seq) FormatSeq(width int) []byte
- func (seq *Seq) GC() float64
- func (seq *Seq) Length() int
- func (seq *Seq) ParseQual(asciiBase int)
- func (seq *Seq) RemoveGaps(letters string) *Seq
- func (seq *Seq) RemoveGapsInplace(letters string) *Seq
- func (seq *Seq) RevCom() *Seq
- func (seq *Seq) RevComInplace() *Seq
- func (seq *Seq) Reverse() *Seq
- func (seq *Seq) ReverseInplace() *Seq
- func (seq *Seq) Slider(window int, step int, circular bool, greedy bool) func() (*Seq, bool)
- func (seq *Seq) String() string
- func (seq *Seq) SubSeq(start int, end int) *Seq
- func (seq *Seq) SubSeqInplace(start int, end int) *Seq
- func (seq *Seq) Translate(transl_table int, frame int, trim bool, clean bool, allowUnknownCodon bool, ...) (*Seq, error)
Constants ¶
const NQualityEncoding int = 6
NQualityEncoding is the number of QualityEncoding + 1: 5 + 1 = 6
Variables ¶
var AlphabetGuessSeqLengthThreshold = 10000
AlphabetGuessSeqLengthThreshold is the length of sequence prefix of the first FASTA record based which FastaRecord guesses the sequence type. 0 for whole seq
var AmbBase2Bases = map[byte][]byte{
'A': {'A'},
'a': {'A'},
'C': {'C'},
'c': {'C'},
'G': {'G'},
'g': {'G'},
'T': {'T'},
't': {'T'},
'U': {'T'},
'u': {'T'},
'M': {'A', 'C', 'M'},
'm': {'A', 'C', 'M'},
'R': {'A', 'G', 'R'},
'r': {'A', 'G', 'R'},
'W': {'A', 'T', 'W'},
'w': {'A', 'T', 'W'},
'S': {'C', 'G', 'S'},
's': {'C', 'G', 'S'},
'Y': {'C', 'T', 'Y'},
'y': {'C', 'T', 'Y'},
'K': {'G', 'T', 'K'},
'k': {'G', 'T', 'K'},
'V': {'A', 'C', 'G', 'M', 'R', 'S', 'V'},
'v': {'A', 'C', 'G', 'M', 'R', 'S', 'V'},
'H': {'A', 'C', 'T', 'M', 'W', 'Y', 'H'},
'h': {'A', 'C', 'T', 'M', 'W', 'Y', 'H'},
'D': {'A', 'G', 'T', 'R', 'W', 'K', 'D'},
'd': {'A', 'G', 'T', 'R', 'W', 'K', 'D'},
'B': {'C', 'G', 'T', 'S', 'Y', 'K', 'B'},
'b': {'C', 'G', 'T', 'S', 'Y', 'K', 'B'},
'N': {'A', 'C', 'M', 'G', 'R', 'S', 'V', 'T', 'W', 'Y', 'H', 'K', 'D', 'B', 'N'},
'n': {'A', 'C', 'M', 'G', 'R', 'S', 'V', 'T', 'W', 'Y', 'H', 'K', 'D', 'B', 'N'},
}
AmbBase2Bases holds relationship of ambiguous base and bases it represents, faster than AmbBase2Bases0
var AmbCodes2Codes = map[int][]int{
1: {1},
2: {2},
4: {4},
8: {8},
3: {1, 2, 3},
5: {1, 4, 5},
9: {1, 8, 9},
6: {2, 4, 6},
10: {2, 8, 10},
12: {4, 8, 12},
7: {1, 2, 4, 3, 5, 6, 7},
11: {1, 2, 8, 3, 9, 10, 11},
13: {1, 4, 8, 5, 9, 12, 13},
14: {2, 4, 8, 6, 10, 12, 14},
15: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15},
}
AmbCodes2Codes is code version of AmbBase2Bases
var CodonTables map[int]*CodonTable
CodonTables contains all the codon tables from NCBI:
1: The Standard Code 2: The Vertebrate Mitochondrial Code 3: The Yeast Mitochondrial Code 4: The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code 5: The Invertebrate Mitochondrial Code 6: The Ciliate, Dasycladacean and Hexamita Nuclear Code 9: The Echinoderm and Flatworm Mitochondrial Code 10: The Euplotid Nuclear Code 11: The Bacterial, Archaeal and Plant Plastid Code 12: The Alternative Yeast Nuclear Code 13: The Ascidian Mitochondrial Code 14: The Alternative Flatworm Mitochondrial Code 16: Chlorophycean Mitochondrial Code 21: Trematode Mitochondrial Code 22: Scenedesmus obliquus Mitochondrial Code 23: Thraustochytrium Mitochondrial Code 24: Pterobranchia Mitochondrial Code 25: Candidate Division SR1 and Gracilibacteria Code 26: Pachysolen tannophilus Nuclear Code 27: Karyorelict Nuclear 28: Condylostoma Nuclear 29: Mesodinium Nuclear 30: Peritrich Nuclear 31: Blastocrithidia Nuclear
var ComplementSeqLenThreshold = 1000
ComplementSeqLenThreshold is the threshold of sequence length that needed to parallelly complement sequence
var ComplementThreads = runtime.NumCPU()
ComplementThreads is the threads number of parallelly complement sequence
var DegenerateBaseMapNucl = map[byte]string{
'A': "A",
'T': "[TU]",
'U': "[TU]",
'C': "C",
'G': "G",
'R': "[AG]",
'Y': "[CTU]",
'M': "[AC]",
'K': "[GTU]",
'S': "[CG]",
'W': "[ATU]",
'H': "[ACTU]",
'B': "[CGTU]",
'V': "[ACG]",
'D': "[AGTU]",
'N': "[ACGTU]",
'a': "a",
't': "[tu]",
'u': "[tu]",
'c': "c",
'g': "g",
'r': "[ag]",
'y': "[ctu]",
'm': "[ac]",
'k': "[gtu]",
's': "[cg]",
'w': "[atu]",
'h': "[actu]",
'b': "[cgtu]",
'v': "[acg]",
'd': "[agtu]",
'n': "[acgtu]",
}
DegenerateBaseMapNucl mappings nucleic acid degenerate base to regular expression
var DegenerateBaseMapNucl2 = map[byte]string{
'A': "A",
'T': "TU",
'U': "TU",
'C': "C",
'G': "G",
'R': "AG",
'Y': "CTU",
'M': "AC",
'K': "GTU",
'S': "CG",
'W': "ATU",
'H': "ACTU",
'B': "CGTU",
'V': "ACG",
'D': "AGTU",
'N': "ACGTU",
'a': "a",
't': "tu",
'u': "tu",
'c': "c",
'g': "g",
'r': "ag",
'y': "ctu",
'm': "ac",
'k': "gtu",
's': "cg",
'w': "atu",
'h': "actu",
'b': "cgtu",
'v': "acg",
'd': "agtu",
'n': "acgtu",
}
DegenerateBaseMapNucl2 mappings nucleic acid degenerate base to all bases.
var DegenerateBaseMapProt = map[byte]string{
'A': "A",
'B': "[DN]",
'C': "C",
'D': "D",
'E': "E",
'F': "F",
'G': "G",
'H': "H",
'I': "I",
'J': "[IL]",
'K': "K",
'L': "L",
'M': "M",
'N': "N",
'P': "P",
'Q': "Q",
'R': "R",
'S': "S",
'T': "T",
'V': "V",
'W': "W",
'X': "[A-Z]",
'Y': "Y",
'Z': "[QE]",
'a': "a",
'b': "[dn]",
'c': "c",
'd': "d",
'e': "e",
'f': "f",
'g': "g",
'h': "h",
'i': "i",
'j': "[il]",
'k': "k",
'l': "l",
'm': "m",
'n': "n",
'p': "p",
'q': "q",
'r': "r",
's': "s",
't': "t",
'v': "v",
'w': "w",
'x': "[a-z]",
'y': "y",
'z': "[qe]",
}
DegenerateBaseMapProt mappings protein degenerate base to regular expression
var ErrInvalidCodon = errors.New("seq: invalid codon")
ErrInvalidCodon means the length of codon is not 3.
var ErrInvalidDNABase = errors.New("seq: invalid DNA base")
ErrInvalidDNABase means invalid DNA base
var ErrInvalidPhredQuality = errors.New("seq: invalid Phred quality")
ErrInvalidPhredQuality occurs for phred quality less than 0.
var ErrInvalidSolexaQuality = errors.New("seq: invalid Solexa quality")
ErrInvalidSolexaQuality occurs for solexa quality less than -5.
var ErrUnknownCodon = errors.New("seq: unknown codon")
ErrUnknownCodon means the codon is not in the codon table, or the codon contains bases expcet for A C T G U.
var ErrUnknownQualityEncoding = errors.New("unknown quality encoding")
ErrUnknownQualityEncoding is error for Unknown quality encoding type
var NMostCommonThreshold = 2
NMostCommonThreshold is the threshold of 'B' in top N most common quality for guessing Illumina 1.5.
var QUAL_MAP [256]float64
var ValidSeqLengthThreshold = 10000
ValidSeqLengthThreshold is the threshold of sequence length that needed to parallelly checking sequence
var ValidSeqThreads = runtime.NumCPU()
ValidSeqThreads is the threads number of parallelly checking sequence
var ValidateSeq = true
ValidateSeq decides whether check sequence or not
var ValidateWholeSeq = true
ValidateWholeSeq is used to determin whether validate all bases of a seq
Functions ¶
func AmbBase2Bases0 ¶
AmbBase2Bases0 converts ambiguous base to bases it represents, slower than AmbBase2Bases
func Bases2AmbBase ¶
Bases2AmbBase converts list of bases to ambiguous base
func Codes2AmbCode ¶
Codes2AmbCode converts list of codes of bases to code of ambiguous base
func Degenerate2Seqs ¶
Degenerate2Seqs transforms seqs containing degenrate bases to all possible sequences.
func Phred2Solexa ¶
Phred2Solexa converts Phred quality to Solexa quality.
func QualityConvert ¶
func QualityConvert(from, to QualityEncoding, quality []byte, force bool) ([]byte, error)
QualityConvert convert quality from one encoding to another encoding. Force means forcely truncate scores > 40 to 40 when converting Illumina-1.8+ to Sanger.
func QualityValue ¶
func QualityValue(encoding QualityEncoding, quality []byte) ([]int, error)
QualityValue returns quality value for given encoding and quality string
func Solexa2Phred ¶
Solexa2Phred converts Solexa quality to Phred quality.
func SubLocation ¶
SubLocation is my sublocation strategy, start, end and returned start and end are all 1-based
1-based index 1 2 3 4 5 6 7 8 9 10
negative index 0-9-8-7-6-5-4-3-2-1
seq A C G T N a c g t n 1:1 A 2:4 C G T -4:-2 c g t -4:-1 c g t n -1:-1 n 2:-2 C G T N a c g t 1:-1 A C G T N a c g t n 1:12 A C G T N a c g t n -12:-1 A C G T N a c g t n
Types ¶
type Alphabet ¶
type Alphabet struct {
// contains filtered or unexported fields
}
Alphabet could be defined. Attention that, **the letters are case sensitive**.
For example, DNA:
DNA, _ = NewAlphabet( "DNA", []byte("acgtACGT"), []byte("tgcaTGCA"), []byte(" -"), []byte("nN"))
var ( DNA *Alphabet DNAredundant *Alphabet RNA *Alphabet RNAredundant *Alphabet Protein *Alphabet Unlimit *Alphabet )
Four types of alphabets are pre-defined:
DNA Deoxyribonucleotide code DNAredundant DNA + Ambiguity Codes RNA Oxyribonucleotide code RNAredundant RNA + Ambiguity Codes Protein Amino Acide single-letter Code Unlimit Self-defined, including all 26 English letters
func GuessAlphabet ¶
GuessAlphabet guesses alphabet by given
func GuessAlphabetLessConservatively ¶
GuessAlphabetLessConservatively change DNA to DNAredundant and RNA to RNAredundant
func NewAlphabet ¶
func NewAlphabet( t string, isUnlimit bool, letters []byte, pairs []byte, gap []byte, ambiguous []byte, ) (*Alphabet, error)
NewAlphabet is Constructor for type *Alphabet*
func (*Alphabet) AmbiguousLetters ¶
AmbiguousLetters returns AmbiguousLetters
func (*Alphabet) IsValidLetter ¶
IsValidLetter is used to validate a letter
func (*Alphabet) PairLetter ¶
PairLetter return the Pair Letter
type CodonTable ¶
type CodonTable struct { ID int Name string InitCodons map[string]struct{} // upper-case of codon as string, map for fast quering StopCodons map[string]struct{} // upper-case of codon as string, map for fast quering // contains filtered or unexported fields }
CodonTable represents a codon table
func NewCodonTable ¶
func NewCodonTable(id int, name string) *CodonTable
NewCodonTable contructs a CodonTable with ID and Name, you need to set the detailed codon table by calling Set or Set2.
func (*CodonTable) Clone ¶
func (t *CodonTable) Clone() CodonTable
Clone returns a deep copy of the CodonTable.
func (*CodonTable) Get ¶
func (t *CodonTable) Get(codon []byte, allowUnknownCodon bool) (byte, error)
Get returns the amino acid of the codon ([]byte), codon can be DNA or RNA. When allowUnknownCodon is true, codons that not int the codon table will still be translated to 'X', and "---" is translated to "-".
func (*CodonTable) Get2 ¶
func (t *CodonTable) Get2(codon string, allowUnknownCodon bool) (byte, error)
Get2 returns the amino acid of the codon (string), codon can be DNA or RNA.
func (*CodonTable) Set ¶
func (t *CodonTable) Set(codon []byte, aminoAcid byte) error
Set sets a codon of byte slice.
func (*CodonTable) Set2 ¶
func (t *CodonTable) Set2(codon string, aminoAcid byte) error
Set2 sets a codon of string.
func (CodonTable) String ¶
func (t CodonTable) String() string
String returns details of the CodonTable.
func (CodonTable) StringWithAmbiguousCodons ¶
func (t CodonTable) StringWithAmbiguousCodons() string
StringWithAmbiguousCodons returns details of the CodonTable, including ambiguous codons.
func (*CodonTable) Translate ¶
func (t *CodonTable) Translate(sequence []byte, frame int, trim bool, clean bool, allowUnknownCodon bool, markInitCodonAsM bool) ([]byte, error)
Translate translates a DNA/RNA sequence to amino acid sequences. Available frame: 1, 2, 3, -1, -2 ,-3. If option trim is true, it removes all 'X' and '*' characters from the right end of the translation. If option clean is true, it changes all STOP codon positions from the '*' character to 'X' (an unknown residue). If option allowUnknownCodon is true, codons not in the codon table will be translated to 'X'. If option markInitCodonAsM is true, initial codon at beginning will be represented as 'M'.
type QualityEncoding ¶
type QualityEncoding int
QualityEncoding is the type of quality encoding
const ( // Unknown quality encoding Unknown QualityEncoding = iota // Sanger format can encode a Phred quality score from 0 to 93 using // ASCII 33 to 126 (although in raw read data the Phred quality score // rarely exceeds 60, higher scores are possible in assemblies or read maps). Sanger // Solexa /Illumina 1.0 format can encode a Solexa/Illumina quality score // from -5 to 62 using ASCII 59 to 126 (although in raw read data Solexa // scores from -5 to 40 only are expected). Solexa // Illumina1p3 means Illumina 1.3+. // Starting with Illumina 1.3 and before Illumina 1.8, the format // encoded a Phred quality score from 0 to 62 using ASCII 64 to 126 // (although in raw read data Phred scores from 0 to 40 only are expected). Illumina1p3 // Illumina1p5 means Illumina 1.5+. // Starting in Illumina 1.5 and before Illumina 1.8, the Phred scores // 0 to 2 have a slightly different meaning. The values 0 and 1 are // no longer used and the value 2, encoded by ASCII 66 "B", is used // also at the end of reads as a Read Segment Quality Control Indicator. Illumina1p5 // Illumina1p8 means Illumina 1.8+. // Starting in Illumina 1.8, the quality scores have basically // returned to the use of the Sanger format (Phred+33) Illumina1p8 )
func GuessQualityEncoding ¶
func GuessQualityEncoding(quality []byte) []QualityEncoding
GuessQualityEncoding returns potential quality encodings.
func (QualityEncoding) IsSolexa ¶
func (qe QualityEncoding) IsSolexa() bool
IsSolexa tells whether the encoding is Solexa
func (QualityEncoding) QualityRange ¶
func (qe QualityEncoding) QualityRange() []int
QualityRange is the typical quality range
func (QualityEncoding) String ¶
func (qe QualityEncoding) String() string
type Seq ¶
Seq represents a FASTA/Q record
func NewSeqWithQual ¶
NewSeqWithQual is used to store fastq sequence
func NewSeqWithQualWithoutValidation ¶
NewSeqWithQualWithoutValidation create Seq with quality without check the sequences
func NewSeqWithoutValidation ¶
NewSeqWithoutValidation create Seq without check the sequences
func (*Seq) BaseContent ¶
BaseContent returns base content for given bases. For example:
seq.BaseContent("gc")
func (*Seq) BaseContentCaseSensitive ¶
BaseContentCaseSensitive returns base content for given case sensitive bases.
func (*Seq) BaseCountCaseSensitive ¶
BaseCountCaseSensitive counts bases, case is not ignored.
func (*Seq) ComplementInplace ¶
ComplementInplace returns complement sequence.
func (*Seq) Degenerate2Regexp ¶
Degenerate2Regexp transforms seqs containing degenrate base to regular expression
func (*Seq) RemoveGaps ¶
RemoveGaps return a new seq without gaps
func (*Seq) RemoveGapsInplace ¶
RemoveGapsInplace removes gaps in place
func (*Seq) RevComInplace ¶
RevComInplace reverses complement sequence in place
func (*Seq) ReverseInplace ¶
ReverseInplace reverses the sequence content
func (*Seq) Slider ¶
Slider returns a function for sliding the sequence. Circular is for circular genome, and it overides greedy. If not circular and greedy is true, last fragment shorter than window will be returned.
func (*Seq) SubSeq ¶
SubSeq returns a sub seq. start and end is 1-based.
Examples:
1-based index 1 2 3 4 5 6 7 8 9 10
negative index 0-9-8-7-6-5-4-3-2-1
seq A C G T N a c g t n 1:1 A 2:4 C G T -4:-2 c g t -4:-1 c g t n -1:-1 n 2:-2 C G T N a c g t 1:-1 A C G T N a c g t n 1:12 A C G T N a c g t n -12:-1 A C G T N a c g t n
func (*Seq) SubSeqInplace ¶
SubSeqInplace return subseq inplace
func (*Seq) Translate ¶
func (seq *Seq) Translate(transl_table int, frame int, trim bool, clean bool, allowUnknownCodon bool, markInitCodonAsM bool) (*Seq, error)
Translate translates the RNA/DNA to amino acid sequence. Available frame: 1, 2, 3, -1, -2 ,-3. If option trim is true, it removes all 'X' and '*' characters from the right end of the translation. If option clean is true, it changes all STOP codon positions from the '*' character to 'X' (an unknown residue). If option allowUnknownCodon is true, codons not in the codon table will be translated to 'X'. If option markInitCodonAsM is true, initial codon at beginning will be represented as 'M'.