Documentation ¶
Overview ¶
Organisation. Move everything to do with the seq structure to the start. Then think about moving all the seqgrp stuff to its own file. Big change I should try. At the moment, we allocate every sequence individually. I could allocate a big lump and set up pointers in there. Even more fun... Use golang.org/x/exp/mmap and just set up slices so they point in there.
Index ¶
- Constants
- func EntropyFromArray(gapsAreChar bool, matrix [][]float32, entropy []float32, logbase int, ...)
- func ReadSeqs(fp io.Reader, seqgrp *SeqGrp, s_opts *Options) (n_dup int, err error)
- func WriteToF(outseq_fname string, seq_set []seq, s_opts *Options) (err error)
- type Options
- type SeqGrp
- func (seqgrp *SeqGrp) Compat(refseq []byte, gapsAreChar bool) []float32
- func (seqgrp *SeqGrp) Entropy(gapsAreChar bool, entropy []float32)
- func (seqgrp *SeqGrp) FindNdx(s string) int
- func (seqgrp *SeqGrp) GapFrac() []float32
- func (seqgrp *SeqGrp) GetCounts() *matrix.FMatrix2d
- func (seqgrp *SeqGrp) GetLen() int
- func (seqgrp *SeqGrp) GetLogBase(gapsAreChar bool) (nSym int)
- func (seqgrp *SeqGrp) GetMap(c byte) uint8
- func (seqgrp *SeqGrp) GetMapping(c uint8) uint8
- func (seqgrp *SeqGrp) GetNSeq() int
- func (seqgrp *SeqGrp) GetNSym() int
- func (seqgrp *SeqGrp) GetRevmap() []uint8
- func (seqgrp *SeqGrp) GetSeqSlc() []seq
- func (seqgrp *SeqGrp) GetSymUsed() [MaxSym]bool
- func (seqgrp *SeqGrp) GetType() SeqType
- func (seqgrp *SeqGrp) SetSymUsed(symSync ...*SymSync)
- func (seqgrp *SeqGrp) TypeKnwn() bool
- func (seqgrp SeqGrp) Upper() error
- func (seqgrp *SeqGrp) UsageFrac(gapsAreChar bool)
- func (seqgrp *SeqGrp) UsageSite()
- type SeqType
- type SymSync
Constants ¶
const (
MaxSym uint8 = 127
)
We only read ascii characters, so anything bigger than this is not valid.
Variables ¶
This section is empty.
Functions ¶
func EntropyFromArray ¶
func EntropyFromArray(gapsAreChar bool, matrix [][]float32, entropy []float32, logbase int, gapMapping uint8)
EntropyFromArray is the inner routine for calculating entropy. It operates on the inner matrix, so it can be called from other routines which do not have the seqgrp, but do have a table of counts.
func WriteToF ¶
WriteToF takes a filename and a slice of sequences. It writes the sequences to the file. For each sequence, it should check if the sequence has been set to nil. What I could change: If we are removing gaps, we make a buffer which grows character by character via WriteByte(). I could make a buffer beforehand and grow as necessary. This should also really act on a seqgrp.
Types ¶
type Options ¶
type Options struct { Vbsty int Dry_run bool // Do not write any files Keep_gaps_rd bool // Keep gaps upon reading Rmv_gaps_wrt bool // Remove gaps on output }
Options contains all the choices passed in from the caller.
type SeqGrp ¶
type SeqGrp struct {
// contains filtered or unexported fields
}
SeqGrp is a group of sequences, with some additional information such as what type (protein, nucleotide) and the number of symbols that have been used.
func Readfile ¶
Readfile takes a filename and reads sequences from it. each in turn. It returns a SeqGrp, number of duplicates and error.
func Str2SeqGrp ¶
Str2SeqGrp takes some strings and returns them as a seqgrp. sIn is a slice of strings which are the sequences. prefix is an optional argument. Sequences need names/comments. If prefix is not given, sequences will be called "> s1", "> s2", ...
func (*SeqGrp) Compat ¶
Compat takes one sequence (a reference). It returns the frequency of each character from this sequence at each position in the alignment. Do you want to remove the reference sequence from the calculations ? Usually yes.
func (*SeqGrp) Entropy ¶
Entropy calculates sequence entropy. It returns the result as a slice of the same length as the sequences. It needs to be told if gaps are characters, or should be ignored. If the sequence is a nucleotide or protein, we know what logarithm to use. If the sequence is unknown, we use the log base the number different symbols The caller allocates space for the result (entropy).
func (*SeqGrp) FindNdx ¶
FindNdx Returns the index of the sequence containing a string. Numbering starts from zero. We remove any ">", space or tab at the start.
func (*SeqGrp) GapFrac ¶
GapFrac looks in a SeqGrp and returns a slice with the fraction of gap characters at each position. If there are no gaps, there is no slice so we quietly return nil without signalling an error.
func (*SeqGrp) GetLen ¶
GetLen returns the length of the first sequence. If we are reading a multiple sequence alignment, this should be the length of all sequences.
func (*SeqGrp) GetLogBase ¶
GetLogBase returns the base to be used for logarithms
func (*SeqGrp) GetMap ¶
GetMap tells us where we are storing info about a symbol in our tallies. So, seq[i].GetMap() tells us where to put info about this character.
func (*SeqGrp) GetMapping ¶
GetMapping returns the mapping (row) for a specific character
func (*SeqGrp) GetSeqSlc ¶
func (seqgrp *SeqGrp) GetSeqSlc() []seq
GetSeqSlc return the slice of sequences
func (*SeqGrp) GetSymUsed ¶
GetSymUsed returns the normally non-exported symUsed
func (*SeqGrp) GetType ¶
GetType looks at a set of sequences and returns its best guess as to the type of file.
func (*SeqGrp) SetSymUsed ¶
SetSymUsed fills out the bool slice which says whether or not a symbol was used. Normally, this is just a loop over all sequences. If we are combining two seqgrp's, then the symbols used in group A should also be marked used in group B and vice versa. If we get a second varadic argument, it is a channel to be used in combining.
func (*SeqGrp) UsageFrac ¶
Usage Frac converts count to normalised frequencies. If letter 'A' occurs 2 times in five positions, its count entry will be changed from 2 to 2/5 = 0.4 If gapsAreChar is true, gaps ("-") are treated as a valid character type. Otherwise they are removed from the tallies. If gapsAreChar is not true, then
a symbol's fraction is the fraction of non-gaps in which you find this symbol the gap's fraction is the fraction of the total number of residues in which one finds a gap.
This means that the fractions of non-gaps adds up to 1, and then you have a bit more due to gaps. It also means that the data looks correct when you plot it out.
func (*SeqGrp) UsageSite ¶
func (seqgrp *SeqGrp) UsageSite()
UsageSite counts how many of each symbol/character appear at each site in the alignment. counts.Mat looks like [length_of_seq][number_of_types] We store it as a float32, since it will later usually be normalised and converted to a fraction. Inaccuracy introduced by working with floats is no problem and we can avoid allocating a new matrix for the frequencies.