seq

package
v0.0.0-...-160f2ba Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 28, 2021 License: GPL-3.0 Imports: 13 Imported by: 0

Documentation

Overview

Organisation. Move everything to do with the seq structure to the start. Then think about moving all the seqgrp stuff to its own file. Big change I should try. At the moment, we allocate every sequence individually. I could allocate a big lump and set up pointers in there. Even more fun... Use golang.org/x/exp/mmap and just set up slices so they point in there.

Index

Constants

View Source
const (
	MaxSym uint8 = 127
)

We only read ascii characters, so anything bigger than this is not valid.

Variables

This section is empty.

Functions

func EntropyFromArray

func EntropyFromArray(gapsAreChar bool,
	matrix [][]float32, entropy []float32, logbase int, gapMapping uint8)

EntropyFromArray is the inner routine for calculating entropy. It operates on the inner matrix, so it can be called from other routines which do not have the seqgrp, but do have a table of counts.

func ReadSeqs

func ReadSeqs(fp io.Reader, seqgrp *SeqGrp, s_opts *Options) (n_dup int, err error)

func WriteToF

func WriteToF(outseq_fname string, seq_set []seq, s_opts *Options) (err error)

WriteToF takes a filename and a slice of sequences. It writes the sequences to the file. For each sequence, it should check if the sequence has been set to nil. What I could change: If we are removing gaps, we make a buffer which grows character by character via WriteByte(). I could make a buffer beforehand and grow as necessary. This should also really act on a seqgrp.

Types

type Options

type Options struct {
	Vbsty        int
	Dry_run      bool // Do not write any files
	Keep_gaps_rd bool // Keep gaps upon reading
	Rmv_gaps_wrt bool // Remove gaps on output
}

Options contains all the choices passed in from the caller.

type SeqGrp

type SeqGrp struct {
	// contains filtered or unexported fields
}

SeqGrp is a group of sequences, with some additional information such as what type (protein, nucleotide) and the number of symbols that have been used.

func Readfile

func Readfile(fname string, s_opts *Options) (*SeqGrp, int, error)

Readfile takes a filename and reads sequences from it. each in turn. It returns a SeqGrp, number of duplicates and error.

func Str2SeqGrp

func Str2SeqGrp(sIn []string, prefix ...string) (seqgrp SeqGrp)

Str2SeqGrp takes some strings and returns them as a seqgrp. sIn is a slice of strings which are the sequences. prefix is an optional argument. Sequences need names/comments. If prefix is not given, sequences will be called "> s1", "> s2", ...

func (*SeqGrp) Compat

func (seqgrp *SeqGrp) Compat(refseq []byte, gapsAreChar bool) []float32

Compat takes one sequence (a reference). It returns the frequency of each character from this sequence at each position in the alignment. Do you want to remove the reference sequence from the calculations ? Usually yes.

func (*SeqGrp) Entropy

func (seqgrp *SeqGrp) Entropy(gapsAreChar bool, entropy []float32)

Entropy calculates sequence entropy. It returns the result as a slice of the same length as the sequences. It needs to be told if gaps are characters, or should be ignored. If the sequence is a nucleotide or protein, we know what logarithm to use. If the sequence is unknown, we use the log base the number different symbols The caller allocates space for the result (entropy).

func (*SeqGrp) FindNdx

func (seqgrp *SeqGrp) FindNdx(s string) int

FindNdx Returns the index of the sequence containing a string. Numbering starts from zero. We remove any ">", space or tab at the start.

func (*SeqGrp) GapFrac

func (seqgrp *SeqGrp) GapFrac() []float32

GapFrac looks in a SeqGrp and returns a slice with the fraction of gap characters at each position. If there are no gaps, there is no slice so we quietly return nil without signalling an error.

func (*SeqGrp) GetCounts

func (seqgrp *SeqGrp) GetCounts() *matrix.FMatrix2d

GetCounts gives us the normally non-exported counts

func (*SeqGrp) GetLen

func (seqgrp *SeqGrp) GetLen() int

GetLen returns the length of the first sequence. If we are reading a multiple sequence alignment, this should be the length of all sequences.

func (*SeqGrp) GetLogBase

func (seqgrp *SeqGrp) GetLogBase(gapsAreChar bool) (nSym int)

GetLogBase returns the base to be used for logarithms

func (*SeqGrp) GetMap

func (seqgrp *SeqGrp) GetMap(c byte) uint8

GetMap tells us where we are storing info about a symbol in our tallies. So, seq[i].GetMap() tells us where to put info about this character.

func (*SeqGrp) GetMapping

func (seqgrp *SeqGrp) GetMapping(c uint8) uint8

GetMapping returns the mapping (row) for a specific character

func (*SeqGrp) GetNSeq

func (seqgrp *SeqGrp) GetNSeq() int

GetNSeq returns the number of sequences

func (*SeqGrp) GetNSym

func (seqgrp *SeqGrp) GetNSym() int

GetNSym returns the number of symbols used in a seqgrp. Used in testing.

func (*SeqGrp) GetRevmap

func (seqgrp *SeqGrp) GetRevmap() []uint8

GetRevmap returns the non-exported revmap

func (*SeqGrp) GetSeqSlc

func (seqgrp *SeqGrp) GetSeqSlc() []seq

GetSeqSlc return the slice of sequences

func (*SeqGrp) GetSymUsed

func (seqgrp *SeqGrp) GetSymUsed() [MaxSym]bool

GetSymUsed returns the normally non-exported symUsed

func (*SeqGrp) GetType

func (seqgrp *SeqGrp) GetType() SeqType

GetType looks at a set of sequences and returns its best guess as to the type of file.

func (*SeqGrp) SetSymUsed

func (seqgrp *SeqGrp) SetSymUsed(symSync ...*SymSync)

SetSymUsed fills out the bool slice which says whether or not a symbol was used. Normally, this is just a loop over all sequences. If we are combining two seqgrp's, then the symbols used in group A should also be marked used in group B and vice versa. If we get a second varadic argument, it is a channel to be used in combining.

func (*SeqGrp) TypeKnwn

func (seqgrp *SeqGrp) TypeKnwn() bool

TypeKnwn tells us if we have decided what kind of sequence we have.

func (SeqGrp) Upper

func (seqgrp SeqGrp) Upper() error

func (*SeqGrp) UsageFrac

func (seqgrp *SeqGrp) UsageFrac(gapsAreChar bool)

Usage Frac converts count to normalised frequencies. If letter 'A' occurs 2 times in five positions, its count entry will be changed from 2 to 2/5 = 0.4 If gapsAreChar is true, gaps ("-") are treated as a valid character type. Otherwise they are removed from the tallies. If gapsAreChar is not true, then

a symbol's fraction is the fraction of non-gaps
            in which you find this symbol
the gap's fraction is the fraction of the total
            number of residues in which one finds a gap.

This means that the fractions of non-gaps adds up to 1, and then you have a bit more due to gaps. It also means that the data looks correct when you plot it out.

func (*SeqGrp) UsageSite

func (seqgrp *SeqGrp) UsageSite()

UsageSite counts how many of each symbol/character appear at each site in the alignment. counts.Mat looks like [length_of_seq][number_of_types] We store it as a float32, since it will later usually be normalised and converted to a fraction. Inaccuracy introduced by working with floats is no problem and we can avoid allocating a new matrix for the frequencies.

type SeqType

type SeqType byte

A marker to say what type of sequence we have, protein, DNA, ...

const (
	Unchecked SeqType = iota // Has not been looked at yet
	Unknown                  // Really unknown, not a protein or nucleotide
	Protein                  //
	DNA                      //
	RNA                      //
	Ntide                    // Nucleotide
)

type SymSync

type SymSync struct {
	Once  sync.Once
	UChan chan [MaxSym]bool
}

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL