kmers

package

v0.0.0-...-d3d09aa Latest Latest Go to latest Published: Mar 5, 2021 License: GPL-3.0 Imports: 9 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/cmdoret/strangerseq

Links

Open Source Insights

Documentation ¶

Index ¶

func Build2dSlice(rows int, cols int) [][]float64
func RandGCWeightSeq(seqlen int, bases []string, cumWeights []float64) string
func RandSeqs(nseq int, seqlen int, bases []string, gc float64) []string
func ScoreSeqs(seqs []string, genome *Genome) ([]float64, []float64)
func SeqGC(seq string) int
type Chain
type Genome
- func NewGenome(path string, k int, gcWeight float64, similar bool, FixedGC float64) *Genome
type SeqsAndScores
type SortByScore

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Build2dSlice ¶

func Build2dSlice(rows int, cols int) [][]float64

Build2dSlice builds a 2d slice of float64 of target size

func RandGCWeightSeq ¶

func RandGCWeightSeq(seqlen int, bases []string, cumWeights []float64) string

RandGCWeightSeq generates a random sequence weighted by GC content cumWeights must be cumulative base weights, with 1 as maximum value

func RandSeqs ¶

func RandSeqs(nseq int, seqlen int, bases []string, gc float64) []string

RandSeqs generates random sequences with target GC content NOTE: Change randstring, currently not weighted

func ScoreSeqs ¶

func ScoreSeqs(seqs []string, genome *Genome) ([]float64, []float64)

ScoreSeqs assigns two scores to each sequence in a list. Scores vary between 0 and 1. The first score only takes k-mer frequency into account, while the second score is adjusted for GC content divergence to the target genome. Rare k-mers increase the score and deviation to genome GC content decreases it.

func SeqGC ¶

func SeqGC(seq string) int

SeqGC returns the number of GC bases in a sequence. Does not handle IUPAC ambiguous bases.

Types ¶

type Chain ¶

type Chain struct {
	Matrix [][]float64    // Markov state transition matrix Lmers -> alphabet
	Lidx   map[string]int // Correspondance between lmers (l=k-1) and Chain's rows
	Bidx   map[string]int // Correspondance between Bases and Chain's cols

}

Chain contains a markov chain of l-th order where l = k-1 giving transition probabilities for the next base. It also has two maps matching lmers and bases to row and col indices of the chain

type Genome ¶

type Genome struct {
	GC       float64        // GC content between 0 and 1
	KmerSize int            // Length of kmers to consider
	Kmers    map[string]int // All kmers and their frequencies
	Bases    []string
	GCWeight float64 // Importance given to GC content of simulated sequences
	Chain    Chain   // Struct containing a Markov chain.
	Similar  bool    // Should equences generated use frequent k-mers ? (instaed of rare k-mers)
}

Genome holds K-mer information about a genome and a Markov state transition matrix of order l = k-1 and transition probabilities are the chance of going to next base B knowing previous l bases.

func NewGenome ¶

func NewGenome(path string, k int, gcWeight float64, similar bool, FixedGC float64) *Genome

NewGenome constructs a Genome object based on a FASTA file and predefined k-mer size.

func (*Genome) FastaToProfile ¶

func (g *Genome) FastaToProfile(file string)

FastaToProfile parses a FASTA file and fills the kmer profile and Markov chain of a Genome struct and set its GC content.

func (*Genome) FillChain ¶

func (g *Genome) FillChain()

FillChain populates transition probabilities in the l-order markov chain based on the Genome Kmer profile. Laplacian smoothing is used to avoid being stuck in a state.

func (*Genome) GenSeqs ¶

func (g *Genome) GenSeqs(nseq int, seqlen int) []string

GenSeqs uses the Markov chain of a Genome object to generate fixed length sequences. It also affects transition probabilities according to the sequence GC deviation and the weight attributed to GC content.

func (*Genome) GenerateKmers ¶

func (g *Genome) GenerateKmers(k int) []string

GenerateKmers initializes a list of all kmers in alphabetical order. Implemented using recursion.

func (*Genome) GetKmers ¶

func (g *Genome) GetKmers(seq string)

GetKmers adds occurrences of kmers in input sequences to the kmer profile of a Genome instance.

func (*Genome) SeedSeq ¶

func (g *Genome) SeedSeq() string

SeedSeq will pick a k-mer using the of their frequencies as probability weights. Uses inverse frequencies if the Similar attribute of receiver genome is set to False. Note that SeedSeq does not directly take GC content into account when picking a k-mer.

type SeqsAndScores ¶

type SeqsAndScores struct {
	Seqs       []string
	KmerScores []float64
	FullScores []float64
}

Define sorting interface to sort sequence according to their (full) scores

type SortByScore ¶

type SortByScore SeqsAndScores

func (SortByScore) Len ¶

func (sbs SortByScore) Len() int

func (SortByScore) Less ¶

func (sbs SortByScore) Less(i, j int) bool

func (SortByScore) Swap ¶

func (sbs SortByScore) Swap(i, j int)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL