repeatgenome

package

v0.0.0-...-69c58a0 Latest Latest Go to latest Published: Nov 17, 2015 License: ISC Imports: 18 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/mmcco/jh-bio

Documentation ¶

Index ¶

func DebugSeq()
func Less(a, b []byte) bool
func TSRevComp(seq []byte) []byte
type Chroms
type ClassID
type ClassNode
- func (classNode *ClassNode) Size() uint64
type ClassNodes
- func (classNodes ClassNodes) Write(filename string) error
type ClassTree
- func (classTree *ClassTree) PrintBranches()
- func (classTree *ClassTree) PrintTree()
type Config
type JSONNode
type KRespPair
type Kmer
- func (kmer Kmer) ClassID() ClassID
- func (kmer Kmer) Int() uint64
- func (kmer *Kmer) SetClassID(classID ClassID)
- func (kmer *Kmer) SetInt(kmerInt uint64)
type KmerInts
type Kmers
- func (kmers Kmers) Len() int
- func (kmers Kmers) Less(i, j int) bool
- func (kmers Kmers) Swap(i, j int)
type MRespPair
type MinInts
- func (minInts MinInts) Len() int
- func (minInts MinInts) Less(i, j int) bool
- func (minInts MinInts) Swap(i, j int)
type PKmers
- func (pkmers PKmers) Len() int
- func (pkmers PKmers) Less(i, j int) bool
- func (pkmers PKmers) Swap(i, j int)
type ReadResponse
- func (readResp ReadResponse) HangingSize() uint64
type ReadSAM
- func GetReadSAMs(readsDirPath string) (error, []ReadSAM)
type ReadSAMRepeat
type ReadSAMResponse
type ReducePair
type Repeat
- func (repeat *Repeat) Print()
- func (repeat *Repeat) Size() uint64
type RepeatGenome
- func New(config Config) (error, *RepeatGenome)
- func (rg *RepeatGenome) AvgPossPercentGenome(resps []ReadResponse, strict bool) float64
- func (rg *RepeatGenome) GetClassChan(reads [][]byte, useLCA bool) chan ReadResponse
- func (rg *RepeatGenome) GetKmerMap() (int, int, map[uint64]*Repeat)
- func (rg *RepeatGenome) GetMatchSpans() map[string]matchSpans
- func (rg *RepeatGenome) GetMinMap() (int, int, map[uint32]*Repeat)
- func (rg *RepeatGenome) GetProcReads() (error, [][]byte)
- func (rg *RepeatGenome) GetReads() (error, [][]byte)
- func (rg *RepeatGenome) KmerClassifyRead(readSAM ReadSAM, kmerMap map[uint64]*Repeat, wg *sync.WaitGroup, ...)
- func (rg *RepeatGenome) KmerClassifyReadVerb(readSAM ReadSAM, kmerMap map[uint64]*Repeat, wg *sync.WaitGroup, ...)
- func (rg *RepeatGenome) KmersGBSize() float64
- func (rg *RepeatGenome) LCA_ClassifyReads(readTextSeqs [][]byte, responseChan chan ReadResponse)
- func (rg *RepeatGenome) MinClassifyRead(readSAM ReadSAM, minMap map[uint32]*Repeat, wg *sync.WaitGroup, ...)
- func (rg *RepeatGenome) MinClassifyReadVerb(readSAM ReadSAM, minMap map[uint32]*Repeat, wg *sync.WaitGroup, ...)
- func (rg *RepeatGenome) PercentRepeats() float64
- func (rg *RepeatGenome) PercentTrueClassifications(responses []ReadSAMResponse, useStrict bool) float64
- func (refGenome *RepeatGenome) PrintChromInfo()
- func (rg *RepeatGenome) QuickClassifyReads(readTextSeqs [][]byte, responseChan chan ReadResponse)
- func (rg *RepeatGenome) ReadKraken(infile *os.File) error
- func (rg *RepeatGenome) RepeatIsCorrect(readSAMRepeat ReadSAMRepeat, strict bool) bool
- func (rg *RepeatGenome) RunDebugTests()
- func (repeatGenome *RepeatGenome) Size() uint64
- func (rg *RepeatGenome) SplitChromsK() (chan KRespPair, chan KRespPair)
- func (rg *RepeatGenome) SplitChromsM() (chan MRespPair, chan MRespPair)
- func (rg *RepeatGenome) WriteClassJSON(useCumSize, printLeaves bool) error
- func (rg *RepeatGenome) WriteKraken() error
- func (rg *RepeatGenome) WriteStatData() error
type Repeats
- func (repeats Repeats) Write(filename string) error
type ResponsePair
type Seq
- func GetSeq(textSeq []byte) Seq
- func (seq Seq) GetBase(i uint64) uint8
- func (seq Seq) Print()
- func (seq Seq) Subseq(a, b uint64) Seq
type Seqs

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func DebugSeq ¶

func DebugSeq()

func Less ¶

func Less(a, b []byte) bool

Returns a bool describing whether the first TextSeq is lexicographically smaller than the second.

func TSRevComp ¶

func TSRevComp(seq []byte) []byte

Returns the reverse complement of the supplied TextSeq. This drains memory and should therefore not be used outside of debugging and printing.

Types ¶

type Chroms ¶

type Chroms map[string](map[string][]byte)

A 2-dimensional map used to represent a newly-parsed FASTA-formatted reference genome.

type ClassID ¶

type ClassID uint16

A type synonym representing a ClassNode by ID. Used to space-efficiently store a read's classification.

type ClassNode ¶

type ClassNode struct {
	Name     string
	ID       ClassID
	Class    []string
	Parent   *ClassNode
	Children []*ClassNode
	Repeat   *Repeat
}

ClassNode.Name - This ClassNode's fully qualified name, excluding

root.

ClassNode.ID - A unique ID starting at 0 that we assign (not

included in RepeatMasker output). Root has ID 0.

ClassNode.Class - This ClassNode's name cut on "/". This likely

isn't necessary, and may be removed in the future.

ClassNode.Parent - A pointer to this ClassNode's parent in the

ancestry tree. It should be nil for root and only for root.

ClassNode.Children - A slice containing pointers to all of this

ClassNode's children in the tree.

ClassNode.Repeat - A pointer to this ClassNode's corresonding

Repeat, if it has one. This field is of dubious value.

func (*ClassNode) Size ¶

func (classNode *ClassNode) Size() uint64

Returns the sum of the sizes of all repeat instances in the supplied ClassNode's subtree.

type ClassNodes ¶

type ClassNodes []*ClassNode

func (ClassNodes) Write ¶

func (classNodes ClassNodes) Write(filename string) error

type ClassTree ¶

type ClassTree struct {
	ClassNodes map[string](*ClassNode)
	NodesByID  []*ClassNode
	Root       *ClassNode
}

ClassTree.ClassNodes - Maps a fully qualified class name (excluding

root) to that class's ClassNode struct, if it exists. This is
slower than ClassTree.NodesByID, and should only be used when
necessary.

ClassTree.NodesByID - A slice of pointers to all ClassNode structs,

indexed by ID. This should be the default means of accessing a
ClassNode.

ClassTree.Root - A pointer to the ClassTree's root, which has name

"root" and ID 0. We explicitly create this - it isn't present in
the RepeatMasker output.

func (*ClassTree) PrintBranches ¶

func (classTree *ClassTree) PrintBranches()

Doesn't print leaves. Prevents the terminal from being flooded with Unknowns, Others, and Simple Repeats.

func (*ClassTree) PrintTree ¶

func (classTree *ClassTree) PrintTree()

type Config ¶

type Config struct {
	Name       string
	Debug      bool
	CPUProfile bool
	MemProfile bool
	WriteLib   bool
	ForceGen   bool
	WriteStats bool
}

A value of type Config is passed to the New() function, which constructs and returns a new RepeatGenome.

type JSONNode ¶

type JSONNode struct {
	Name     string      `json:"name"`
	Size     uint64      `json:"size"`
	Children []*JSONNode `json:"children"`
}

Used only for recursively writing the JSON representation of the ClassTree.

type KRespPair ¶

type KRespPair struct {
	KmerInt uint64
	Repeat  *Repeat
}

type Kmer ¶

type Kmer [10]byte

This is what is stored by the main Kraken data structure: RepeatGenome.Kmers The first eight bits are the integer representation of the kmer's sequence (type KmerInt). The last two are the class ID (type ClassID).

func (Kmer) ClassID ¶

func (kmer Kmer) ClassID() ClassID

A more declarative and modifiable accessor function. While it would almost certainly be inlined, this is such a performance-critical operation that this function isn't currently used.

func (Kmer) Int ¶

func (kmer Kmer) Int() uint64

A more declarative and modifiable accessor function. While it would almost certainly be inlined, this is such a performance-critical operation that this function isn't currently used.

func (*Kmer) SetClassID ¶

func (kmer *Kmer) SetClassID(classID ClassID)

func (*Kmer) SetInt ¶

func (kmer *Kmer) SetInt(kmerInt uint64)

type KmerInts ¶

type KmerInts []uint64

A two-bits-per-base sequence of up to 31 bases, with low-order bits

 occupied first.
00 = 'a'
01 = 'c'
10 = 'g'
11 = 't'

The definitions of KmerInt was previously here, but I reverted to uint64 for
simplicity.

type Kmers ¶

type Kmers []Kmer

func (Kmers) Len ¶

func (kmers Kmers) Len() int

func (Kmers) Less ¶

func (kmers Kmers) Less(i, j int) bool

func (Kmers) Swap ¶

func (kmers Kmers) Swap(i, j int)

type MRespPair ¶

type MRespPair struct {
	MinInt uint32
	Repeat *Repeat
}

type MinInts ¶

type MinInts []uint32

A two-bits-per-base sequence of up to 15 bases, with low-bits

occupied first.

The definitions of MinInt was previously here, but I reverted to uint32 for
simplicity.

func (MinInts) Len ¶

func (minInts MinInts) Len() int

func (MinInts) Less ¶

func (minInts MinInts) Less(i, j int) bool

func (MinInts) Swap ¶

func (minInts MinInts) Swap(i, j int)

type PKmers ¶

type PKmers []*Kmer

func (PKmers) Len ¶

func (pkmers PKmers) Len() int

needed for sort.Interface

func (PKmers) Less ¶

func (pkmers PKmers) Less(i, j int) bool

func (PKmers) Swap ¶

func (pkmers PKmers) Swap(i, j int)

type ReadResponse ¶

type ReadResponse struct {
	Seq       []byte
	ClassNode *ClassNode
}

The type sent back from read-classifying goroutines of RepeatGenome.ClassifyReads()

func (ReadResponse) HangingSize ¶

func (readResp ReadResponse) HangingSize() uint64

Returns the number of base pairs from which the supplied read could have originated, assuming that its classification was correct. This is done in terms of Kraken-Q logic, meaning that there is at least one kmer shared between the repeat reference and the read. Therefore, the read must overlap a repeat reference from the classified subtree by at least k bases. This function is used to calculate the probability of correct classification assuming random selection, and the amount to which a classification narrows a read's potential origin.

type ReadSAM ¶

type ReadSAM struct {
	TextSeq  []byte
	SeqName  string
	StartInd uint64
}

func GetReadSAMs ¶

func GetReadSAMs(readsDirPath string) (error, []ReadSAM)

Passes all file names in the dir to parseReadSAMs and returns the concatenated results.

type ReadSAMRepeat ¶

type ReadSAMRepeat struct {
	ReadSAM ReadSAM
	Repeat  *Repeat
}

type ReadSAMResponse ¶

type ReadSAMResponse struct {
	ReadSAM   ReadSAM
	ClassNode *ClassNode
}

type ReducePair ¶

type ReducePair struct {
	LcaPtr *ClassID
	Set    Kmers
}

type Repeat ¶

type Repeat struct {
	ID        uint64
	Name      string
	ClassList []string
	ClassNode *ClassNode
	Instances []*bioutils.Match
}

Repeat.ID - A unique ID that we assign (not included in

RepeatMasker output). Because these are assigned in the order in
which they are encountered in <genome name>.fa.out, they are not
compatible across even different versions of the same reference
genome. This may change.

Repeat.Name - The repeat's fully qualified name, excluding root. Repeat.ClassList - A slice of this Repeat's class ancestry from the

top of the tree down, excluding root.

Repeat.ClassNode - A pointer to the ClassNode which corresponds to

this repeat.

Repeat.Instances - A slice of pointers to all matches that are

instances of this repeat.

func (*Repeat) Print ¶

func (repeat *Repeat) Print()

func (*Repeat) Size ¶

func (repeat *Repeat) Size() uint64

Returns the sum of the sizes of all of a repeat sequence type's instances.

type RepeatGenome ¶

type RepeatGenome struct {
	Name string

	Kmers      Kmers
	MinOffsets []int64
	MinCounts  []uint32
	SortedMins MinInts
	Matches    bioutils.Matches
	ClassTree  ClassTree
	Repeats    Repeats
	RepeatMap  map[string]*Repeat
	// contains filtered or unexported fields
}

RepeatGenome.Name - The name of the reference genome, such as "dm3"

or "hg38". This is used to name created directories, and to find
directories and files that may be read from, such as a stored
Kraken library and reference sequences.

RepeatGenome.chroms - A 2-dimensional map mapping a chromosome name

to a map of its sequence names to their sequences (in text form).
Actual 2-dimensional mapping is currently impossible because of
RepeatMasker's 1-dimensional output.

RepeatGenome.Kmers - A slice of all Kmers, sorted primarily by

minimizer and secondarily by lexicographical value.

RepeatGenome.MinOffsets - Maps a minimizer to its offset in the

Kmers slice, or -1 if no kmers of this minimizer were stored.

RepeatGenome.MinCounts - Maps a minimizer to the number of stored

kmers associated with it.

RepeatGenome.SortedMins - A sorted slice of all minimizers of

stored kmers.

RepeatGenome.Matches - All matches, indexed by their assigned IDs. RepeatGenome.ClassTree - Contains all information used for LCA

determination and read classification. It may eventually be
collapsed into RepeatGenome, as accessing it is rather verbose.

RepeatGenome.Repeats - A slice of all repeats, indexed by their

assigned IDs.

RepeatGenome.RepeatMap - Maps a fully qualified repeat name,

excluding root, to its struct.

func New ¶

func New(config Config) (error, *RepeatGenome)

func (*RepeatGenome) AvgPossPercentGenome ¶

func (rg *RepeatGenome) AvgPossPercentGenome(resps []ReadResponse, strict bool) float64

Returns the average percent of the genome a read from the given set could have originated from, assuming their classification was correct. This is used to estimate how much the classification assisted us in locating reads' origins. The more specific and helpful the classifications are, the lower the percentage will be. Uses a cumulative average to prevent overflow.

func (*RepeatGenome) GetClassChan ¶

func (rg *RepeatGenome) GetClassChan(reads [][]byte, useLCA bool) chan ReadResponse

Dispatches as many read-classifying goroutines as there are CPUs, giving each a subslice of the slice of reads provided. Each read-classifying goroutine is given a unique response chan. These are then merged into a single response chan, which is the return value. The useLCA parameter determines whether to use Quick or LCA read classification logic.

func (*RepeatGenome) GetKmerMap ¶

func (rg *RepeatGenome) GetKmerMap() (int, int, map[uint64]*Repeat)

func (*RepeatGenome) GetMatchSpans ¶

func (rg *RepeatGenome) GetMatchSpans() map[string]matchSpans

func (*RepeatGenome) GetMinMap ¶

func (rg *RepeatGenome) GetMinMap() (int, int, map[uint32]*Repeat)

func (*RepeatGenome) GetProcReads ¶

func (rg *RepeatGenome) GetProcReads() (error, [][]byte)

A rather hairy function that classifies all reads in ./<genome-name>-reads/*.proc if any exist. .proc files are our own creation for ease of parsing and testing. They contain one lowercase read sequence per line, and nothing else. We have a script that will convert FASTQ files to .proc files: github.com/mmcco/bioinformatics/blob/master/scripts/format-FASTA-reads.py This is generally really easy to do. However, we will used a FASTQ reader when we get past the initial testing phase. This could be done concurrently, considering how many disk accesses there are.

func (*RepeatGenome) GetReads ¶

func (rg *RepeatGenome) GetReads() (error, [][]byte)

func (*RepeatGenome) KmerClassifyRead ¶

func (rg *RepeatGenome) KmerClassifyRead(readSAM ReadSAM, kmerMap map[uint64]*Repeat, wg *sync.WaitGroup, c chan ReadSAMRepeat)

func (*RepeatGenome) KmerClassifyReadVerb ¶

func (rg *RepeatGenome) KmerClassifyReadVerb(readSAM ReadSAM, kmerMap map[uint64]*Repeat, wg *sync.WaitGroup, c chan ReadSAMRepeat)

func (*RepeatGenome) KmersGBSize ¶

func (rg *RepeatGenome) KmersGBSize() float64

Returns the size in gigabytes of the supplied RepeatGenome's Kmers field.

func (*RepeatGenome) LCA_ClassifyReads ¶

func (rg *RepeatGenome) LCA_ClassifyReads(readTextSeqs [][]byte, responseChan chan ReadResponse)

Classifies each read in a slice of reads, stored as type []byte. The read and its classification are returned through responseChan. In the future, the reads may be of type Seq. However, this currently seems to be the fastest way of doing things. This version returns the LCA of all recognized kmers' classifications.

func (*RepeatGenome) MinClassifyRead ¶

func (rg *RepeatGenome) MinClassifyRead(readSAM ReadSAM, minMap map[uint32]*Repeat, wg *sync.WaitGroup, c chan ReadSAMRepeat)

func (*RepeatGenome) MinClassifyReadVerb ¶

func (rg *RepeatGenome) MinClassifyReadVerb(readSAM ReadSAM, minMap map[uint32]*Repeat, wg *sync.WaitGroup, c chan ReadSAMRepeat)

func (*RepeatGenome) PercentRepeats ¶

func (rg *RepeatGenome) PercentRepeats() float64

Returns the percent of a RepeatGenome's reference bases that are contained in a repeat instance. It makes the assumption that no base is contained in more than one repeat instance.

func (*RepeatGenome) PercentTrueClassifications ¶

func (rg *RepeatGenome) PercentTrueClassifications(responses []ReadSAMResponse, useStrict bool) float64

Determines whether a read overlaps any repeat instances in the given ClassNode's subtree. If the argument strict is true, the read must be entirely contained in a reference repeat instance (classic Kraken logic). Otherwise, the read must overlap a reference repeat instance by at least k bases.

func (*RepeatGenome) PrintChromInfo ¶

func (refGenome *RepeatGenome) PrintChromInfo()

func (*RepeatGenome) QuickClassifyReads ¶

func (rg *RepeatGenome) QuickClassifyReads(readTextSeqs [][]byte, responseChan chan ReadResponse)

Classifies each read in a slice of reads, stored as type []byte. The read and its classification are returned through responseChan. In the future, the reads may be of type Seq. However, this currently seems to be the fastest way of doing things. This version uses the first recognized kmer for classification - the Kraken-Q technique.

func (*RepeatGenome) ReadKraken ¶

func (rg *RepeatGenome) ReadKraken(infile *os.File) error

has a lot of error handling, but pretty simple logic

func (*RepeatGenome) RepeatIsCorrect ¶

func (rg *RepeatGenome) RepeatIsCorrect(readSAMRepeat ReadSAMRepeat, strict bool) bool

func (*RepeatGenome) RunDebugTests ¶

func (rg *RepeatGenome) RunDebugTests()

func (*RepeatGenome) Size ¶

func (repeatGenome *RepeatGenome) Size() uint64

Returns the total number of bases in a RepeatGenome's reference chromosomes.

func (*RepeatGenome) SplitChromsK ¶

func (rg *RepeatGenome) SplitChromsK() (chan KRespPair, chan KRespPair)

func (*RepeatGenome) SplitChromsM ¶

func (rg *RepeatGenome) SplitChromsM() (chan MRespPair, chan MRespPair)

func (*RepeatGenome) WriteClassJSON ¶

func (rg *RepeatGenome) WriteClassJSON(useCumSize, printLeaves bool) error

Writes a JSON representation of the class tree. Used by the Javascript visualization, among other things. Currently, each node is associated with a value "size", the number of kmers associated with it. useCumSize determines whether the kmer count is cumulative, counting all kmers in its subtree.

func (*RepeatGenome) WriteKraken ¶

func (rg *RepeatGenome) WriteKraken() error

func (*RepeatGenome) WriteStatData ¶

func (rg *RepeatGenome) WriteStatData() error

type Repeats ¶

type Repeats []*Repeat

func (Repeats) Write ¶

func (repeats Repeats) Write(filename string) error

type ResponsePair ¶

type ResponsePair struct {
	Kmer   Kmer
	MinInt uint32
}

The type returned by RepeatGenome.getMatchKmers(), which process raw kmers. The LCA contained in the Kmer value is not the Kmer's final LCA, but simply the ClassNode ID of the match this instance of the Kmer came from.

type Seq ¶

type Seq struct {
	Bytes []byte
	Len   uint64
}

Each base is represented by two bits. High-order bits are occupied first. Remember that Seq.Len is the number of bases contained, while len(Seq.Bytes) is the number of bytes necessary to represent them.

func GetSeq ¶

func GetSeq(textSeq []byte) Seq

Converts a TextSeq to the more memory-efficient Seq type. Upper- and lower-case base bytes are currently supported, but stable code should immediately convert to lower-case. The logic works and is sane, but could be altered in the future for brevity and efficiency.

func (Seq) GetBase ¶

func (seq Seq) GetBase(i uint64) uint8

Return the i-th byte of the Seq (zero-indexed).

func (Seq) Print ¶

func (seq Seq) Print()

func (Seq) Subseq ¶

func (seq Seq) Subseq(a, b uint64) Seq

Return the subsequence of the supplied Seq from a (inclusive) to b (exclusive), like a slice.

type Seqs ¶

type Seqs []Seq

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL