Documentation ¶
Index ¶
- func DebugSeq()
- func Less(a, b []byte) bool
- func TSRevComp(seq []byte) []byte
- type Chroms
- type ClassID
- type ClassNode
- type ClassNodes
- type ClassTree
- type Config
- type JSONNode
- type KRespPair
- type Kmer
- type KmerInts
- type Kmers
- type MRespPair
- type MinInts
- type PKmers
- type ReadResponse
- type ReadSAM
- type ReadSAMRepeat
- type ReadSAMResponse
- type ReducePair
- type Repeat
- type RepeatGenome
- func (rg *RepeatGenome) AvgPossPercentGenome(resps []ReadResponse, strict bool) float64
- func (rg *RepeatGenome) GetClassChan(reads [][]byte, useLCA bool) chan ReadResponse
- func (rg *RepeatGenome) GetKmerMap() (int, int, map[uint64]*Repeat)
- func (rg *RepeatGenome) GetMatchSpans() map[string]matchSpans
- func (rg *RepeatGenome) GetMinMap() (int, int, map[uint32]*Repeat)
- func (rg *RepeatGenome) GetProcReads() (error, [][]byte)
- func (rg *RepeatGenome) GetReads() (error, [][]byte)
- func (rg *RepeatGenome) KmerClassifyRead(readSAM ReadSAM, kmerMap map[uint64]*Repeat, wg *sync.WaitGroup, ...)
- func (rg *RepeatGenome) KmerClassifyReadVerb(readSAM ReadSAM, kmerMap map[uint64]*Repeat, wg *sync.WaitGroup, ...)
- func (rg *RepeatGenome) KmersGBSize() float64
- func (rg *RepeatGenome) LCA_ClassifyReads(readTextSeqs [][]byte, responseChan chan ReadResponse)
- func (rg *RepeatGenome) MinClassifyRead(readSAM ReadSAM, minMap map[uint32]*Repeat, wg *sync.WaitGroup, ...)
- func (rg *RepeatGenome) MinClassifyReadVerb(readSAM ReadSAM, minMap map[uint32]*Repeat, wg *sync.WaitGroup, ...)
- func (rg *RepeatGenome) PercentRepeats() float64
- func (rg *RepeatGenome) PercentTrueClassifications(responses []ReadSAMResponse, useStrict bool) float64
- func (refGenome *RepeatGenome) PrintChromInfo()
- func (rg *RepeatGenome) QuickClassifyReads(readTextSeqs [][]byte, responseChan chan ReadResponse)
- func (rg *RepeatGenome) ReadKraken(infile *os.File) error
- func (rg *RepeatGenome) RepeatIsCorrect(readSAMRepeat ReadSAMRepeat, strict bool) bool
- func (rg *RepeatGenome) RunDebugTests()
- func (repeatGenome *RepeatGenome) Size() uint64
- func (rg *RepeatGenome) SplitChromsK() (chan KRespPair, chan KRespPair)
- func (rg *RepeatGenome) SplitChromsM() (chan MRespPair, chan MRespPair)
- func (rg *RepeatGenome) WriteClassJSON(useCumSize, printLeaves bool) error
- func (rg *RepeatGenome) WriteKraken() error
- func (rg *RepeatGenome) WriteStatData() error
- type Repeats
- type ResponsePair
- type Seq
- type Seqs
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Chroms ¶
A 2-dimensional map used to represent a newly-parsed FASTA-formatted reference genome.
type ClassID ¶
type ClassID uint16
A type synonym representing a ClassNode by ID. Used to space-efficiently store a read's classification.
type ClassNode ¶
type ClassNode struct { Name string ID ClassID Class []string Parent *ClassNode Children []*ClassNode Repeat *Repeat }
ClassNode.Name - This ClassNode's fully qualified name, excluding
root.
ClassNode.ID - A unique ID starting at 0 that we assign (not
included in RepeatMasker output). Root has ID 0.
ClassNode.Class - This ClassNode's name cut on "/". This likely
isn't necessary, and may be removed in the future.
ClassNode.Parent - A pointer to this ClassNode's parent in the
ancestry tree. It should be nil for root and only for root.
ClassNode.Children - A slice containing pointers to all of this
ClassNode's children in the tree.
ClassNode.Repeat - A pointer to this ClassNode's corresonding
Repeat, if it has one. This field is of dubious value.
type ClassNodes ¶
type ClassNodes []*ClassNode
func (ClassNodes) Write ¶
func (classNodes ClassNodes) Write(filename string) error
type ClassTree ¶
ClassTree.ClassNodes - Maps a fully qualified class name (excluding
root) to that class's ClassNode struct, if it exists. This is slower than ClassTree.NodesByID, and should only be used when necessary.
ClassTree.NodesByID - A slice of pointers to all ClassNode structs,
indexed by ID. This should be the default means of accessing a ClassNode.
ClassTree.Root - A pointer to the ClassTree's root, which has name
"root" and ID 0. We explicitly create this - it isn't present in the RepeatMasker output.
func (*ClassTree) PrintBranches ¶
func (classTree *ClassTree) PrintBranches()
Doesn't print leaves. Prevents the terminal from being flooded with Unknowns, Others, and Simple Repeats.
type Config ¶
type Config struct { Name string Debug bool CPUProfile bool MemProfile bool WriteLib bool ForceGen bool WriteStats bool }
A value of type Config is passed to the New() function, which constructs and returns a new RepeatGenome.
type JSONNode ¶
type JSONNode struct { Name string `json:"name"` Size uint64 `json:"size"` Children []*JSONNode `json:"children"` }
Used only for recursively writing the JSON representation of the ClassTree.
type Kmer ¶
type Kmer [10]byte
This is what is stored by the main Kraken data structure: RepeatGenome.Kmers The first eight bits are the integer representation of the kmer's sequence (type KmerInt). The last two are the class ID (type ClassID).
func (Kmer) ClassID ¶
A more declarative and modifiable accessor function. While it would almost certainly be inlined, this is such a performance-critical operation that this function isn't currently used.
func (Kmer) Int ¶
A more declarative and modifiable accessor function. While it would almost certainly be inlined, this is such a performance-critical operation that this function isn't currently used.
func (*Kmer) SetClassID ¶
type KmerInts ¶
type KmerInts []uint64
A two-bits-per-base sequence of up to 31 bases, with low-order bits
occupied first. 00 = 'a' 01 = 'c' 10 = 'g' 11 = 't' The definitions of KmerInt was previously here, but I reverted to uint64 for simplicity.
type MinInts ¶
type MinInts []uint32
A two-bits-per-base sequence of up to 15 bases, with low-bits
occupied first. The definitions of MinInt was previously here, but I reverted to uint32 for simplicity.
type ReadResponse ¶
The type sent back from read-classifying goroutines of RepeatGenome.ClassifyReads()
func (ReadResponse) HangingSize ¶
func (readResp ReadResponse) HangingSize() uint64
Returns the number of base pairs from which the supplied read could have originated, assuming that its classification was correct. This is done in terms of Kraken-Q logic, meaning that there is at least one kmer shared between the repeat reference and the read. Therefore, the read must overlap a repeat reference from the classified subtree by at least k bases. This function is used to calculate the probability of correct classification assuming random selection, and the amount to which a classification narrows a read's potential origin.
type ReadSAM ¶
func GetReadSAMs ¶
Passes all file names in the dir to parseReadSAMs and returns the concatenated results.
type ReadSAMRepeat ¶
type ReadSAMResponse ¶
type ReducePair ¶
type Repeat ¶
type Repeat struct { ID uint64 Name string ClassList []string ClassNode *ClassNode Instances []*bioutils.Match }
Repeat.ID - A unique ID that we assign (not included in
RepeatMasker output). Because these are assigned in the order in which they are encountered in <genome name>.fa.out, they are not compatible across even different versions of the same reference genome. This may change.
Repeat.Name - The repeat's fully qualified name, excluding root. Repeat.ClassList - A slice of this Repeat's class ancestry from the
top of the tree down, excluding root.
Repeat.ClassNode - A pointer to the ClassNode which corresponds to
this repeat.
Repeat.Instances - A slice of pointers to all matches that are
instances of this repeat.
type RepeatGenome ¶
type RepeatGenome struct { Name string Kmers Kmers MinOffsets []int64 MinCounts []uint32 SortedMins MinInts Matches bioutils.Matches ClassTree ClassTree Repeats Repeats RepeatMap map[string]*Repeat // contains filtered or unexported fields }
RepeatGenome.Name - The name of the reference genome, such as "dm3"
or "hg38". This is used to name created directories, and to find directories and files that may be read from, such as a stored Kraken library and reference sequences.
RepeatGenome.chroms - A 2-dimensional map mapping a chromosome name
to a map of its sequence names to their sequences (in text form). Actual 2-dimensional mapping is currently impossible because of RepeatMasker's 1-dimensional output.
RepeatGenome.Kmers - A slice of all Kmers, sorted primarily by
minimizer and secondarily by lexicographical value.
RepeatGenome.MinOffsets - Maps a minimizer to its offset in the
Kmers slice, or -1 if no kmers of this minimizer were stored.
RepeatGenome.MinCounts - Maps a minimizer to the number of stored
kmers associated with it.
RepeatGenome.SortedMins - A sorted slice of all minimizers of
stored kmers.
RepeatGenome.Matches - All matches, indexed by their assigned IDs. RepeatGenome.ClassTree - Contains all information used for LCA
determination and read classification. It may eventually be collapsed into RepeatGenome, as accessing it is rather verbose.
RepeatGenome.Repeats - A slice of all repeats, indexed by their
assigned IDs.
RepeatGenome.RepeatMap - Maps a fully qualified repeat name,
excluding root, to its struct.
func New ¶
func New(config Config) (error, *RepeatGenome)
func (*RepeatGenome) AvgPossPercentGenome ¶
func (rg *RepeatGenome) AvgPossPercentGenome(resps []ReadResponse, strict bool) float64
Returns the average percent of the genome a read from the given set could have originated from, assuming their classification was correct. This is used to estimate how much the classification assisted us in locating reads' origins. The more specific and helpful the classifications are, the lower the percentage will be. Uses a cumulative average to prevent overflow.
func (*RepeatGenome) GetClassChan ¶
func (rg *RepeatGenome) GetClassChan(reads [][]byte, useLCA bool) chan ReadResponse
Dispatches as many read-classifying goroutines as there are CPUs, giving each a subslice of the slice of reads provided. Each read-classifying goroutine is given a unique response chan. These are then merged into a single response chan, which is the return value. The useLCA parameter determines whether to use Quick or LCA read classification logic.
func (*RepeatGenome) GetKmerMap ¶
func (rg *RepeatGenome) GetKmerMap() (int, int, map[uint64]*Repeat)
func (*RepeatGenome) GetMatchSpans ¶
func (rg *RepeatGenome) GetMatchSpans() map[string]matchSpans
func (*RepeatGenome) GetProcReads ¶
func (rg *RepeatGenome) GetProcReads() (error, [][]byte)
A rather hairy function that classifies all reads in ./<genome-name>-reads/*.proc if any exist. .proc files are our own creation for ease of parsing and testing. They contain one lowercase read sequence per line, and nothing else. We have a script that will convert FASTQ files to .proc files: github.com/mmcco/bioinformatics/blob/master/scripts/format-FASTA-reads.py This is generally really easy to do. However, we will used a FASTQ reader when we get past the initial testing phase. This could be done concurrently, considering how many disk accesses there are.
func (*RepeatGenome) GetReads ¶
func (rg *RepeatGenome) GetReads() (error, [][]byte)
func (*RepeatGenome) KmerClassifyRead ¶
func (rg *RepeatGenome) KmerClassifyRead(readSAM ReadSAM, kmerMap map[uint64]*Repeat, wg *sync.WaitGroup, c chan ReadSAMRepeat)
func (*RepeatGenome) KmerClassifyReadVerb ¶
func (rg *RepeatGenome) KmerClassifyReadVerb(readSAM ReadSAM, kmerMap map[uint64]*Repeat, wg *sync.WaitGroup, c chan ReadSAMRepeat)
func (*RepeatGenome) KmersGBSize ¶
func (rg *RepeatGenome) KmersGBSize() float64
Returns the size in gigabytes of the supplied RepeatGenome's Kmers field.
func (*RepeatGenome) LCA_ClassifyReads ¶
func (rg *RepeatGenome) LCA_ClassifyReads(readTextSeqs [][]byte, responseChan chan ReadResponse)
Classifies each read in a slice of reads, stored as type []byte. The read and its classification are returned through responseChan. In the future, the reads may be of type Seq. However, this currently seems to be the fastest way of doing things. This version returns the LCA of all recognized kmers' classifications.
func (*RepeatGenome) MinClassifyRead ¶
func (rg *RepeatGenome) MinClassifyRead(readSAM ReadSAM, minMap map[uint32]*Repeat, wg *sync.WaitGroup, c chan ReadSAMRepeat)
func (*RepeatGenome) MinClassifyReadVerb ¶
func (rg *RepeatGenome) MinClassifyReadVerb(readSAM ReadSAM, minMap map[uint32]*Repeat, wg *sync.WaitGroup, c chan ReadSAMRepeat)
func (*RepeatGenome) PercentRepeats ¶
func (rg *RepeatGenome) PercentRepeats() float64
Returns the percent of a RepeatGenome's reference bases that are contained in a repeat instance. It makes the assumption that no base is contained in more than one repeat instance.
func (*RepeatGenome) PercentTrueClassifications ¶
func (rg *RepeatGenome) PercentTrueClassifications(responses []ReadSAMResponse, useStrict bool) float64
Determines whether a read overlaps any repeat instances in the given ClassNode's subtree. If the argument strict is true, the read must be entirely contained in a reference repeat instance (classic Kraken logic). Otherwise, the read must overlap a reference repeat instance by at least k bases.
func (*RepeatGenome) PrintChromInfo ¶
func (refGenome *RepeatGenome) PrintChromInfo()
func (*RepeatGenome) QuickClassifyReads ¶
func (rg *RepeatGenome) QuickClassifyReads(readTextSeqs [][]byte, responseChan chan ReadResponse)
Classifies each read in a slice of reads, stored as type []byte. The read and its classification are returned through responseChan. In the future, the reads may be of type Seq. However, this currently seems to be the fastest way of doing things. This version uses the first recognized kmer for classification - the Kraken-Q technique.
func (*RepeatGenome) ReadKraken ¶
func (rg *RepeatGenome) ReadKraken(infile *os.File) error
has a lot of error handling, but pretty simple logic
func (*RepeatGenome) RepeatIsCorrect ¶
func (rg *RepeatGenome) RepeatIsCorrect(readSAMRepeat ReadSAMRepeat, strict bool) bool
func (*RepeatGenome) RunDebugTests ¶
func (rg *RepeatGenome) RunDebugTests()
func (*RepeatGenome) Size ¶
func (repeatGenome *RepeatGenome) Size() uint64
Returns the total number of bases in a RepeatGenome's reference chromosomes.
func (*RepeatGenome) SplitChromsK ¶
func (rg *RepeatGenome) SplitChromsK() (chan KRespPair, chan KRespPair)
func (*RepeatGenome) SplitChromsM ¶
func (rg *RepeatGenome) SplitChromsM() (chan MRespPair, chan MRespPair)
func (*RepeatGenome) WriteClassJSON ¶
func (rg *RepeatGenome) WriteClassJSON(useCumSize, printLeaves bool) error
Writes a JSON representation of the class tree. Used by the Javascript visualization, among other things. Currently, each node is associated with a value "size", the number of kmers associated with it. useCumSize determines whether the kmer count is cumulative, counting all kmers in its subtree.
func (*RepeatGenome) WriteKraken ¶
func (rg *RepeatGenome) WriteKraken() error
func (*RepeatGenome) WriteStatData ¶
func (rg *RepeatGenome) WriteStatData() error
type ResponsePair ¶
The type returned by RepeatGenome.getMatchKmers(), which process raw kmers. The LCA contained in the Kmer value is not the Kmer's final LCA, but simply the ClassNode ID of the match this instance of the Kmer came from.
type Seq ¶
Each base is represented by two bits. High-order bits are occupied first. Remember that Seq.Len is the number of bases contained, while len(Seq.Bytes) is the number of bytes necessary to represent them.
func GetSeq ¶
Converts a TextSeq to the more memory-efficient Seq type. Upper- and lower-case base bytes are currently supported, but stable code should immediately convert to lower-case. The logic works and is sane, but could be altered in the future for brevity and efficiency.