Documentation ¶
Overview ¶
Adapted from Ant Zucaro's matchr package:
https://github.com/antzucaro/matchr
matchr is GPLv2 licensed.
Index ¶
- Constants
- func BytesToU32(seq []byte) uint32
- func BytesToU64(seq []byte) uint64
- func CanonicalRepr32(seq uint32) uint32
- func CanonicalRepr64(seq uint64) uint64
- func MinKey(seq uint64) uint8
- func Minimize(seq uint64) uint32
- func NeedlemanWunsch(a, b []byte) float64
- func ReadFASTA(reader io.Reader) (error, map[string][]byte)
- func RevComp32(seq uint32) uint32
- func RevComp64(seq uint64) uint64
- func SmithWaterman(a, b []byte) float64
- func U32ToBytes(minInt uint32) []byte
- func U64ToBytes(kmerInt uint64) []byte
- type Match
- type Matches
Constants ¶
const GAP_SCORE float64 = -0.5
const K = 31
K is the kmer size (in basepairs).
const KMask = (1 << (2 * K)) - 1
KMask and MMask contain k and m consecutive right aligned 1 bits respectively (e.g. "0000011111111" for k=8).
const M = 15
M is the minimizer size, which must be <= k.
const MATCH_SCORE float64 = 1.0
const MISMATCH_SCORE float64 = -2.0
const MMask = (1 << (2 * M)) - 1
Variables ¶
This section is empty.
Functions ¶
func BytesToU32 ¶
func BytesToU64 ¶
func CanonicalRepr32 ¶
The canonical representation of a sequence is the lexicographically smaller of its positive-strand and its reverse complement.
func CanonicalRepr64 ¶
The canonical representation of a sequence is the lexicographically smaller of its positive-strand and its reverse complement.
func MinKey ¶
Returns the index of the uint64's minimizer. The positive-strand index of the minimizer is minKey % K. If the MinKey is >= K, the minimizer is the reverse complement of the indexed minimizer.
func Minimize ¶
Return the minimizer of a uint64 using canonical representation (testing both the positive-strand and its reverse complement)
func NeedlemanWunsch ¶
func ReadFASTA ¶
Ignores semicolon syntax for now. Could later use speed hack of accumulating slices and calling bytes.Join() at the end.
func SmithWaterman ¶
func U32ToBytes ¶
func U64ToBytes ¶
Types ¶
type Match ¶
type Match struct { SW_Score int32 PercDiv float64 PercDel float64 PercIns float64 SeqName string SeqStart uint64 SeqEnd uint64 SeqRemains uint64 IsRevComp bool RepeatClass []string RepeatStart int64 RepeatEnd int64 RepeatRemains int64 InsertionID uint64 // these are generated, not parsed RepeatName string ID uint64 }
RepeatMasker is a program that takes as input a set of reference repeat sequences and a reference genome. It outputs "matches", specific instances of supplied reference repeats in the supplied reference genome. These are stored in the file <genome-name>.fa.out, and are parsed line-by-line into values of this type.
Match.SW_Score - Smith-Waterman score, describing the likeness of
this match to the repeat reference sequence.
Match.PercDiv - "% substitutions in matching region compared to the
consensus" - RepeatMasker docs
Match.PercDel - "% of bases opposite a gap in the query sequence
(deleted bp)" - RepeatMasker docs
Match.PercIns - "% of bases opposite a gap in the repeat consensus
(inserted bp)" - RepeatMasker docs
Match.SeqName - The name (without ".fa") of the reference genome
FASTA file this match came from. It is typically the chromosome name, such as "chr2L". This is inherently unsound, as RepeatMasker gives only a 1-dimensional qualification, but FASTA-formatted reference genomes are 2-dimensional, using both filename and sequence name. Reference genome FASTA files generally contain only a single sequence, with the same name as the file. If this is not the case when parsing a FASTA reference genome, we print an ominous warning to stdout and use the sequence name.
Match.SeqStart - The match's start index (inclusive and
zero-indexed) in the reference genome. Note that RepeatMasker's output is one-indexed.
Match.SeqEnd - The end index (exclusive and zero-indexed) in the
reference genome.
Match.SeqRemains - The number of bases past the end of the match in
the relevant reference sequence.
Match.IsRevComp - Whether the match was for the reverse complement
of the reference repeat sequence. In this case, we manually adjust some location fields, as RepeatMasker's output gives indexes for the reverse complement of the reference sequence. This allows us to treat all matches with the same logic.
Match.RepeatClass - The repeat's full ancestry in a slice of
strings. This includes its repeat class and repeat name, which are listed separately in the RepeatMasker output file. Root is implicit and excluded.
Match.RepeatStart - The start index (inclusive and zero-indexed) of
this match in the repeat consensus sequence. A signed integer is used because it can be negative in weird cases.
Match.RepeatEnd - The end sequence (exclusive and zero-indexed) of
this match in the consensus repeat sequence. A signed integer is used, in agreement with Match.RepeatStart.
Match.RepeatRemains - The number of bases at the end of the
consensus repeat sequence that this match excludes.
Match.InsertionID - A numerical ID that is the same only for
matches of the same long terminal repeat (LTR) instance. The sequences classified as <LTR name>_I or <LTR name>_int are the internal sequences of LTRs. These are less well defined than the core LTR sequence. These IDs begin at 1.
The below fields are not parsed, but rather calculated: Match.RepeatName - Simply Match.RepeatClass's items concatenated.
It is used for quick printing, and is not necessarily going to remain in the long-term.
Match.ClassNode - A pointer to the match's corresponding ClassNode in RepeatGenome.ClassTree. Match.ID - A unique ID, used as a quick way of referencing and
indexing a repeat. Pointers can generally be used, so this isn't necessarily going to remain in the long-term.