bioutils

package
v0.0.0-...-69c58a0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 17, 2015 License: ISC Imports: 8 Imported by: 0

Documentation

Overview

Adapted from Ant Zucaro's matchr package:

https://github.com/antzucaro/matchr

matchr is GPLv2 licensed.

Index

Constants

View Source
const GAP_SCORE float64 = -0.5
View Source
const K = 31

K is the kmer size (in basepairs).

View Source
const KMask = (1 << (2 * K)) - 1

KMask and MMask contain k and m consecutive right aligned 1 bits respectively (e.g. "0000011111111" for k=8).

View Source
const M = 15

M is the minimizer size, which must be <= k.

View Source
const MATCH_SCORE float64 = 1.0
View Source
const MISMATCH_SCORE float64 = -2.0
View Source
const MMask = (1 << (2 * M)) - 1

Variables

This section is empty.

Functions

func BytesToU32

func BytesToU32(seq []byte) uint32

func BytesToU64

func BytesToU64(seq []byte) uint64

func CanonicalRepr32

func CanonicalRepr32(seq uint32) uint32

The canonical representation of a sequence is the lexicographically smaller of its positive-strand and its reverse complement.

func CanonicalRepr64

func CanonicalRepr64(seq uint64) uint64

The canonical representation of a sequence is the lexicographically smaller of its positive-strand and its reverse complement.

func MinKey

func MinKey(seq uint64) uint8

Returns the index of the uint64's minimizer. The positive-strand index of the minimizer is minKey % K. If the MinKey is >= K, the minimizer is the reverse complement of the indexed minimizer.

func Minimize

func Minimize(seq uint64) uint32

Return the minimizer of a uint64 using canonical representation (testing both the positive-strand and its reverse complement)

func NeedlemanWunsch

func NeedlemanWunsch(a, b []byte) float64

func ReadFASTA

func ReadFASTA(reader io.Reader) (error, map[string][]byte)

Ignores semicolon syntax for now. Could later use speed hack of accumulating slices and calling bytes.Join() at the end.

func RevComp32

func RevComp32(seq uint32) uint32

Returns the reverse complement of a sequence stored in a uint32.

func RevComp64

func RevComp64(seq uint64) uint64

Returns the reverse complement of a sequence stored in a uint64.

func SmithWaterman

func SmithWaterman(a, b []byte) float64

func U32ToBytes

func U32ToBytes(minInt uint32) []byte

func U64ToBytes

func U64ToBytes(kmerInt uint64) []byte

Types

type Match

type Match struct {
	SW_Score      int32
	PercDiv       float64
	PercDel       float64
	PercIns       float64
	SeqName       string
	SeqStart      uint64
	SeqEnd        uint64
	SeqRemains    uint64
	IsRevComp     bool
	RepeatClass   []string
	RepeatStart   int64
	RepeatEnd     int64
	RepeatRemains int64
	InsertionID   uint64

	// these are generated, not parsed
	RepeatName string
	ID         uint64
}

RepeatMasker is a program that takes as input a set of reference repeat sequences and a reference genome. It outputs "matches", specific instances of supplied reference repeats in the supplied reference genome. These are stored in the file <genome-name>.fa.out, and are parsed line-by-line into values of this type.

Match.SW_Score - Smith-Waterman score, describing the likeness of

this match to the repeat reference sequence.

Match.PercDiv - "% substitutions in matching region compared to the

consensus" - RepeatMasker docs

Match.PercDel - "% of bases opposite a gap in the query sequence

(deleted bp)" - RepeatMasker docs

Match.PercIns - "% of bases opposite a gap in the repeat consensus

(inserted bp)" - RepeatMasker docs

Match.SeqName - The name (without ".fa") of the reference genome

FASTA file this match came from. It is typically the chromosome
name, such as "chr2L". This is inherently unsound, as RepeatMasker
gives only a 1-dimensional qualification, but FASTA-formatted
reference genomes are 2-dimensional, using both filename and
sequence name.
Reference genome FASTA files generally contain only a single
sequence, with the same name as the file. If this is not the
case when parsing a FASTA reference genome, we print an ominous
warning to stdout and use the sequence name.

Match.SeqStart - The match's start index (inclusive and

zero-indexed) in the reference genome. Note that RepeatMasker's
output is one-indexed.

Match.SeqEnd - The end index (exclusive and zero-indexed) in the

reference genome.

Match.SeqRemains - The number of bases past the end of the match in

the relevant reference sequence.

Match.IsRevComp - Whether the match was for the reverse complement

of the reference repeat sequence. In this case, we manually adjust
some location fields, as RepeatMasker's output gives indexes for
the reverse complement of the reference sequence. This allows us to
treat all matches with the same logic.

Match.RepeatClass - The repeat's full ancestry in a slice of

strings. This includes its repeat class and repeat name, which are
listed separately in the RepeatMasker output file. Root is implicit
and excluded.

Match.RepeatStart - The start index (inclusive and zero-indexed) of

this match in the repeat consensus sequence. A signed integer is
used because it can be negative in weird cases.

Match.RepeatEnd - The end sequence (exclusive and zero-indexed) of

this match in the consensus repeat sequence. A signed integer is
used, in agreement with Match.RepeatStart.

Match.RepeatRemains - The number of bases at the end of the

consensus repeat sequence that this match excludes.

Match.InsertionID - A numerical ID that is the same only for

matches of the same long terminal repeat (LTR) instance. The
sequences classified as <LTR name>_I or <LTR name>_int are the
internal sequences of LTRs. These are less well defined than the
core LTR sequence. These IDs begin at 1.

The below fields are not parsed, but rather calculated: Match.RepeatName - Simply Match.RepeatClass's items concatenated.

It is used for quick printing, and is not necessarily going to
remain in the long-term.

Match.ClassNode - A pointer to the match's corresponding ClassNode in RepeatGenome.ClassTree. Match.ID - A unique ID, used as a quick way of referencing and

indexing a repeat. Pointers can generally be used, so this isn't
necessarily going to remain in the long-term.

func (Match) Size

func (match Match) Size() uint64

Returns the size in bases of a repeat instance.

type Matches

type Matches []Match

func ParseMatches

func ParseMatches(reader io.Reader) (error, Matches)

func (Matches) Write

func (matches Matches) Write(filename string) error

Writes a representation of the Matches to the supplied filename.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL