bioutils

package

v0.0.0-...-69c58a0 Latest Latest Go to latest Published: Nov 17, 2015 License: ISC Imports: 8 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/mmcco/jh-bio

Documentation ¶

Overview ¶

Adapted from Ant Zucaro's matchr package:

https://github.com/antzucaro/matchr

matchr is GPLv2 licensed.

Index ¶

Constants
func BytesToU32(seq []byte) uint32
func BytesToU64(seq []byte) uint64
func CanonicalRepr32(seq uint32) uint32
func CanonicalRepr64(seq uint64) uint64
func MinKey(seq uint64) uint8
func Minimize(seq uint64) uint32
func NeedlemanWunsch(a, b []byte) float64
func ReadFASTA(reader io.Reader) (error, map[string][]byte)
func RevComp32(seq uint32) uint32
func RevComp64(seq uint64) uint64
func SmithWaterman(a, b []byte) float64
func U32ToBytes(minInt uint32) []byte
func U64ToBytes(kmerInt uint64) []byte
type Match
- func (match Match) Size() uint64
type Matches
- func ParseMatches(reader io.Reader) (error, Matches)
- func (matches Matches) Write(filename string) error

Constants ¶

View Source

const GAP_SCORE float64 = -0.5

View Source

const K = 31

K is the kmer size (in basepairs).

View Source

const KMask = (1 << (2 * K)) - 1

KMask and MMask contain k and m consecutive right aligned 1 bits respectively (e.g. "0000011111111" for k=8).

View Source

const M = 15

M is the minimizer size, which must be <= k.

View Source

const MATCH_SCORE float64 = 1.0

View Source

const MISMATCH_SCORE float64 = -2.0

View Source

const MMask = (1 << (2 * M)) - 1

Variables ¶

This section is empty.

Functions ¶

func BytesToU32 ¶

func BytesToU32(seq []byte) uint32

func BytesToU64 ¶

func BytesToU64(seq []byte) uint64

func CanonicalRepr32 ¶

func CanonicalRepr32(seq uint32) uint32

The canonical representation of a sequence is the lexicographically smaller of its positive-strand and its reverse complement.

func CanonicalRepr64 ¶

func CanonicalRepr64(seq uint64) uint64

The canonical representation of a sequence is the lexicographically smaller of its positive-strand and its reverse complement.

func MinKey ¶

func MinKey(seq uint64) uint8

Returns the index of the uint64's minimizer. The positive-strand index of the minimizer is minKey % K. If the MinKey is >= K, the minimizer is the reverse complement of the indexed minimizer.

func Minimize ¶

func Minimize(seq uint64) uint32

Return the minimizer of a uint64 using canonical representation (testing both the positive-strand and its reverse complement)

func NeedlemanWunsch ¶

func NeedlemanWunsch(a, b []byte) float64

func ReadFASTA ¶

func ReadFASTA(reader io.Reader) (error, map[string][]byte)

Ignores semicolon syntax for now. Could later use speed hack of accumulating slices and calling bytes.Join() at the end.

func RevComp32 ¶

func RevComp32(seq uint32) uint32

Returns the reverse complement of a sequence stored in a uint32.

func RevComp64 ¶

func RevComp64(seq uint64) uint64

Returns the reverse complement of a sequence stored in a uint64.

func SmithWaterman ¶

func SmithWaterman(a, b []byte) float64

func U32ToBytes ¶

func U32ToBytes(minInt uint32) []byte

func U64ToBytes ¶

func U64ToBytes(kmerInt uint64) []byte

Types ¶

type Match ¶

type Match struct {
	SW_Score      int32
	PercDiv       float64
	PercDel       float64
	PercIns       float64
	SeqName       string
	SeqStart      uint64
	SeqEnd        uint64
	SeqRemains    uint64
	IsRevComp     bool
	RepeatClass   []string
	RepeatStart   int64
	RepeatEnd     int64
	RepeatRemains int64
	InsertionID   uint64

	// these are generated, not parsed
	RepeatName string
	ID         uint64
}

RepeatMasker is a program that takes as input a set of reference repeat sequences and a reference genome. It outputs "matches", specific instances of supplied reference repeats in the supplied reference genome. These are stored in the file <genome-name>.fa.out, and are parsed line-by-line into values of this type.

Match.SW_Score - Smith-Waterman score, describing the likeness of

this match to the repeat reference sequence.

Match.PercDiv - "% substitutions in matching region compared to the

consensus" - RepeatMasker docs

Match.PercDel - "% of bases opposite a gap in the query sequence

(deleted bp)" - RepeatMasker docs

Match.PercIns - "% of bases opposite a gap in the repeat consensus

(inserted bp)" - RepeatMasker docs

Match.SeqName - The name (without ".fa") of the reference genome

FASTA file this match came from. It is typically the chromosome
name, such as "chr2L". This is inherently unsound, as RepeatMasker
gives only a 1-dimensional qualification, but FASTA-formatted
reference genomes are 2-dimensional, using both filename and
sequence name.
Reference genome FASTA files generally contain only a single
sequence, with the same name as the file. If this is not the
case when parsing a FASTA reference genome, we print an ominous
warning to stdout and use the sequence name.

Match.SeqStart - The match's start index (inclusive and

zero-indexed) in the reference genome. Note that RepeatMasker's
output is one-indexed.

Match.SeqEnd - The end index (exclusive and zero-indexed) in the

reference genome.

Match.SeqRemains - The number of bases past the end of the match in

the relevant reference sequence.

Match.IsRevComp - Whether the match was for the reverse complement

of the reference repeat sequence. In this case, we manually adjust
some location fields, as RepeatMasker's output gives indexes for
the reverse complement of the reference sequence. This allows us to
treat all matches with the same logic.

Match.RepeatClass - The repeat's full ancestry in a slice of

strings. This includes its repeat class and repeat name, which are
listed separately in the RepeatMasker output file. Root is implicit
and excluded.

Match.RepeatStart - The start index (inclusive and zero-indexed) of

this match in the repeat consensus sequence. A signed integer is
used because it can be negative in weird cases.

Match.RepeatEnd - The end sequence (exclusive and zero-indexed) of

this match in the consensus repeat sequence. A signed integer is
used, in agreement with Match.RepeatStart.

Match.RepeatRemains - The number of bases at the end of the

consensus repeat sequence that this match excludes.

Match.InsertionID - A numerical ID that is the same only for

matches of the same long terminal repeat (LTR) instance. The
sequences classified as <LTR name>_I or <LTR name>_int are the
internal sequences of LTRs. These are less well defined than the
core LTR sequence. These IDs begin at 1.

The below fields are not parsed, but rather calculated: Match.RepeatName - Simply Match.RepeatClass's items concatenated.

It is used for quick printing, and is not necessarily going to
remain in the long-term.

Match.ClassNode - A pointer to the match's corresponding ClassNode in RepeatGenome.ClassTree. Match.ID - A unique ID, used as a quick way of referencing and

indexing a repeat. Pointers can generally be used, so this isn't
necessarily going to remain in the long-term.

func (Match) Size ¶

func (match Match) Size() uint64

Returns the size in bases of a repeat instance.

type Matches ¶

type Matches []Match

func ParseMatches ¶

func ParseMatches(reader io.Reader) (error, Matches)

func (Matches) Write ¶

func (matches Matches) Write(filename string) error

Writes a representation of the Matches to the supplied filename.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL