fasta

package
v0.0.0-...-ad47f17 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 10, 2024 License: Apache-2.0 Imports: 15 Imported by: 0

Documentation

Overview

Package fasta contains code for parsing (optionally indexed) FASTA files. See http://www.htslib.org/doc/faidx.html. Briefly, FASTA files consist of a number of named sequences that may be interrupted by newlines. For example:

>chr7 ACGTAC GAGGAC GCG >chr8 ACGT

Note: Sequence names are defined to be the stretch of characters excluding spaces immediately after '>'. Any text appear after a space are ignored. For example, '>chr1 A viral sequence' becomes 'chr1'.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FaiToReferenceLengths

func FaiToReferenceLengths(index io.Reader) (map[string]uint64, error)

FaiToReferenceLengths reads in a fasta fai file and returns a map of reference name to reference length. This doesn't require reading in the fasta itself.

func GenerateIndex

func GenerateIndex(out io.Writer, in io.Reader) (err error)

GenerateIndex generates an index (*.fai) from FASTA. The index can be later passed to NewIndexed() to random-access the FASTA file quickly.

The index format is defined by "samtool faidx" (http://www.htslib.org/doc/faidx.html).

func OptClean

func OptClean(o *opts)

OptClean specifies returned FASTA sequences should be cleaned as described in biosimd.CleanASCIISeq*. It is equivalent to OptEncoding(CleanASCII).

Types

type Encoding

type Encoding byte
const (
	// RawASCII encoding preserves the original bytes, including case.
	RawASCII Encoding = iota
	// CleanASCII encoding capitalizes all lowercase 'a'/'c'/'g'/'t', and
	// converts all non-ACGT characters to 'N'.
	CleanASCII
	// Seq8 encoding is 'A'/'a' = 1, 'C'/'c' = 2, 'G'/'g' = 4, 'T'/'t' = 8,
	// anything else = 15.  This plays well with BAM/PAM files.
	Seq8
	// TODO(cchang): Add 'Base5' encoding, where 'A'/'a' = 0, 'C'/'c' = 1,
	// 'G'/'g' = 2, 'T'/'t' = 3, anything else = 4.
	EncodingLimit
)

type Fasta

type Fasta interface {
	// Get returns a substring of the given sequence name at the given
	// coordinates, which are treated as a 0-based half-open interval
	// [start, end). Get is thread-safe.
	Get(seqName string, start, end uint64) (string, error)

	// Len returns the length of the given sequence.
	Len(seqName string) (uint64, error)

	// SeqNames returns the names of all sequences, in the order of appearance in
	// the FASTA file.
	SeqNames() []string
}

Fasta represents FASTA-formatted data, consisting of a set of named sequences.

func New

func New(r io.Reader, opts ...Opt) (Fasta, error)

New creates a new Fasta that holds all the FASTA data from the given reader in memory. Pass OptIndex, if possible, to read much faster.

func NewIndexed

func NewIndexed(fasta io.ReadSeeker, index io.Reader, opts ...Opt) (Fasta, error)

NewIndexed creates a new Fasta that can perform efficient random lookups using the provided index, without reading the data into memory.

Note: Callers that expect to read many or all of the FASTA file sequences should use New(..., OptIndex(...)) instead.

type Opt

type Opt func(*opts)

Opt is an optional argument to New, NewIndexed.

func OptEncoding

func OptEncoding(e Encoding) Opt

OptEncoding specifies the encoding of the in-memory FASTA sequences.

func OptIndex

func OptIndex(index []byte) Opt

OptIndex makes New read FASTA file with a provided index, like NewIndexed. Unlike NewIndexed, New with OptIndex is optimized for reading all sequences in the FASTA file rather than a small, random subset. Callers that plan to read many or all FASTA sequences should use this (though as always, profile in your application).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL