fasta

package
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 1, 2024 License: BSD-3-Clause Imports: 15 Imported by: 10

Documentation

Overview

Package fasta provides functions for reading, writing, and manipulating fasta files.

Index

Constants

This section is empty.

Variables

View Source
var (
	ErrSeekStartOutsideChr = errors.New("requested start position greater than requested chromosome length, nil output")
	ErrSeekEndOutsideChr   = errors.New("requested bases past end of chr, output truncated")
)

Functions

func AllAreEqual

func AllAreEqual(alpha []Fasta, beta []Fasta) bool

AllAreEqual returns true if every entry in a slice of Fasta structs passes IsEqual. Sensitive to order in the slice.

func AllAreEqualIgnoreOrder

func AllAreEqualIgnoreOrder(alpha []Fasta, beta []Fasta) bool

AllAreEqualIgnoreOrder returns true if every entry in a slice of Fasta structs passes IsEqual. Not sensitive to order in the slice.

func AllToUpper

func AllToUpper(records []Fasta)

AllToUpper converts all bases to uppercase in all sequences in a slice of fasta records.

func AlnPosToRefPos

func AlnPosToRefPos(record Fasta, AlnPos int) int

AlnPosToRefPos returns the reference position associated with a given AlnPos for an input Fasta. If the AlnPos corresponds to a gap, it gives the preceding reference position. 0 based. Consider using AlnPosToRefPosCounter instead if tracking refStart and alnStart will be beneficial, e.g. when working through entire chromosomes

func AlnPosToRefPosCounter

func AlnPosToRefPosCounter(record Fasta, AlnPos int, refStart int, alnStart int) int

AlnPosToRefPosCounter is like AlnPosToRefPos, but can begin midway through a chromosome at a refPosition/alnPosition pair, defined with the input variables refStart and alnStart.

func AlnPosToRefPosCounterSeq added in v1.0.1

func AlnPosToRefPosCounterSeq(record []dna.Base, AlnPos int, refStart int, alnStart int) int

AlnPosToRefPosCounterSeq is AlnPosToRefPosCounter but the input record is just the sequence of the fasta struct

func AssemblyStats

func AssemblyStats(infile string, countLowerAsGaps bool) (int, int, int, int, int)

AssemblyStats takes the path to a fasta file and a flag for whether lower case letters should count as assembly gaps. Five ints are returned, which encode: the N50 size, half the size of the genome, size of the genome, size of the largest contig, and the number of contigs.

func BinFasta

func BinFasta(genome []Fasta, binNum int) map[int][]Fasta

BinFasta takes in a slice of fastas and breaks it up into x number of fastas with relatively equal sequence in each, where x equals the number of bins specified.

func BinGenomeNoBreaks

func BinGenomeNoBreaks(genome []Fasta, binNum int, minSize int) map[int][]Fasta

BinGenomeNoBreaks takes in an entire genome which is sorted largest to smallest contig and breaks up the fasta so that smaller contigs get combined into a single fasta, while large contigs become a single fasta on their own. The user must specify the number of bins for the genome to be broken into, the genome must have more contigs than bins in order to combine any contigs and equal number bins to contigs if each contig gets its own record. The bins will all be filled with the first contig encountered when it's empty, and then the smallest of those bins will be filled when the contig is equal to binNum+1. The minSize option allows for a user to specify a minimum length of sequence to go into each bin and in this case the number of bins returned depends on the minSize and the binNum will be ignored.

func CalculateN50

func CalculateN50(contigList []int, halfGenome int) int

CalculateN50 takes a slice of contig lengths and the size of half the genome. It returns the N50 size.

func GoReadToChan

func GoReadToChan(filename string) <-chan Fasta

GoReadToChan reads fasta records from an input filename and returns a channel of Fasta structs.

func IsEqual

func IsEqual(alpha Fasta, beta Fasta) bool

IsEqual returns true if two input Fasta structs have an equal name and sequence.

func IsFasta

func IsFasta(filename string) bool

IsFasta returns true if the input filename has a fasta file extension. Input filename may have a .gz suffix.

func MakeContigList

func MakeContigList(records []Fasta, countLowerAsGaps bool) []int

MakeContigList takes a slice of fasta sequences and a flag for whether lower case letters should count as gaps. A slice of contig sizes is the return value.

func NumSegregatingSites

func NumSegregatingSites(aln []Fasta) int

NumSegregatingSites returns the number of sites in an alignment block that are segregating.

func PairwiseMutationDistanceInRange

func PairwiseMutationDistanceInRange(seq1 Fasta, seq2 Fasta, alnStart int, alnEnd int) int

PairwiseMutationDistanceInRange calculates the number of mutations between two Fasta sequences from a specified start and end alignment column. Segregating sites are counted as 1, as are INDELs regardless of length.

func PairwiseMutationDistanceReferenceWindow

func PairwiseMutationDistanceReferenceWindow(seq1 Fasta, seq2 Fasta, alnStart int, windowSize int) (int, bool, int)

PairwiseMutationDistanceReferenceWindow takes two input fasta sequences and calculates the number of mutations in a reference window of a given size. Segregating sites are counted as 1, as are INDELs regardless of length. alnStart indicates the beginning alignment column for distance evaluation, and windowSize is the number of references bases to compare. Three returns, first is the pairwise mutation distance, second is reachedEnd, a bool that is true for incomplete windows. The third return is alignmentEnd, or the last alignment column evaluated.

func ReadToChan

func ReadToChan(file *fileio.EasyReader, data chan<- Fasta, wg *sync.WaitGroup)

ReadToChan is a helper function of GoReadToChan.

func ReadToString

func ReadToString(filename string) map[string]string

ReadToString reads a fasta file to a map of sequence strings keyed by the record name.

func RefPosToAlnPos

func RefPosToAlnPos(record Fasta, RefPos int) int

RefPosToAlnPos returns the alignment position associated with a given reference position for an input MultiFa. 0 based.

func RefPosToAlnPosCounter

func RefPosToAlnPosCounter(record Fasta, RefPos int, refStart int, alnStart int) int

RefPosToAlnPosCounter is like RefPosToAlnPos, but can begin midway through a chromosome at a refPosition/alnPosition pair, defined by the input variables refStart and alnStart.

func ReverseComplement

func ReverseComplement(record Fasta)

ReverseComplement the sequence in a fasta record.

func ReverseComplementAll

func ReverseComplementAll(records []Fasta)

ReverseComplementAll sequences in a slice of fasta records.

func ScanN added in v1.0.1

func ScanN(aln []Fasta, queryName string) [][]int

Scan takes in a multiFa alignment, scans the user-specified sequence for a user-specified pattern (N for now) and returns the positions in reference sequence coordinates

func SeekByIndex

func SeekByIndex(sr *Seeker, chr, start, end int) ([]dna.Base, error)

SeekByIndex returns a portion of a fasta sequence identified by chromosome index (order in file). Input start and end should be 0-based start-closed end-open.

func SeekByName

func SeekByName(sr *Seeker, chr string, start, end int) ([]dna.Base, error)

SeekByName returns a portion of a fasta sequence identified by chromosome name. Input start and end should be 0-based start-open end-closed.

func SortByName

func SortByName(seqs []Fasta)

SortByName sorts fasta records lexicographically.

func SortBySeq

func SortBySeq(seqs []Fasta)

SortBySeq sorts fasta records by sequence.

func ToChromInfo

func ToChromInfo(records []Fasta) []chromInfo.ChromInfo

ToChromInfo converts a []Fasta into a []ChromInfo. Useful for applications that do not require the entire fasta sequence to be kept in memory, but just the name, size, and order of fasta records.

func ToMap

func ToMap(ref []Fasta) map[string][]dna.Base

ToMap converts the a slice of fasta records (e.g. the output of the Read function) to a map of sequences keyed to the sequences name.

func ToUpper

func ToUpper(fa Fasta)

ToUpper converts all bases in a fasta sequence to uppercase.

func Write

func Write(filename string, records []Fasta)

Write a fasta to input filename. Output fastas have line length of 50.

func WriteAssemblyStats

func WriteAssemblyStats(assemblyName string, outfile string, N50 int, halfGenome int, genomeLength int, largestContig int, numContigs int)

WriteAssemblyStats takes the name of an assembly, a path to an output file, and stats for: the N50 size, half the size of the genome, size of the genome, size of the largest contig, and the number of contigs. The stats, with some human-readable labels are written to the output file.

func WriteFasta

func WriteFasta(file io.Writer, rec Fasta, lineLength int)

WriteFasta writes a single fasta record to an io.Writer.

func WriteToFileHandle

func WriteToFileHandle(file io.Writer, records []Fasta, lineLength int)

WriteToFileHandle writes a slice of fasta records to a given io.Writer instead of creating a new io.Writer as is done in the Write function.

Types

type Fasta

type Fasta struct {
	Name string
	Seq  []dna.Base
}

Fasta stores the name and sequence of each '>' delimited record in a fasta file.

func Copy

func Copy(f Fasta) Fasta

Copy returns a memory copy of an input fasta struct.

func CopyAll

func CopyAll(f []Fasta) []Fasta

CopyAll returns a memory copy of a slice of input fasta structs.

func CopySubset

func CopySubset(records []Fasta, start int, end int) []Fasta

CopySubset returns a copy of a multiFa from a specified start and end position.

func CreateAllGaps

func CreateAllGaps(name string, numGaps int) Fasta

CreateAllGaps creates a fasta record where the sequence is all gaps of length numGaps.

func CreateAllNs

func CreateAllNs(name string, numN int) Fasta

CreateAllNs creates a fasta record where the sequence is all Ns of length numN.

func DistColumn

func DistColumn(records []Fasta) []Fasta

returns alignment columns with no gaps or lowercase letters.

func Extract

func Extract(f Fasta, start int, end int, name string) Fasta

Extract will subset a sequence in a fasta file and return a new fasta record with the same name and a subset of the sequence. Input start and end are left-closed right-open.

func ExtractMulti

func ExtractMulti(records []Fasta, start int, end int) []Fasta

ExtractMulti extracts a subsequence from a fasta file for every entry in a multiFa alignment.

func NextFasta

func NextFasta(file *fileio.EasyReader) (Fasta, bool)

NextFasta reads a single fasta record from an input EasyReader. Returns true when the file is fully read.

func NextFastaForced

func NextFastaForced(file *fileio.EasyReader) (Fasta, bool)

NextFastaForced functions identically to Read, but any invalid characters in the sequence will be masked to N.

func Read

func Read(filename string) []Fasta

Read in a fasta file to a []Fasta struct. All sequence records must be preceded by a name line starting with '>'. Each record must have a unique sequence name.

func ReadForced

func ReadForced(filename string) []Fasta

ReadForced functions identically to Read, but any invalid characters in the sequence will be masked to N.

func Remove

func Remove(slice []Fasta, i int) []Fasta

Remove fasta record with index i from slice of fasta.

func RemoveGaps

func RemoveGaps(records []Fasta) []Fasta

RemoveGaps from all fasta records in a slice.

func RemoveMissingMult

func RemoveMissingMult(records []Fasta) []Fasta

RemoveMissingMult removes any entries comprised only of gaps in a multiple alignment block,.

func SegregatingSites

func SegregatingSites(aln []Fasta) []Fasta

SegregatingSites takes in a multiFa alignment and returns a new alignment containing only the columns with segregating sites.

func TrimName

func TrimName(fa Fasta) Fasta

TrimName retains the first space delimited field of a fasta name.

type FastaMap

type FastaMap map[string][]dna.Base

FastaMap stores fasta sequences as a map keyed by the sequence name instead of a slice. This allows for easy fasta lookups of chromosomes provided by other files (e.g. BED files). A FastaMap can be generated using the ToMap function (e.g. fasta.ToMap(fasta.Read('filename'))).

type Index

type Index struct {
	// contains filtered or unexported fields
}

Index stores the byte offset for each fasta sequencing allowing for efficient random access.

func CreateIndex

func CreateIndex(filename string) Index

CreateIndex for a fasta file for efficient random access.

func (Index) String

func (idx Index) String() string

String method for Index enables easy writing with the fmt package.

type Seeker

type Seeker struct {
	// contains filtered or unexported fields
}

Seeker enables random access of fasta sequences using a pre-computed index.

func NewSeeker

func NewSeeker(fasta, index string) *Seeker

NewSeeker opens a fasta file and an fai index file and enables seek functionality so the entire fasta file does not need to be present in memory.

If you input an empty string for 'index', NewSeeker tries to find the index file as 'fasta'.fai.

func (*Seeker) Close

func (rs *Seeker) Close() error

Close the Seeker.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL