genome

package
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 18, 2024 License: MIT Imports: 12 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var BufferSize = 65536 // os.Getpagesize()

BufferSize is size of reading and writing buffer

View Source
var ErrBrokenFile = errors.New("genome data: broken file")

ErrBrokenFile means the file is not complete.

View Source
var ErrEmptySeq = errors.New("genome data: empty seq")

ErrEmptySeq means the sequence is empty

View Source
var ErrInvalidFileFormat = errors.New("genome data: invalid binary format")

ErrInvalidFileFormat means invalid file format.

View Source
var ErrInvalidTwoBitData = errors.New("genome data: invalid two-bit data")

ErrInvalidTwoBitData means the length of two bit seq slice does not match the number of bases

View Source
var ErrVersionMismatch = errors.New("genome data: version mismatch")

ErrVersionMismatch means version mismatch between files and program

View Source
var GenomeIndexFileExt = ".idx"

KVIndexFileExt is the file extension of k-mer data index file.

View Source
var Magic = [8]byte{'.', 'g', 'e', 'n', 'o', 'm', 'e', 's'}

Magic number for checking file format

View Source
var MagicIdx = [8]byte{'.', 'g', 'e', 'n', 'o', 'm', 'e', 'i'}

Magic number for the index file

View Source
var MainVersion uint8 = 0

MainVersion is use for checking compatibility

View Source
var MinorVersion uint8 = 1

MinorVersion is less important

View Source
var PoolGenome = &sync.Pool{New: func() interface{} {
	return &Genome{
		ID:  make([]byte, 0, 128),
		Seq: make([]byte, 0, 20<<20),

		GenomeSize: 0,
		SeqSizes:   make([]int, 0, 128),

		Done: make(chan int),
	}
}}

PoolGenome is the object pool for Genome

Functions

func RecycleGenome

func RecycleGenome(g *Genome)

RecycleGenome recycle a Genome

func RecycleTwoBit

func RecycleTwoBit(b2 *[]byte)

RecycleSeq recycles the sequence.

func Seq2TwoBit

func Seq2TwoBit(s []byte) *[]byte

Seq2TwoBit converts a DNA sequence to 2bit-packed sequence.

func TwoBit2Seq

func TwoBit2Seq(b2 []byte, bases int) ([]byte, error)

TwoBit2Seq converts a 2bit-packed sequence to DNA.

Types

type Genome

type Genome struct {
	ID  []byte // genome ID
	Seq []byte // sequence, bases

	GenomeSize int       // bases of all sequences
	Len        int       // length of contatenated sequences
	NumSeqs    int       // number of sequences
	SeqSizes   []int     // sizes of sequences
	SeqIDs     []*[]byte // IDs of all sequences

	// only used in index building
	Kmers     *[]uint64 // lexichash mask result
	Locses    *[][]int  // lexichash mask result
	TwoBit    *[]byte   // bit-packed sequence
	StartTime time.Time

	GenomeIdx int // only for collecting Batch+Genome Index of split genome chunks, not saved in index

	// seed positions to write to the file
	Locs       *[]uint32
	ExtraKmers *[]*[]uint64 // 3*n. (kmer, loc)

	// for making sure both genome and key-value data being written
	Done chan int

	// offset of sequence, only used in calling SubSeq for more than once
	SeqOffSet int64
}

Genome represents a reference sequence to insert and a matched subsequence

func (*Genome) Reset

func (r *Genome) Reset()

Reset resets the Genome.

func (Genome) String

func (r Genome) String() string

type Reader

type Reader struct {
	Index []uint64 // index data of all genome records, (offset, nbases)
	// contains filtered or unexported fields
}

Reader is for fast extracting of subsequence of any sequence in the data file.

func NewReader

func NewReader(file string) (*Reader, error)

NewReader returns a reader from a genome file. The reader is recycled after calling Close().

func (*Reader) Close

func (r *Reader) Close() error

Close closes and recycles the reader.

func (*Reader) GenomeInfo added in v0.4.0

func (r *Reader) GenomeInfo(idx int) (*Genome, error)

GenomeInfo returns the genome information of a genome (idx is 0-based), Please call RecycleGenome() after using the result.

func (*Reader) Seq

func (r *Reader) Seq(idx int) (*Genome, error)

Seq returns the sequence with index of genome (0-based).

func (*Reader) SubSeq

func (r *Reader) SubSeq(idx int, start int, end int) (*Genome, error)

SubSeq returns the subsequence of a genome (idx is 0-based), from start to end (both are 0-based and included). Please call RecycleGenome() after using the result.

func (*Reader) SubSeq2

func (r *Reader) SubSeq2(idx int, seqid []byte, start int, end int) (*Genome, int, error)

SubSeq2 returns the subsequence of one genome (idx is 0-based), from start to end (both are 0-based and included). It also return the actual end position (0-based). Please call RecycleGenome() after using the result.

func (*Reader) SubSeq3 added in v0.5.0

func (r *Reader) SubSeq3(idx int, start int, end int, g *Genome) (*Genome, error)

SubSeq3 returns the subsequence of a genome (idx is 0-based), from start to end (both are 0-based and included). Please call RecycleGenome() after using the result.

type Writer

type Writer struct {
	// contains filtered or unexported fields
}

Writer saves a list of DNA sequences into 2bit-encoded format, along with its genome information.

func NewWriter

func NewWriter(file string, batch uint32) (*Writer, error)

NewWriter creates a new Writer. Batch is the batch id for this data file.

func (*Writer) Close

func (w *Writer) Close() error

Close writes the index file and finishes the writing.

func (*Writer) Write

func (w *Writer) Write(s *Genome) error

Write writes one genome. After calling this, you need to call RecycleGenome to recycle the genome.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL