Documentation ¶
Index ¶
- Constants
- Variables
- func CleanupDB(db *DB, pool *redCompressPool)
- func Exec(cmd *exec.Cmd) error
- func IsLowComplexity(residues []byte, offset, window int) bool
- func PrintFlagDefaults()
- func ReadOriginalSeqs(fileName string, ignore []byte) (chan ReadOriginalSeq, error)
- func Reduce(seq []byte) []byte
- func SeqIdentity(seq1, seq2 []byte) int
- func StartCompressReducedWorkers(db *DB) redCompressPool
- func Translate(sequence []byte) [][]byte
- func TranslateQuerySeqs(query *bytes.Reader, action SearchOperator) (*bytes.Reader, error)
- func Vprint(s string)
- func Vprintf(format string, v ...interface{})
- func Vprintln(s string)
- type CoarseDB
- func (coarsedb *CoarseDB) Add(oseq []byte) (int, *CoarseSeq)
- func (coarsedb *CoarseDB) CoarseSeqGet(i uint) *CoarseSeq
- func (coarsedb *CoarseDB) Expand(comdb *CompressedDB, id, start, end int) ([]OriginalSeq, error)
- func (coarsedb *CoarseDB) LoadSeqs() (err error)
- func (coarsedb *CoarseDB) NumSequences() int
- func (coarsedb *CoarseDB) ReadCoarseSeq(id int) (*CoarseSeq, error)
- type CoarseSeq
- type CompressedDB
- func (comdb *CompressedDB) NumSequences() int
- func (comdb *CompressedDB) ReadNextSeq(coarsedb *CoarseDB, seqFile io.Reader, orgSeqId int) (OriginalSeq, error)
- func (comdb *CompressedDB) ReadSeq(coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)
- func (comdb *CompressedDB) ReadSeqFromCompressedSource(coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)
- func (comdb *CompressedDB) SeqGet(coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)
- func (comdb *CompressedDB) Write(cseq CompressedSeq)
- type CompressedSeq
- type DB
- type DBConf
- type EditScript
- type LinkToCoarse
- type LinkToCompressed
- type OriginalSeq
- type ReadOriginalSeq
- type ReducedSeq
- type SearchOperator
- type SeedLoc
- type Seeds
- type Sequence
Constants ¶
const ( FileCoarseFasta = "coarse.fasta" FileCoarseFastaIndex = "coarse.fasta.index" FileCoarseLinks = "coarse.links" FileCoarsePlainLinks = "coarse.links.plain" FileCoarseLinksIndex = "coarse.links.index" FileCoarseSeeds = "coarse.seeds" FileCoarsePlainSeeds = "coarse.seeds.plain" )
Hard-coded file names for different pieces of a mica database.
const ( FileCompressed = "compressed" FileIndex = "compressed.index" )
const ( FileParams = "params" FileBlastCoarse = "blastdb-coarse" FileDmndCoarse = "blastdb-dmnd" FileBlastFine = "blastdb-fine" )
const ( ModSubstitution = iota ModDeletion ModInsertion )
Variables ¶
var ( DefaultQueryDBConf = &DBConf{ MinMatchLen: 40, MatchKmerSize: 4, GappedWindowSize: 25, UngappedWindowSize: 10, ExtSeqIdThreshold: 60, MatchSeqIdThreshold: 70, MatchExtend: 30, MapSeedSize: 6, ExtSeedSize: 0, LowComplexity: 10, SeedLowComplexity: 6, SavePlain: false, ReadOnly: true, SaveCompressed: false, BlastMakeBlastDB: "makeblastdb", Dmnd: "diamond", BlastDBSize: 0, } DefaultDBConf = &DBConf{ MinMatchLen: 40, MatchKmerSize: 4, GappedWindowSize: 25, UngappedWindowSize: 10, ExtSeqIdThreshold: 60, MatchSeqIdThreshold: 70, MatchExtend: 30, MapSeedSize: 6, ExtSeedSize: 0, LowComplexity: 10, SeedLowComplexity: 6, SavePlain: false, ReadOnly: true, SaveCompressed: false, BlastMakeBlastDB: "makeblastdb", Dmnd: "diamond", BlastDBSize: 0, } )
var ( SeedAlphaSize = len(blosum.Alphabet62) SeedAlphaNums = make([]int, 26) ReverseSeedAlphaNums = make([]byte, 26) )
SeedAlphaNums is a map to assign *valid* amino acid resiudes contiunous values so that base-N arithmetic can be performed on them. (Where N = SeedAlphaSize.) Invalid amino acid resiudes map to -1 and will produce a panic.
var (
Verbose = false
)
Functions ¶
func CleanupDB ¶
func CleanupDB(db *DB, pool *redCompressPool)
When the program ends (either by SIGTERM or when all of the input sequences are compressed), 'cleanup' is executed. It writes all CPU/memory profiles if they're enabled, waits for the compression workers to finish, saves the database to disk and closes all file handles.
func Exec ¶
Exec runs a command created with 'Command' in the os/exec package, and converts anything reported to stderr to a Go error value.
Note that if the command returns successfully, the error is guaranteed to be nil.
func IsLowComplexity ¶
IsLowComplexity detects whether the residue at the given offset is in a region of low complexity, where low complexity is defined as a window where every residue is the same (no variation in composition).
func PrintFlagDefaults ¶
func PrintFlagDefaults()
func ReadOriginalSeqs ¶
func ReadOriginalSeqs( fileName string, ignore []byte, ) (chan ReadOriginalSeq, error)
ReadOriginalSeqs reads a FASTA formatted file and returns a channel that each new sequence is sent to.
func SeqIdentity ¶
SeqIdentity computes the Sequence identity of two byte slices. The number returned is an integer in the range 0-100, inclusive. SeqIdentity returns zero if the lengths of both seq1 and seq2 are zero.
If the lengths of seq1 and seq2 are not equal, SeqIdentity will panic.
func StartCompressReducedWorkers ¶
func StartCompressReducedWorkers(db *DB) redCompressPool
startCompressWorkers initializes a pool of compression workers.
The compressPool returned can be used to compress sequences concurrently.
func TranslateQuerySeqs ¶
Types ¶
type CoarseDB ¶
type CoarseDB struct { Seqs []*CoarseSeq Seeds Seeds // File pointers to each file in the "coarse" part of a mica database. FileFasta *os.File FileFastaIndex *os.File FileSeeds *os.File FileLinks *os.File FileLinksIndex *os.File // contains filtered or unexported fields }
CoarseDB represents a set of unique sequences that comprise the "coarse" database. Sequences in the coarse database, combined with information in the compressed database, are used to re-create the original sequences.
func (*CoarseDB) Add ¶
Add takes an original sequence, converts it to a coarse sequence, and adds it as a new coarse sequence to the coarse database. Seeds are also generated for each K-mer in the sequence. The resulting coarse sequence is returned along with its sequence identifier.
func (*CoarseDB) CoarseSeqGet ¶
CoarseSeqGet is a thread-safe way to retrieve a sequence with index `i` from the coarse database.
func (*CoarseDB) Expand ¶
func (coarsedb *CoarseDB) Expand( comdb *CompressedDB, id, start, end int) ([]OriginalSeq, error)
Expand will follow all links to compressed sequences for the coarse sequence at index `id` and return a slice of decompressed sequences.
func (*CoarseDB) NumSequences ¶
NumRequences returns the number of sequences in the coarse database based on the file size of the coarse database index.
func (*CoarseDB) ReadCoarseSeq ¶
ReadCoarseSeq reads the coarse sequence with identifier 'id' from disk, using the fasta index. (If a coarse sequence has already been read, it is returned from cache to save trips to disk.)
TODO: Note that this does *not* recover links typically found in a coarse sequence, although it probably should to avoid doing it in CoarseDB.Expand.
type CoarseSeq ¶
type CoarseSeq struct { *Sequence Links *LinkToCompressed // contains filtered or unexported fields }
referenceSeq embeds a Sequence and serves as a typing mechanism to distguish reference Sequences in the compressed database with original Sequences from the input FASTA file.
func (*CoarseSeq) AddLink ¶
func (rseq *CoarseSeq) AddLink(link *LinkToCompressed)
func (*CoarseSeq) NewSubSequence ¶
type CompressedDB ¶
type CompressedDB struct { // File pointers to be used in reading/writing compressed databases. File *os.File Index *os.File CompressedSource bool // contains filtered or unexported fields }
A CompressedDB corresponds to a list of all original sequences compressed by replacing regions of sequences that are redundant with pointers to similar regions in the coarse database. Each pointer includes an offset and an edit script, which allows complete recovery of the original sequence.
N.B. A compressed database doesn't keep an in memory representation of all compressed sequences. In particular, writing to a compressed database always corresponds to writing a compressed sequence to disk. And reading from a compressed database always corresponds to reading a sequence from disk (unless it has been cached in 'seqCache').
func (*CompressedDB) NumSequences ¶
func (comdb *CompressedDB) NumSequences() int
NumSequences returns the number of sequences in the compressed database using the file size of the index.
func (*CompressedDB) ReadNextSeq ¶
func (comdb *CompressedDB) ReadNextSeq( coarsedb *CoarseDB, seqFile io.Reader, orgSeqId int) (OriginalSeq, error)
func (*CompressedDB) ReadSeq ¶
func (comdb *CompressedDB) ReadSeq( coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)
func (*CompressedDB) ReadSeqFromCompressedSource ¶
func (comdb *CompressedDB) ReadSeqFromCompressedSource(coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)
func (*CompressedDB) SeqGet ¶
func (comdb *CompressedDB) SeqGet( coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)
SeqGet reads a sequence from the compressed database, and decompressed it using the coarse database provided. The decompressed sequence is then added to cache.
If the sequence has already been decompressed, the decompressed sequence from cache is returned.
SeqGet will panic if it is called while a compressed database is open for writing.
func (*CompressedDB) Write ¶
func (comdb *CompressedDB) Write(cseq CompressedSeq)
Write queues a new compressed sequence to be written to disk.
type CompressedSeq ¶
type CompressedSeq struct { // A sequence number. Id int // Name is an uncompressed string from the original FASTA header. Name string // Links is an ordered lists of links to portions of the reference // database. When all links are followed, the concatenation of each // sequence corresponding to each link equals the entire original sequence. Links []LinkToCoarse }
CompressedSeq corresponds to the components of a compressed sequence.
func CompressReduced ¶
func CompressReduced(db *DB, redSeqId int, redSeq *ReducedSeq, mem *memory) CompressedSeq
CompressReduced will convert a 4-character alphabet sequence into a compressed sequence. The process involves finding commonality in the original sequence with other sequences in the coarse database, and linking those common sub-sequences to sub-sequences in the coarse database.
N.B. `mem` is used in alignment and seed lookups to prevent allocation. Think of them as goroutine-specific memory arenas.
func NewCompressedSeq ¶
func NewCompressedSeq(id int, name string) CompressedSeq
NewCompressedSeq creates a CompressedSeq value using the name provided. The Link slice is initialized but empty.
func (*CompressedSeq) Add ¶
func (cseq *CompressedSeq) Add(link LinkToCoarse)
Add will add a LinkToCoarse to the end of the CompressedSeq's Links list.
func (CompressedSeq) Decompress ¶
func (cseq CompressedSeq) Decompress(coarse *CoarseDB) (OriginalSeq, error)
Decompress decompresses a particular compressed sequence using the given coarse sequence. Namely, all of the links are followed and all of the edit scripts are "applied" to recover the original sequence.
func (CompressedSeq) String ¶
func (cseq CompressedSeq) String() string
type DB ¶
type DB struct { // An embedded configuration. *DBConf // The path to the directory on disk. Path string // The name of the database. This corresponds to the basename of the path. Name string // The compressed database component. ComDB *CompressedDB // The coarse database component. CoarseDB *CoarseDB // contains filtered or unexported fields }
A DB represents a mica database, which has three main components: a coarse database, a compressed database and a configuration file.
A DB can be opened either for writing/appending (compression) or for reading (decompression).
func NewReadDB ¶
NewReadDB opens a mica database for reading. An error is returned if there is a problem accessing any of the files on disk.
Also, if the 'makeblastdb' or 'blastp' executales are not found, then an error is returned.
func NewWriteDB ¶
NewWriteDB creates a new mica database, and prepares it for writing (or opens an existing database and prepares it for appending if 'appnd' is set).
An error is returned if there is a problem accessing any of the files in the database.
It is an error to open a database for writing that already exists if 'appnd' is not set.
'conf' should be a database configuration, typically defined (initially) from command line parameters. Note that if 'appnd' is set, then the configuration will be read from disk---only options explicitly set via the command line will be overwritten.
func (*DB) ReadClose ¶
func (db *DB) ReadClose()
ReadClose closes all appropriate files after reading from a database.
func (*DB) Save ¶
Save will write the contents of the database to disk. This should be called after compression is complete.
After the database is saved, a blastp database is created from the coarse database.
N.B. The compressed database is written as each sequence is processed, so this call will only save the coarse database. This may take a *very* long time if the database is not read only (since the seeds table has to be written).
func (*DB) WriteClose ¶
func (db *DB) WriteClose()
WriteClose closes all appropriate files after writing to a database.
type DBConf ¶
type DBConf struct { MinMatchLen int MatchKmerSize int GappedWindowSize int UngappedWindowSize int ExtSeqIdThreshold int MatchSeqIdThreshold int MatchExtend int MapSeedSize int ExtSeedSize int LowComplexity int SeedLowComplexity int SavePlain bool ReadOnly bool SaveCompressed bool BlastMakeBlastDB string Dmnd string BlastDBSize uint64 }
type EditScript ¶
type EditScript struct {
// contains filtered or unexported fields
}
func NewEditScript ¶
func NewEditScript(alignment [2][]byte) *EditScript
func NewEditScriptParse ¶
func NewEditScriptParse(editScript string) (*EditScript, error)
func (*EditScript) Apply ¶
func (diff *EditScript) Apply(fromSeq []byte) []byte
func (*EditScript) String ¶
func (diff *EditScript) String() string
type LinkToCoarse ¶
type LinkToCoarse struct { // Diff, when "applied" to the porition of the reference sequence indicated // by this link, will yield the original sequence corresponding to this // link precisely. If Diff is empty, then the subsequence of the reference // sequence indicated here is equivalent to the corresponding piece of // the original sequence. Diff string CoarseSeqId uint CoarseStart, CoarseEnd uint16 }
LinkToCoarse represents a component of a compressed original sequence that allows perfect reconstruction (i.e., decompression) of the original sequence.
func NewLinkToCoarse ¶
func NewLinkToCoarse(coarseSeqId, coarseStart, coarseEnd uint, alignment [2][]byte) LinkToCoarse
func NewLinkToCoarseNoDiff ¶
func NewLinkToCoarseNoDiff( coarseSeqId, coarseStart, coarseEnd uint) LinkToCoarse
func (LinkToCoarse) String ¶
func (lk LinkToCoarse) String() string
type LinkToCompressed ¶
type LinkToCompressed struct { OrgSeqId uint32 CoarseStart, CoarseEnd uint16 Next *LinkToCompressed }
LinkToCompressed represents a link from a reference sequence to a compressed original sequence. It serves as a bridge from a BLAST hit in the coarse database to the corresponding original sequence that is redundant to the specified residue range in the reference sequence.
func NewLinkToCompressed ¶
func NewLinkToCompressed( orgSeqId uint32, coarseStart, coarseEnd uint16) *LinkToCompressed
func (LinkToCompressed) String ¶
func (lk LinkToCompressed) String() string
type OriginalSeq ¶
type OriginalSeq struct {
*Sequence
}
OriginalSeq embeds a Sequence and serves as a typing mechanism to distguish reference Sequences in the compressed database with original Sequences from the input FASTA file.
func NewFastaOriginalSeq ¶
func NewFastaOriginalSeq(id int, s seq.Sequence) *OriginalSeq
func NewOriginalSeq ¶
func NewOriginalSeq(id int, name string, residues []byte) *OriginalSeq
func (*OriginalSeq) NewSubSequence ¶
func (oseq *OriginalSeq) NewSubSequence(start, end uint) *OriginalSeq
type ReadOriginalSeq ¶
type ReadOriginalSeq struct { Seq *OriginalSeq Err error }
ReadOriginalSeq is the value sent over `chan ReadOriginalSeq` when a new sequence is read from a fasta file
type ReducedSeq ¶
type ReducedSeq struct {
*Sequence
}
ReducedSeq embeds a Sequence and serves as a typing mechanism to distguish reduced-alphabet (DNA) Sequences from amino acid Sequences.
func NewReducedSeq ¶
func NewReducedSeq(oseq *OriginalSeq) *ReducedSeq
func (*ReducedSeq) NewSubSequence ¶
func (rseq *ReducedSeq) NewSubSequence(start, end uint) *ReducedSeq
type SeedLoc ¶
type SeedLoc struct { // Index into the coarse database sequence slice. SeqInd uint32 // Index into the coarse sequence corresponding to `SeqInd`. ResInd uint16 Next *SeedLoc }
SeedLoc represents the information required to translate a seed to a slice of residues from the coarse database. Namely, the index of the sequence in the coarse database and the index of the residue where the seed starts in that sequence.
Every SeedLoc also contains a pointer to the next seed location. This design was chosen so that each SeedLoc is independently allocated (as opposed to using a slice, which incurs a lot of allocation overhead when expanding the slice, and has the potential for pinning memory).
func NewSeedLoc ¶
type Seeds ¶
type Seeds struct { // Table of lists of seed locations. Its length is always equivalent // to (SeedAlphaSize)^(SeedSize). Locs []*SeedLoc SeedSize int // contains filtered or unexported fields }
Seeds is a list of lists of seed locations. The index into the seeds table corresponds to a hash of particular K-mer. The list found at each row in the seed table corresponds to all locations in the coarse database in which the K-mer occurs.
func NewSeeds ¶
NewSeeds creates a new table of seed location lists. The table is initialized with enough memory to hold lists for all possible K-mers. Namely, the length of seeds is equivalent to 20^(K) where 20 is the number of amino acids (size of alphabet) and K is equivalent to the length of each K-mer.
func (*Seeds) Add ¶
Add will create seed locations for all K-mers in corSeq and add them to the seeds table.
func (Seeds) Lookup ¶
Lookup returns a list of all seed locations corresponding to a particular K-mer.
`mem` is a pointer to a slice of seed locations, where a seed location is a tuple of (sequence index, residue index). `mem` is used to prevent unnecessary allocation. A pointer to thise slice is returned.