mica

package module

v0.0.0-...-bae6f69 Latest Latest Go to latest Published: Feb 10, 2018 License: GPL-2.0 Imports: 19 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ndaniels/MICA

Links

Open Source Insights

README ¶

ABOUT

MICA (Metagenomic Inquiry Compressive Acceleration) is a family of programs for performing compressively-accelerated metagenomic sequence searches based on BLASTX and DIAMOND. MICA also includes compressively accelerated versions of the BLASTP family of tools (including PSI-BLAST and DELTA-BLAST), as well as a compression tool (mica-compress) for creating searchable, compressed databases based on an input FASTA file.

If you use MICA, please cite:

Daniels N, Gallant A, Peng J, Cowen L, Baym M, Berger B "Compressive Genomics for Protein Databases." Bioinformatics 29.13 (2013): i283-i290.
Yu YW, Daniels N, Danko DC, Berger B "Entropy-scaling search of massive biological data." (2015) Submitted.
Buchfink B, Chao X, Huson, D. "Fast and sensitive protein alignment using DIAMOND." Nature methods 12.1 (2015): 59-60.

MICA is licensed under the GNU public license version 2.0. If you would like to license MICA in an environment where the GNU public license is unacceptable (such as inclusion in a non-GPL software package) commercial MICA licensing is available through MIT office of Technology Transfer. Contact bab@mit.edu for more information. Contact ndaniels@csail.mit.edu for issues involving the code.

COMPATIBILITY NOTE

Between publication of DIAMOND and MICA, DIAMOND made some changes that affected both database and output compatibility. Specifically, DIAMOND now outputs only a .daa file, which can be converted to BLAST-style tabular output or SAM output via diamond view. The results in Yu et al. depend on the November 2014 version of DIAMOND, which output a BLAST tabular file directly. Thus, the database provided depends on particular versions of DIAMOND and MICA. Specifically, the paper results were produced using this version of MICA and this version of DIAMOND (a linux-x64 binary)

Current MICA works with current DIAMOND, and can optionally convert the .daa file to BLAST tabular format directly. MICA can also perform its fine search step using BLASTX or BLASTP, which results in a small run-time performance penalty, but allows any BLASTX options to be passed along. This can be useful for incorporating MICA into an existing BLASTX pipeline that depends on options not supported by DIAMOND.

DIAMOND vs. BLASTX

MICA performs a two-stage search: it searches cluster representatives first ("coarse search"), expands those representatives that are reasonably close to a query, and finally performs a second search ("fine search") on just the sequences reprsented by those coarse hits. MICA relies on DIAMOND for its coarse search, but allows the choice of DIAMOND or BLASTX for its fine search. DIAMOND is faster, but less flexible in terms of the options that can be provided to BLASTX. BLASTX imposes a moderate run-time penalty (~20-40%) compared to DIAMOND but allows all options to be passed through via the --blast-args flag. This means that any of BLAST's output formats, filtering options, and such are available. To use DIAMOND for fine search, specify the argument --dmnd-fine (with a specific output file, such as --dmnd-fine=results.txt). To use BLAST for fine search, omit this argument, and use output redirection (> results.txt).

QUICK EXAMPLE

Assuming you have Go, DIAMOND and BLAST+ installed, here is a quick example of how to perform a compressively accelerated MICA search using a compressed database that has already been created.

# Install MICA
go get github.com/ndaniels/MICA/...

# Download and extract the database. It is large and could take a while.
# Make sure to check for a newer version!
wget http://giant.csail.mit.edu/gems/nr-20140917-mica.tgz
tar zxf nr-20140917-mica.tgz

# Search.
mica-xsearch --dmnd-fine=result.txt nr-20140917-mica query.fasta

There are more examples covering more use cases further down.

INSTALLATION

The easiest way to install is to download binaries compiled for your operating system. No other dependencies are required (sans BLAST+ and DIAMOND, which should already be in your PATH). They can be downloaded here: http://gems.csail.mit.edu/

Compiling from source is also easy; compiling MICA only requires that git and Go are installed. If Go is not already available via your package manager, it can be installed from source by following the directions here: http://golang.org/doc/install

Once Go is installed, you'll need to set your GOPATH, which is where MICA (and other Go packages) will be installed. We recommend running

mkdir $HOME/go

And adding the following to your ~/.profile or equivalent:

export GOPATH="$HOME/go"
export PATH="$PATH:$GOPATH/bin"

Finally, run the following command to download, compile and install CaBLASTP:

go get github.com/ndaniels/MICA/...

The MICA executables should be installed in $GOPATH/bin.

MICA has been tested against Go 1.4.2.

EXECUTABLES

There are seven binary executables in the MICA suite, also available as binaries for users without Go installed. They are:

mica-xsearch      A compressively accelerated translated search (like BLASTX),
                    which can use DIAMOND or BLASTX for fine search.
mica-psearch      A compressively accelerated protein search (like BLASTP),
                    which can use DIAMOND or BLASTP for fine search.

mica-compress     Compresses FASTA input files (such as nr.fasta or
                      nr.gz) into a compressed database for quick searching.

mica-decompress   A rarely-needed inverse of mica-compress.

mica-search       A compressively accelerated version of BLASTP.

mica-psisearch    A compressively accelerated version of PSI-BLAST.

mica-deltasearch  A compressively accelerated version of DELTA-BLAST.

Every executable can be run with the --help flag to get a list of command line options.

PREREQUISITES

MICA boosts BLAST+ and DIAMOND protein-database search, and as such it is not completely self-contained. It relies on BLAST+ and DIAMOND.

To use MICA, you must already have BLAST+ 2.2 or later installed, so that the BLAST binaries are in your PATH. DELTA-BLAST requires BLAST+ 2.2.26 or later and we recommend 2.2.27. DELTA-BLAST also requires an RPS database configured per NCBI's instructions. You must also have DIAMOND installed (tested with DIAMOND 0.7.9) so that the diamond binary is in your PATH.

We provide binaries for Mac OS X (64-bit intel, tested on OS X 10.10.3 and built with Go 1.4.2) and Linux (64-bit intel/AMD, tested on Linux kernel 3.13.0 and Go 1.4.2). With Go installed, MICA might work on Microsoft Windows but is untested and unsupported.

You do not need the Go compiler installed to use the binary distributions of MICA.

ADDITIONAL FILES

As compression is compute-intensive, we provide an already-compressed database based on NCBI's NR from September 17, 2014, which we will update quarterly. Since the MICA compressed database format is actually a directory structure, we provide it as a .tar.gz file, so should be unarchived with tar zxf nr-20140917-mica.tgz.

The result will be a directory, 'nr-20140917-mica', which contains the various files necessary for MICA to run.

Should you wish to create your own compressed database, you would use the mica-compress binary. The database we provide was created with:

mica-compress --match-seq-id-threshold 60 --ext-seed-size 0
                  --ext-seq-id-threshold 50 --max-seeds 20 -p 40
                  nr-20140917-mica nr.fasta

Several of the command-line arguments are tuning parameters that affect the run-time performance of compression.

The --max-seeds argument caps the size of the seeds table to, in this case, 20 gigabytes. Compressing large databases can require a great deal of RAM. A significantly smaller cap will harm compression.

The --ext-seed-size argument allows for larger k-mer seeds without the memory overhead associated with the larger size, by greedily requiring the additional residues to be exact matches.

The --match-seq-id-threshold argument sets the sequence identity percentage required for a match during compression.

The --ext-seq-id-threshold argument sets the sequence identity percentage required for a single instance of extension during compression.

The -p argument simply sets the number of processor cores used during compression, and bears no relevance to the resulting compressed database.

In this case, the input file is nr.fasta, and the output name for the compressed database is nr-20140917-mica. Note that the compressed database is actually a directory that will be created by mica-compress.

USAGE

Run mica-compress -help, mica-xsearch -help, mica-psearch -help, mica-deltasearch -help, mica-search -help, or mica-psisearch -help for detailed help as to command-line arguments.

EXAMPLES

To perform a compressively accelerated DIAMOND search, you might do:

mica-xsearch --dmnd-fine=result.txt
                     /path/to/mica_database /path/to/query.fasta

where:

result.txt is the local file path to output the DIAMOND results

/path/to/mica_database is the local file path to your MICA 
compressed database (it will be the path to nr-20140917-mica if you are 
using the provided September, 2014 database)

/path/to/query.fasta is simply the local file path to the FASTA file you 
wish to use as a query.

To perform a compressively accelerated BLASTX search, you might do:

mica-xsearch /path/to/mica_database /path/to/query.fasta
  --blast-args -evalue 1e-7 -outfmt 6 > result.txt

where:

/path/to/mica_database is the local file path to the MICA 
compressed database,

/path/to/query.fasta is the local file path to the FASTA file you wish to 
use as a query,

1e-7 is the BLAST e-value you wish to use,

-outfmt 6 is the standard BLAST argument to produce tabular output, and

result.txt is where you wish the results to appear.

Arguments the user wishes to pass to the underlying BLAST program, if BLAST is used for fine search, such as adjusting the output format or the E-value threshold, may be passed via the --blast-args flag.

For example, to specify XML output, one might run:

mica-xsearch /path/to/mica_database /path/to/query.fasta
                --blast-args -outfmt 5

Where -outfmt 5 is, as indicated in the NCBI BLAST+ user guide, the command-line argument for XML output.

REPORTING BUGS

If you find any bugs or have any problems using MICA, please submit a bug report on our issue tracker:

https://github.com/ndaniels/MICA/issues

Documentation ¶

Index ¶

Constants
Variables
func CleanupDB(db *DB, pool *redCompressPool)
func Exec(cmd *exec.Cmd) error
func IsLowComplexity(residues []byte, offset, window int) bool
func PrintFlagDefaults()
func ReadOriginalSeqs(fileName string, ignore []byte) (chan ReadOriginalSeq, error)
func Reduce(seq []byte) []byte
func SeqIdentity(seq1, seq2 []byte) int
func StartCompressReducedWorkers(db *DB) redCompressPool
func Translate(sequence []byte) [][]byte
func TranslateQuerySeqs(query *bytes.Reader, action SearchOperator) (*bytes.Reader, error)
func Vprint(s string)
func Vprintf(format string, v ...interface{})
func Vprintln(s string)
type CoarseDB
- func (coarsedb *CoarseDB) Add(oseq []byte) (int, *CoarseSeq)
- func (coarsedb *CoarseDB) CoarseSeqGet(i uint) *CoarseSeq
- func (coarsedb *CoarseDB) Expand(comdb *CompressedDB, id, start, end int) ([]OriginalSeq, error)
- func (coarsedb *CoarseDB) LoadSeqs() (err error)
- func (coarsedb *CoarseDB) NumSequences() int
- func (coarsedb *CoarseDB) ReadCoarseSeq(id int) (*CoarseSeq, error)
type CoarseSeq
- func NewCoarseSeq(id int, name string, residues []byte) *CoarseSeq
- func NewFastaCoarseSeq(id int, s seq.Sequence) *CoarseSeq
- func (rseq *CoarseSeq) AddLink(link *LinkToCompressed)
- func (rseq *CoarseSeq) NewSubSequence(start, end uint) *CoarseSeq
type CompressedDB
- func (comdb *CompressedDB) NumSequences() int
- func (comdb *CompressedDB) ReadNextSeq(coarsedb *CoarseDB, seqFile io.Reader, orgSeqId int) (OriginalSeq, error)
- func (comdb *CompressedDB) ReadSeq(coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)
- func (comdb *CompressedDB) ReadSeqFromCompressedSource(coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)
- func (comdb *CompressedDB) SeqGet(coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)
- func (comdb *CompressedDB) Write(cseq CompressedSeq)
type CompressedSeq
- func CompressReduced(db *DB, redSeqId int, redSeq *ReducedSeq, mem *memory) CompressedSeq
- func NewCompressedSeq(id int, name string) CompressedSeq
- func (cseq *CompressedSeq) Add(link LinkToCoarse)
- func (cseq CompressedSeq) Decompress(coarse *CoarseDB) (OriginalSeq, error)
- func (cseq CompressedSeq) String() string
type DB
- func NewReadDB(dir string) (*DB, error)
- func NewWriteDB(appnd bool, conf *DBConf, dir string) (*DB, error)
- func (db *DB) ReadClose()
- func (db *DB) Save() error
- func (db *DB) WriteClose()
type DBConf
- func LoadDBConf(r io.Reader) (conf *DBConf, err error)
- func (conf *DBConf) DeepCopy() *DBConf
- func (flagConf *DBConf) FlagMerge(fileConf *DBConf) (*DBConf, error)
- func (dbConf DBConf) Write(w io.Writer) error
type EditScript
- func NewEditScript(alignment [2][]byte) *EditScript
- func NewEditScriptParse(editScript string) (*EditScript, error)
- func (diff *EditScript) Apply(fromSeq []byte) []byte
- func (diff *EditScript) String() string
type LinkToCoarse
- func NewLinkToCoarse(coarseSeqId, coarseStart, coarseEnd uint, alignment [2][]byte) LinkToCoarse
- func NewLinkToCoarseNoDiff(coarseSeqId, coarseStart, coarseEnd uint) LinkToCoarse
- func (lk LinkToCoarse) String() string
type LinkToCompressed
- func NewLinkToCompressed(orgSeqId uint32, coarseStart, coarseEnd uint16) *LinkToCompressed
- func (lk LinkToCompressed) String() string
type OriginalSeq
- func NewFastaOriginalSeq(id int, s seq.Sequence) *OriginalSeq
- func NewOriginalSeq(id int, name string, residues []byte) *OriginalSeq
- func (oseq *OriginalSeq) NewSubSequence(start, end uint) *OriginalSeq
type ReadOriginalSeq
type ReducedSeq
- func NewReducedSeq(oseq *OriginalSeq) *ReducedSeq
- func (rseq *ReducedSeq) NewSubSequence(start, end uint) *ReducedSeq
type SearchOperator
type SeedLoc
- func NewSeedLoc(seqInd uint32, resInd uint16) *SeedLoc
type Seeds
- func NewSeeds(seedSize, lowComplexityWindow int) Seeds
- func (ss *Seeds) Add(coarseSeqIndex int, corSeq *CoarseSeq)
- func (ss Seeds) Lookup(kmer []byte, mem *[][2]uint) [][2]uint
- func (ss *Seeds) MaybeWipe(seedTableSizeGB float64)
- func (ss Seeds) NumSeeds() int64
type Sequence
- func (s *Sequence) FastaSeq() seq.Sequence
- func (seq *Sequence) Len() int
- func (seq *Sequence) String() string

Constants ¶

View Source

const (
	FileCoarseFasta      = "coarse.fasta"
	FileCoarseFastaIndex = "coarse.fasta.index"
	FileCoarseLinks      = "coarse.links"
	FileCoarsePlainLinks = "coarse.links.plain"
	FileCoarseLinksIndex = "coarse.links.index"
	FileCoarseSeeds      = "coarse.seeds"
	FileCoarsePlainSeeds = "coarse.seeds.plain"
)

Hard-coded file names for different pieces of a mica database.

View Source

const (
	FileCompressed = "compressed"
	FileIndex      = "compressed.index"
)

View Source

const (
	FileParams      = "params"
	FileBlastCoarse = "blastdb-coarse"
	FileDmndCoarse  = "blastdb-dmnd"
	FileBlastFine   = "blastdb-fine"
)

View Source

const (
	ModSubstitution = iota
	ModDeletion
	ModInsertion
)

Variables ¶

View Source

var (
	DefaultQueryDBConf = &DBConf{
		MinMatchLen:         40,
		MatchKmerSize:       4,
		GappedWindowSize:    25,
		UngappedWindowSize:  10,
		ExtSeqIdThreshold:   60,
		MatchSeqIdThreshold: 70,
		MatchExtend:         30,
		MapSeedSize:         6,
		ExtSeedSize:         0,
		LowComplexity:       10,
		SeedLowComplexity:   6,
		SavePlain:           false,
		ReadOnly:            true,
		SaveCompressed:      false,
		BlastMakeBlastDB:    "makeblastdb",
		Dmnd:                "diamond",
		BlastDBSize:         0,
	}

	DefaultDBConf = &DBConf{
		MinMatchLen:         40,
		MatchKmerSize:       4,
		GappedWindowSize:    25,
		UngappedWindowSize:  10,
		ExtSeqIdThreshold:   60,
		MatchSeqIdThreshold: 70,
		MatchExtend:         30,
		MapSeedSize:         6,
		ExtSeedSize:         0,
		LowComplexity:       10,
		SeedLowComplexity:   6,
		SavePlain:           false,
		ReadOnly:            true,
		SaveCompressed:      false,
		BlastMakeBlastDB:    "makeblastdb",
		Dmnd:                "diamond",
		BlastDBSize:         0,
	}
)

View Source

var (
	SeedAlphaSize        = len(blosum.Alphabet62)
	SeedAlphaNums        = make([]int, 26)
	ReverseSeedAlphaNums = make([]byte, 26)
)

SeedAlphaNums is a map to assign *valid* amino acid resiudes contiunous values so that base-N arithmetic can be performed on them. (Where N = SeedAlphaSize.) Invalid amino acid resiudes map to -1 and will produce a panic.

View Source

var (
	Verbose = false
)

Functions ¶

func CleanupDB ¶

func CleanupDB(db *DB, pool *redCompressPool)

When the program ends (either by SIGTERM or when all of the input sequences are compressed), 'cleanup' is executed. It writes all CPU/memory profiles if they're enabled, waits for the compression workers to finish, saves the database to disk and closes all file handles.

func Exec ¶

func Exec(cmd *exec.Cmd) error

Exec runs a command created with 'Command' in the os/exec package, and converts anything reported to stderr to a Go error value.

Note that if the command returns successfully, the error is guaranteed to be nil.

func IsLowComplexity ¶

func IsLowComplexity(residues []byte, offset, window int) bool

IsLowComplexity detects whether the residue at the given offset is in a region of low complexity, where low complexity is defined as a window where every residue is the same (no variation in composition).

func PrintFlagDefaults ¶

func PrintFlagDefaults()

func ReadOriginalSeqs ¶

func ReadOriginalSeqs(
	fileName string,
	ignore []byte,
) (chan ReadOriginalSeq, error)

ReadOriginalSeqs reads a FASTA formatted file and returns a channel that each new sequence is sent to.

func Reduce ¶

func Reduce(seq []byte) []byte

func SeqIdentity ¶

func SeqIdentity(seq1, seq2 []byte) int

SeqIdentity computes the Sequence identity of two byte slices. The number returned is an integer in the range 0-100, inclusive. SeqIdentity returns zero if the lengths of both seq1 and seq2 are zero.

If the lengths of seq1 and seq2 are not equal, SeqIdentity will panic.

func StartCompressReducedWorkers ¶

func StartCompressReducedWorkers(db *DB) redCompressPool

startCompressWorkers initializes a pool of compression workers.

The compressPool returned can be used to compress sequences concurrently.

func Translate ¶

func Translate(sequence []byte) [][]byte

func TranslateQuerySeqs ¶

func TranslateQuerySeqs(
	query *bytes.Reader, action SearchOperator) (*bytes.Reader, error)

func Vprint ¶

func Vprint(s string)

func Vprintf ¶

func Vprintf(format string, v ...interface{})

func Vprintln ¶

func Vprintln(s string)

Types ¶

type CoarseDB ¶

type CoarseDB struct {
	Seqs  []*CoarseSeq
	Seeds Seeds

	// File pointers to each file in the "coarse" part of a mica database.
	FileFasta      *os.File
	FileFastaIndex *os.File
	FileSeeds      *os.File
	FileLinks      *os.File
	FileLinksIndex *os.File
	// contains filtered or unexported fields
}

CoarseDB represents a set of unique sequences that comprise the "coarse" database. Sequences in the coarse database, combined with information in the compressed database, are used to re-create the original sequences.

func (*CoarseDB) Add ¶

func (coarsedb *CoarseDB) Add(oseq []byte) (int, *CoarseSeq)

Add takes an original sequence, converts it to a coarse sequence, and adds it as a new coarse sequence to the coarse database. Seeds are also generated for each K-mer in the sequence. The resulting coarse sequence is returned along with its sequence identifier.

func (*CoarseDB) CoarseSeqGet ¶

func (coarsedb *CoarseDB) CoarseSeqGet(i uint) *CoarseSeq

CoarseSeqGet is a thread-safe way to retrieve a sequence with index `i` from the coarse database.

func (*CoarseDB) Expand ¶

func (coarsedb *CoarseDB) Expand(
	comdb *CompressedDB, id, start, end int) ([]OriginalSeq, error)

Expand will follow all links to compressed sequences for the coarse sequence at index `id` and return a slice of decompressed sequences.

func (*CoarseDB) LoadSeqs ¶

func (coarsedb *CoarseDB) LoadSeqs() (err error)

func (*CoarseDB) NumSequences ¶

func (coarsedb *CoarseDB) NumSequences() int

NumRequences returns the number of sequences in the coarse database based on the file size of the coarse database index.

func (*CoarseDB) ReadCoarseSeq ¶

func (coarsedb *CoarseDB) ReadCoarseSeq(id int) (*CoarseSeq, error)

ReadCoarseSeq reads the coarse sequence with identifier 'id' from disk, using the fasta index. (If a coarse sequence has already been read, it is returned from cache to save trips to disk.)

TODO: Note that this does *not* recover links typically found in a coarse sequence, although it probably should to avoid doing it in CoarseDB.Expand.

type CoarseSeq ¶

type CoarseSeq struct {
	*Sequence
	Links *LinkToCompressed
	// contains filtered or unexported fields
}

referenceSeq embeds a Sequence and serves as a typing mechanism to distguish reference Sequences in the compressed database with original Sequences from the input FASTA file.

func NewCoarseSeq ¶

func NewCoarseSeq(id int, name string, residues []byte) *CoarseSeq

func NewFastaCoarseSeq ¶

func NewFastaCoarseSeq(id int, s seq.Sequence) *CoarseSeq

func (*CoarseSeq) AddLink ¶

func (rseq *CoarseSeq) AddLink(link *LinkToCompressed)

func (*CoarseSeq) NewSubSequence ¶

func (rseq *CoarseSeq) NewSubSequence(start, end uint) *CoarseSeq

type CompressedDB ¶

type CompressedDB struct {
	// File pointers to be used in reading/writing compressed databases.
	File  *os.File
	Index *os.File

	CompressedSource bool
	// contains filtered or unexported fields
}

A CompressedDB corresponds to a list of all original sequences compressed by replacing regions of sequences that are redundant with pointers to similar regions in the coarse database. Each pointer includes an offset and an edit script, which allows complete recovery of the original sequence.

N.B. A compressed database doesn't keep an in memory representation of all compressed sequences. In particular, writing to a compressed database always corresponds to writing a compressed sequence to disk. And reading from a compressed database always corresponds to reading a sequence from disk (unless it has been cached in 'seqCache').

func (*CompressedDB) NumSequences ¶

func (comdb *CompressedDB) NumSequences() int

NumSequences returns the number of sequences in the compressed database using the file size of the index.

func (*CompressedDB) ReadNextSeq ¶

func (comdb *CompressedDB) ReadNextSeq(
	coarsedb *CoarseDB, seqFile io.Reader, orgSeqId int) (OriginalSeq, error)

func (*CompressedDB) ReadSeq ¶

func (comdb *CompressedDB) ReadSeq(
	coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)

func (*CompressedDB) ReadSeqFromCompressedSource ¶

func (comdb *CompressedDB) ReadSeqFromCompressedSource(coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)

func (*CompressedDB) SeqGet ¶

func (comdb *CompressedDB) SeqGet(
	coarsedb *CoarseDB, orgSeqId int) (OriginalSeq, error)

SeqGet reads a sequence from the compressed database, and decompressed it using the coarse database provided. The decompressed sequence is then added to cache.

If the sequence has already been decompressed, the decompressed sequence from cache is returned.

SeqGet will panic if it is called while a compressed database is open for writing.

func (*CompressedDB) Write ¶

func (comdb *CompressedDB) Write(cseq CompressedSeq)

Write queues a new compressed sequence to be written to disk.

type CompressedSeq ¶

type CompressedSeq struct {
	// A sequence number.
	Id int

	// Name is an uncompressed string from the original FASTA header.
	Name string

	// Links is an ordered lists of links to portions of the reference
	// database. When all links are followed, the concatenation of each
	// sequence corresponding to each link equals the entire original sequence.
	Links []LinkToCoarse
}

CompressedSeq corresponds to the components of a compressed sequence.

func CompressReduced ¶

func CompressReduced(db *DB, redSeqId int,
	redSeq *ReducedSeq, mem *memory) CompressedSeq

CompressReduced will convert a 4-character alphabet sequence into a compressed sequence. The process involves finding commonality in the original sequence with other sequences in the coarse database, and linking those common sub-sequences to sub-sequences in the coarse database.

N.B. `mem` is used in alignment and seed lookups to prevent allocation. Think of them as goroutine-specific memory arenas.

func NewCompressedSeq ¶

func NewCompressedSeq(id int, name string) CompressedSeq

NewCompressedSeq creates a CompressedSeq value using the name provided. The Link slice is initialized but empty.

func (*CompressedSeq) Add ¶

func (cseq *CompressedSeq) Add(link LinkToCoarse)

Add will add a LinkToCoarse to the end of the CompressedSeq's Links list.

func (CompressedSeq) Decompress ¶

func (cseq CompressedSeq) Decompress(coarse *CoarseDB) (OriginalSeq, error)

Decompress decompresses a particular compressed sequence using the given coarse sequence. Namely, all of the links are followed and all of the edit scripts are "applied" to recover the original sequence.

func (CompressedSeq) String ¶

func (cseq CompressedSeq) String() string

type DB ¶

type DB struct {
	// An embedded configuration.
	*DBConf

	// The path to the directory on disk.
	Path string

	// The name of the database. This corresponds to the basename of the path.
	Name string

	// The compressed database component.
	ComDB *CompressedDB

	// The coarse database component.
	CoarseDB *CoarseDB
	// contains filtered or unexported fields
}

A DB represents a mica database, which has three main components: a coarse database, a compressed database and a configuration file.

A DB can be opened either for writing/appending (compression) or for reading (decompression).

func NewReadDB ¶

func NewReadDB(dir string) (*DB, error)

NewReadDB opens a mica database for reading. An error is returned if there is a problem accessing any of the files on disk.

Also, if the 'makeblastdb' or 'blastp' executales are not found, then an error is returned.

func NewWriteDB ¶

func NewWriteDB(appnd bool, conf *DBConf, dir string) (*DB, error)

NewWriteDB creates a new mica database, and prepares it for writing (or opens an existing database and prepares it for appending if 'appnd' is set).

An error is returned if there is a problem accessing any of the files in the database.

It is an error to open a database for writing that already exists if 'appnd' is not set.

'conf' should be a database configuration, typically defined (initially) from command line parameters. Note that if 'appnd' is set, then the configuration will be read from disk---only options explicitly set via the command line will be overwritten.

func (*DB) ReadClose ¶

func (db *DB) ReadClose()

ReadClose closes all appropriate files after reading from a database.

func (*DB) Save ¶

func (db *DB) Save() error

Save will write the contents of the database to disk. This should be called after compression is complete.

After the database is saved, a blastp database is created from the coarse database.

N.B. The compressed database is written as each sequence is processed, so this call will only save the coarse database. This may take a *very* long time if the database is not read only (since the seeds table has to be written).

func (*DB) WriteClose ¶

func (db *DB) WriteClose()

WriteClose closes all appropriate files after writing to a database.

type DBConf ¶

type DBConf struct {
	MinMatchLen         int
	MatchKmerSize       int
	GappedWindowSize    int
	UngappedWindowSize  int
	ExtSeqIdThreshold   int
	MatchSeqIdThreshold int
	MatchExtend         int
	MapSeedSize         int
	ExtSeedSize         int
	LowComplexity       int
	SeedLowComplexity   int
	SavePlain           bool
	ReadOnly            bool
	SaveCompressed      bool
	BlastMakeBlastDB    string
	Dmnd                string
	BlastDBSize         uint64
}

func LoadDBConf ¶

func LoadDBConf(r io.Reader) (conf *DBConf, err error)

func (*DBConf) DeepCopy ¶

func (conf *DBConf) DeepCopy() *DBConf

func (*DBConf) FlagMerge ¶

func (flagConf *DBConf) FlagMerge(fileConf *DBConf) (*DBConf, error)

func (DBConf) Write ¶

func (dbConf DBConf) Write(w io.Writer) error

type EditScript ¶

type EditScript struct {
	// contains filtered or unexported fields
}

func NewEditScript ¶

func NewEditScript(alignment [2][]byte) *EditScript

func NewEditScriptParse ¶

func NewEditScriptParse(editScript string) (*EditScript, error)

func (*EditScript) Apply ¶

func (diff *EditScript) Apply(fromSeq []byte) []byte

func (*EditScript) String ¶

func (diff *EditScript) String() string

type LinkToCoarse ¶

type LinkToCoarse struct {
	// Diff, when "applied" to the porition of the reference sequence indicated
	// by this link, will yield the original sequence corresponding to this
	// link precisely. If Diff is empty, then the subsequence of the reference
	// sequence indicated here is equivalent to the corresponding piece of
	// the original sequence.
	Diff                   string
	CoarseSeqId            uint
	CoarseStart, CoarseEnd uint16
}

LinkToCoarse represents a component of a compressed original sequence that allows perfect reconstruction (i.e., decompression) of the original sequence.

func NewLinkToCoarse ¶

func NewLinkToCoarse(coarseSeqId, coarseStart, coarseEnd uint,
	alignment [2][]byte) LinkToCoarse

func NewLinkToCoarseNoDiff ¶

func NewLinkToCoarseNoDiff(
	coarseSeqId, coarseStart, coarseEnd uint) LinkToCoarse

func (LinkToCoarse) String ¶

func (lk LinkToCoarse) String() string

type LinkToCompressed ¶

type LinkToCompressed struct {
	OrgSeqId               uint32
	CoarseStart, CoarseEnd uint16
	Next                   *LinkToCompressed
}

LinkToCompressed represents a link from a reference sequence to a compressed original sequence. It serves as a bridge from a BLAST hit in the coarse database to the corresponding original sequence that is redundant to the specified residue range in the reference sequence.

func NewLinkToCompressed ¶

func NewLinkToCompressed(
	orgSeqId uint32, coarseStart, coarseEnd uint16) *LinkToCompressed

func (LinkToCompressed) String ¶

func (lk LinkToCompressed) String() string

type OriginalSeq ¶

type OriginalSeq struct {
	*Sequence
}

OriginalSeq embeds a Sequence and serves as a typing mechanism to distguish reference Sequences in the compressed database with original Sequences from the input FASTA file.

func NewFastaOriginalSeq ¶

func NewFastaOriginalSeq(id int, s seq.Sequence) *OriginalSeq

func NewOriginalSeq ¶

func NewOriginalSeq(id int, name string, residues []byte) *OriginalSeq

func (*OriginalSeq) NewSubSequence ¶

func (oseq *OriginalSeq) NewSubSequence(start, end uint) *OriginalSeq

type ReadOriginalSeq ¶

type ReadOriginalSeq struct {
	Seq *OriginalSeq
	Err error
}

ReadOriginalSeq is the value sent over `chan ReadOriginalSeq` when a new sequence is read from a fasta file

type ReducedSeq ¶

type ReducedSeq struct {
	*Sequence
}

ReducedSeq embeds a Sequence and serves as a typing mechanism to distguish reduced-alphabet (DNA) Sequences from amino acid Sequences.

func NewReducedSeq ¶

func NewReducedSeq(oseq *OriginalSeq) *ReducedSeq

func (*ReducedSeq) NewSubSequence ¶

func (rseq *ReducedSeq) NewSubSequence(start, end uint) *ReducedSeq

type SearchOperator ¶

type SearchOperator func(*bytes.Reader) (*bytes.Reader, error)

type SeedLoc ¶

type SeedLoc struct {
	// Index into the coarse database sequence slice.
	SeqInd uint32

	// Index into the coarse sequence corresponding to `SeqInd`.
	ResInd uint16

	Next *SeedLoc
}

SeedLoc represents the information required to translate a seed to a slice of residues from the coarse database. Namely, the index of the sequence in the coarse database and the index of the residue where the seed starts in that sequence.

Every SeedLoc also contains a pointer to the next seed location. This design was chosen so that each SeedLoc is independently allocated (as opposed to using a slice, which incurs a lot of allocation overhead when expanding the slice, and has the potential for pinning memory).

func NewSeedLoc ¶

func NewSeedLoc(seqInd uint32, resInd uint16) *SeedLoc

type Seeds ¶

type Seeds struct {
	// Table of lists of seed locations. Its length is always equivalent
	// to (SeedAlphaSize)^(SeedSize).
	Locs []*SeedLoc

	SeedSize int
	// contains filtered or unexported fields
}

Seeds is a list of lists of seed locations. The index into the seeds table corresponds to a hash of particular K-mer. The list found at each row in the seed table corresponds to all locations in the coarse database in which the K-mer occurs.

func NewSeeds ¶

func NewSeeds(seedSize, lowComplexityWindow int) Seeds

NewSeeds creates a new table of seed location lists. The table is initialized with enough memory to hold lists for all possible K-mers. Namely, the length of seeds is equivalent to 20^(K) where 20 is the number of amino acids (size of alphabet) and K is equivalent to the length of each K-mer.

func (*Seeds) Add ¶

func (ss *Seeds) Add(coarseSeqIndex int, corSeq *CoarseSeq)

Add will create seed locations for all K-mers in corSeq and add them to the seeds table.

func (Seeds) Lookup ¶

func (ss Seeds) Lookup(kmer []byte, mem *[][2]uint) [][2]uint

Lookup returns a list of all seed locations corresponding to a particular K-mer.

`mem` is a pointer to a slice of seed locations, where a seed location is a tuple of (sequence index, residue index). `mem` is used to prevent unnecessary allocation. A pointer to thise slice is returned.

func (*Seeds) MaybeWipe ¶

func (ss *Seeds) MaybeWipe(seedTableSizeGB float64)

MaybeWipe completely wipes the seeds table if the memory of the seeds table exceeds seedTableSizeGB (which is the number of gigabytes).

func (Seeds) NumSeeds ¶

func (ss Seeds) NumSeeds() int64

NumSeeds computes the number of seeds currently in the seeds table. Since the seeds table is typically big, this is an expensive operation.

type Sequence ¶

type Sequence struct {
	Name     string
	Residues []byte
	Offset   uint
	Id       int
}

Sequence is the underlying (i.e., embedded) type of reference and original Sequences used in cablast.

func (*Sequence) FastaSeq ¶

func (s *Sequence) FastaSeq() seq.Sequence

FastaSeq returns a new seq.Sequence from TuftsBCB/seq.

func (*Sequence) Len ¶

func (seq *Sequence) Len() int

Len retuns the number of residues in this Sequence.

func (*Sequence) String ¶

func (seq *Sequence) String() string

String returns a string (fasta) representation of this Sequence. If this Sequence is a subSequence, then the range of the subSequence (with respect to the original Sequence) is also printed.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
blosum
cmd
mica-compress
mica-decompress
mica-deltasearch
mica-psearch
mica-psisearch
mica-reindexer
mica-search
mica-xsearch

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL