scorch

package

v0.8.2 Latest Latest Go to latest Published: Mar 20, 2020 License: Apache-2.0 Imports: 30 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/mohsensamiei/bleve

Links

Open Source Insights

README ¶

scorch

Definitions

Batch

A collection of Documents to mutate in the index.

Document

Has a unique identifier (arbitrary bytes).
Is comprised of a list of fields.

Field

Has a name (string).
Has a type (text, number, date, geopoint).
Has a value (depending on type).
Can be indexed, stored, or both.
If indexed, can be analyzed. -m If indexed, can optionally store term vectors.

Scope

Scorch MUST implement the bleve.index API without requiring any changes to this API.

Scorch MAY introduce new interfaces, which can be discovered to allow use of new capabilities not in the current API.

Implementation

The scorch implementation starts with the concept of a segmented index.

A segment is simply a slice, subset, or portion of the entire index. A segmented index is one which is composed of one or more segments. Although segments are created in a particular order, knowing this ordering is not required to achieve correct semantics when querying. Because there is no ordering, this means that when searching an index, you can (and should) search all the segments concurrently.

Internal Wrapper

In order to accommodate the existing APIs while also improving the implementation, the scorch implementation includes some wrapper functionality that must be described.

_id field

In scorch, field 0 is prearranged to be named _id. All documents have a value for this field, which is the documents external identifier. In this version the field MUST be both indexed AND stored. The scorch wrapper adds this field, as it will not be present in the Document from the calling bleve code.

NOTE: If a document already contains a field _id, it will be replaced. If this is problematic, the caller must ensure such a scenario does not happen.

Proposed Structures

type Segment interface {

  Dictionary(field string) TermDictionary

}

type TermDictionary interface {

  PostingsList(term string, excluding PostingsList) PostingsList

}

type PostingsList interface {

  Next() Posting

  And(other PostingsList) PostingsList
  Or(other PostingsList) PostingsList

}

type Posting interface {
  Number() uint64

  Frequency() uint64
  Norm() float64

  Locations() Locations
}

type Locations interface {
  Start() uint64
  End() uint64
  Pos() uint64
  ArrayPositions() ...
}

type DeletedDocs {

}

type SegmentSnapshot struct {
  segment Segment
  deleted PostingsList
}

type IndexSnapshot struct {
  segment []SegmentSnapshot
}

What about errors? What about memory mgmnt or context? Postings List separate iterator to separate stateful from stateless

Mutating the Index

The bleve.index API has methods for directly making individual mutations (Update/Delete/SetInternal/DeleteInternal), however for this first implementation, we assume that all of these calls can simply be turned into a Batch of size 1. This may be highly inefficient, but it will be correct. This decision is made based on the fact that Couchbase FTS always uses Batches.

NOTE: As a side-effect of this decision, it should be clear that performance tuning may depend on the batch size, which may in-turn require changes in FTS.

From this point forward, only Batch mutations will be discussed.

Sequence of Operations:

For each document in the batch, search through all existing segments. The goal is to build up a per-segment bitset which tells us which documents in that segment are obsoleted by the addition of the new segment we're currently building. NOTE: we're not ready for this change to take effect yet, so rather than this operation mutating anything, they simply return bitsets, which we can apply later. Logically, this is something like:
```
  foreach segment {
    dict := segment.Dictionary("\_id")
    postings := empty postings list
    foreach docID {
      postings = postings.Or(dict.PostingsList(docID, nil))
    }
  }
```
NOTE: it is illustrated above as nested for loops, but some or all of these could be concurrently. The end result is that for each segment, we have (possibly empty) bitset.
Also concurrent with 1, the documents in the batch are analyzed. This analysis proceeds using the existing analyzer pool.
(after 2 completes) Analyzed documents are fed into a function which builds a new Segment representing this information.
We now have everything we need to update the state of the system to include this new snapshot.

Acquire a lock
Create a new IndexSnapshot
For each SegmentSnapshot in the IndexSnapshot, take the deleted PostingsList and OR it with the new postings list for this Segment. Construct a new SegmentSnapshot for the segment using this new deleted PostingsList. Append this SegmentSnapshot to the IndexSnapshot.
Create a new SegmentSnapshot wrapping our new segment with nil deleted docs.
Append the new SegmentSnapshot to the IndexSnapshot
Release the lock

An ASCII art example:

0 - Empty Index

No segments

IndexSnapshot
  segments []
  deleted []


1 - Index Batch [ A B C ]

segment       0
numbers   [ 1 2 3 ]
\_id      [ A B C ]

IndexSnapshot
  segments [ 0 ]
  deleted [ nil ]


2 - Index Batch [ B' ]

segment       0           1
numbers   [ 1 2 3 ]     [ 1 ]
\_id      [ A B C ]     [ B ]

Compute bitset segment-0-deleted-by-1:
          [ 0 1 0 ]

OR it with previous (nil) (call it 0-1)
          [ 0 1 0 ]

IndexSnapshot
  segments [  0    1 ]
  deleted  [ 0-1 nil ]

3 - Index Batch [ C' ]

  segment       0           1      2
  numbers   [ 1 2 3 ]     [ 1 ]  [ 1 ]
  \_id      [ A B C ]     [ B ]  [ C ]

  Compute bitset segment-0-deleted-by-2:
            [ 0 0 1 ]

  OR it with previous ([ 0 1 0 ]) (call it 0-12)
            [ 0 1 1 ]

Compute bitset segment-1-deleted-by-2:
            [ 0 ]

OR it with previous (nil)
            still just nil


  IndexSnapshot
    segments [  0    1    2 ]
    deleted  [ 0-12 nil  nil ]

is there opportunity to stop early when doc is found in one segment also, more efficient way to find bits for long lists of ids?

Searching

In the bleve.index API all searching starts by getting an IndexReader, which represents a snapshot of the index at a point in time.

As described in the section above, our index implementation maintains a pointer to the current IndexSnapshot. When a caller gets an IndexReader, they get a copy of this pointer, and can use it as long as they like. The IndexSnapshot contains SegmentSnapshots, which only contain pointers to immutable segments. The deleted posting lists associated with a segment change over time, but the particular deleted posting list in YOUR snapshot is immutable. This gives a stable view of the data.

Term Search

Term search is the only searching primitive exposed in today's bleve.index API. This ultimately could limit our ability to take advantage of the indexing improvements, but it also means it will be easier to get a first version of this working.

A term search for term T in field F will look something like this:

  searchResultPostings = empty
  foreach segment {
    dict := segment.Dictionary(F)
    segmentResultPostings = dict.PostingsList(T, segmentSnapshotDeleted)
    // make segmentLocal numbers into global numbers, and flip bits in searchResultPostings
  }

The searchResultPostings will be a new implementation of the TermFieldReader inteface.

As a reminder this interface is:

// TermFieldReader is the interface exposing the enumeration of documents
// containing a given term in a given field. Documents are returned in byte
// lexicographic order over their identifiers.
type TermFieldReader interface {
	// Next returns the next document containing the term in this field, or nil
	// when it reaches the end of the enumeration.  The preAlloced TermFieldDoc
	// is optional, and when non-nil, will be used instead of allocating memory.
	Next(preAlloced *TermFieldDoc) (*TermFieldDoc, error)

	// Advance resets the enumeration at specified document or its immediate
	// follower.
	Advance(ID IndexInternalID, preAlloced *TermFieldDoc) (*TermFieldDoc, error)

	// Count returns the number of documents contains the term in this field.
	Count() uint64
	Close() error
}

At first glance this appears problematic, we have no way to return documents in order of their identifiers. But it turns out the wording of this perhaps too strong, or a bit ambiguous. Originally, this referred to the external identifiers, but with the introduction of a distinction between internal/external identifiers, returning them in order of their internal identifiers is also acceptable. ASIDE: the reason for this is that most callers just use Next() and literally don't care what the order is, they could be in any order and it would be fine. There is only one search that cares and that is the ConjunctionSearcher, which relies on Next/Advance having very specific semantics. Later in this document we will have a proposal to split into multiple interfaces:

The weakest interface, only supports Next() no ordering at all.
Ordered, supporting Advance()
And/Or'able capable of internally efficiently doing these ops with like interfaces (if not capable then can always fall back to external walking)

But, the good news is that we don't even have to do that for our first implementation. As long as the global numbers we use for internal identifiers are consistent within this IndexSnapshot, then Next() will be ordered by ascending document number, and Advance() will still work correctly.

NOTE: there is another place where we rely on the ordering of these hits, and that is in the "_id" sort order. Previously this was the natural order, and a NOOP for the collector, now it must be implemented by actually sorting on the "_id" field. We probably should introduce at least a marker interface to detect this.

An ASCII art example:

Let's start with the IndexSnapshot we ended with earlier:

3 - Index Batch [ C' ]

  segment       0           1      2
  numbers   [ 1 2 3 ]     [ 1 ]  [ 1 ]
  \_id      [ A B C ]     [ B ]  [ C ]

  Compute bitset segment-0-deleted-by-2:
            [ 0 0 1 ]

  OR it with previous ([ 0 1 0 ]) (call it 0-12)
            [ 0 1 1 ]

Compute bitset segment-1-deleted-by-2:
            [ 0 0 0 ]

OR it with previous (nil)
            still just nil


  IndexSnapshot
    segments [  0    1    2 ]
    deleted  [ 0-12 nil  nil ]

Now let's search for the term 'cat' in the field 'desc' and let's assume that Document C (both versions) would match it.

Concurrently:

  - Segment 0
   - Get Term Dictionary For Field 'desc'
   - From it get Postings List for term 'cat' EXCLUDING 0-12
   - raw segment matches [ 0 0 1 ] but excluding [ 0 1 1 ] gives [ 0 0 0 ]
  - Segment 1
   - Get Term Dictionary For Field 'desc'
   - From it get Postings List for term 'cat' excluding nil
   - [ 0 ]
  - Segment 2
   - Get Term Dictionary For Field 'desc'
   - From it get Postings List for term 'cat' excluding nil
   - [ 1 ]

Map local bitsets into global number space (global meaning cross-segment but still unique to this snapshot)

IndexSnapshot already should have mapping something like:
0 - Offset 0
1 - Offset 3 (because segment 0 had 3 docs)
2 - Offset 4 (because segment 1 had 1 doc)

This maps to search result bitset:

[ 0 0 0 0 1]

Caller would call Next() and get doc number 5 (assuming 1 based indexing for now)

Caller could then ask to get term locations, stored fields, external doc ID for document number 5.  Internally in the IndexSnapshot, we can now convert that back, and realize doc number 5 comes from segment 2, 5-4=1 so we're looking for doc number 1 in segment 2.  That happens to be C...

Future improvements

In the future, interfaces to detect these non-serially operating TermFieldReaders could expose their own And() and Or() up to the higher level Conjunction/Disjunction searchers. Doing this alone offers some win, but also means there would be greater burden on the Searcher code rewriting logical expressions for maximum performance.

Another related topic is that of peak memory usage. With serially operating TermFieldReaders it was necessary to start them all at the same time and operate in unison. However, with these non-serially operating TermFieldReaders we have the option of doing a few at a time, consolidating them, dispoting the intermediaries, and then doing a few more. For very complex queries with many clauses this could reduce peak memory usage.

Memory Tracking

All segments must be able to produce two statistics, an estimate of their explicit memory usage, and their actual size on disk (if any). For in-memory segments, disk usage could be zero, and the memory usage represents the entire information content. For mmap-based disk segments, the memory could be as low as the size of tracking structure itself (say just a few pointers).

This would allow the implementation to throttle or block incoming mutations when a threshold memory usage has (or would be) exceeded.

Persistence

Obviously, we want to support (but maybe not require) asynchronous persistence of segments. My expectation is that segments are initially built in memory. At some point they are persisted to disk. This poses some interesting challenges.

At runtime, the state of an index (it's IndexSnapshot) is not only the contents of the segments, but also the bitmasks of deleted documents. These bitmasks indirectly encode an ordering in which the segments were added. The reason is that the bitmasks encode which items have been obsoleted by other (subsequent or more future) segments. In the runtime implementation we compute bitmask deltas and then merge them at the same time we bring the new segment in. One idea is that we could take a similar approach on disk. When we persist a segment, we persist the bitmask deltas of segments known to exist at that time, and eventually these can get merged up into a base segment deleted bitmask.

This also relates to the topic rollback, addressed next...

Rollback

One desirable property in the Couchbase ecosystem is the ability to rollback to some previous (though typically not long ago) state. One idea for keeping this property in this design is to protect some of the most recent segments from merging. Then, if necessary, they could be "undone" to reveal previous states of the system. In these scenarios "undone" has to properly undo the deleted bitmasks on the other segments. Again, the current thinking is that rather than "undo" anything, it could be work that was deferred in the first place, thus making it easier to logically undo.

Another possibly related approach would be to tie this into our existing snapshot mechanism. Perhaps simulating a slow reader (holding onto index snapshots) for some period of time, can be the mechanism to achieve the desired end goal.

Internal Storage

The bleve.index API has support for "internal storage". The ability to store information under a separate name space.

This is not used for high volume storage, so it is tempting to think we could just put a small k/v store alongside the rest of the index. But, the reality is that this storage is used to maintain key information related to the rollback scenario. Because of this, its crucial that ordering and overwriting of key/value pairs correspond with actual segment persistence in the index. Based on this, I believe its important to put the internal key/value pairs inside the segments themselves. But, this also means that they must follow a similar "deleted" bitmask approach to obsolete values in older segments. But, this also seems to substantially increase the complexity of the solution because of the separate name space, it would appear to require its own bitmask. Further keys aren't numeric, which then implies yet another mapping from internal key to number, etc.

More thought is required here.

Merging

The segmented index approach requires merging to prevent the number of segments from growing too large.

Recent experience with LSMs has taught us that having the correct merge strategy can make a huge difference in the overall performance of the system. In particular, a simple merge strategy which merges segments too aggressively can lead to high write amplification and unnecessarily rendering cached data useless.

A few simple principles have been identified.

Roughly we merge multiple smaller segments into a single larger one.
The larger a segment gets the less likely we should be to ever merge it.
Segments with large numbers of deleted/obsoleted items are good candidates as the merge will result in a space savings.
Segments with all items deleted/obsoleted can be dropped.

Merging of a segment should be able to proceed even if that segment is held by an ongoing snapshot, it should only delay the removal of it.

Documentation ¶

Index ¶

Constants
Variables
func NewScorch(storeName string, config map[string]interface{}, ...) (index.Index, error)
type DocValueReader
- func (dvr *DocValueReader) VisitDocValues(id index.IndexInternalID, visitor index.DocumentFieldTermVisitor) (err error)
type Event
type EventKind
type IndexSnapshot
- func (i *IndexSnapshot) AddRef()
- func (i *IndexSnapshot) Close() error
- func (i *IndexSnapshot) DecRef() (err error)
- func (i *IndexSnapshot) DocCount() (uint64, error)
- func (i *IndexSnapshot) DocIDReaderAll() (index.DocIDReader, error)
- func (i *IndexSnapshot) DocIDReaderOnly(ids []string) (index.DocIDReader, error)
- func (i *IndexSnapshot) DocValueReader(fields []string) (index.DocValueReader, error)
- func (i *IndexSnapshot) Document(id string) (rv *document.Document, err error)
- func (i *IndexSnapshot) DocumentVisitFieldTerms(id index.IndexInternalID, fields []string, ...) error
- func (i *IndexSnapshot) DumpAll() chan interface{}
- func (i *IndexSnapshot) DumpDoc(id string) chan interface{}
- func (i *IndexSnapshot) DumpFields() chan interface{}
- func (i *IndexSnapshot) ExternalID(id index.IndexInternalID) (string, error)
- func (i *IndexSnapshot) FieldDict(field string) (index.FieldDict, error)
- func (i *IndexSnapshot) FieldDictContains(field string) (index.FieldDictContains, error)
- func (i *IndexSnapshot) FieldDictFuzzy(field string, term string, fuzziness int, prefix string) (index.FieldDict, error)
- func (i *IndexSnapshot) FieldDictOnly(field string, onlyTerms [][]byte, includeCount bool) (index.FieldDict, error)
- func (i *IndexSnapshot) FieldDictPrefix(field string, termPrefix []byte) (index.FieldDict, error)
- func (i *IndexSnapshot) FieldDictRange(field string, startTerm []byte, endTerm []byte) (index.FieldDict, error)
- func (i *IndexSnapshot) FieldDictRegexp(field string, termRegex string) (index.FieldDict, error)
- func (i *IndexSnapshot) Fields() ([]string, error)
- func (i *IndexSnapshot) GetInternal(key []byte) ([]byte, error)
- func (i *IndexSnapshot) Internal() map[string][]byte
- func (i *IndexSnapshot) InternalID(id string) (rv index.IndexInternalID, err error)
- func (i *IndexSnapshot) Segments() []*SegmentSnapshot
- func (i *IndexSnapshot) Size() int
- func (i *IndexSnapshot) TermFieldReader(term []byte, field string, includeFreq, includeNorm, includeTermVectors bool) (index.TermFieldReader, error)
type IndexSnapshotDocIDReader
- func (i *IndexSnapshotDocIDReader) Advance(ID index.IndexInternalID) (index.IndexInternalID, error)
- func (i *IndexSnapshotDocIDReader) Close() error
- func (i *IndexSnapshotDocIDReader) Next() (index.IndexInternalID, error)
- func (i *IndexSnapshotDocIDReader) Size() int
type IndexSnapshotFieldDict
- func (i *IndexSnapshotFieldDict) Close() error
- func (i *IndexSnapshotFieldDict) Contains(key []byte) (bool, error)
- func (i *IndexSnapshotFieldDict) Len() int
- func (i *IndexSnapshotFieldDict) Less(a, b int) bool
- func (i *IndexSnapshotFieldDict) Next() (*index.DictEntry, error)
- func (i *IndexSnapshotFieldDict) Pop() interface{}
- func (i *IndexSnapshotFieldDict) Push(x interface{})
- func (i *IndexSnapshotFieldDict) Swap(a, b int)
type IndexSnapshotTermFieldReader
- func (i *IndexSnapshotTermFieldReader) Advance(ID index.IndexInternalID, preAlloced *index.TermFieldDoc) (*index.TermFieldDoc, error)
- func (i *IndexSnapshotTermFieldReader) Close() error
- func (i *IndexSnapshotTermFieldReader) Count() uint64
- func (i *IndexSnapshotTermFieldReader) Next(preAlloced *index.TermFieldDoc) (*index.TermFieldDoc, error)
- func (s *IndexSnapshotTermFieldReader) Optimize(kind string, octx index.OptimizableContext) (index.OptimizableContext, error)
- func (i *IndexSnapshotTermFieldReader) Size() int
type OptimizeTFRConjunction
- func (o *OptimizeTFRConjunction) Finish() (index.Optimized, error)
type OptimizeTFRConjunctionUnadorned
- func (o *OptimizeTFRConjunctionUnadorned) Finish() (rv index.Optimized, err error)
type OptimizeTFRDisjunctionUnadorned
- func (o *OptimizeTFRDisjunctionUnadorned) Finish() (rv index.Optimized, err error)
type RollbackPoint
- func (r *RollbackPoint) GetInternal(key []byte) []byte
type Scorch
- func (s *Scorch) AddEligibleForRemoval(epoch uint64)
- func (s *Scorch) Advanced() (store.KVStore, error)
- func (s *Scorch) Analyze(d *document.Document) *index.AnalysisResult
- func (s *Scorch) Batch(batch *index.Batch) (err error)
- func (s *Scorch) Close() (err error)
- func (s *Scorch) Delete(id string) error
- func (s *Scorch) DeleteInternal(key []byte) error
- func (s *Scorch) LoadSnapshot(epoch uint64) (rv *IndexSnapshot, err error)
- func (s *Scorch) MemoryUsed() (memUsed uint64)
- func (s *Scorch) Open() error
- func (s *Scorch) Reader() (index.IndexReader, error)
- func (s *Scorch) ReportBytesWritten(bytesWritten uint64)
- func (s *Scorch) Rollback(to *RollbackPoint) error
- func (s *Scorch) RollbackPoints() ([]*RollbackPoint, error)
- func (s *Scorch) RootBoltSnapshotEpochs() ([]uint64, error)
- func (s *Scorch) SetInternal(key, val []byte) error
- func (s *Scorch) Stats() json.Marshaler
- func (s *Scorch) StatsMap() map[string]interface{}
- func (s *Scorch) Update(doc *document.Document) error
type SegmentSnapshot
- func (s *SegmentSnapshot) Close() error
- func (s *SegmentSnapshot) Count() uint64
- func (s *SegmentSnapshot) Deleted() *roaring.Bitmap
- func (s *SegmentSnapshot) DocID(num uint64) ([]byte, error)
- func (s *SegmentSnapshot) DocNumbers(docIDs []string) (*roaring.Bitmap, error)
- func (s *SegmentSnapshot) DocNumbersLive() *roaring.Bitmap
- func (s *SegmentSnapshot) Fields() []string
- func (s *SegmentSnapshot) FullSize() int64
- func (s *SegmentSnapshot) Id() uint64
- func (s SegmentSnapshot) LiveSize() int64
- func (s *SegmentSnapshot) Segment() segment.Segment
- func (s *SegmentSnapshot) Size() (rv int)
- func (s *SegmentSnapshot) VisitDocument(num uint64, visitor segment.DocumentFieldValueVisitor) error
type Stats
- func (s *Stats) MarshalJSON() ([]byte, error)
- func (s *Stats) ToMap() map[string]interface{}

Constants ¶

View Source

const Name = "scorch"

View Source

const Version uint8 = 2

Variables ¶

View Source

var DefaultChunkFactor uint32 = 1024

View Source

var DefaultMemoryPressurePauseThreshold uint64 = math.MaxUint64

View Source

var DefaultMinSegmentsForInMemoryMerge = 2

DefaultMinSegmentsForInMemoryMerge represents the default number of in-memory zap segments that persistSnapshotMaybeMerge() needs to see in an IndexSnapshot before it decides to merge and persist those segments

View Source

var DefaultPersisterNapTimeMSec int = 0 // ms

DefaultPersisterNapTimeMSec is kept to zero as this helps in direct persistence of segments with the default safe batch option. If the default safe batch option results in high number of files on disk, then users may initialise this configuration parameter with higher values so that the persister will nap a bit within it's work loop to favour better in-memory merging of segments to result in fewer segment files on disk. But that may come with an indexing performance overhead. Unsafe batch users are advised to override this to higher value for better performance especially with high data density.

View Source

var DefaultPersisterNapUnderNumFiles int = 1000

DefaultPersisterNapUnderNumFiles helps in controlling the pace of persister. At times of a slow merger progress with heavy file merging operations, its better to pace down the persister for letting the merger to catch up within a range defined by this parameter. Fewer files on disk (as per the merge plan) would result in keeping the file handle usage under limit, faster disk merger and a healthier index. Its been observed that such a loosely sync'ed introducer-persister-merger trio results in better overall performance.

View Source

var ErrClosed = fmt.Errorf("scorch closed")

View Source

var EventKindBatchIntroduction = EventKind(6)

EventKindBatchIntroduction is fired when Batch() completes.

View Source

var EventKindBatchIntroductionStart = EventKind(5)

EventKindBatchIntroductionStart is fired when Batch() is invoked which introduces a new segment.

View Source

var EventKindClose = EventKind(2)

EventKindClose is fired when a scorch index has been fully closed.

View Source

var EventKindCloseStart = EventKind(1)

EventKindCloseStart is fired when a Scorch.Close() has begun.

View Source

var EventKindMergerProgress = EventKind(3)

EventKindMergerProgress is fired when the merger has completed a round of merge processing.

View Source

var EventKindPersisterProgress = EventKind(4)

EventKindPersisterProgress is fired when the persister has completed a round of persistence processing.

View Source

var NumSnapshotsToKeep = 1

NumSnapshotsToKeep represents how many recent, old snapshots to keep around per Scorch instance. Useful for apps that require rollback'ability.

View Source

var OptimizeConjunction = true

View Source

var OptimizeConjunctionUnadorned = true

View Source

var OptimizeDisjunctionUnadorned = true

View Source

var OptimizeDisjunctionUnadornedMinChildCardinality = uint64(256)

View Source

var OptimizeTFRConjunctionUnadornedField = "*"

View Source

var OptimizeTFRConjunctionUnadornedTerm = []byte("<conjunction:unadorned>")

View Source

var OptimizeTFRDisjunctionUnadornedField = "*"

View Source

var OptimizeTFRDisjunctionUnadornedTerm = []byte("<disjunction:unadorned>")

View Source

var RegistryAsyncErrorCallbacks = map[string]func(error){}

RegistryAsyncErrorCallbacks should be treated as read-only after process init()'ialization.

View Source

var RegistryEventCallbacks = map[string]func(Event){}

RegistryEventCallbacks should be treated as read-only after process init()'ialization.

View Source

var TermSeparator byte = 0xff

View Source

var TermSeparatorSplitSlice = []byte{TermSeparator}

Functions ¶

func NewScorch ¶

func NewScorch(storeName string,
	config map[string]interface{},
	analysisQueue *index.AnalysisQueue) (index.Index, error)

Types ¶

type DocValueReader ¶ added in v0.8.0

type DocValueReader struct {
	// contains filtered or unexported fields
}

func (*DocValueReader) VisitDocValues ¶ added in v0.8.0

func (dvr *DocValueReader) VisitDocValues(id index.IndexInternalID,
	visitor index.DocumentFieldTermVisitor) (err error)

type Event ¶

type Event struct {
	Kind     EventKind
	Scorch   *Scorch
	Duration time.Duration
}

Event represents the information provided in an OnEvent() callback.

type EventKind ¶

type EventKind int

EventKind represents an event code for OnEvent() callbacks.

type IndexSnapshot ¶

type IndexSnapshot struct {
	// contains filtered or unexported fields
}

func (*IndexSnapshot) AddRef ¶

func (i *IndexSnapshot) AddRef()

func (*IndexSnapshot) Close ¶ added in v0.8.0

func (i *IndexSnapshot) Close() error

func (*IndexSnapshot) DecRef ¶

func (i *IndexSnapshot) DecRef() (err error)

func (*IndexSnapshot) DocCount ¶

func (i *IndexSnapshot) DocCount() (uint64, error)

func (*IndexSnapshot) DocIDReaderAll ¶

func (i *IndexSnapshot) DocIDReaderAll() (index.DocIDReader, error)

func (*IndexSnapshot) DocIDReaderOnly ¶

func (i *IndexSnapshot) DocIDReaderOnly(ids []string) (index.DocIDReader, error)

func (*IndexSnapshot) DocValueReader ¶ added in v0.8.0

func (i *IndexSnapshot) DocValueReader(fields []string) (
	index.DocValueReader, error)

func (*IndexSnapshot) Document ¶

func (i *IndexSnapshot) Document(id string) (rv *document.Document, err error)

func (*IndexSnapshot) DocumentVisitFieldTerms ¶

func (i *IndexSnapshot) DocumentVisitFieldTerms(id index.IndexInternalID,
	fields []string, visitor index.DocumentFieldTermVisitor) error

func (*IndexSnapshot) DumpAll ¶ added in v0.8.0

func (i *IndexSnapshot) DumpAll() chan interface{}

func (*IndexSnapshot) DumpDoc ¶ added in v0.8.0

func (i *IndexSnapshot) DumpDoc(id string) chan interface{}

func (*IndexSnapshot) DumpFields ¶ added in v0.8.0

func (i *IndexSnapshot) DumpFields() chan interface{}

func (*IndexSnapshot) ExternalID ¶

func (i *IndexSnapshot) ExternalID(id index.IndexInternalID) (string, error)

func (*IndexSnapshot) FieldDict ¶

func (i *IndexSnapshot) FieldDict(field string) (index.FieldDict, error)

func (*IndexSnapshot) FieldDictContains ¶ added in v0.8.0

func (i *IndexSnapshot) FieldDictContains(field string) (index.FieldDictContains, error)

func (*IndexSnapshot) FieldDictFuzzy ¶ added in v0.8.0

func (i *IndexSnapshot) FieldDictFuzzy(field string,
	term string, fuzziness int, prefix string) (index.FieldDict, error)

func (*IndexSnapshot) FieldDictOnly ¶ added in v0.8.0

func (i *IndexSnapshot) FieldDictOnly(field string,
	onlyTerms [][]byte, includeCount bool) (index.FieldDict, error)

func (*IndexSnapshot) FieldDictPrefix ¶

func (i *IndexSnapshot) FieldDictPrefix(field string,
	termPrefix []byte) (index.FieldDict, error)

func (*IndexSnapshot) FieldDictRange ¶

func (i *IndexSnapshot) FieldDictRange(field string, startTerm []byte,
	endTerm []byte) (index.FieldDict, error)

func (*IndexSnapshot) FieldDictRegexp ¶ added in v0.8.0

func (i *IndexSnapshot) FieldDictRegexp(field string,
	termRegex string) (index.FieldDict, error)

func (*IndexSnapshot) Fields ¶

func (i *IndexSnapshot) Fields() ([]string, error)

func (*IndexSnapshot) GetInternal ¶

func (i *IndexSnapshot) GetInternal(key []byte) ([]byte, error)

func (*IndexSnapshot) Internal ¶

func (i *IndexSnapshot) Internal() map[string][]byte

func (*IndexSnapshot) InternalID ¶

func (i *IndexSnapshot) InternalID(id string) (rv index.IndexInternalID, err error)

func (*IndexSnapshot) Segments ¶

func (i *IndexSnapshot) Segments() []*SegmentSnapshot

func (*IndexSnapshot) Size ¶ added in v0.8.0

func (i *IndexSnapshot) Size() int

func (*IndexSnapshot) TermFieldReader ¶

func (i *IndexSnapshot) TermFieldReader(term []byte, field string, includeFreq,
	includeNorm, includeTermVectors bool) (index.TermFieldReader, error)

type IndexSnapshotDocIDReader ¶

type IndexSnapshotDocIDReader struct {
	// contains filtered or unexported fields
}

func (*IndexSnapshotDocIDReader) Advance ¶

func (i *IndexSnapshotDocIDReader) Advance(ID index.IndexInternalID) (index.IndexInternalID, error)

func (*IndexSnapshotDocIDReader) Close ¶

func (i *IndexSnapshotDocIDReader) Close() error

func (*IndexSnapshotDocIDReader) Next ¶

func (i *IndexSnapshotDocIDReader) Next() (index.IndexInternalID, error)

func (*IndexSnapshotDocIDReader) Size ¶ added in v0.8.0

func (i *IndexSnapshotDocIDReader) Size() int

type IndexSnapshotFieldDict ¶

type IndexSnapshotFieldDict struct {
	// contains filtered or unexported fields
}

func (*IndexSnapshotFieldDict) Close ¶

func (i *IndexSnapshotFieldDict) Close() error

func (*IndexSnapshotFieldDict) Contains ¶ added in v0.8.0

func (i *IndexSnapshotFieldDict) Contains(key []byte) (bool, error)

func (*IndexSnapshotFieldDict) Len ¶

func (i *IndexSnapshotFieldDict) Len() int

func (*IndexSnapshotFieldDict) Less ¶

func (i *IndexSnapshotFieldDict) Less(a, b int) bool

func (*IndexSnapshotFieldDict) Next ¶

func (i *IndexSnapshotFieldDict) Next() (*index.DictEntry, error)

func (*IndexSnapshotFieldDict) Pop ¶

func (i *IndexSnapshotFieldDict) Pop() interface{}

func (*IndexSnapshotFieldDict) Push ¶

func (i *IndexSnapshotFieldDict) Push(x interface{})

func (*IndexSnapshotFieldDict) Swap ¶

func (i *IndexSnapshotFieldDict) Swap(a, b int)

type IndexSnapshotTermFieldReader ¶

type IndexSnapshotTermFieldReader struct {
	// contains filtered or unexported fields
}

func (*IndexSnapshotTermFieldReader) Advance ¶

func (i *IndexSnapshotTermFieldReader) Advance(ID index.IndexInternalID, preAlloced *index.TermFieldDoc) (*index.TermFieldDoc, error)

func (*IndexSnapshotTermFieldReader) Close ¶

func (i *IndexSnapshotTermFieldReader) Close() error

func (*IndexSnapshotTermFieldReader) Count ¶

func (i *IndexSnapshotTermFieldReader) Count() uint64

func (*IndexSnapshotTermFieldReader) Next ¶

func (i *IndexSnapshotTermFieldReader) Next(preAlloced *index.TermFieldDoc) (*index.TermFieldDoc, error)

func (*IndexSnapshotTermFieldReader) Optimize ¶ added in v0.8.0

func (s *IndexSnapshotTermFieldReader) Optimize(kind string,
	octx index.OptimizableContext) (index.OptimizableContext, error)

func (*IndexSnapshotTermFieldReader) Size ¶ added in v0.8.0

func (i *IndexSnapshotTermFieldReader) Size() int

type OptimizeTFRConjunction ¶ added in v0.8.0

type OptimizeTFRConjunction struct {
	// contains filtered or unexported fields
}

func (*OptimizeTFRConjunction) Finish ¶ added in v0.8.0

func (o *OptimizeTFRConjunction) Finish() (index.Optimized, error)

type OptimizeTFRConjunctionUnadorned ¶ added in v0.8.0

type OptimizeTFRConjunctionUnadorned struct {
	// contains filtered or unexported fields
}

func (*OptimizeTFRConjunctionUnadorned) Finish ¶ added in v0.8.0

func (o *OptimizeTFRConjunctionUnadorned) Finish() (rv index.Optimized, err error)

Finish of an unadorned conjunction optimization will compute a termFieldReader with an "actual" bitmap that represents the constituent bitmaps AND'ed together. This termFieldReader cannot provide any freq-norm or termVector associated information.

type OptimizeTFRDisjunctionUnadorned ¶ added in v0.8.0

type OptimizeTFRDisjunctionUnadorned struct {
	// contains filtered or unexported fields
}

func (*OptimizeTFRDisjunctionUnadorned) Finish ¶ added in v0.8.0

func (o *OptimizeTFRDisjunctionUnadorned) Finish() (rv index.Optimized, err error)

Finish of an unadorned disjunction optimization will compute a termFieldReader with an "actual" bitmap that represents the constituent bitmaps OR'ed together. This termFieldReader cannot provide any freq-norm or termVector associated information.

type RollbackPoint ¶

type RollbackPoint struct {
	// contains filtered or unexported fields
}

func (*RollbackPoint) GetInternal ¶

func (r *RollbackPoint) GetInternal(key []byte) []byte

type Scorch ¶

type Scorch struct {
	// contains filtered or unexported fields
}

func (*Scorch) AddEligibleForRemoval ¶

func (s *Scorch) AddEligibleForRemoval(epoch uint64)

func (*Scorch) Advanced ¶

func (s *Scorch) Advanced() (store.KVStore, error)

func (*Scorch) Analyze ¶

func (s *Scorch) Analyze(d *document.Document) *index.AnalysisResult

func (*Scorch) Batch ¶

func (s *Scorch) Batch(batch *index.Batch) (err error)

Batch applices a batch of changes to the index atomically

func (*Scorch) Close ¶

func (s *Scorch) Close() (err error)

func (*Scorch) Delete ¶

func (s *Scorch) Delete(id string) error

func (*Scorch) DeleteInternal ¶

func (s *Scorch) DeleteInternal(key []byte) error

func (*Scorch) LoadSnapshot ¶

func (s *Scorch) LoadSnapshot(epoch uint64) (rv *IndexSnapshot, err error)

LoadSnapshot loads the segment with the specified epoch NOTE: this is currently ONLY intended to be used by the command-line tool

func (*Scorch) MemoryUsed ¶

func (s *Scorch) MemoryUsed() (memUsed uint64)

func (*Scorch) Open ¶

func (s *Scorch) Open() error

func (*Scorch) Reader ¶

func (s *Scorch) Reader() (index.IndexReader, error)

Reader returns a low-level accessor on the index data. Close it to release associated resources.

func (*Scorch) ReportBytesWritten ¶ added in v0.8.0

func (s *Scorch) ReportBytesWritten(bytesWritten uint64)

func (*Scorch) Rollback ¶

func (s *Scorch) Rollback(to *RollbackPoint) error

Rollback atomically and durably (if unsafeBatch is unset) brings the store back to the point in time as represented by the RollbackPoint. Rollback() should only be passed a RollbackPoint that came from the same store using the RollbackPoints() API.

func (*Scorch) RollbackPoints ¶

func (s *Scorch) RollbackPoints() ([]*RollbackPoint, error)

RollbackPoints returns an array of rollback points available for the application to rollback to, with more recent rollback points (higher epochs) coming first.

func (*Scorch) RootBoltSnapshotEpochs ¶

func (s *Scorch) RootBoltSnapshotEpochs() ([]uint64, error)

func (*Scorch) SetInternal ¶

func (s *Scorch) SetInternal(key, val []byte) error

func (*Scorch) Stats ¶

func (s *Scorch) Stats() json.Marshaler

func (*Scorch) StatsMap ¶

func (s *Scorch) StatsMap() map[string]interface{}

func (*Scorch) Update ¶

func (s *Scorch) Update(doc *document.Document) error

type SegmentSnapshot ¶

type SegmentSnapshot struct {
	// contains filtered or unexported fields
}

func (*SegmentSnapshot) Close ¶

func (s *SegmentSnapshot) Close() error

func (*SegmentSnapshot) Count ¶

func (s *SegmentSnapshot) Count() uint64

func (*SegmentSnapshot) Deleted ¶

func (s *SegmentSnapshot) Deleted() *roaring.Bitmap

func (*SegmentSnapshot) DocID ¶ added in v0.8.0

func (s *SegmentSnapshot) DocID(num uint64) ([]byte, error)

func (*SegmentSnapshot) DocNumbers ¶

func (s *SegmentSnapshot) DocNumbers(docIDs []string) (*roaring.Bitmap, error)

func (*SegmentSnapshot) DocNumbersLive ¶

func (s *SegmentSnapshot) DocNumbersLive() *roaring.Bitmap

DocNumbersLive returns a bitmap containing doc numbers for all live docs

func (*SegmentSnapshot) Fields ¶

func (s *SegmentSnapshot) Fields() []string

func (*SegmentSnapshot) FullSize ¶

func (s *SegmentSnapshot) FullSize() int64

func (*SegmentSnapshot) Id ¶

func (s *SegmentSnapshot) Id() uint64

func (SegmentSnapshot) LiveSize ¶

func (s SegmentSnapshot) LiveSize() int64

func (*SegmentSnapshot) Segment ¶

func (s *SegmentSnapshot) Segment() segment.Segment

func (*SegmentSnapshot) Size ¶ added in v0.8.0

func (s *SegmentSnapshot) Size() (rv int)

func (*SegmentSnapshot) VisitDocument ¶

func (s *SegmentSnapshot) VisitDocument(num uint64, visitor segment.DocumentFieldValueVisitor) error

type Stats ¶

type Stats struct {
	TotUpdates uint64
	TotDeletes uint64

	TotBatches        uint64
	TotBatchesEmpty   uint64
	TotBatchIntroTime uint64
	MaxBatchIntroTime uint64

	CurRootEpoch       uint64
	LastPersistedEpoch uint64
	LastMergedEpoch    uint64

	TotOnErrors uint64

	TotAnalysisTime uint64
	TotIndexTime    uint64

	TotIndexedPlainTextBytes uint64

	TotTermSearchersStarted  uint64
	TotTermSearchersFinished uint64

	TotIntroduceLoop       uint64
	TotIntroduceSegmentBeg uint64
	TotIntroduceSegmentEnd uint64
	TotIntroducePersistBeg uint64
	TotIntroducePersistEnd uint64
	TotIntroduceMergeBeg   uint64
	TotIntroduceMergeEnd   uint64
	TotIntroduceRevertBeg  uint64
	TotIntroduceRevertEnd  uint64

	TotIntroducedItems         uint64
	TotIntroducedSegmentsBatch uint64
	TotIntroducedSegmentsMerge uint64

	TotPersistLoopBeg          uint64
	TotPersistLoopErr          uint64
	TotPersistLoopProgress     uint64
	TotPersistLoopWait         uint64
	TotPersistLoopWaitNotified uint64
	TotPersistLoopEnd          uint64

	TotPersistedItems    uint64
	TotItemsToPersist    uint64
	TotPersistedSegments uint64

	TotPersisterSlowMergerPause  uint64
	TotPersisterSlowMergerResume uint64

	TotPersisterNapPauseCompleted uint64
	TotPersisterMergerNapBreak    uint64

	TotFileMergeLoopBeg uint64
	TotFileMergeLoopErr uint64
	TotFileMergeLoopEnd uint64

	TotFileMergePlan     uint64
	TotFileMergePlanErr  uint64
	TotFileMergePlanNone uint64
	TotFileMergePlanOk   uint64

	TotFileMergePlanTasks              uint64
	TotFileMergePlanTasksDone          uint64
	TotFileMergePlanTasksErr           uint64
	TotFileMergePlanTasksSegments      uint64
	TotFileMergePlanTasksSegmentsEmpty uint64

	TotFileMergeSegmentsEmpty uint64
	TotFileMergeSegments      uint64
	TotFileSegmentsAtRoot     uint64
	TotFileMergeWrittenBytes  uint64

	TotFileMergeZapBeg  uint64
	TotFileMergeZapEnd  uint64
	TotFileMergeZapTime uint64
	MaxFileMergeZapTime uint64

	TotFileMergeIntroductions        uint64
	TotFileMergeIntroductionsDone    uint64
	TotFileMergeIntroductionsSkipped uint64

	CurFilesIneligibleForRemoval     uint64
	TotSnapshotsRemovedFromMetaStore uint64

	TotMemMergeBeg          uint64
	TotMemMergeErr          uint64
	TotMemMergeDone         uint64
	TotMemMergeZapBeg       uint64
	TotMemMergeZapEnd       uint64
	TotMemMergeZapTime      uint64
	MaxMemMergeZapTime      uint64
	TotMemMergeSegments     uint64
	TotMemorySegmentsAtRoot uint64
}

Stats tracks statistics about the index, fields that are prefixed like CurXxxx are gauges (can go up and down), and fields that are prefixed like TotXxxx are monotonically increasing counters.

func (*Stats) MarshalJSON ¶

func (s *Stats) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler, and in contrast to standard json marshaling provides atomic safety

func (*Stats) ToMap ¶ added in v0.8.0

func (s *Stats) ToMap() map[string]interface{}

atomically populates the returned map

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
mergeplan Package mergeplan provides a segment merge planning approach that's inspired by Lucene's TieredMergePolicy.java and descriptions like http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html	Package mergeplan provides a segment merge planning approach that's inspired by Lucene's TieredMergePolicy.java and descriptions like http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
segment
zap

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL