docs

package
v0.0.0-...-6ec82fb Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 14, 2025 License: BSD-3-Clause Imports: 6 Imported by: 0

Documentation

Overview

Package docs implements a corpus of text documents identified by document IDs. It allows retrieving the documents by ID as well as retrieving documents that are new since a previous scan.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Latest

func Latest[T Entry](src Source[T]) timed.DBTime

Latest returns the latest known DBTime marked old by the source's DocWatcher.

func LatestFunc

func LatestFunc[T Entry](src Source[T]) func() timed.DBTime

Latest returns a function that returns the latest known DBTime marked old by the source's DocWatcher.

func Restart

func Restart[T Entry](src Source[T])

Restart causes the next call to Sync to behave as if it has never sync'ed any data before for the src. The result is that all data will be reconverted to doc form and re-added. Docs that have not changed since the last addition to the corpus will appear unmodified; others will be marked new in the corpus.

func Sync

func Sync[T Entry, S Source[T]](dc *Corpus, src S)

Sync reads new embeddable values from src and adds the documents to the corpus dc.

Sync uses [Source.DocWatcher] to save its position across multiple calls.

Sync logs status and unexpected problems to lg.

Types

type Corpus

type Corpus struct {
	// contains filtered or unexported fields
}

A Corpus is the collection of documents stored in a database.

func New

func New(lg *slog.Logger, db storage.DB) *Corpus

New returns a new Corpus representing the documents stored in db.

func (*Corpus) Add

func (c *Corpus) Add(id, title, text string)

Add adds a document with the given id, title, and text. If the document already exists in the corpus with the same title and text, Add is a no-op. Otherwise, if the document already exists in the corpus, it is replaced.

func (*Corpus) Delete

func (c *Corpus) Delete(id string)

Delete deletes a document with the given id. If the document does not exist inthe corpus, Delete is a no-op.

func (*Corpus) DocWatcher

func (c *Corpus) DocWatcher(name string) *timed.Watcher[*Doc]

DocWatcher returns a new storage.Watcher with the given name. It picks up where any previous Watcher of the same name left off.

func (*Corpus) Docs

func (c *Corpus) Docs(prefix string) iter.Seq[*Doc]

Docs returns an iterator over all documents in the corpus with IDs starting with a given prefix. The documents are ordered by ID.

func (*Corpus) DocsAfter

func (c *Corpus) DocsAfter(dbtime timed.DBTime, prefix string) iter.Seq[*Doc]

DocsAfter returns an iterator over all documents with DBTime greater than dbtime and with IDs starting with the prefix. The documents are ordered by DBTime.

func (*Corpus) Get

func (c *Corpus) Get(id string) (doc *Doc, ok bool)

Get returns the document with the given id. It returns nil, false if no document is found. It returns d, true otherwise.

type Doc

type Doc struct {
	DBTime timed.DBTime // DBTime when Doc was written
	ID     string       // document identifier (such as a URL)
	Title  string       // title of document
	Text   string       // text of document
}

A Doc is a single document in the Corpus.

type Entry

type Entry interface {
	// LastWritten returns the DBTime this piece of data was last written
	// to its data source.
	LastWritten() timed.DBTime
}

Entry is a timed entry in a Source.

type Source

type Source[T Entry] interface {
	// DocWatcher returns the watcher to use to keep track
	// of last [Sync] for this data source.
	DocWatcher() *timed.Watcher[T]
	// ToDocs converts the data to an iterator of [*Doc] values
	// that can be stored in a [Corpus].
	// It returns (nil, false) if the data should not be stored
	// in the [Corpus].
	ToDocs(T) (iter.Seq[*Doc], bool)
}

Source is a data source to pull into a Corpus.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL