apoco

package

v0.9.0 Latest Latest Go to latest Published: Feb 15, 2022 License: MIT Imports: 21 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

git.sr.ht/~flobar/apoco

Links

Open Source Insights

Documentation ¶

Index ¶

func AgreeingOCRs(t T, i, n int) (float64, bool)
func ApplyOCRToCorrection(ocr, sug string) string
func CandidateAgreeingOCR(t T, i, n int) (float64, bool)
func CandidateHistPatternConf(t T, i, n int) (float64, bool)
func CandidateHistPatternConfLog(t T, i, n int) (float64, bool)
func CandidateIsLexiconEntry(cand gofiler.Candidate) bool
func CandidateLen(t T, i, n int) (float64, bool)
func CandidateLevDist(t T, i, n int) (float64, bool)
func CandidateMatchesOCR(t T, i, n int) (float64, bool)
func CandidateOCRPatternConf(t T, i, n int) (float64, bool)
func CandidateOCRPatternConfLog(t T, i, n int) (float64, bool)
func CandidateProfilerWeight(t T, i, n int) (float64, bool)
func CandidateWLevDist(t T, i, n int) (float64, bool)
func CandidatesContainsLexiconEntry(cands []gofiler.Candidate) bool
func DocumentLexicality(t T, i, n int) (float64, bool)
func EachDocument(ctx context.Context, in <-chan T, f func(*Document, []T) error) error
func EachLine(ctx context.Context, in <-chan T, f func([]T) error) error
func EachToken(ctx context.Context, in <-chan T, f func(T) error) error
func EachTrigram(str string, fn func(string))
func Log(f string, args ...interface{})
func LogEnabled() bool
func OCRLevDist(t T, i, n int) (float64, bool)
func OCRMaxCharConf(t T, i, n int) (float64, bool)
func OCRMinCharConf(t T, i, n int) (float64, bool)
func OCRTokenLen(t T, i, n int) (float64, bool)
func OCRWLevDist(t T, i, n int) (float64, bool)
func Pipe(ctx context.Context, fns ...StreamFunc) error
func RankingCandidateConfDiffToNext(t T, i, n int) (float64, bool)
func RankingConf(t T, i, n int) (float64, bool)
func RankingConfDiffToNext(t T, i, n int) (float64, bool)
func ReadProfile(name string) (gofiler.Profile, error)
func RunProfiler(ctx context.Context, exe, config string, ts ...T) (gofiler.Profile, error)
func SendTokens(ctx context.Context, out chan<- T, tokens ...T) error
func SetLog(enable bool)
func WriteProfile(name string, profile gofiler.Profile) error
type Char
- func (char Char) String() string
type Chars
- func (chars Chars) Chars() string
- func (chars Chars) Confs() []float64
- func (chars Chars) String() string
type Correction
type Document
- func (d *Document) AddUnigram(token string)
- func (d *Document) LookupOCRPattern(a, b []rune) (float64, bool)
- func (d *Document) Unigram(str string) float64
type FeatureFunc
type FeatureSet
- func NewFeatureSet(names ...string) (FeatureSet, error)
- func (fs FeatureSet) Calculate(xs []float64, t T, n int) []float64
- func (fs FeatureSet) Names(names []string, typ string, nocr int) []string
type FreqList
- func (f *FreqList) EachTrigram(str string, fn func(float64))
type LMConfig
type Model
- func ReadModel(name string, lms map[string]LMConfig, create bool) (*Model, error)
- func (m *Model) Get(mod string, nocr int) (*ml.LR, FeatureSet, error)
- func (m *Model) Put(mod string, nocr int, lr *ml.LR, fs []string)
- func (m *Model) Write(name string) (err error)
type ModelData
type Ranking
type Split
type StreamFunc
- func AddShortTokensToProfile(max int) StreamFunc
- func Combine(ctx context.Context, fns ...StreamFunc) StreamFunc
- func ConnectCandidates() StreamFunc
- func ConnectCorrections(p ml.Predictor, fs FeatureSet, n int) StreamFunc
- func ConnectLanguageModel(lm map[string]*FreqList) StreamFunc
- func ConnectMergesWithGT() StreamFunc
- func ConnectProfile(profile gofiler.Profile) StreamFunc
- func ConnectRankings(p ml.Predictor, fs FeatureSet, n int) StreamFunc
- func ConnectSplitCandidates() StreamFunc
- func ConnectUnigrams() StreamFunc
- func FilterBad(min int) StreamFunc
- func FilterLexiconEntries() StreamFunc
- func FilterNonLexiconEntries() StreamFunc
- func FilterShort(min int) StreamFunc
- func MarkSplits(n int) StreamFunc
- func Normalize() StreamFunc
- func Tee(fns ...func(T) error) StreamFunc
type T
- func ReadToken(ctx context.Context, in <-chan T) (T, bool, error)
- func (t T) ContainsLexiconEntry() bool
- func (t T) GT() string
- func (t T) IsLexiconEntry() bool
- func (t T) String() string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func AgreeingOCRs ¶

func AgreeingOCRs(t T, i, n int) (float64, bool)

AgreeingOCRs returns the number of OCRs that aggree with the master OCR token.

func ApplyOCRToCorrection ¶

func ApplyOCRToCorrection(ocr, sug string) string

ApplyOCRToCorrection applies the casing of the master OCR string to the correction's candidate suggestion and prepends and appends any punctuation of the master OCR to the suggestion.

func CandidateAgreeingOCR ¶

func CandidateAgreeingOCR(t T, i, n int) (float64, bool)

CandidateAgreeingOCR returns the number of OCR tokens that agree with the specific profiler candidate of the token.

func CandidateHistPatternConf ¶

func CandidateHistPatternConf(t T, i, n int) (float64, bool)

CandidateHistPatternConf returns the product of the confidences of the primary OCR characters for the assumed historical rewrite pattern of the connected candidate.

func CandidateHistPatternConfLog ¶ added in v0.0.17

func CandidateHistPatternConfLog(t T, i, n int) (float64, bool)

CandidateHistPatternConfLog returns the sum of the logrithm of the confidences of the primary OCR characters for the assumed historical rewrite pattern of the connected candidate.

func CandidateIsLexiconEntry ¶ added in v0.0.50

func CandidateIsLexiconEntry(cand gofiler.Candidate) bool

CandidateIsLexiconEntry returns true if the given candidate represents a lexicon entry, i.e. it contains no OCR- and/or historic patterns.

func CandidateLen ¶

func CandidateLen(t T, i, n int) (float64, bool)

CandidateLen returns the length of the connected profiler candidate.

func CandidateLevDist ¶ added in v0.0.54

func CandidateLevDist(t T, i, n int) (float64, bool)

CandidateLevDist returns the levenshtein distance between the OCR token and the token's connected profiler candidate. For the master OCR the according Distance from the profiler candidate is used, whereas for support OCRs the levenshtein distance is calculated.

func CandidateMatchesOCR ¶

func CandidateMatchesOCR(t T, i, n int) (float64, bool)

CandidateMatchesOCR returns true if the according ocr matches the connected candidate and false otherwise.

func CandidateOCRPatternConf ¶

func CandidateOCRPatternConf(t T, i, n int) (float64, bool)

CandidateOCRPatternConf returns the product of the confidences of the primary OCR characters for the assumed OCR error pattern of the connected candidate. TODO: rename to CandiateErrPatternConf

func CandidateOCRPatternConfLog ¶ added in v0.0.16

func CandidateOCRPatternConfLog(t T, i, n int) (float64, bool)

CandidateOCRPatternConfLog returns the sum of the logarithm of the confidences of the primary OCR characters for the assumed OCR error pattern of the connected candidate.

func CandidateProfilerWeight ¶

func CandidateProfilerWeight(t T, i, n int) (float64, bool)

CandidateProfilerWeight returns the profiler confidence value for tokens candidate.

func CandidateWLevDist ¶ added in v0.0.54

func CandidateWLevDist(t T, i, n int) (float64, bool)

CandidateLevDist returns the weighted levenshtein distance between the OCR token and the token's connected profiler candidate. For the master OCR the according Distance from the profiler candidate is used, whereas for support OCRs the levenshtein distance is calculated.

func CandidatesContainsLexiconEntry ¶ added in v0.0.50

func CandidatesContainsLexiconEntry(cands []gofiler.Candidate) bool

CandidatesContainsLexiconEntry returns true if any of the given candidates contains a candidate that represents a lexicon entry.

func DocumentLexicality ¶

func DocumentLexicality(t T, i, n int) (float64, bool)

DocumentLexicality returns the (global) lexicality of the given token's document. Using this feature only makes sense if the training contains at least more than one training document.

func EachDocument ¶ added in v0.0.59

func EachDocument(ctx context.Context, in <-chan T, f func(*Document, []T) error) error

EachDocument iterates over the tokens grouping them together based on their language models. The given callback function is called for each group of tokens. This function assumes that the tokens are connected with their Document.

func EachLine ¶ added in v0.0.46

func EachLine(ctx context.Context, in <-chan T, f func([]T) error) error

EachLine calls the given callback function for each line.

func EachToken ¶

func EachToken(ctx context.Context, in <-chan T, f func(T) error) error

EachToken iterates over the tokens in the input channel and calls the callback function for each token.

func EachTrigram ¶ added in v0.0.18

func EachTrigram(str string, fn func(string))

func Log ¶ added in v0.0.21

func Log(f string, args ...interface{})

Log logs the given message if logging is enabled. This function uses log.Printf for logging, so it is save to be used concurrently.

func LogEnabled ¶ added in v0.0.29

func LogEnabled() bool

LogEnabled returns true if logging is currently enabled.

func OCRLevDist ¶ added in v0.0.54

func OCRLevDist(t T, i, n int) (float64, bool)

OCRLevDist returns the levenshtein distance between the secondary OCRs with the primary OCR.

func OCRMaxCharConf ¶

func OCRMaxCharConf(t T, i, n int) (float64, bool)

OCRMaxCharConf returns the maximal character confidence of the master OCR token.

func OCRMinCharConf ¶

func OCRMinCharConf(t T, i, n int) (float64, bool)

OCRMinCharConf returns the minimal character confidence of the master OCR token.

func OCRTokenLen ¶

func OCRTokenLen(t T, i, n int) (float64, bool)

OCRTokenLen returns the length of the OCR token. It operates on any configuration.

func OCRWLevDist ¶ added in v0.0.54

func OCRWLevDist(t T, i, n int) (float64, bool)

OCRWLevDist returns the weighted levenshtein distance between the secondary OCRs with the primary OCR.

func Pipe ¶

func Pipe(ctx context.Context, fns ...StreamFunc) error

Pipe pipes multiple stream funcs together, making shure to run all of them concurently. The first function in the list (the reader) is called with a nil input channel. The last function is always called with a nil output channel. To clarify: the first function must never read from its input channel and the last function must never write to its output channel.

StreamFunctions should transform the input tokens to output tokens. They must never close any channels. They should use the SendTokens, ReadToken and EachToken utility functions to ensure proper handling of context cancelation.

func RankingCandidateConfDiffToNext ¶

func RankingCandidateConfDiffToNext(t T, i, n int) (float64, bool)

RankingCandidateConfDiffToNext returns the top ranked candidate's weight minus the the weight of the next (or 0).

func RankingConf ¶

func RankingConf(t T, i, n int) (float64, bool)

RankingConf returns the confidence of the best ranked correction candidate for the given token.

func RankingConfDiffToNext ¶

func RankingConfDiffToNext(t T, i, n int) (float64, bool)

RankingConfDiffToNext returns the difference of the best ranked correction candidate's confidence to the next. If only one correction candidate is available, the next ranking's confidence is assumed to be 0.

func ReadProfile ¶ added in v0.0.17

func ReadProfile(name string) (gofiler.Profile, error)

ReadProfile reads the profile from a gzipped json formatted file.

func RunProfiler ¶ added in v0.0.17

func RunProfiler(ctx context.Context, exe, config string, ts ...T) (gofiler.Profile, error)

RunProfiler runs the profiler over the given tokens (using the token entries at index 0) with the given executable and config file. The profiler's output is logged to stderr.

func SendTokens ¶

func SendTokens(ctx context.Context, out chan<- T, tokens ...T) error

SendTokens writes tokens into the given output channel. This function should always be used to write tokens into output channels.

func SetLog ¶ added in v0.0.19

func SetLog(enable bool)

SetLog enables or disables logging. This function is not safe for concurrent usage and should be used once at application start.

func WriteProfile ¶ added in v0.0.17

func WriteProfile(name string, profile gofiler.Profile) error

WriteProfile writes the profile as gzipped json formatted file.

Types ¶

type Char ¶

type Char struct {
	Conf float64 // confidence of the rune
	Char rune    // rune
}

Char represents an OCR char with its confidence.

func (Char) String ¶

func (char Char) String() string

type Chars ¶

type Chars []Char

Chars represents the master OCR chars with the respective confidences.

func (Chars) Chars ¶ added in v0.0.45

func (chars Chars) Chars() string

Chars converts a char array to a string containing the chars.

func (Chars) Confs ¶ added in v0.0.9

func (chars Chars) Confs() []float64

Confs returns the confidences as array.

func (Chars) String ¶

func (chars Chars) String() string

type Correction ¶

type Correction struct {
	Candidate *gofiler.Candidate
	Conf      float64
}

Correction represents a correction decision for tokens.

type Document ¶ added in v0.0.36

type Document struct {
	LM         map[string]*FreqList // Global language models.
	Unigrams   FreqList             // Document-wise unigram model.
	Profile    gofiler.Profile      // Document-wise profile.
	OCRPats    map[string]float64   // Error patterns (from the profiler).
	Group      string               // File group or directory of the document.
	Lexicality float64              // Lexicality score.
}

Document represents the token's document.

func (*Document) AddUnigram ¶ added in v0.0.36

func (d *Document) AddUnigram(token string)

AddUnigram adds the token to the language model's unigram map.

func (*Document) LookupOCRPattern ¶ added in v0.0.54

func (d *Document) LookupOCRPattern(a, b []rune) (float64, bool)

func (*Document) Unigram ¶ added in v0.0.36

func (d *Document) Unigram(str string) float64

Unigram looks up the given token in the unigram list (or 0 if the unigram is not present).

type FeatureFunc ¶

type FeatureFunc func(t T, i, n int) (float64, bool)

FeatureFunc defines the function a feature needs to implement. A feature func gets a token and a configuration (the current OCR-index i and the total number of parallel OCRs n). The function then should return the feature value for the given token and whether this feature applies for the given configuration (i and n).

type FeatureSet ¶

type FeatureSet []FeatureFunc

FeatureSet is just a list of feature funcs.

func NewFeatureSet ¶

func NewFeatureSet(names ...string) (FeatureSet, error)

NewFeatureSet creates a new feature set from the list of feature function names.

Feature function names can have optional arguments. The arguments of a feature function must be given in a comma-separated list enclosed in `()`. For example `feature`, `feature()`, `feature(arg1,arg2)` are all valid feature function names.

func (FeatureSet) Calculate ¶

func (fs FeatureSet) Calculate(xs []float64, t T, n int) []float64

Calculate calculates the feature vector for the given feature functions for the given token and the given number of OCRs and appends it to the given vector. Any given feature function that does not apply to the given configuration (and returns false as it second return parameter for the configuration) is omitted and not appended to the resulting feature vector.

func (FeatureSet) Names ¶ added in v0.0.21

func (fs FeatureSet) Names(names []string, typ string, nocr int) []string

Names returns the names of the features including the features for different values of OCR's. This function panics if the length of the feature set differs from the length of the given feature names.

type FreqList ¶

type FreqList struct {
	FreqList map[string]int `json:"freqList"`
	Total    int            `json:"total"`
}

FreqList is a simple frequenzy map.

func (*FreqList) EachTrigram ¶ added in v0.0.56

func (f *FreqList) EachTrigram(str string, fn func(float64))

EachTrigram calls the given callback function for each trigram in the given string.

type LMConfig ¶ added in v0.0.62

type LMConfig struct {
	Path string `json:"path"`
}

LMConfig configures the path to a language model csv file.

type Model ¶

type Model struct {
	Models             map[string]map[int]ModelData // Models map the name and nocr to the model data.
	GlobalHistPatterns map[string]float64           // Historical pattern frequencies from the profiler.
	GlobalOCRPatterns  map[string]float64           // OCR pattern frequencies from the profiler.
	LM                 map[string]*FreqList         // Language models.
}

Model holds the different models for the different training runs for a different number of OCRs. It is used to save and load the models for the automatic postcorrection.

func ReadModel ¶

func ReadModel(name string, lms map[string]LMConfig, create bool) (*Model, error)

ReadModel reads a model from a gob compressed input file. If the given file does not exist, the according language models are loaded and a new model is returned. If create is set to false no new model will be created and the model must be read from an existing file.

func (*Model) Get ¶

func (m *Model) Get(mod string, nocr int) (*ml.LR, FeatureSet, error)

Get loads the the model and the according feature set for the given configuration.

func (*Model) Put ¶

func (m *Model) Put(mod string, nocr int, lr *ml.LR, fs []string)

Put inserts the weights and the according feature set for the given configuration into this model.

func (*Model) Write ¶

func (m *Model) Write(name string) (err error)

Write writes the model as gob encoded, gziped file to the given path overwriting any previous existing models.

type ModelData ¶ added in v0.0.7

type ModelData struct {
	Features []string // Feature names used to train the model.
	Model    *ml.LR   // The trained model.
}

ModelData holds a linear regression model.

type Ranking ¶

type Ranking struct {
	Candidate *gofiler.Candidate
	Prob      float64
}

Ranking maps correction candidates of tokens to their predicted probabilities.

type Split ¶ added in v0.0.46

type Split struct {
	Candidates []gofiler.Candidate
	Tokens     []T
	Valid      bool
}

type StreamFunc ¶

type StreamFunc func(context.Context, <-chan T, chan<- T) error

StreamFunc is a type def for stream functions. A stream function is used to transform tokens from the input channel to the output channel. They should be used with the Pipe function to chain multiple functions together.

func AddShortTokensToProfile ¶ added in v0.0.46

func AddShortTokensToProfile(max int) StreamFunc

AddShortTokensToProfile returns a stream function that adds fake profiler interpretation for short tokens into the token's profile. Short tokens are tokens with less than or equal to max unicode runes.

func Combine ¶ added in v0.0.14

func Combine(ctx context.Context, fns ...StreamFunc) StreamFunc

Combine lets you combine stream functions. All functions are run concurently in their own error group.

func ConnectCandidates ¶

func ConnectCandidates() StreamFunc

ConnectCandidates returns a stream function that connects tokens with their respective candidates to the stream. Tokens with no candidates or tokens with only a modern interpretation are filtered from the stream.

func ConnectCorrections ¶

func ConnectCorrections(p ml.Predictor, fs FeatureSet, n int) StreamFunc

ConnectCorrections connects the tokens with the decider's correction decisions.

func ConnectLanguageModel ¶ added in v0.0.37

func ConnectLanguageModel(lm map[string]*FreqList) StreamFunc

ConnectLanguageModel connects the document of the tokens to a language model.

func ConnectMergesWithGT ¶ added in v0.0.46

func ConnectMergesWithGT() StreamFunc

func ConnectProfile ¶ added in v0.0.29

func ConnectProfile(profile gofiler.Profile) StreamFunc

ConnectProfile returns a stream function that connects the tokens with the profile.

func ConnectRankings ¶

func ConnectRankings(p ml.Predictor, fs FeatureSet, n int) StreamFunc

ConnectRankings connects the tokens of the input stream with their respective rankings.

func ConnectSplitCandidates ¶ added in v0.0.46

func ConnectSplitCandidates() StreamFunc

ConnectCandidates returns a stream function that connects tokens with their respective candidates to the stream. Tokens with no candidates or tokens with only a modern interpretation are filtered from the stream.

func ConnectUnigrams ¶ added in v0.0.29

func ConnectUnigrams() StreamFunc

ConnectUnigrams adds the unigrams to the tokens's language model.

func FilterBad ¶

func FilterBad(min int) StreamFunc

FilterBad returns a astream function that filters tokens with not enough ocr and/or gt tokens.

func FilterLexiconEntries ¶

func FilterLexiconEntries() StreamFunc

FilterLexiconEntries returns a stream function that filters all tokens that are lexicon entries from the stream.

func FilterNonLexiconEntries ¶ added in v0.0.59

func FilterNonLexiconEntries() StreamFunc

FilterNonLexiconEntries returns a stream function that filters all tokens that are not lexicon entries from the stream.

func FilterShort ¶

func FilterShort(min int) StreamFunc

FilterShort returns a stream function that filters short master OCR tokens from the input stream. Short tokens are tokens, with less than min unicode characters.

func MarkSplits ¶ added in v0.0.60

func MarkSplits(n int) StreamFunc

MarkSplits detects possible splits between the primary and a secondary OCR (denoted by the given index n). A split is detected if two or more primary OCR tokens are aligned to the same secondary OCR token.

func Normalize ¶

func Normalize() StreamFunc

Normalize returns a stream function that trims all leading and subsequent punctionation from the tokens, converts them to lowercase and replaces any whitespace (in the case of merges due to alignment) with a '_'.

func Tee ¶ added in v0.0.14

func Tee(fns ...func(T) error) StreamFunc

Tee calls all the given callback function for each token. After all functions have been called, if the output channel is not nil, the token is send to the output channel.

type T ¶ added in v0.0.14

type T struct {
	Document *Document   // Document this token belongs to.
	Payload  interface{} // Token payload; either *gofiler.Candidate, []Ranking, Correction or Split
	Cor      string      // Correction for the token
	File     string      // The file of the token
	ID       string      // ID of the token in its file
	Chars    Chars       // Master OCR chars including their confidences
	Tokens   []string    // Master and support OCRs and gt
	EOL, SOL bool        // End of line and start of line marker.
	IsSplit  bool        // Marks possible split tokens between the primary and secondary OCR.
}

T represents aligned OCR-tokens.

func ReadToken ¶

func ReadToken(ctx context.Context, in <-chan T) (T, bool, error)

ReadToken reads one token from the given channel. This function should alsways be used to read single tokens from input channels.

func (T) ContainsLexiconEntry ¶ added in v0.0.50

func (t T) ContainsLexiconEntry() bool

ContainsLexiconEntry returns true if any of the suggestions of the token are a lexicon entry.

func (T) GT ¶ added in v0.0.52

func (t T) GT() string

func (T) IsLexiconEntry ¶ added in v0.0.14

func (t T) IsLexiconEntry() bool

IsLexiconEntry returns true if this token is a normal lexicon entry for its connected language model.

func (T) String ¶ added in v0.0.14

func (t T) String() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
align
mets
ml
node Package node provides helper functions to work with queryxml.Node pointers.	Package node provides helper functions to work with queryxml.Node pointers.
pagexml
snippets

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL