paperminer

package module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 9, 2024 License: BSD-3-Clause Imports: 6 Imported by: 0

README

Amend Paperless documents with extracted information

Latest release CI workflow Go reference

Paperminer is a system for amending documents stored in Paperless-ngx with additional information ("facts") extracted from the documents themselves or other sources.

The hansmi/dossier package is called to parse PDF documents (other formats could be implemented).

The Go programming language's plugin package comes with a number of caveats which make it unsuitable. Compile-time plugins via the hansmi/staticplug package are used instead. It's therefore necessary to set up your own build. An example for a program with a plugin can be found in the example/myminer directory.

Plugins may use dossier sketches to look for specific regular expressions at absolute or relative positions on pages. The sketchfacts package is often sufficient even though it ignores pages beyond the first. Custom logic can produce document facts from the findings.

Plugins may also extract arbitrary document pages and implement their own data extraction. External APIs may also be involved.

Normalizing extracted text before parsing it further is generally recommended, not just for date and time: remove extraneous whitespace and separators, etc. Regular expressions should also be written to be flexible where possible. OCR-derived text is often not exactly the same as the original.

Useful packages for writing document facters:

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GlobalPluginRegistry

func GlobalPluginRegistry() *staticplug.Registry

GlobalPluginRegistry returns a pointer to a global plugin registry.

func MustRegisterPlugin

func MustRegisterPlugin(p staticplug.Plugin)

Types

type DocumentFacter

type DocumentFacter interface {
	staticplug.Plugin

	// DocumentFacts is invoked after a document has been parsed into
	// structured text. The return value can be nil to report that no suitable
	// facts were found.
	DocumentFacts(context.Context, DocumentFacterOptions) (*Facts, error)
}

type DocumentFacterOptions

type DocumentFacterOptions struct {
	Logger   *zap.Logger
	Document *dossier.Document
}

type Facts

type Facts struct {
	Reporter *string `json:"reporter"`

	Title         *string    `json:"title,omitempty"`
	Created       *time.Time `json:"created,omitempty"`
	DocumentType  *string    `json:"document_type,omitempty"`
	Correspondent *string    `json:"correspondent,omitempty"`
	StoragePath   *string    `json:"storage_path,omitempty"`

	SetTags   []string `json:"set_tags,omitempty"`
	UnsetTags []string `json:"unset_tags,omitempty"`
}

func (*Facts) IsEmpty

func (f *Facts) IsEmpty() bool

IsEmpty returns whether at least one fact property has been set.

func (*Facts) String

func (f *Facts) String() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL