gogosseract

package module
v0.0.11-0ad3421 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 4, 2023 License: Apache-2.0 Imports: 10 Imported by: 1

README

gogosseract

Coverage Go Report Card Go Reference

A reimplementation of https://github.com/otiai10/gosseract without CGo, running Tesseract compiled to WASM with Emscripten via Wazero

The WASM is generated from my personal fork of robertknight's well written tesseract-wasm project.

Note that Tesseract is only compiled with support for the Tesseract LSTM neural network OCR engine, and not for "classic" Tesseract.

Training Data

Tesseract requires training data in order to accurately recognize text. The official source is here. Strategies for dealing with this include downloading it at runtime, or embedding the file within your Go binary using go:embed at compile time.

Accuracy

Tesseract can work better if the input images are preprocessed. See this page for tips.

https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html

Examples

Using Tesseract to parse text from an image.

    cfg := gogosseract.Config{
        Language: "eng",
        TrainingData: trainingDataFile,
    }
    // While Tesseract's output is very useful for debugging, you have the option to silence or redirect it
    cfg.Stderr = io.Discard
    cfg.Stdout = io.Discard
    // Compile the Tesseract WASM and run it, loading in the TrainingData and setting any Config Variables provided
    tess, err := gogosseract.New(ctx, cfg)
    handleErr(err)
    // Load the image, without parsing it.
    err = tess.LoadImage(ctx, imageFile, gogosseract.LoadImageOptions{})
    handleErr(err)

    text, err = tess.GetText(ctx, func(progress int32) { log.Printf("Tesseract parsing is %d%% complete.", progress) })
    handleErr(err)
    // Closing the Tesseract instance will clean up everything used by Tesseract and it's WASM module
    handleErr(tess.Close(ctx))

Using a Pool of Tesseract workers for thread safe concurrent image parsing.

    cfg := gogosseract.Config{
        Language: "eng",
        TrainingData: trainingDataFile,
    }
    // Create 10 Tesseract instances that can process image requests concurrently.
	pool, err := gogosseract.NewPool(ctx, 10, gogosseract.PoolConfig{Config: cfg})
    handleErr(err)
    defer Pool.Close()

    // ParseImage loads the image and waits until the Tesseract worker sends back your result.
    hocr, err := pool.ParseImage(ctx, img, gogosseract.ParseImageOptions{
        IsHOCR: true,
    })
    handleErr(err)

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	wasm.CompileConfig
	// Languages Tesseract scans for. Defaults to "eng".
	Language string
	// Training Data Tesseract uses. Required. Must support the provided language. https://github.com/tesseract-ocr/tessdata_fast for more details.
	TrainingData io.Reader
	// Variables are optionally passed into Tesseract as variable config options. Some options are listed at http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
	Variables map[string]string
	// WASMCache is an optional wazero.CompilationCache used for running multiple Tesseract instances more efficiently.
	WASMCache wazero.CompilationCache
}

type LoadImageOptions

type LoadImageOptions struct {
	// RemoveUnderlines uses Leptonica (C img lib) to remove the underlines from the given image. Copies a lot.
	RemoveUnderlines bool
}

type ParseImageOptions

type ParseImageOptions struct {
	LoadImageOptions
	// IsHOCR makes a GetHOCR request instead of the default GetText
	IsHOCR bool
	// Called whenever Tesseract's parsing progresses, gives a percentage.
	ProgressCB func(int32)
}

type Pool

type Pool struct {
	// contains filtered or unexported fields
}

func NewPool

func NewPool(ctx context.Context, count uint, cfg PoolConfig) (_ *Pool, err error)

NewPool creates a pool of Tesseract clients for safe, efficient concurrent use.

func (*Pool) Close

func (p *Pool) Close()

Close shuts down the Pool, Close's the Tesseract workers, and waits for the goroutines to end.

func (*Pool) ParseImage

func (p *Pool) ParseImage(ctx context.Context, img io.Reader, opts ParseImageOptions) (string, error)

ParseImage loads an image into our Tesseract object and gets back text from it. Both actions are executed on an available worker. Set a timeout with context.WithTimeout to handle the case where all workers are busy.

type PoolConfig

type PoolConfig struct {
	Config
	// TrainingDataBytes is Config.TrainingData, but as a []byte for concurrency's sake.
	// Multiple Tesseract workers can't read from a single io.Reader, so they can't benefit from streaming the data.
	// For convenience you only need to set either Config.TrainingData or TrainingDataBytes.
	TrainingDataBytes []byte
}

type Tesseract

type Tesseract struct {
	// contains filtered or unexported fields
}

func New

func New(ctx context.Context, cfg Config) (t *Tesseract, err error)

New creates a new Tesseract class that is ready for use. The Tesseract WASM is initialized with the given trainingdata, language and variable options. Each Tesseract object is NOT safe for concurrent use.

func (*Tesseract) ClearImage

func (t *Tesseract) ClearImage(ctx context.Context) error

ClearImage clears the image from within Tesseract. LoadImage calls this for you.

func (*Tesseract) Close

func (t *Tesseract) Close(ctx context.Context) error

Close shuts down all the resources associated with the Tesseract object.

func (*Tesseract) GetHOCR

func (t *Tesseract) GetHOCR(ctx context.Context, progressCB func(int32)) (string, error)

GetHOCR parses a previously loaded image for HOCR text. progressCB is called with a percentage for tracking Tesseract's recognition progress.

func (*Tesseract) GetText

func (t *Tesseract) GetText(ctx context.Context, progressCB func(int32)) (string, error)

GetText parses a previously loaded image for text. progressCB is called with a percentage for tracking Tesseract's recognition progress.

func (*Tesseract) LoadImage

func (t *Tesseract) LoadImage(ctx context.Context, img io.Reader, opts LoadImageOptions) error

LoadImage clears any previously loaded images, and loads the provided img into Tesseract WASM for parsing. Unfortunately the image is fully copied to memory a few times. Leptonica parses it into a Pix object and Tesseract copies that Pix object internally. Keep that in mind when working with large images.

Directories

Path Synopsis
internal
gen
Code generated by wazero-emscripten-embind, DO NOT EDIT.
Code generated by wazero-emscripten-embind, DO NOT EDIT.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL