gosseract

package module
v0.0.0-...-cbec72f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 27, 2023 License: MIT Imports: 8 Imported by: 0

README

gosseract OCR

Go Test Docker Test Vagrant Test codecov Go Report Card Maintainability Go Reference

Golang OCR package, by using Tesseract C++ library.

OCR Server

Do you just want OCR server, or see the working example of this package? Yes, there is already-made server application, which is seriously easy to deploy!

👉 https://github.com/otiai10/ocrserver

Example

package main

import (
	"fmt"
	"github.com/otiai10/gosseract/v2"
)

func main() {
	client := gosseract.NewClient()
	defer client.Close()
	client.SetImage("path/to/image.png")
	text, _ := client.Text()
	fmt.Println(text)
	// Hello, World!
}

Installation

  1. tesseract-ocr, including library and headers
  2. go get -t github.com/otiai10/gosseract/v2

Please check this Dockerfile to get started step-by-step. Or if you want the env instantly, you can just try by docker run -it --rm otiai10/gosseract.

Test

In case you have tesseract-ocr on your local, you can just hit

% go test .

Otherwise, if you DON'T want to install tesseract-ocr on your local, kick ./test/runtime which is using Docker and Vagrant to test the source code on some runtimes.

% ./test/runtime --driver docker
% ./test/runtime --driver vagrant

Check ./test/runtimes for more information about runtime tests.

Issues

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func ClearPersistentCache

func ClearPersistentCache()

ClearPersistentCache clears any library-level memory caches. There are a variety of expensive-to-load constant data structures (mostly language dictionaries) that are cached globally – surviving the Init() and End() of individual TessBaseAPI's. This function allows the clearing of these caches.

func GetAvailableLanguages

func GetAvailableLanguages() ([]string, error)

GetAvailableLanguages returns a list of available languages in the default tesspath

func Version

func Version() string

Version returns the version of Tesseract-OCR

Types

type BoundingBox

type BoundingBox struct {
	Box                                image.Rectangle
	Word                               string
	Confidence                         float64
	BlockNum, ParNum, LineNum, WordNum int
}

BoundingBox contains the position, confidence and UTF8 text of the recognized word

type Client

type Client struct {

	// Trim specifies characters to trim, which would be trimed from result string.
	// As results of OCR, text often contains unnecessary characters, such as newlines, on the head/foot of string.
	// If `Trim` is set, this client will remove specified characters from the result.
	Trim bool

	// TessdataPrefix can indicate directory path to `tessdata`.
	// It is set `/usr/local/share/tessdata/` or something like that, as default.
	// TODO: Implement and test
	TessdataPrefix string

	// Languages are languages to be detected. If not specified, it's gonna be "eng".
	Languages []string

	// Variables is just a pool to evaluate "tesseract::TessBaseAPI->SetVariable" in delay.
	// TODO: Think if it should be public, or private property.
	Variables map[SettableVariable]string

	// Config is a file path to the configuration for Tesseract
	// See http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
	// TODO: Fix link to official page
	ConfigFilePath string
	// contains filtered or unexported fields
}

Client is argument builder for tesseract::TessBaseAPI.

func NewClient

func NewClient() *Client

NewClient construct new Client. It's due to caller to Close this client.

Example
client := NewClient()
// Never forget to defer Close. It is due to caller to Close this client.
defer client.Close()
Output:

func (*Client) Close

func (client *Client) Close() (err error)

Close frees allocated API. This MUST be called for ANY client constructed by "NewClient" function.

func (*Client) DetectOrientationScript

func (client *Client) DetectOrientationScript() (out *OSDResult, err error)

DetectOrientationScript detects the orientation and script of the text.

func (*Client) DisableOutput

func (client *Client) DisableOutput() error

DisableOutput ...

func (*Client) GetBoundingBoxes

func (client *Client) GetBoundingBoxes(level PageIteratorLevel) (out []BoundingBox, err error)

GetBoundingBoxes returns bounding boxes for each matched word

func (*Client) GetBoundingBoxesVerbose

func (client *Client) GetBoundingBoxesVerbose() (out []BoundingBox, err error)

GetBoundingBoxesVerbose returns bounding boxes at word level with block_num, par_num, line_num and word_num according to the c++ api that returns a formatted TSV output. Reference: `TessBaseAPI::GetTSVText`.

func (*Client) HOCRText

func (client *Client) HOCRText() (out string, err error)

HOCRText finally initialize tesseract::TessBaseAPI, execute OCR and returns hOCR text. See https://en.wikipedia.org/wiki/HOCR for more information of hOCR.

func (*Client) SetBlacklist

func (client *Client) SetBlacklist(blacklist string) error

SetBlacklist sets blacklist chars. See official documentation for blacklist here https://tesseract-ocr.github.io/tessdoc/ImproveQuality#dictionaries-word-lists-and-patterns

func (*Client) SetConfigFile

func (client *Client) SetConfigFile(fpath string) error

SetConfigFile sets the file path to config file.

func (*Client) SetImage

func (client *Client) SetImage(imagepath string) error

SetImage sets path to image file to be processed OCR.

Example
client := NewClient()
defer client.Close()

client.SetImage("./test/data/001-helloworld.png")
// See "ExampleClient_Text" for more practical usecase ;)
Output:

func (*Client) SetImageFromBytes

func (client *Client) SetImageFromBytes(data []byte) error

SetImageFromBytes sets the image data to be processed OCR.

func (*Client) SetLanguage

func (client *Client) SetLanguage(langs ...string) error

SetLanguage sets languages to use. English as default.

func (*Client) SetPageSegMode

func (client *Client) SetPageSegMode(mode PageSegMode) error

SetPageSegMode sets "Page Segmentation Mode" (PSM) to detect layout of characters. See official documentation for PSM here https://tesseract-ocr.github.io/tessdoc/ImproveQuality#page-segmentation-method See https://github.com/otiai10/gosseract/issues/52 for more information.

func (*Client) SetTessdataPrefix

func (client *Client) SetTessdataPrefix(prefix string) error

SetTessdataPrefix sets path to the models directory. Environment variable TESSDATA_PREFIX is used as default.

func (*Client) SetVariable

func (client *Client) SetVariable(key SettableVariable, value string) error

SetVariable sets parameters, representing tesseract::TessBaseAPI->SetVariable. See official documentation here https://zdenop.github.io/tesseract-doc/classtesseract_1_1_tess_base_a_p_i.html#a2e09259c558c6d8e0f7e523cbaf5adf5 Because `api->SetVariable` must be called after `api->Init`, this method cannot detect unexpected key for variables. Check `client.setVariablesToInitializedAPI` for more information.

func (*Client) SetWhitelist

func (client *Client) SetWhitelist(whitelist string) error

SetWhitelist sets whitelist chars. See official documentation for whitelist here https://tesseract-ocr.github.io/tessdoc/ImproveQuality#dictionaries-word-lists-and-patterns

Example
if os.Getenv("TESS_LSTM_DISABLED") == "1" {
	os.Exit(0)
}

client := NewClient()
defer client.Close()
client.SetImage("./test/data/002-confusing.png")

client.SetWhitelist("IO-")
text1, _ := client.Text()

client.SetWhitelist("10-")
text2, _ := client.Text()

fmt.Println(text1, text2)
Output:

func (*Client) Text

func (client *Client) Text() (out string, err error)

Text finally initialize tesseract::TessBaseAPI, execute OCR and extract text detected as string.

Example
client := NewClient()
defer client.Close()

client.SetImage("./test/data/001-helloworld.png")

text, err := client.Text()
fmt.Println(text, err)
Output:

func (*Client) Version

func (client *Client) Version() string

Version provides the version of Tesseract used by this client.

type Content

type Content struct {
	ID    string `xml:"id,attr"`
	Title string `xml:"title,attr"`
	Class string `xml:"class,attr"`
	Par   Par    `xml:"p"`
}

Content represents `<div class='ocr_carea' />`

type Line

type Line struct {
	ID    string `xml:"id,attr"`
	Title string `xml:"title,attr"`
	Class string `xml:"class,attr"`
	Words []Word `xml:"span"`
}

Line represents `<span class='ocr_line' />`

type OSDResult

type OSDResult struct {
	// OrientationDegree is the detected clockwise rotation of the input
	// image in degrees (0, 90, 180, 270).
	OrientationDegree int
	// OrientationConfidence is a confidence indicator for the orientation
	// for which "15.0 is reasonably confident".
	OrientationConfidence float64
	ScriptName            *string
	ScriptConfidence      float64
}

func (OSDResult) String

func (osd OSDResult) String() string

type Page

type Page struct {
	ID      string  `xml:"id,attr"`
	Title   string  `xml:"title,attr"`
	Class   string  `xml:"class,attr"`
	Content Content `xml:"div"`
}

Page represents `<div class='ocr_page' />`

type PageIteratorLevel

type PageIteratorLevel int

PageIteratorLevel maps directly to tesseracts enum tesseract::PageIteratorLevel represents the hierarchy of the page elements used in ResultIterator. https://github.com/tesseract-ocr/tesseract/blob/a18620cfea33d03032b71fe1b9fc424777e34252/ccstruct/publictypes.h#L219-L225

const (
	// RIL_BLOCK - Block of text/image/separator line.
	RIL_BLOCK PageIteratorLevel = iota
	// RIL_PARA - Paragraph within a block.
	RIL_PARA
	// RIL_TEXTLINE - Line within a paragraph.
	RIL_TEXTLINE
	// RIL_WORD - Word within a textline.
	RIL_WORD
	// RIL_SYMBOL - Symbol/character within a word.
	RIL_SYMBOL
)

type PageSegMode

type PageSegMode int

PageSegMode represents tesseract::PageSegMode. See https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method and https://github.com/tesseract-ocr/tesseract/blob/a18620cfea33d03032b71fe1b9fc424777e34252/ccstruct/publictypes.h#L158-L183 for more information.

const (
	// PSM_OSD_ONLY - Orientation and script detection (OSD) only.
	PSM_OSD_ONLY PageSegMode = iota
	// PSM_AUTO_OSD - Automatic page segmentation with OSD.
	PSM_AUTO_OSD
	// PSM_AUTO_ONLY - Automatic page segmentation, but no OSD, or OCR.
	PSM_AUTO_ONLY
	// PSM_AUTO - (DEFAULT) Fully automatic page segmentation, but no OSD.
	PSM_AUTO
	// PSM_SINGLE_COLUMN - Assume a single column of text of variable sizes.
	PSM_SINGLE_COLUMN
	// PSM_SINGLE_BLOCK_VERT_TEXT - Assume a single uniform block of vertically aligned text.
	PSM_SINGLE_BLOCK_VERT_TEXT
	// PSM_SINGLE_BLOCK - Assume a single uniform block of text.
	PSM_SINGLE_BLOCK
	// PSM_SINGLE_LINE - Treat the image as a single text line.
	PSM_SINGLE_LINE
	// PSM_SINGLE_WORD - Treat the image as a single word.
	PSM_SINGLE_WORD
	// PSM_CIRCLE_WORD - Treat the image as a single word in a circle.
	PSM_CIRCLE_WORD
	// PSM_SINGLE_CHAR - Treat the image as a single character.
	PSM_SINGLE_CHAR
	// PSM_SPARSE_TEXT - Find as much text as possible in no particular order.
	PSM_SPARSE_TEXT
	// PSM_SPARSE_TEXT_OSD - Sparse text with orientation and script det.
	PSM_SPARSE_TEXT_OSD
	// PSM_RAW_LINE - Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
	PSM_RAW_LINE

	// PSM_COUNT - Just a number of enum entries. This is NOT a member of PSM ;)
	PSM_COUNT
)

type Par

type Par struct {
	ID       string `xml:"id,attr"`
	Title    string `xml:"title,attr"`
	Class    string `xml:"class,attr"`
	Language string `xml:"lang,attr"`
	Lines    []Line `xml:"span"`
}

Par represents `<p class='ocr_par' />`

type SettableVariable

type SettableVariable string

SettableVariable represents available strings for TessBaseAPI::SetVariable. See https://groups.google.com/forum/#!topic/tesseract-ocr/eHTBzrBiwvQ and https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/tesseractclass.h

const (
	// DEBUG_FILE - File to send output to.
	DEBUG_FILE SettableVariable = "debug_file"
	// TESSEDIT_CHAR_WHITELIST - Whitelist of chars to recognize
	// There is a known issue in 4.00 with LSTM
	// https://github.com/tesseract-ocr/tesseract/issues/751
	TESSEDIT_CHAR_WHITELIST SettableVariable = "tessedit_char_whitelist"
	// TESSEDIT_CHAR_BLACKLIST - Blacklist of chars not to recognize
	// There is a known issue in 4.00 with LSTM
	// https://github.com/tesseract-ocr/tesseract/issues/751
	TESSEDIT_CHAR_BLACKLIST SettableVariable = "tessedit_char_blacklist"
)

Followings are variables which can be used for TessBaseAPI::SetVariable. If anything missing (I know there are many), please add one below.

type Word

type Word struct {
	ID         string `xml:"id,attr"`
	Title      string `xml:"title,attr"`
	Class      string `xml:"class,attr"`
	Characters string `xml:",chardata"`
}

Word represents `<span class='ocr_word' />`

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL