hocr

package
v0.1.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 22, 2023 License: GPL-3.0 Imports: 14 Imported by: 5

Documentation

Overview

hocr contains structures and functions for parsing and analysing hocr files

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func BoxCoords

func BoxCoords(s string) ([4]int, error)

BoxCoords parses bbox coordinate strings

func GetAvgConf

func GetAvgConf(hocrfn string) (float64, error)

GetAvgConf calculates the average confidence of a hOCR file from confidences embedded in each word

func GetLineBasics

func GetLineBasics(hocrfn string) (line.Details, error)

GetLineBasics parses a hocr file and returns a corresponding line.Details, without any image extracts

func GetLineDetails

func GetLineDetails(hocrfn string) (line.Details, error)

GetLineDetails parses a hocr file and returns a corresponding line.Details, including image extracts for each line

func GetLineDetailsCustomImg added in v0.1.4

func GetLineDetailsCustomImg(hocrfn string, imgfn string) (line.Details, error)

GetLineDetailsCustomImg is a variant of GetLineDetails that uses a provided image path for line image extracts, rather than the image name embedded in the .hocr

func GetText

func GetText(hocrfn string) (string, error)

GetText parses a hOCR file and extracts the text from it

func GetWordConfs

func GetWordConfs(hocrfn string) ([]float64, error)

GetWordConfs is a utility function that parses a hocr file and returns an array containing the confidences of each word therein

func LineText

func LineText(l OcrLine) string

LineText extracts the text from an OcrLine

Types

type Hocr

type Hocr struct {
	Pages []Page `xml:"body>div"`
}

func Parse

func Parse(b []byte) (Hocr, error)

Parse parses a hOCR file

type OcrChar

type OcrChar struct {
	Class string    `xml:"class,attr"`
	Id    string    `xml:"id,attr"`
	Title string    `xml:"title,attr"`
	Chars []OcrChar `xml:"span"`
	Text  string    `xml:",chardata"`
}

type OcrLine

type OcrLine struct {
	Class string    `xml:"class,attr"`
	Id    string    `xml:"id,attr"`
	Title string    `xml:"title,attr"`
	Words []OcrWord `xml:"span"`
	Text  string    `xml:",chardata"`
}

type OcrWord

type OcrWord struct {
	Class string    `xml:"class,attr"`
	Id    string    `xml:"id,attr"`
	Title string    `xml:"title,attr"`
	Chars []OcrChar `xml:"span"`
	Text  string    `xml:",chardata"`
}

type Page added in v0.1.4

type Page struct {
	Lines []OcrLine `xml:"div>p>span"`
	Title string    `xml:"title,attr"`
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL