alto

package
v2.5.4+incompatible Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 7, 2018 License: Apache-2.0 Imports: 11 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Block

type Block struct {
	Rect
	Lines []Line `xml:"line"`
}

A Block contains lines and a rectangle around them

type Doc

type Doc struct {
	Page Page `xml:"page"`
}

Doc is the outermost element in the pdftotext html; it should contain exactly one page in all cases for us

type Flow

type Flow struct {
	Blocks []Block `xml:"block"`
}

Flow is just a container of blocks, theoretically grouped in a meaningful way (though this isn't always the case with PDFs that we see)

type Line

type Line struct {
	Rect
	Words []Word `xml:"word"`
}

A Line contains the individual word elements, and a rectangle

type Page

type Page struct {
	Flows  []Flow  `xml:"flow"`
	Width  float64 `xml:"width,attr"`
	Height float64 `xml:"height,attr"`
}

Page holds the outer <page> wrapper around all the <flow> elements

type Rect

type Rect struct {
	XMin float64 `xml:"xMin,attr"`
	YMin float64 `xml:"yMin,attr"`
	XMax float64 `xml:"xMax,attr"`
	YMax float64 `xml:"yMax,attr"`
}

Rect is a common structure embedded in most pdftotext elements

func (Rect) Height

func (r Rect) Height() float64

Height returns the Y difference

func (Rect) Width

func (r Rect) Width() float64

Width returns the X difference

type Transformer

type Transformer struct {
	PDFFilename        string
	ALTOOutputFilename string
	ScaleFactor        float64
	ImageNumber        int

	// Logger can be set up manually for customized logging, otherwise it just
	// gets set to the default logger
	Logger *logger.Logger
	// contains filtered or unexported fields
}

Transformer holds onto various data needed to convert a PDF into ALTO-compatible XML, halting the process at the first error

func New

func New(pdfFile, altoFile string, pdfDPI int, imgNo int) *Transformer

New sets up a new transformer to convert a PDF to ALTO XML

func (*Transformer) Transform

func (t *Transformer) Transform() error

Transform takes the PDF file and runs it through pdftotext, then strips extraneous data from the generated HTML file, and finally writes an ALTO-like XML file to ALTOOutputFilename. If the return is anything but nil, the ALTO XML will not have been created.

type Word

type Word struct {
	Rect
	Text string `xml:",chardata"`
}

A Word is the most granular element we get, containing a rectangle around the text and the text itself

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL