alto

package

v2.5.4+incompatible Latest Latest Go to latest Published: Sep 7, 2018 License: Apache-2.0 Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/uoregon-libraries/newspaper-curation-app

Links

Open Source Insights

Documentation ¶

Index ¶

type Block
type Doc
type Flow
type Line
type Page
type Rect
- func (r Rect) Height() float64
- func (r Rect) Width() float64
type Transformer
- func New(pdfFile, altoFile string, pdfDPI int, imgNo int) *Transformer
- func (t *Transformer) Transform() error
type Word

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Block ¶

type Block struct {
	Rect
	Lines []Line `xml:"line"`
}

A Block contains lines and a rectangle around them

type Doc ¶

type Doc struct {
	Page Page `xml:"page"`
}

Doc is the outermost element in the pdftotext html; it should contain exactly one page in all cases for us

type Flow ¶

type Flow struct {
	Blocks []Block `xml:"block"`
}

Flow is just a container of blocks, theoretically grouped in a meaningful way (though this isn't always the case with PDFs that we see)

type Line ¶

type Line struct {
	Rect
	Words []Word `xml:"word"`
}

A Line contains the individual word elements, and a rectangle

type Page ¶

type Page struct {
	Flows  []Flow  `xml:"flow"`
	Width  float64 `xml:"width,attr"`
	Height float64 `xml:"height,attr"`
}

Page holds the outer <page> wrapper around all the <flow> elements

type Rect ¶

type Rect struct {
	XMin float64 `xml:"xMin,attr"`
	YMin float64 `xml:"yMin,attr"`
	XMax float64 `xml:"xMax,attr"`
	YMax float64 `xml:"yMax,attr"`
}

Rect is a common structure embedded in most pdftotext elements

func (Rect) Height ¶

func (r Rect) Height() float64

Height returns the Y difference

func (Rect) Width ¶

func (r Rect) Width() float64

Width returns the X difference

type Transformer ¶

type Transformer struct {
	PDFFilename        string
	ALTOOutputFilename string
	ScaleFactor        float64
	ImageNumber        int

	// Logger can be set up manually for customized logging, otherwise it just
	// gets set to the default logger
	Logger *logger.Logger
	// contains filtered or unexported fields
}

Transformer holds onto various data needed to convert a PDF into ALTO-compatible XML, halting the process at the first error

func New ¶

func New(pdfFile, altoFile string, pdfDPI int, imgNo int) *Transformer

New sets up a new transformer to convert a PDF to ALTO XML

func (*Transformer) Transform ¶

func (t *Transformer) Transform() error

Transform takes the PDF file and runs it through pdftotext, then strips extraneous data from the generated HTML file, and finally writes an ALTO-like XML file to ALTOOutputFilename. If the return is anything but nil, the ALTO XML will not have been created.

type Word ¶

type Word struct {
	Rect
	Text string `xml:",chardata"`
}

A Word is the most granular element we get, containing a rectangle around the text and the text itself

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL