Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Doc ¶
type Doc struct {
Page Page `xml:"page"`
}
Doc is the outermost element in the pdftotext html; it should contain exactly one page in all cases for us
type Flow ¶
type Flow struct {
Blocks []Block `xml:"block"`
}
Flow is just a container of blocks, theoretically grouped in a meaningful way (though this isn't always the case with PDFs that we see)
type Page ¶
type Page struct { Flows []Flow `xml:"flow"` Width float64 `xml:"width,attr"` Height float64 `xml:"height,attr"` }
Page holds the outer <page> wrapper around all the <flow> elements
type Rect ¶
type Rect struct { XMin float64 `xml:"xMin,attr"` YMin float64 `xml:"yMin,attr"` XMax float64 `xml:"xMax,attr"` YMax float64 `xml:"yMax,attr"` }
Rect is a common structure embedded in most pdftotext elements
type Transformer ¶
type Transformer struct { PDFFilename string ALTOOutputFilename string ScaleFactor float64 ImageNumber int // Logger can be set up manually for customized logging, otherwise it just // gets set to the default logger Logger *logger.Logger // contains filtered or unexported fields }
Transformer holds onto various data needed to convert a PDF into ALTO-compatible XML, halting the process at the first error
func New ¶
func New(pdfFile, altoFile string, pdfDPI int, imgNo int) *Transformer
New sets up a new transformer to convert a PDF to ALTO XML
func (*Transformer) Transform ¶
func (t *Transformer) Transform() error
Transform takes the PDF file and runs it through pdftotext, then strips extraneous data from the generated HTML file, and finally writes an ALTO-like XML file to ALTOOutputFilename. If the return is anything but nil, the ALTO XML will not have been created.