extractor

package

v0.0.0-...-a2e00f7 Latest Latest Go to latest Published: Nov 21, 2024 License: MIT Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/carmel/unipdf

Links

Open Source Insights

README ¶

TEXT EXTRACTION CODE

There are two directionss.

reading
depth

In English text,

the reading direction is left to right, increasing X in the PDF coordinate system.
the depth directon is top to bottom, decreasing Y in the PDF coordinate system.

HOW TEXT IS EXTRACTED

text_page.go makeTextPage() is the top level text extraction function. It returns an ordered list of textParas which are described below.

A page's textMarks are obtained from its content stream. They are in the order they occur in the content stream.
The textMarks are grouped into word fragments calledtextWords by scanning through the textMarks and splitting on space characters and the gaps between marks.
The textWordss are grouped into rectangular regions based on their bounding boxes' proximities to other textWords. These rectangular regions are called textParass. (In the current implementation there is an intermediate step where the textWords are divided into containers called wordBags.)
The textWords in each textPara are arranged into textLines (textWords of similar depth).
Within each textLine, textWords are sorted in reading order and each one that starts a whole word is marked by setting its newWord flag to true. (See textLine.text().)
All the textParas on a page are checked to see if they are arranged as cells within a table and, if they are, they are combined into textTables and a textPara containing the textTable replaces the textParas containing the cells.
The textParas, some of which may be tables, are sorted into reading order (the order in which they are read, not in the reading direction).

The entire order of extracted text from a page is expressed in paraList.writeText().

This function iterates through the textParas, which are sorted in reading order.
For each textPara with a table, it iterates through the table cell textParas. (See textPara.writeCellText().)
For each (top level or table cell) textPara, it iterates through the textLines.
For each textLine, it iterates through the textWords inserting a space before each one that has the newWord flag set.

`textWord` creation

makeTextWords() combines textMarks into textWords, word fragments.
textWords are the atoms of the text extraction code.

`textPara` creation

dividePage() combines textWords that are close to each other into groups in rectangular regions called wordBags.
wordBag.arrangeText() arranges the textWords in the rectangular regions into textLines, groups textWords of about the same depth sorted left to right.
textLine.markWordBoundaries() marks the textWords in each textLine that start whole words.

TODO

Handle diagonal text.
Get R to L text extraction working.
Get top to bottom text extraction working.

Documentation ¶

Overview ¶

Package extractor is used for quickly extracting PDF content through a simple interface. Currently offers functionality for extracting textual content.

Index ¶

Constants
type Extractor
- func New(page *model.PdfPage) (*Extractor, error)
- func NewFromContents(contents string, resources *model.PdfPageResources) (*Extractor, error)
type ImageExtractOptions
type ImageMark
type PageImages
type PageText
type RenderMode
type TableCell
type TextMark
- func (tm TextMark) String() string
type TextMarkArray
type TextTable

Constants ¶

View Source

const TOL = 1.0e-6

TOL is the tolerance for coordinates to be consideted equal. It is big enough to cover all rounding errors and small enough that TOL point differences on a page aren't visible.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Extractor ¶

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor stores and offers functionality for extracting content from PDF pages.

func New ¶

func New(page *model.PdfPage) (*Extractor, error)

New returns an Extractor instance for extracting content from the input PDF page.

func NewFromContents ¶

func NewFromContents(contents string, resources *model.PdfPageResources) (*Extractor, error)

NewFromContents creates a new extractor from contents and page resources.

func (*Extractor) ExtractPageImages ¶

func (e *Extractor) ExtractPageImages(options *ImageExtractOptions) (*PageImages, error)

ExtractPageImages returns the image contents of the page extractor, including data and position, size information for each image. A set of options to control page image extraction can be passed in. The options parameter can be nil for the default options. By default, inline stencil masks are not extracted.

func (*Extractor) ExtractPageText ¶

func (e *Extractor) ExtractPageText() (*PageText, int, int, error)

ExtractPageText returns the text contents of `e` (an Extractor for a page) as a PageText. TODO(peterwilliams97): The stats complicate this function signature and aren't very useful.

Replace with a function like Extract() (*PageText, error)

func (*Extractor) ExtractText ¶

func (e *Extractor) ExtractText() (string, error)

ExtractText processes and extracts all text data in content streams and returns as a string. It takes into account character encodings in the PDF file, which are decoded by CharcodeBytesToUnicode. Characters that can't be decoded are replaced with MissingCodeRune ('\ufffd' = �).

func (*Extractor) ExtractTextWithStats ¶

func (e *Extractor) ExtractTextWithStats() (extracted string, numChars int, numMisses int, err error)

ExtractTextWithStats works like ExtractText but returns the number of characters in the output (`numChars`) and the number of characters that were not decoded (`numMisses`).

type ImageExtractOptions ¶

type ImageExtractOptions struct {
	IncludeInlineStencilMasks bool
}

ImageExtractOptions contains options for controlling image extraction from PDF pages.

type ImageMark ¶

type ImageMark struct {
	Image *model.Image

	// Dimensions of the image as displayed in the PDF.
	Width  float64
	Height float64

	// Position of the image in PDF coordinates (lower left corner).
	X float64
	Y float64

	// Angle in degrees, if rotated.
	Angle float64
}

ImageMark represents an image drawn on a page and its position in device coordinates. All coordinates are in device coordinates.

type PageImages ¶

type PageImages struct {
	Images []ImageMark
}

PageImages represents extracted images on a PDF page with spatial information: display position and size.

type PageText ¶

type PageText struct {
	// contains filtered or unexported fields
}

PageText represents the layout of text on a device page.

func (PageText) Marks ¶

func (pt PageText) Marks() *TextMarkArray

Marks returns the TextMark collection for a page. It represents all the text on the page.

func (PageText) String ¶

func (pt PageText) String() string

String returns a string describing `pt`.

func (PageText) Tables ¶

func (pt PageText) Tables() []TextTable

Tables returns the tables extracted from the page.

func (PageText) Text ¶

func (pt PageText) Text() string

Text returns the extracted page text.

func (PageText) ToText ¶

func (pt PageText) ToText() string

ToText returns the page text as a single string. Deprecated: This function is deprecated and will be removed in a future major version. Please use Text() instead.

type RenderMode ¶

type RenderMode int

RenderMode specifies the text rendering mode (Tmode), which determines whether showing text shall cause glyph outlines to be stroked, filled, used as a clipping boundary, or some combination of the three. Stroking, filling, and clipping shall have the same effects for a text object as they do for a path object (see 8.5.3, "Path-Painting Operators" and 8.5.4, "Clipping Path Operators").

const (
	RenderModeStroke RenderMode = 1 << iota // Stroke
	RenderModeFill                          // Fill
	RenderModeClip                          // Clip
)

Render mode type.

type TableCell ¶

type TableCell struct {
	// Text is the extracted text.
	Text string
	// Marks returns the TextMarks corresponding to the text in Text.
	Marks TextMarkArray
}

TableCell is a cell in a TextTable.

type TextMark ¶

type TextMark struct {
	// Text is the extracted text.
	Text string
	// Original is the text in the PDF. It has not been decoded like `Text`.
	Original string
	// BBox is the bounding box of the text.
	BBox model.PdfRectangle
	// Font is the font the text was drawn with.
	Font *model.PdfFont
	// FontSize is the font size the text was drawn with.
	FontSize float64
	// Offset is the offset of the start of TextMark.Text in the extracted text. If you do this
	//   text, textMarks := pageText.Text(), pageText.Marks()
	//   marks := textMarks.Elements()
	// then marks[i].Offset is the offset of marks[i].Text in text.
	Offset int
	// Meta is set true for spaces and line breaks that we insert in the extracted text. We insert
	// spaces (line breaks) when we see characters that are over a threshold horizontal (vertical)
	//  distance  apart. See wordJoiner (lineJoiner) in PageText.computeViews().
	Meta bool
	// FillColor is the fill color of the text.
	// The color is nil for spaces and line breaks (i.e. the Meta field is true).
	FillColor color.Color
	// StrokeColor is the stroke color of the text.
	// The color is nil for spaces and line breaks (i.e. the Meta field is true).
	StrokeColor color.Color
}

TextMark represents extracted text on a page with information regarding both textual content, formatting (font and size) and positioning. It is the smallest unit of text on a PDF page, typically a single character.

getBBox() in test_text.go shows how to compute bounding boxes of substrings of extracted text. The following code extracts the text on PDF page `page` into `text` then finds the bounding box `bbox` of substring `term` in `text`.

ex, _ := New(page)
// handle errors
pageText, _, _, err := ex.ExtractPageText()
// handle errors
text := pageText.Text()
textMarks := pageText.Marks()

	start := strings.Index(text, term)
 end := start + len(term)
 spanMarks, err := textMarks.RangeOffset(start, end)
 // handle errors
 bbox, ok := spanMarks.BBox()
 // handle errors

func (TextMark) String ¶

func (tm TextMark) String() string

String returns a string describing `tm`.

type TextMarkArray ¶

type TextMarkArray struct {
	// contains filtered or unexported fields
}

TextMarkArray is a collection of TextMarks.

func (*TextMarkArray) Append ¶

func (ma *TextMarkArray) Append(mark TextMark)

Append appends `mark` to the mark array.

func (*TextMarkArray) BBox ¶

func (ma *TextMarkArray) BBox() (model.PdfRectangle, bool)

BBox returns the smallest axis-aligned rectangle that encloses all the TextMarks in `ma`.

func (*TextMarkArray) Elements ¶

func (ma *TextMarkArray) Elements() []TextMark

Elements returns the TextMarks in `ma`.

func (*TextMarkArray) Len ¶

func (ma *TextMarkArray) Len() int

Len returns the number of TextMarks in `ma`.

func (*TextMarkArray) RangeOffset ¶

func (ma *TextMarkArray) RangeOffset(start, end int) (*TextMarkArray, error)

RangeOffset returns the TextMarks in `ma` that overlap text[start:end] in the extracted text. These are tm: `start` <= tm.Offset + len(tm.Text) && tm.Offset < `end` where `start` and `end` are offsets in the extracted text. NOTE: TextMarks can contain multiple characters. e.g. "ffi" for the ﬃ ligature so the first and last elements of the returned TextMarkArray may only partially overlap text[start:end].

func (TextMarkArray) String ¶

func (ma TextMarkArray) String() string

String returns a string describing `ma`.

type TextTable ¶

type TextTable struct {
	W, H  int
	Cells [][]TableCell
}

TextTable represents a table. Cells are ordered top-to-bottom, left-to-right. Cells[y] is the (0-offset) y'th row in the table. Cells[y][x] is the (0-offset) x'th column in the table.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL