pdf

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 5, 2022 License: BSD-3-Clause Imports: 17 Imported by: 0

README

PDF Reader

In a nutshell PDF Reader it is a simple Go library for reading PDF files which enables text exctraction being in the form of Plain Text or Formatted Text. The very first developer of this library it was https://github.com/rsc/pdf, being forked and improved by https://github.com/ledongthuc/pdf. Cloudresty has forked ledongthuc's library with the aim to maintain and improve it further.

  Features

  • Get plain text content (without format)
  • Get content (including all font and formatting information)

 

Install Read PDF Go library

go get -u github.com/cloudresty/pdf

 

Read PDF Go Library Examples

Go Read PDF - Plain Text
package main

import (
	"bytes"
	"fmt"

	"github.com/cloudresty/pdf"
)

//
// Main Function
//

func main() {

	pdf.DebugOn = true

	//
	// Read PDF File
	//

	pdfText, errReadPDF := readPDF("file.pdf")

	if errReadPDF != nil {
		panic(errReadPDF)
	}

	fmt.Println(pdfText)
	
	return

}

//
// Read PDF Function
//

func readPDF(path string) (string, error) {

	//
	// Open PDF File
	//

	f, r, errPDFOpen := pdf.Open(path)
	
    defer f.Close()

	if errPDFOpen != nil {
		return "", errPDFOpen
	}

	//
	// Extract Plain Text
	//

	var buf bytes.Buffer

    b, errGetPlainText := r.GetPlainText()

	if errGetPlainText != nil {
        return "", errGetPlainText
    }

	buf.ReadFrom(b)

	return buf.String(), nil

}
Go Read PDF - Formatted Text
//
// Read PDF Function (standalone)
//

func readPDF(path string) (string, error) {

	//
	// Open PDF File
	//

	f, r, errPDFOpen := pdf.Open(path)

	defer f.Close()

	if errPDFOpen != nil {
		return "", errPDFOpen
	}

	//
	// Page Count
	//
	
	totalPage := r.NumPage()

	//
	// Loop Through Pages
	//

	for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {

		p := r.Page(pageIndex)

		if p.V.IsNull() {
			continue
		}

		//
		// Extract Formatted Text
		//

		var lastTextStyle pdf.Text
		texts := p.Content().Text

		for _, text := range texts {

			if isSameSentence(text, lastTextStyle) {

				lastTextStyle.S = lastTextStyle.S + text.S
		
			} else {

				fmt.Printf("Font: %s, Font-size: %f, x: %f, y: %f, content: %s \n", lastTextStyle.Font, lastTextStyle.FontSize, lastTextStyle.X, lastTextStyle.Y, lastTextStyle.S)
				lastTextStyle = text

			}

		}

	}

	return "", nil

}
Go Read PDF - Text Grouped by Rows
package main

import (
	"fmt"
	"os"

	"github.com/cloudresty/pdf"
)

//
// Main Function
//

func main() {

	//
	// Read Local PDF File
	//

	content, errReadPDF := readPDF(os.Args[1])

	if errReadPDF != nil {
		panic(errReadPDF)
	}

	fmt.Println(content)

	return

}

//
// Read PDF Function
//

func readPDF(path string) (string, error) {

	//
	// Open PDF File
	//

	f, r, errReadPDF := pdf.Open(path)

	defer func() {
		_ = f.Close()
	}()

	if errReadPDF != nil {
		return "", errReadPDF
	}

	//
	// Page Count
	//

	totalPage := r.NumPage()

	//
	// Loop Through Pages
	//

	for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {

		p := r.Page(pageIndex)

		if p.V.IsNull() {
			continue
		}

		//
		// Get Rows
		//

		rows, _ := p.GetTextByRow()

		for _, row := range rows {

			println("Row: ", row.Position)

			for _, word := range row.Content {

				fmt.Println(word.S)

			}

		}

	}

	return "", nil
}

Documentation

Overview

Package pdf implements reading of PDF files.

Overview

PDF is Adobe's Portable Document Format, ubiquitous on the internet. A PDF document is a complex data format built on a fairly simple structure. This package exposes the simple structure along with some wrappers to extract basic information. If more complex information is needed, it is possible to extract that information by interpreting the structure exposed by this package.

Specifically, a PDF is a data structure built from Values, each of which has one of the following Kinds:

Null, for the null object.
Integer, for an integer.
Real, for a floating-point number.
Bool, for a boolean value.
Name, for a name constant (as in /Helvetica).
String, for a string constant.
Dict, for a dictionary of name-value pairs.
Array, for an array of values.
Stream, for an opaque data stream and associated header dictionary.

The accessors on Value—Int64, Float64, Bool, Name, and so on—return a view of the data as the given type. When there is no appropriate view, the accessor returns a zero result. For example, the Name accessor returns the empty string if called on a Value v for which v.Kind() != Name. Returning zero values this way, especially from the Dict and Array accessors, which themselves return Values, makes it possible to traverse a PDF quickly without writing any error checking. On the other hand, it means that mistakes can go unreported.

The basic structure of the PDF file is exposed as the graph of Values.

Most richer data structures in a PDF file are dictionaries with specific interpretations of the name-value pairs. The Font and Page wrappers make the interpretation of a specific Value as the corresponding type easier. They are only helpers, though: they are implemented only in terms of the Value API and could be moved outside the package. Equally important, traversal of other PDF data structures can be implemented in other packages as needed.

Index

Constants

This section is empty.

Variables

View Source
var DebugOn = false

DebugOn is responsible for logging messages into stdout. If problems arise during reading, set it true.

View Source
var ErrInvalidPassword = fmt.Errorf("encrypted PDF: invalid password")

Functions

func Interpret

func Interpret(strm Value, do func(stk *Stack, op string))

Interpret interprets the content in a stream as a basic PostScript program, pushing values onto a stack and then calling the do function to execute operators. The do function may push or pop values from the stack as needed to implement op.

Interpret handles the operators "dict", "currentdict", "begin", "end", "def", and "pop" itself.

Interpret is not a full-blown PostScript interpreter. Its job is to handle the very limited PostScript found in certain supporting file formats embedded in PDF files, such as cmap files that describe the mapping from font code points to Unicode code points.

There is no support for executable blocks, among other limitations.

Types

type Column

type Column struct {
	Position int64
	Content  TextVertical
}

Column represents the contents of a column

type Columns

type Columns []*Column

Columns is a list of column

type Content

type Content struct {
	Text []Text
	Rect []Rect
}

Content describes the basic content on a page: the text and any drawn rectangles.

type Font

type Font struct {
	V Value
	// contains filtered or unexported fields
}

A Font represent a font in a PDF file. The methods interpret a Font dictionary stored in V.

func (Font) BaseFont

func (f Font) BaseFont() string

BaseFont returns the font's name (BaseFont property).

func (Font) Encoder

func (f Font) Encoder() TextEncoding

Encoder returns the encoding between font code point sequences and UTF-8.

func (Font) FirstChar

func (f Font) FirstChar() int

FirstChar returns the code point of the first character in the font.

func (Font) LastChar

func (f Font) LastChar() int

LastChar returns the code point of the last character in the font.

func (Font) Width

func (f Font) Width(code int) float64

Width returns the width of the given code point.

func (Font) Widths

func (f Font) Widths() []float64

Widths returns the widths of the glyphs in the font. In a well-formed PDF, len(f.Widths()) == f.LastChar()+1 - f.FirstChar().

type Outline

type Outline struct {
	Title string    // title for this element
	Child []Outline // child elements
}

An Outline is a tree describing the outline (also known as the table of contents) of a document.

type Page

type Page struct {
	V Value
}

A Page represent a single page in a PDF file. The methods interpret a Page dictionary stored in V.

func (Page) Content

func (p Page) Content() Content

Content returns the page's content.

func (Page) Font

func (p Page) Font(name string) Font

Font returns the font with the given name associated with the page.

func (Page) Fonts

func (p Page) Fonts() []string

Fonts returns a list of the fonts associated with the page.

func (Page) GetPlainText

func (p Page) GetPlainText(fonts map[string]*Font) (result string, err error)

GetPlainText returns the page's all text without format. fonts can be passed in (to improve parsing performance) or left nil

func (Page) GetTextByColumn

func (p Page) GetTextByColumn() (Columns, error)

GetTextByColumn returns the page's all text grouped by column

func (Page) GetTextByRow

func (p Page) GetTextByRow() (Rows, error)

GetTextByRow returns the page's all text grouped by rows

func (Page) Resources

func (p Page) Resources() Value

Resources returns the resources dictionary associated with the page.

type Point

type Point struct {
	X float64
	Y float64
}

A Point represents an X, Y pair.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

A Reader is a single PDF file open for reading.

func NewReader

func NewReader(f io.ReaderAt, size int64) (*Reader, error)

NewReader opens a file for reading, using the data in f with the given total size.

func NewReaderEncrypted

func NewReaderEncrypted(f io.ReaderAt, size int64, pw func() string) (*Reader, error)

NewReaderEncrypted opens a file for reading, using the data in f with the given total size. If the PDF is encrypted, NewReaderEncrypted calls pw repeatedly to obtain passwords to try. If pw returns the empty string, NewReaderEncrypted stops trying to decrypt the file and returns an error.

func Open

func Open(file string) (*os.File, *Reader, error)

Open opens a file for reading.

func (*Reader) GetPlainText

func (r *Reader) GetPlainText() (reader io.Reader, err error)

GetPlainText returns all the text in the PDF file

func (*Reader) NumPage

func (r *Reader) NumPage() int

NumPage returns the number of pages in the PDF file.

func (*Reader) Outline

func (r *Reader) Outline() Outline

Outline returns the document outline. The Outline returned is the root of the outline tree and typically has no Title itself. That is, the children of the returned root are the top-level entries in the outline.

func (*Reader) Page

func (r *Reader) Page(num int) Page

Page returns the page for the given page number. Page numbers are indexed starting at 1, not 0. If the page is not found, Page returns a Page with p.V.IsNull().

func (*Reader) Trailer

func (r *Reader) Trailer() Value

Trailer returns the file's Trailer value.

type Rect

type Rect struct {
	Min, Max Point
}

A Rect represents a rectangle.

type Row

type Row struct {
	Position int64
	Content  TextHorizontal
}

Row represents the contents of a row

type Rows

type Rows []*Row

Rows is a list of rows

type Stack

type Stack struct {
	// contains filtered or unexported fields
}

A Stack represents a stack of values.

func (*Stack) Len

func (stk *Stack) Len() int

func (*Stack) Pop

func (stk *Stack) Pop() Value

func (*Stack) Push

func (stk *Stack) Push(v Value)

type Text

type Text struct {
	Font     string  // the font used
	FontSize float64 // the font size, in points (1/72 of an inch)
	X        float64 // the X coordinate, in points, increasing left to right
	Y        float64 // the Y coordinate, in points, increasing bottom to top
	W        float64 // the width of the text, in points
	S        string  // the actual UTF-8 text
}

A Text represents a single piece of text drawn on a page.

type TextEncoding

type TextEncoding interface {
	// Decode returns the UTF-8 text corresponding to
	// the sequence of code points in raw.
	Decode(raw string) (text string)
}

A TextEncoding represents a mapping between font code points and UTF-8 text.

type TextHorizontal

type TextHorizontal []Text

TextHorizontal implements sort.Interface for sorting a slice of Text values in horizontal order, left to right, and then top to bottom within a column.

func (TextHorizontal) Len

func (x TextHorizontal) Len() int

func (TextHorizontal) Less

func (x TextHorizontal) Less(i, j int) bool

func (TextHorizontal) Swap

func (x TextHorizontal) Swap(i, j int)

type TextVertical

type TextVertical []Text

TextVertical implements sort.Interface for sorting a slice of Text values in vertical order, top to bottom, and then left to right within a line.

func (TextVertical) Len

func (x TextVertical) Len() int

func (TextVertical) Less

func (x TextVertical) Less(i, j int) bool

func (TextVertical) Swap

func (x TextVertical) Swap(i, j int)

type Value

type Value struct {
	// contains filtered or unexported fields
}

A Value is a single PDF value, such as an integer, dictionary, or array. The zero Value is a PDF null (Kind() == Null, IsNull() = true).

func (Value) Bool

func (v Value) Bool() bool

Bool returns v's boolean value. If v.Kind() != Bool, Bool returns false.

func (Value) Float64

func (v Value) Float64() float64

Float64 returns v's float64 value, converting from integer if necessary. If v.Kind() != Float64 and v.Kind() != Int64, Float64 returns 0.

func (Value) Index

func (v Value) Index(i int) Value

Index returns the i'th element in the array v. If v.Kind() != Array or if i is outside the array bounds, Index returns a null Value.

func (Value) Int64

func (v Value) Int64() int64

Int64 returns v's int64 value. If v.Kind() != Int64, Int64 returns 0.

func (Value) IsNull

func (v Value) IsNull() bool

IsNull reports whether the value is a null. It is equivalent to Kind() == Null.

func (Value) Key

func (v Value) Key(key string) Value

Key returns the value associated with the given name key in the dictionary v. Like the result of the Name method, the key should not include a leading slash. If v is a stream, Key applies to the stream's header dictionary. If v.Kind() != Dict and v.Kind() != Stream, Key returns a null Value.

func (Value) Keys

func (v Value) Keys() []string

Keys returns a sorted list of the keys in the dictionary v. If v is a stream, Keys applies to the stream's header dictionary. If v.Kind() != Dict and v.Kind() != Stream, Keys returns nil.

func (Value) Kind

func (v Value) Kind() ValueKind

Kind reports the kind of value underlying v.

func (Value) Len

func (v Value) Len() int

Len returns the length of the array v. If v.Kind() != Array, Len returns 0.

func (Value) Name

func (v Value) Name() string

Name returns v's name value. If v.Kind() != Name, Name returns the empty string. The returned name does not include the leading slash: if v corresponds to the name written using the syntax /Helvetica, Name() == "Helvetica".

func (Value) RawString

func (v Value) RawString() string

RawString returns v's string value. If v.Kind() != String, RawString returns the empty string.

func (Value) Reader

func (v Value) Reader() io.ReadCloser

Reader returns the data contained in the stream v. If v.Kind() != Stream, Reader returns a ReadCloser that responds to all reads with a “stream not present” error.

func (Value) String

func (v Value) String() string

String returns a textual representation of the value v. Note that String is not the accessor for values with Kind() == String. To access such values, see RawString, Text, and TextFromUTF16.

func (Value) Text

func (v Value) Text() string

Text returns v's string value interpreted as a “text string” (defined in the PDF spec) and converted to UTF-8. If v.Kind() != String, Text returns the empty string.

func (Value) TextFromUTF16

func (v Value) TextFromUTF16() string

TextFromUTF16 returns v's string value interpreted as big-endian UTF-16 and then converted to UTF-8. If v.Kind() != String or if the data is not valid UTF-16, TextFromUTF16 returns the empty string.

type ValueKind

type ValueKind int

A ValueKind specifies the kind of data underlying a Value.

const (
	Null ValueKind = iota
	Bool
	Integer
	Real
	String
	Name
	Dict
	Array
	Stream
)

The PDF value kinds.

Notes

Bugs

  • The package is incomplete, although it has been used successfully on some large real-world PDF files.

  • There is no support for closing open PDF files. If you drop all references to a Reader, the underlying reader will eventually be garbage collected.

  • The library makes no attempt at efficiency. A value cache maintained in the Reader would probably help significantly.

  • The support for reading encrypted files is weak.

  • The Value API does not support error reporting. The intent is to allow users to set an error reporting callback in Reader, but that code has not been implemented.

Directories

Path Synopsis
Pdfpasswd searches for the password for an encrypted PDF by trying all strings over a given alphabet up to a given length.
Pdfpasswd searches for the password for an encrypted PDF by trying all strings over a given alphabet up to a given length.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL