pdfium

package module

v0.3.0 Latest Latest Go to latest Published: Jan 29, 2022 License: MIT Imports: 2 Imported by: 7

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/klippa-app/go-pdfium

Links

Open Source Insights

README ¶

go-pdfium

🚀 Easy PDF rendering and text extraction using Go and pdfium 🚀

A fast, multi-threaded and easy to use PDF renderer / text extractor for Go applications.

Features

Option between single-threaded and multi-threaded
Get page count
Get plain text of a page
Get structured text of a page (text, angle, position, size, font information)
Render 1 or multiple pages into a Go image.Image using either DPI or pixel size
Render the image above directly as a jpeg or png into a file path or byte array
Get page size in either points or pixel size (when rendered in a specific DPI)
High test coverage ⭐

pdfium

This project uses the pdfium C++ library by Google (https://pdfium.googlesource.com/pdfium/) to process the PDF documents.

Single/Multi-threading

Since pdfium is not a multithreaded C++ library, we can not directly make it multithreaded by calling it from Go's subroutines.

This library allows you to call pdfium in a single or multi-threaded way.

We have implemented multi-threading this using HashiCorp's Go Plugin System, which allows us launch separate pdfium worker processes, and then route the requests through the different workers. This also makes it a bit more safe to use pdfium, as it's less likely to segfaults or corrupt your main Go application. The Plugin system provides the communication between the processes using GRPc, however, when implementing this library, you won't really see anything of that. From the outside it will look like normal Go code.

Single-threading works by directly calling the pdfium library from the same process. Single-threaded might be preferred if the caller is managing the workers themselves and does not want the overhead of another process. Be aware that since pdfium is C++, we can't handle segfaults caused by pdfium, which may cause your process to be killed.

Be aware that pdfium could use quite some memory depending on the size of the PDF and the requests that you do, so be aware of the amount of workers that you configure.

Prerequisites

To use this Go library, you will need the actual pdfium library to run it and have it available through pkgconfig.

Get the library

You can try to compile pdfium yourself, but you can also use pre-compiled binaries, for example from: https://github.com/bblanchon/pdfium-binaries/releases

If you use a pre-compiled library, make sure to extract it somewhere logical, for example /opt/pdfium.

Configure pkg-config

Create/edit file /usr/lib/pkgconfig/pdfium.pc

prefix={path}
libdir={path}/lib
includedir={path}/include

Name: pdfium
Description: pdfium
Version: 4849
Requires:

Libs: -L${libdir} -lpdfium
Cflags: -I${includedir}

Replace {path} with the path you extracted/compiled pdfium in.

Make sure you extend your library path when running:

export LD_LIBRARY_PATH={path}/lib

You can do this globally or just in your editor.

this can globally be done on ubuntu by editing ~/.profile and adding the line in this file. reloading for bash can be done by relogging or running source ~/.profile can be used to test the change for a terminal

Getting started

To get started, make sure that you create a separate package in your application that will start the worker.

The examples below can also be found in the examples folder.

Single-threaded

For single threaded implementations we just have to initialize the library.

pdfium/renderer/renderer.go

package renderer

import (
	"github.com/klippa-app/go-pdfium"
	"github.com/klippa-app/go-pdfium/single_threaded"
)

var Pdfium pdfium.Pdfium

func init() {
	// Init the pdfium library and return the instance to open documents.
	Pdfium = single_threaded.Init()
}

Multi-threaded

Worker package

This package has to be named main to make it available as a binary. The plugin system will use this to start new pdfium workers. Example:

pdfium/worker/main.go

package main

import (
	"github.com/klippa-app/go-pdfium/multi_threaded/worker"
)

func main() {
	worker.StartWorker()
}

Worker configuration

To actually start workers, you will have to init the pdfium library somewhere, this also allows you to dynamically start workers when needed. The best location to add this is in the init() of a package that is going to call the pdfium library. Example:

pdfium/renderer/renderer.go

package renderer

import (
	"github.com/klippa-app/go-pdfium"
	"github.com/klippa-app/go-pdfium/multi_threaded"
)

var Pdfium pdfium.Pdfium

func init() {
	// Init the pdfium library and return the instance to open documents.
	// You can tweak these configs to your need. Be aware that workers can use quite some memory.
	Pdfium = multi_threaded.Init(multi_threaded.Config{
		MinIdle:  1, // Makes sure that at least x workers are always available
		MaxIdle:  1, // Makes sure that at most x workers are ever available
		MaxTotal: 1, // Maxium amount of workers in total, allows the amount of workers to grow when needed, items between total max and idle max are automatically cleaned up, while idle workers are kept alive so they can be used directly.
		Command: multi_threaded.Command{
			BinPath: "go",                                     // Only do this while developing, on production put the actual binary path in here. You should not want the Go runtime on production.
			Args:    []string{"run", "pdfium/worker/main.go"}, // This is a reference to the worker package, this can be left empty when using a direct binary path.
		},
	})
}

Get page count

package renderer

import (
	"io/ioutil"
	"log"

	"github.com/klippa-app/go-pdfium"
	"github.com/klippa-app/go-pdfium/requests"
)

var Pdfium pdfium.Pdfium

// Insert the single/multi-threaded init() here.

func main() {
	filePath := "example.pdf"
	pageCount, err := getPageCount(filePath)
	if err != nil {
		log.Fatal(err)
	}

	log.Printf("The PDF %s has %d pages", filePath, pageCount)
}

func getPageCount(filePath string) (int, error) {
	// Load the PDF file into a byte array.
	pdfBytes, err := ioutil.ReadFile(filePath)
	if err != nil {
		return 0, err
	}

	// Open the PDF using pdfium (and claim a worker)
	doc, err := Pdfium.NewDocument(&pdfBytes)
	if err != nil {
		return 0, err
	}

	// Always close the document, this will release the worker and it's resources
	defer doc.Close()

	pageCount, err := doc.GetPageCount(&requests.GetPageCount{})
	if err != nil {
		return 0, err
	}

	return pageCount.PageCount, nil
}

Render a page

package renderer

import (
	"image/png"
	"io/ioutil"
	"log"
	"os"

	"github.com/klippa-app/go-pdfium"
	"github.com/klippa-app/go-pdfium/requests"
)

var Pdfium pdfium.Pdfium

// Insert the single/multi-threaded init() here.

func main() {
	filePath := "example.pdf"
	output := "example.pdf.png"
	err := renderPage(filePath, 1, output)
	if err != nil {
		log.Fatal(err)
	}
}

func renderPage(filePath string, page int, output string) error {
	// Load the PDF file into a byte array.
	pdfBytes, err := ioutil.ReadFile(filePath)
	if err != nil {
		return err
	}

	// Open the PDF using pdfium (and claim a worker)
	doc, err := Pdfium.NewDocument(&pdfBytes)
	if err != nil {
		return err
	}

	// Always close the document, this will release the worker and it's resources
	defer doc.Close()

	// Render the page in DPI 200.
	pageRender, err := doc.RenderPageInDPI(&requests.RenderPageInDPI{
		DPI:  200,      // The DPI to render the page in.
		Page: page - 1, // The page to render, 0-indexed.
	})
	if err != nil {
		return err
	}

	// Write the output to a file.
	f, err := os.Create(output)
	if err != nil {
		return err
	}
	defer f.Close()

	err = png.Encode(f, pageRender.Image)
	if err != nil {
		return err
	}

	return nil
}

About Klippa

Founded in 2015, Klippa's goal is to digitize & automate administrative processes with modern technologies. We help clients enhance the effectiveness of their organization by using machine learning and OCR. Since 2015, more than a thousand happy clients have used Klippa's software solutions. Klippa currently has an international team of 50 people, with offices in Groningen, Amsterdam and Brasov.

License

The MIT License (MIT)

Documentation ¶

Index ¶

type Document
type NewDocumentOption
- func OpenDocumentWithPasswordOption(password string) NewDocumentOption
type Pdfium

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Document ¶

type Document interface {
	// GetPageCount returns the amount of pages for the document.
	GetPageCount(request *requests.GetPageCount) (*responses.GetPageCount, error)

	// GetPageText returns the text of a given page in plain text.
	GetPageText(request *requests.GetPageText) (*responses.GetPageText, error)

	// GetPageTextStructured returns the text of a given page in a structured way,
	// with coordinates and font information.
	GetPageTextStructured(request *requests.GetPageTextStructured) (*responses.GetPageTextStructured, error)

	// RenderPageInDPI renders a given page in the given DPI.
	RenderPageInDPI(request *requests.RenderPageInDPI) (*responses.RenderPage, error)

	// RenderPagesInDPI renders the given pages in the given DPI.
	RenderPagesInDPI(request *requests.RenderPagesInDPI) (*responses.RenderPages, error)

	// RenderPageInPixels renders a given page in the given pixel size.
	RenderPageInPixels(request *requests.RenderPageInPixels) (*responses.RenderPage, error)

	// RenderPagesInPixels renders the given pages in the given pixel sizes.
	RenderPagesInPixels(request *requests.RenderPagesInPixels) (*responses.RenderPages, error)

	// GetPageSize returns the size of the page in points.
	GetPageSize(request *requests.GetPageSize) (*responses.GetPageSize, error)

	// GetPageSizeInPixels returns the size of a page in pixels when rendered in the given DPI.
	GetPageSizeInPixels(request *requests.GetPageSizeInPixels) (*responses.GetPageSizeInPixels, error)

	// RenderToFile allows you to call one of the other render functions
	// and output the resulting image into a file.
	RenderToFile(request *requests.RenderToFile) (*responses.RenderToFile, error)

	// Close closes the document, releases the resources and gives back the worker to the pool.
	Close()
}

type NewDocumentOption ¶

type NewDocumentOption interface {
	AlterOpenDocumentRequest(*requests.OpenDocument)
}

func OpenDocumentWithPasswordOption ¶

func OpenDocumentWithPasswordOption(password string) NewDocumentOption

OpenDocumentWithPasswordOption can be used as NewDocumentOption when your PDF contains a password.

type Pdfium ¶

type Pdfium interface {
	// NewDocument returns a pdfium Document from the given PDF.
	NewDocument(file *[]byte, opts ...NewDocumentOption) (Document, error)
}

Source Files ¶

View all Source files

pdfium.go

Directories ¶

Path	Synopsis
errors
examples
multi_threaded
multi_threaded/worker
single_threaded
internal
commons
implementation
multi_threaded
worker
requests
responses
shared_tests
single_threaded

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL