docconv

package module
v1.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 18, 2021 License: MIT Imports: 21 Imported by: 0

README

docconv

GoDoc Build Status

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text.

Note for returning users: the Go import path for this package been moved to code.sajari.com/docconv.

Installation

If you haven't setup Go before, you first need to install Go.

To fetch and build the code:

$ go get code.sajari.com/docconv/...

This will also build the command line tool docd into $GOPATH/bin. Make sure that $GOPATH/bin is in your PATH environment variable.

Dependencies

tidy, wv, popplerutils, unrtf, https://github.com/JalfResi/justext

Example install of dependencies (not all systems):

$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext
Optional dependencies

To add image support to the docconv library you first need to install and build gosseract.

Now you can add -tags ocr to any go command when building/fetching/testing docconv to include support for processing images:

$ go get -tags ocr code.sajari.com/docconv/...

This may complain on macOS, which you can fix by installing tesseract via brew:

$ brew install tesseract

docd tool

The docd tool runs as either:

  1. a service on port 8888 (by default)

    Documents can be sent as a multipart POST request and the plain text (body) and meta information are then returned as a JSON object.

  2. a service exposed from within a Docker container

    This also runs as a service, but from within a Docker container. There are three build scripts:

    The debian version uses the Debian package repository which can vary with builds. The alpine version uses a very cut down Linux distribution to produce a container ~40MB. It also locks the dependency versions for consistency, but may miss out on future updates. The appengine version is a flex based custom runtime for Google Cloud.

  3. via the command line.

    Documents can be sent as an argument, e.g.

    $ docd -input document.pdf
    
Optional flags
  • addr - the bind address for the HTTP server, default is ":8888"
  • log-level
    • 0: errors & critical info
    • 1: inclues 0 and logs each request as well
    • 2: include 1 and logs the response payloads
  • readability-length-low - sets the readability length low if the ?readability=1 parameter is set
  • readability-length-high - sets the readability length high if the ?readability=1 parameter is set
  • readability-stopwords-low - sets the readability stopwords low if the ?readability=1 parameter is set
  • readability-stopwords-high - sets the readability stopwords high if the ?readability=1 parameter is set
  • readability-max-link-density - sets the readability max link density if the ?readability=1 parameter is set
  • readability-max-heading-distance - sets the readability max heading distance if the ?readability=1 parameter is set
  • readability-use-classes - comma separated list of readability classes to use if the ?readability=1 parameter is set
How to start the service
$ # This will only log errors and critical info
$ docd -log-level 0

$ # This will run on port 8000 and log each request
$ docd -addr :8000 -log-level 1

Example usage (code)

Some basic code is shown below, but normally you would accept the file by HTTP or open it from the file system.

This should be enough to get you started though.

Use case 1: run locally

Note: this assumes you have the dependencies installed.

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv"
)

func main() {
	res, err := docconv.ConvertPath("your-file.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}
Use case 2: request over the network
package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv/client"
)

func main() {
	// Create a new client, using the default endpoint (localhost:8888)
	c := client.New()

	res, err := client.ConvertPath(c, "your-file.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(res)
}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ConvertDoc

func ConvertDoc(r io.Reader) (string, map[string]string, error)

ConvertDoc converts an MS Word .doc to text.

func ConvertDocx

func ConvertDocx(r io.Reader) (string, map[string]string, error)

ConvertDocx converts an MS Word docx file to text.

func ConvertHTML

func ConvertHTML(r io.Reader, readability bool) (string, map[string]string, error)

ConvertHTML converts HTML into text.

func ConvertImage

func ConvertImage(r io.Reader) (string, map[string]string, error)

ConvertImage converts images to text. Requires gosseract (ocr build tag).

func ConvertODT

func ConvertODT(r io.Reader) (string, map[string]string, error)

ConvertODT converts a ODT file to text

func ConvertPDF

func ConvertPDF(r io.Reader) (string, map[string]string, error)

func ConvertPDFText

func ConvertPDFText(path string) (BodyResult, MetaResult, error)

func ConvertPages

func ConvertPages(r io.Reader) (string, map[string]string, error)

ConvertPages converts a Pages file to text.

func ConvertPathReadability

func ConvertPathReadability(path string, readability bool) ([]byte, error)

ConvertPathReadability converts a local path to text, with the given readability option.

func ConvertPptx added in v1.1.2

func ConvertPptx(r io.Reader) (string, map[string]string, error)

ConvertPptx converts an MS PowerPoint pptx file to text.

func ConvertRTF

func ConvertRTF(r io.Reader) (string, map[string]string, error)

ConvertRTF converts RTF files to text.

func ConvertURL

func ConvertURL(input io.Reader, readability bool) (string, map[string]string, error)

ConvertURL fetches the HTML page at the URL given in the io.Reader.

func ConvertXML

func ConvertXML(r io.Reader) (string, map[string]string, error)

ConvertXML converts an XML file to text.

func DocxXMLToText

func DocxXMLToText(r io.Reader) (string, error)

DocxXMLToText converts Docx XML into plain text.

func HTMLReadability

func HTMLReadability(r io.Reader) []byte

HTMLReadability extracts the readable text in an HTML document

func HTMLToText

func HTMLToText(input io.Reader) string

HTMLToText converts HTML to plain text.

func MimeTypeByExtension

func MimeTypeByExtension(filename string) string

MimeTypeByExtension returns a mimetype for the given extension, or application/octet-stream if none can be determined.

func SetImageLanguages

func SetImageLanguages(...string)

SetImageLanguages sets the languages parameter passed to gosseract.

func Tidy

func Tidy(r io.Reader, xmlIn bool) ([]byte, error)

Tidy attempts to tidy up XML. Errors & warnings are deliberately suppressed as underlying tools throw warnings very easily.

func XMLToMap

func XMLToMap(r io.Reader) (map[string]string, error)

XMLToMap converts XML to a nested string map.

func XMLToText

func XMLToText(r io.Reader, breaks []string, skip []string, strict bool) (string, error)

XMLToText converts XML to plain text given how to treat elements.

Types

type BodyResult

type BodyResult struct {
	// contains filtered or unexported fields
}

type HTMLReadabilityOptions

type HTMLReadabilityOptions struct {
	LengthLow             int
	LengthHigh            int
	StopwordsLow          float64
	StopwordsHigh         float64
	MaxLinkDensity        float64
	MaxHeadingDistance    int
	ReadabilityUseClasses string
}

HTMLReadabilityOptions is a type which defines parameters that are passed to the justext package. TODO: Improve this!

var HTMLReadabilityOptionsValues HTMLReadabilityOptions

HTMLReadabilityOptionsValues are the global settings used for HTMLReadability. TODO: Remove this from global state.

type LocalFile

type LocalFile struct {
	*os.File
	// contains filtered or unexported fields
}

LocalFile is a type which wraps an *os.File. See NewLocalFile for more details.

func NewLocalFile

func NewLocalFile(r io.Reader) (*LocalFile, error)

NewLocalFile ensures that there is a file which contains the data provided by r. If r is actually an instance of *os.File then this file is used, otherwise a temporary file is created and the data from r copied into it. Callers must call Done() when the LocalFile is no longer needed to ensure all resources are cleaned up.

func (*LocalFile) Done

func (l *LocalFile) Done()

Done cleans up all resources.

type MetaResult

type MetaResult struct {
	// contains filtered or unexported fields
}

Meta data

type Response

type Response struct {
	Body  string            `json:"body"`
	Meta  map[string]string `json:"meta"`
	MSecs uint32            `json:"msecs"`
	Error string            `json:"error"`
}

Response payload sent back to the requestor

func Convert

func Convert(r io.Reader, mimeType string, readability bool) (*Response, error)

Convert a file to plain text.

func ConvertPath

func ConvertPath(path string) (*Response, error)

ConvertPath converts a local path to text.

Directories

Path Synopsis
Package client defines types and functions for interacting with docconv HTTP servers.
Package client defines types and functions for interacting with docconv HTTP servers.
Package TSP is a generated protocol buffer package.
Package TSP is a generated protocol buffer package.
Package snappy implements the snappy block-based compression format.
Package snappy implements the snappy block-based compression format.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL