fileconversion

package module

v0.0.0-...-1b64e2d Latest Latest Go to latest Published: Oct 30, 2019 License: Unlicense Imports: 52 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/charlesnjau/fileconversion

Links

Open Source Insights

README ¶

fileconversion

This is a Go library to convert various file formats into plaintext and provide related useful functions.

This library is used for https://intelx.io and was successfully tested over 184 million individual files. It is partly written from scratch, partly forked from open source and partly a rewrite of existing code. Many existing libraries lack stability and functionality and this libraries solves that.

We welcome any contributions - please open issues for any feature requests, bugs, and other related issues.

It supports following file formats for plaintext conversion:

Word: DOC, DOCX, RTF, ODT
Excel: XLS, XLSX, ODS
PowerPoint: PPTX
PDF
Ebook: EPUB, MOBI
Website: HTML

Functions for compressed and container files:

Decompress files: GZ, BZ, BZ2, XZ
Extract files from containers: ZIP, RAR, 7Z, TAR

Picture related functions:

Check if pictures are excessively large
Compress (and convert) pictures to JPEG: GIF, JPEG, PNG, BMP, TIFF
Resize and compress pictures
Extract pictures from PDF files

To download this library:

go get -u github.com/IntelligenceX/fileconversion

And then use it like:

package main

import (
	"bytes"
	"fmt"
	"os"

	"github.com/IntelligenceX/fileconversion"
)

const sizeLimit = 2 * 1024 * 1024 // 2 MB

func main() {
	// extract text from an XLSX file
	file, err := os.Open("Test.xlsx")
	if err != nil {
		fmt.Printf("Error opening file: %s\n", err)
		return
	}

	defer file.Close()
	stat, _ := file.Stat()

	buffer := bytes.NewBuffer(make([]byte, 0, sizeLimit))

	fileconversion.XLSX2Text(file, stat.Size(), buffer, sizeLimit, -1)

	fmt.Println(buffer.String())
}

Functions

The package exports the following functions:

XLSX2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64, rowLimit int) (written int64, err error)
DOCX2Text(file io.ReaderAt, size int64) (string, error)
EPUB2Text(file io.ReaderAt, size int64, limit int64) (string, error)
HTML2Text(reader io.Reader) (pageText string, err error)
HTML2TextAndLinks(reader io.Reader, baseURL string) (pageText string, links []string, err error)
Mobi2Text(file io.ReadSeeker) (string, error)
ODS2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)
ODT2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)
PDFListContentStreams(f io.ReadSeeker, w io.Writer, size int64) (written int64, err error)
PPTX2Text(file io.ReaderAt, size int64) (string, error)
RTF2Text(inputRtf string) string
XLS2Text(reader io.ReadSeeker, writer io.Writer, size int64) (written int64, err error)
XLSX2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64, rowLimit int) (written int64, err error)

Picture functions:

IsExcessiveLargePicture(Picture []byte) (excessive bool, err error)
CompressJPEG(Picture []byte, quality int) (compressed []byte)
ResizeCompressPicture(Picture []byte, Quality int, MaxWidth, MaxHeight uint) 
PDFExtractImages(input io.ReadSeeker) (images []ImageResult, err error)

Compression and container file functions:

DecompressFile(data []byte) (decompressed []byte, valid bool)
ContainerExtractFiles(data []byte, callback func(name string, size int64, date time.Time, data []byte))

Dependencies

This library uses other go packages. Run the following command to download them:

go get -u github.com/nwaples/rardecode
go get -u github.com/saracen/go7z
go get -u github.com/ulikunitz/xz
go get -u github.com/mattetti/filebuffer
go get -u github.com/richardlehane/mscfb
go get -u github.com/taylorskalyo/goreader/epub
go get -u github.com/PuerkitoBio/goquery
go get -u github.com/ssor/bom
go get -u github.com/levigross/exp-html
go get -u github.com/neofight/mobi/convert
go get -u github.com/neofight/mobi/headers
go get -u github.com/unidoc/unipdf
go get -u github.com/nfnt/resize
go get -u github.com/tealeg/xlsx
go get -u gopkg.in/xmlpath.v2

Tests

There are no functional tests. The only test functions are used manually for debugging.

Forks

Other packages were tested and either found insufficient, or unstable. Many of the below listed packages were found to be unstable, cause crashes, as well as exhaust memory due to bad programming, bad input sanitizing and bad memory management.

html2text is forked from https://github.com/jaytaylor/html2text
odf is forked from https://github.com/knieriem/odf
ole2 is forked and partly rewritten from https://github.com/extrame/ole2
xls is forked from https://github.com/sergeilem/xls which is a fork from https://github.com/extrame/xls
doc is forked from https://github.com/EndFirstCorp/doc2txt
docx is forked from https://github.com/guylaor/goword
mobi is forked from https://github.com/neofight/mobi
odt is forked from https://github.com/lu4p/cat
pptx is forked from https://github.com/mr-tim/rol-o-decks
rtf is forked from https://github.com/J45k4/rtf-go

License

This is free and unencumbered software released into the public domain.

Note that this package includes, or consists partly of forks or rewrite of existing open source code. Use at your own risk. Intelligence X does not provide any warranty for this library or any parts of it.

Documentation ¶

Index ¶

func CompressJPEG(Picture []byte, quality int) (compressed []byte)
func ContainerExtractFiles(data []byte, ...)
func DOC2Text(r io.Reader) (io.Reader, error)
func DOCX2Text(file io.ReaderAt, size int64) (string, error)
func DecompressFile(data []byte) (decompressed []byte, valid bool)
func EPUB2Text(file io.ReaderAt, size int64, limit int64) (string, error)
func HTML2Text(reader io.Reader) (pageText string, err error)
func HTML2TextAndLinks(reader io.Reader, baseURL string) (pageText string, links []string, err error)
func InitPDFLicense(key, name string)
func IsExcessiveLargePicture(Picture []byte) (excessive bool, err error)
func IsFileDOC(data []byte) bool
func IsFileDOCX(data []byte) bool
func IsFileMOBI(data []byte) bool
func IsFilePPT(data []byte) bool
func IsFilePPTX(data []byte) bool
func IsFileRTF(data []byte) bool
func IsFileXLS(data []byte) bool
func IsFileXLSX(data []byte) bool
func IsFileZIP(data []byte) bool
func Mobi2Text(file io.ReadSeeker) (string, error)
func ODS2Cells(file io.ReaderAt, size int64) (cells []string, err error)
func ODS2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)
func ODT2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)
func PDFGetCreationDate(f io.ReadSeeker) (date time.Time, valid bool)
func PDFListContentStreams(f io.ReadSeeker, w io.Writer, size int64) (written int64, err error)
func PPTX2Text(file io.ReaderAt, size int64) (string, error)
func RTF2Text(inputRtf string) string
func ResizeCompressPicture(Picture []byte, Quality int, MaxWidth, MaxHeight uint) (compressed []byte, err error)
func XLS2Cells(reader io.ReadSeeker) (cells []string, err error)
func XLS2Text(reader io.ReadSeeker, writer io.Writer, size int64) (written int64, err error)
func XLSX2Cells(file io.ReaderAt, size int64, rowLimit int) (cells []string, err error)
func XLSX2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64, rowLimit int) (written int64, err error)
type ImageResult
- func PDFExtractImages(input io.ReadSeeker) (images []ImageResult, err error)
type PPTXDocument
- func (doc PPTXDocument) AsText() (text string)
type PPTXSlide
type SlideNumberSorter
- func (a SlideNumberSorter) Len() int
- func (a SlideNumberSorter) Less(i, j int) bool
- func (a SlideNumberSorter) Swap(i, j int)
type WordDocument
- func WordParse(doc string) (WordDocument, error)
- func (w WordDocument) AsText() string
type WordParagraph
type WordRow
type WordStyle

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CompressJPEG ¶

func CompressJPEG(Picture []byte, quality int) (compressed []byte)

CompressJPEG compresses a JPEG picture according to the input Warning: If the image claims to be large (in terms of width & height), this may use a lot of memory. Use IsExcessiveLargePicture first.

func ContainerExtractFiles ¶

func ContainerExtractFiles(data []byte, callback func(name string, size int64, date time.Time, data []byte))

ContainerExtractFiles extracts files from supported containers: ZIP, RAR, 7Z, TAR

func DOC2Text ¶

func DOC2Text(r io.Reader) (io.Reader, error)

DOC2Text converts a standard io.Reader from a Microsoft Word .doc binary file and returns a reader (actually a bytes.Buffer) which will output the plain text found in the .doc file

func DOCX2Text ¶

func DOCX2Text(file io.ReaderAt, size int64) (string, error)

DOCX2Text extracts text of a Word document Size is the full size of the input file.

func DecompressFile ¶

func DecompressFile(data []byte) (decompressed []byte, valid bool)

DecompressFile decompresses data. It supports: GZ, BZ, BZ2, XZ

func EPUB2Text ¶

func EPUB2Text(file io.ReaderAt, size int64, limit int64) (string, error)

EPUB2Text converts an EPUB ebook to text

func HTML2Text ¶

func HTML2Text(reader io.Reader) (pageText string, err error)

HTML2Text extracts the text from the HTML

func HTML2TextAndLinks ¶

func HTML2TextAndLinks(reader io.Reader, baseURL string) (pageText string, links []string, err error)

HTML2TextAndLinks extracts the text from the HTML and all links from <a> and <img> tags of a HTML If the base URL is provided, relative links will be converted to absolute ones.

func InitPDFLicense ¶

func InitPDFLicense(key, name string)

InitPDFLicense initializes the PDF license

func IsExcessiveLargePicture ¶

func IsExcessiveLargePicture(Picture []byte) (excessive bool, err error)

IsExcessiveLargePicture checks if the picture has reasonable width and height, preventing potential DoS when decoding it This protects against this problem: If the image claims to be large (in terms of width & height), jpeg.Decode may use a lot of memory, see https://github.com/golang/go/issues/10532.

func IsFileDOC ¶

func IsFileDOC(data []byte) bool

IsFileDOC checks if the data indicates a DOC file DOC has multiple signature according to https://filesignatures.net/index.php?search=doc&mode=EXT, D0 CF 11 E0 A1 B1 1A E1

func IsFileDOCX ¶

func IsFileDOCX(data []byte) bool

IsFileDOCX checks if the data indicates a DOCX file DOCX has a signature of 50 4B 03 04

func IsFileMOBI ¶

func IsFileMOBI(data []byte) bool

IsFileMOBI checks if the data indicates a MOBI file

func IsFilePPT ¶

func IsFilePPT(data []byte) bool

IsFilePPT checks if the data indicates a PPT file PPT has multiple signature according to https://www.filesignatures.net/index.php?page=search&search=PPT&mode=EXT, D0 CF 11 E0 A1 B1 1A E1. This overlaps with others (including DOC ans XLS).

func IsFilePPTX ¶

func IsFilePPTX(data []byte) bool

IsFilePPTX checks if the data indicates a PPTX file PPTX has a signature of 50 4B 03 04 Warning: This collides with ZIP, DOCX and other zip-based files.

func IsFileRTF ¶

func IsFileRTF(data []byte) bool

IsFileRTF checks if the data indicates a RTF file RTF has a signature of 7B 5C 72 74 66 31, or in string "{\rtf1"

func IsFileXLS ¶

func IsFileXLS(data []byte) bool

IsFileXLS checks if the data indicates a XLS file XLS has a signature of D0 CF 11 E0 A1 B1 1A E1

func IsFileXLSX ¶

func IsFileXLSX(data []byte) bool

IsFileXLSX checks if the data indicates a XLSX file XLSX has a signature of 50 4B 03 04 Warning: This collides with ZIP, DOCX and other zip-based files.

func IsFileZIP ¶

func IsFileZIP(data []byte) bool

IsFileZIP checks if the data indicates a ZIP file. Many file formats like DOCX, XLSX, PPTX and APK are actual ZIP files. Signature 50 4B 03 04

func Mobi2Text ¶

func Mobi2Text(file io.ReadSeeker) (string, error)

Mobi2Text converts a MOBI ebook to text

func ODS2Cells ¶

func ODS2Cells(file io.ReaderAt, size int64) (cells []string, err error)

ODS2Cells converts an ODS file to individual cells Size is the full size of the input file.

func ODS2Text ¶

func ODS2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)

ODS2Text extracts text of an OpenDocument Spreadsheet Size is the full size of the input file.

func ODT2Text ¶

func ODT2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)

ODT2Text extracts text of an OpenDocument Text file Size is the full size of the input file.

func PDFGetCreationDate ¶

func PDFGetCreationDate(f io.ReadSeeker) (date time.Time, valid bool)

PDFGetCreationDate tries to get the creation date

func PDFListContentStreams ¶

func PDFListContentStreams(f io.ReadSeeker, w io.Writer, size int64) (written int64, err error)

PDFListContentStreams writes all text streams in a PDF to the writer It returns the number of characters attempted written (excluding "Page N" and new-lines) and an error, if any. It can be used to determine whether any text was extracted. The parameter size is the max amount of bytes (not characters) to write out.

func PPTX2Text ¶

func PPTX2Text(file io.ReaderAt, size int64) (string, error)

PPTX2Text extracts text of a PowerPoint document Size is the full size of the input file.

func RTF2Text ¶

func RTF2Text(inputRtf string) string

RTF2Text removes rtf characters from string and returns the new string.

func ResizeCompressPicture ¶

func ResizeCompressPicture(Picture []byte, Quality int, MaxWidth, MaxHeight uint) (compressed []byte, err error)

ResizeCompressPicture scales a picture down and compresses it. It accepts GIF, JPEG, PNG as input but output will always be JPEG. Quality specifies the output JPEG quality 0-100. Anything below 75 will noticably reduce the picture quality. Warning: If the image claims to be large (in terms of width & height), this may use a lot of memory. Use IsExcessiveLargePicture first. Scaling a picture down is optional and only done if MaxWidth and MaxHeight are not 0. Even without rescaling, this function is useful to convert a picture into JPEG.

func XLS2Cells ¶

func XLS2Cells(reader io.ReadSeeker) (cells []string, err error)

XLS2Cells converts an XLS file to individual cells

func XLS2Text ¶

func XLS2Text(reader io.ReadSeeker, writer io.Writer, size int64) (written int64, err error)

XLS2Text extracts text from an Excel sheet. It returns bytes written. The parameter size is the max amount of bytes (not characters) to write out. The whole Excel file is required even for partial text extraction. This function returns no error with 0 bytes written in case of corrupted or invalid file.

func XLSX2Cells ¶

func XLSX2Cells(file io.ReaderAt, size int64, rowLimit int) (cells []string, err error)

XLSX2Cells converts an XLSX file to individual cells Size is the full size of the input file. rowLimit defines how many rows per sheet to extract. -1 means unlimited. This exists as protection against some XLSX files that may use excessive amount of memory.

func XLSX2Text ¶

func XLSX2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64, rowLimit int) (written int64, err error)

XLSX2Text extracts text of an Excel sheet Size is the full size of the input file. Limit is the output limit in bytes. rowLimit defines how many rows per sheet to extract. -1 means unlimited. This exists as protection against some XLSX files that may use excessive amount of memory.

Types ¶

type ImageResult ¶

type ImageResult struct {
	Image image.Image
	Name  string
}

ImageResult contains an extracted image

func PDFExtractImages ¶

func PDFExtractImages(input io.ReadSeeker) (images []ImageResult, err error)

PDFExtractImages extracts all images from a PDF file

type PPTXDocument ¶

type PPTXDocument struct {
	Slides []PPTXSlide
}

PPTXDocument is a PPTX document loaded into memory

func (PPTXDocument) AsText ¶

func (doc PPTXDocument) AsText() (text string)

AsText returns the text on all slides

type PPTXSlide ¶

type PPTXSlide struct {
	SlideNumber int
	//ThumbnailBase64 string
	TextContent string
}

PPTXSlide is a single PPTX slide

type SlideNumberSorter ¶

type SlideNumberSorter []PPTXSlide

SlideNumberSorter is used for sorting

func (SlideNumberSorter) Len ¶

func (a SlideNumberSorter) Len() int

func (SlideNumberSorter) Less ¶

func (a SlideNumberSorter) Less(i, j int) bool

func (SlideNumberSorter) Swap ¶

func (a SlideNumberSorter) Swap(i, j int)

type WordDocument ¶

type WordDocument struct {
	Paragraphs []WordParagraph
}

WordDocument is a full word doc

func WordParse ¶

func WordParse(doc string) (WordDocument, error)

WordParse parses a word file

func (WordDocument) AsText ¶

func (w WordDocument) AsText() string

AsText returns all text in the document

type WordParagraph ¶

type WordParagraph struct {
	Style WordStyle `xml:"pPr>pStyle"`
	Rows  []WordRow `xml:"r"`
}

WordParagraph is a single paragraph

type WordRow ¶

type WordRow struct {
	Text string `xml:"t"`
}

WordRow ...

type WordStyle ¶

type WordStyle struct {
	Val string `xml:"val,attr"`
}

WordStyle ...

Directories ¶

Path	Synopsis
html2text
odf
ods This package implements rudimentary support for reading Open Document Spreadsheet files.	This package implements rudimentary support for reading Open Document Spreadsheet files.
ole2
xls xls package use to parse the 97 -2004 microsoft xls file(".xls" suffix, NOT ".xlsx" suffix )	xls package use to parse the 97 -2004 microsoft xls file(".xls" suffix, NOT ".xlsx" suffix )

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

fileconversion

Functions

Dependencies

Tests

Forks

License

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func CompressJPEG ¶

func ContainerExtractFiles ¶

func DOC2Text ¶

func DOCX2Text ¶

func DecompressFile ¶

func EPUB2Text ¶

func HTML2Text ¶

func HTML2TextAndLinks ¶

func InitPDFLicense ¶

func IsExcessiveLargePicture ¶

func IsFileDOC ¶

func IsFileDOCX ¶

func IsFileMOBI ¶

func IsFilePPT ¶

func IsFilePPTX ¶

func IsFileRTF ¶

func IsFileXLS ¶

func IsFileXLSX ¶

func IsFileZIP ¶

func Mobi2Text ¶

func ODS2Cells ¶

func ODS2Text ¶

func ODT2Text ¶

func PDFGetCreationDate ¶

func PDFListContentStreams ¶

func PPTX2Text ¶

func RTF2Text ¶

func ResizeCompressPicture ¶

func XLS2Cells ¶

func XLS2Text ¶

func XLSX2Cells ¶

func XLSX2Text ¶

Types ¶

type ImageResult ¶

func PDFExtractImages ¶

type PPTXDocument ¶

func (PPTXDocument) AsText ¶

type PPTXSlide ¶

type SlideNumberSorter ¶

func (SlideNumberSorter) Len ¶

func (SlideNumberSorter) Less ¶

func (SlideNumberSorter) Swap ¶

type WordDocument ¶

func WordParse ¶

func (WordDocument) AsText ¶

type WordParagraph ¶

type WordRow ¶

type WordStyle ¶

Source Files ¶

Directories ¶