fileconversion

package module
v0.0.0-...-dfdb317 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 30, 2024 License: Unlicense Imports: 49 Imported by: 0

README

fileconversion

This is a Go library to convert various file formats into plaintext and provide related useful functions.

This library is used for https://intelx.io and was successfully tested over 184 million individual files. It is partly written from scratch, partly forked from open source and partly a rewrite of existing code. Many existing libraries lack stability and functionality and this libraries solves that.

We welcome any contributions - please open issues for any feature requests, bugs, and other related issues.

It supports following file formats for plaintext conversion:

  • Word: DOC, DOCX, RTF, ODT
  • Excel: XLS, XLSX, ODS
  • PowerPoint: PPTX
  • PDF
  • Ebook: EPUB, MOBI
  • Website: HTML

Functions for compressed and container files:

  • Decompress files: GZ, BZ, BZ2, XZ
  • Extract files from containers: ZIP, RAR, 7Z, TAR

Picture related functions:

  • Check if pictures are excessively large
  • Compress (and convert) pictures to JPEG: GIF, JPEG, PNG, BMP, TIFF
  • Resize and compress pictures
  • Extract pictures from PDF files

To download this library:

go get -u github.com/IntelligenceX/fileconversion

And then use it like:

package main

import (
	"bytes"
	"fmt"
	"os"

	"github.com/IntelligenceX/fileconversion"
)

const sizeLimit = 2 * 1024 * 1024 // 2 MB

func main() {
	// extract text from an XLSX file
	file, err := os.Open("Test.xlsx")
	if err != nil {
		fmt.Printf("Error opening file: %s\n", err)
		return
	}

	defer file.Close()
	stat, _ := file.Stat()

	buffer := bytes.NewBuffer(make([]byte, 0, sizeLimit))

	fileconversion.XLSX2Text(file, stat.Size(), buffer, sizeLimit, -1)

	fmt.Println(buffer.String())
}

Functions

The package exports the following functions:

XLSX2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64, rowLimit int) (written int64, err error)
DOCX2Text(file io.ReaderAt, size int64) (string, error)
EPUB2Text(file io.ReaderAt, size int64, limit int64) (string, error)
HTML2Text(reader io.Reader) (pageText string, err error)
HTML2TextAndLinks(reader io.Reader, baseURL string) (pageText string, links []string, err error)
Mobi2Text(file io.ReadSeeker) (string, error)
ODS2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)
ODT2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)
PDFListContentStreams(f io.ReadSeeker, w io.Writer, size int64) (written int64, err error)
PPTX2Text(file io.ReaderAt, size int64) (string, error)
RTF2Text(inputRtf string) string
XLS2Text(reader io.ReadSeeker, writer io.Writer, size int64) (written int64, err error)
XLSX2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64, rowLimit int) (written int64, err error)

Picture functions:

IsExcessiveLargePicture(Picture []byte) (excessive bool, err error)
CompressJPEG(Picture []byte, quality int) (compressed []byte)
ResizeCompressPicture(Picture []byte, Quality int, MaxWidth, MaxHeight uint) 
PDFExtractImages(input io.ReadSeeker) (images []ImageResult, err error)

Compression and container file functions:

DecompressFile(data []byte) (decompressed []byte, valid bool)
ContainerExtractFiles(data []byte, callback func(name string, size int64, date time.Time, data []byte))

Dependencies

This library uses other go packages. Run the following command to download them:

go get -u github.com/nwaples/rardecode
go get -u github.com/saracen/go7z
go get -u github.com/ulikunitz/xz
go get -u github.com/mattetti/filebuffer
go get -u github.com/richardlehane/mscfb
go get -u github.com/taylorskalyo/goreader/epub
go get -u github.com/PuerkitoBio/goquery
go get -u github.com/ssor/bom
go get -u github.com/levigross/exp-html
go get -u github.com/neofight/mobi/convert
go get -u github.com/neofight/mobi/headers
go get -u github.com/unidoc/unipdf
go get -u github.com/nfnt/resize
go get -u github.com/tealeg/xlsx
go get -u gopkg.in/xmlpath.v2

Tests

There are no functional tests. The only test functions are used manually for debugging.

Forks

Other packages were tested and either found insufficient, or unstable. Many of the below listed packages were found to be unstable, cause crashes, as well as exhaust memory due to bad programming, bad input sanitizing and bad memory management.

License

This is free and unencumbered software released into the public domain.

Note that this package includes, or consists partly of forks or rewrite of existing open source code. Use at your own risk. Intelligence X does not provide any warranty for this library or any parts of it.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CompressJPEG

func CompressJPEG(Picture []byte, quality int) (compressed []byte)

CompressJPEG compresses a JPEG picture according to the input Warning: If the image claims to be large (in terms of width & height), this may use a lot of memory. Use IsExcessiveLargePicture first.

func ContainerExtractFiles

func ContainerExtractFiles(data []byte, callback func(name string, size int64, date time.Time, data []byte))

ContainerExtractFiles extracts files from supported containers: ZIP, RAR, 7Z, TAR

func DOC2Text

func DOC2Text(r io.Reader) (io.Reader, error)

DOC2Text converts a standard io.Reader from a Microsoft Word .doc binary file and returns a reader (actually a bytes.Buffer) which will output the plain text found in the .doc file

func DOCX2Text

func DOCX2Text(file io.ReaderAt, size int64) (string, error)

DOCX2Text extracts text of a Word document Size is the full size of the input file.

func DecompressFile

func DecompressFile(data []byte) (decompressed []byte, valid bool)

DecompressFile decompresses data. It supports: GZ, BZ, BZ2, XZ

func EPUB2Text

func EPUB2Text(file io.ReaderAt, size int64, limit int64) (string, error)

EPUB2Text converts an EPUB ebook to text

func HTML2Text

func HTML2Text(reader io.Reader) (pageText string, err error)

HTML2Text extracts the text from the HTML

func HTML2TextAndLinks(reader io.Reader, baseURL string) (pageText string, links []string, err error)

HTML2TextAndLinks extracts the text from the HTML and all links from <a> and <img> tags of a HTML If the base URL is provided, relative links will be converted to absolute ones.

func InitPDFLicense

func InitPDFLicense(key, name string)

InitPDFLicense initializes the PDF license

func IsExcessiveLargePicture

func IsExcessiveLargePicture(Picture []byte) (excessive bool, err error)

IsExcessiveLargePicture checks if the picture has reasonable width and height, preventing potential DoS when decoding it This protects against this problem: If the image claims to be large (in terms of width & height), jpeg.Decode may use a lot of memory, see https://github.com/golang/go/issues/10532.

func IsFileDOC

func IsFileDOC(data []byte) bool

IsFileDOC checks if the data indicates a DOC file DOC has multiple signature according to https://filesignatures.net/index.php?search=doc&mode=EXT, D0 CF 11 E0 A1 B1 1A E1

func IsFileDOCX

func IsFileDOCX(data []byte) bool

IsFileDOCX checks if the data indicates a DOCX file DOCX has a signature of 50 4B 03 04

func IsFilePPT

func IsFilePPT(data []byte) bool

IsFilePPT checks if the data indicates a PPT file PPT has multiple signature according to https://www.filesignatures.net/index.php?page=search&search=PPT&mode=EXT, D0 CF 11 E0 A1 B1 1A E1. This overlaps with others (including DOC ans XLS).

func IsFilePPTX

func IsFilePPTX(data []byte) bool

IsFilePPTX checks if the data indicates a PPTX file PPTX has a signature of 50 4B 03 04 Warning: This collides with ZIP, DOCX and other zip-based files.

func IsFileRTF

func IsFileRTF(data []byte) bool

IsFileRTF checks if the data indicates a RTF file RTF has a signature of 7B 5C 72 74 66 31, or in string "{\rtf1"

func IsFileXLS

func IsFileXLS(data []byte) bool

IsFileXLS checks if the data indicates a XLS file XLS has a signature of D0 CF 11 E0 A1 B1 1A E1

func IsFileXLSX

func IsFileXLSX(data []byte) bool

IsFileXLSX checks if the data indicates a XLSX file XLSX has a signature of 50 4B 03 04 Warning: This collides with ZIP, DOCX and other zip-based files.

func IsFileZIP

func IsFileZIP(data []byte) bool

IsFileZIP checks if the data indicates a ZIP file. Many file formats like DOCX, XLSX, PPTX and APK are actual ZIP files. Signature 50 4B 03 04

func ODS2Cells

func ODS2Cells(file io.ReaderAt, size int64) (cells []string, err error)

ODS2Cells converts an ODS file to individual cells Size is the full size of the input file.

func ODS2Text

func ODS2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)

ODS2Text extracts text of an OpenDocument Spreadsheet Size is the full size of the input file.

func ODT2Text

func ODT2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)

ODT2Text extracts text of an OpenDocument Text file Size is the full size of the input file.

func PDFGetCreationDate

func PDFGetCreationDate(f io.ReadSeeker) (date time.Time, valid bool)

PDFGetCreationDate tries to get the creation date

func PDFListContentStreams

func PDFListContentStreams(f io.ReadSeeker, w io.Writer, size int64) (written int64, err error)

PDFListContentStreams writes all text streams in a PDF to the writer It returns the number of characters attempted written (excluding "Page N" and new-lines) and an error, if any. It can be used to determine whether any text was extracted. The parameter size is the max amount of bytes (not characters) to write out.

func PPTX2Text

func PPTX2Text(file io.ReaderAt, size int64) (string, error)

PPTX2Text extracts text of a PowerPoint document Size is the full size of the input file.

func RTF2Text

func RTF2Text(inputRtf string) string

RTF2Text removes rtf characters from string and returns the new string.

func ResizeCompressPicture

func ResizeCompressPicture(Picture []byte, Quality int, MaxWidth, MaxHeight uint) (compressed []byte, err error)

ResizeCompressPicture scales a picture down and compresses it. It accepts GIF, JPEG, PNG as input but output will always be JPEG. Quality specifies the output JPEG quality 0-100. Anything below 75 will noticably reduce the picture quality. Warning: If the image claims to be large (in terms of width & height), this may use a lot of memory. Use IsExcessiveLargePicture first. Scaling a picture down is optional and only done if MaxWidth and MaxHeight are not 0. Even without rescaling, this function is useful to convert a picture into JPEG.

func XLS2Cells

func XLS2Cells(reader io.ReadSeeker) (cells []string, err error)

XLS2Cells converts an XLS file to individual cells

func XLS2Text

func XLS2Text(reader io.ReadSeeker, writer io.Writer, size int64) (written int64, err error)

XLS2Text extracts text from an Excel sheet. It returns bytes written. The parameter size is the max amount of bytes (not characters) to write out. The whole Excel file is required even for partial text extraction. This function returns no error with 0 bytes written in case of corrupted or invalid file.

func XLSX2Cells

func XLSX2Cells(file io.ReaderAt, size int64, rowLimit int) (cells []string, err error)

XLSX2Cells converts an XLSX file to individual cells Size is the full size of the input file. rowLimit defines how many rows per sheet to extract. -1 means unlimited. This exists as protection against some XLSX files that may use excessive amount of memory.

func XLSX2Text

func XLSX2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64, rowLimit int) (written int64, err error)

XLSX2Text extracts text of an Excel sheet Size is the full size of the input file. Limit is the output limit in bytes. rowLimit defines how many rows per sheet to extract. -1 means unlimited. This exists as protection against some XLSX files that may use excessive amount of memory.

Types

type ImageResult

type ImageResult struct {
	Image image.Image
	Name  string
}

ImageResult contains an extracted image

func PDFExtractImages

func PDFExtractImages(input io.ReadSeeker) (images []ImageResult, err error)

PDFExtractImages extracts all images from a PDF file

type PPTXDocument

type PPTXDocument struct {
	Slides []PPTXSlide
}

PPTXDocument is a PPTX document loaded into memory

func (PPTXDocument) AsText

func (doc PPTXDocument) AsText() (text string)

AsText returns the text on all slides

type PPTXSlide

type PPTXSlide struct {
	SlideNumber int
	//ThumbnailBase64 string
	TextContent string
}

PPTXSlide is a single PPTX slide

type SlideNumberSorter

type SlideNumberSorter []PPTXSlide

SlideNumberSorter is used for sorting

func (SlideNumberSorter) Len

func (a SlideNumberSorter) Len() int

func (SlideNumberSorter) Less

func (a SlideNumberSorter) Less(i, j int) bool

func (SlideNumberSorter) Swap

func (a SlideNumberSorter) Swap(i, j int)

type WordDocument

type WordDocument struct {
	Paragraphs []WordParagraph
}

WordDocument is a full word doc

func WordParse

func WordParse(doc string) (WordDocument, error)

WordParse parses a word file

func (WordDocument) AsText

func (w WordDocument) AsText() string

AsText returns all text in the document

type WordParagraph

type WordParagraph struct {
	Style WordStyle `xml:"pPr>pStyle"`
	Rows  []WordRow `xml:"r"`
}

WordParagraph is a single paragraph

type WordRow

type WordRow struct {
	Text string `xml:"t"`
}

WordRow ...

type WordStyle

type WordStyle struct {
	Val string `xml:"val,attr"`
}

WordStyle ...

Directories

Path Synopsis
odf
ods
This package implements rudimentary support for reading Open Document Spreadsheet files.
This package implements rudimentary support for reading Open Document Spreadsheet files.
xls package use to parse the 97 -2004 microsoft xls file(".xls" suffix, NOT ".xlsx" suffix )
xls package use to parse the 97 -2004 microsoft xls file(".xls" suffix, NOT ".xlsx" suffix )

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL