Documentation ¶
Overview ¶
Package totext uses https://github.com/sajari/docconv to extract text from different file types.
Index ¶
- Variables
- func CaptureHTML(browser *rod.Browser, inputURL string) (content string, err error)
- func CleanUpHTML(content string) string
- func ConvertDocToText(filepath string) (content string, metadata map[string]string, err error)
- func ConvertDocxToText(filepath string) (content string, metadata map[string]string, err error)
- func ConvertHTMLToText(filepath string, skipPrettifyError bool) (content string, metadata map[string]string, err error)
- func ConvertOdtToText(filepath string) (content string, metadata map[string]string, err error)
- func ConvertPDFToText(filepath string) (content string, metadata map[string]string, err error)
- func ConvertPagesToText(filepath string) (content string, metadata map[string]string, err error)
- func ConvertRTFToText(filepath string) (content string, metadata map[string]string, err error)
- func ConvertURLToText(browser *rod.Browser, inputURL string, skipPrettifyError bool) (htmlFilename, content string, metadata map[string]string, err error)
- func CreateHTMLFilename(u *url.URL) string
- func DeleteFile(filepath string) error
- func FilterNonReadableCharacter(input string) string
- func GetFilename(filepath string) string
- func IsAbsPath(filepath string) bool
- func IsContentTypeHTML(contentType string) bool
- func IsHostnameValid(hostname string) bool
- func IsMIMETypeMatched(fileExt FileExtension, mime MIME) bool
- func ParseURLAndValidate(inputURL string) (*url.URL, error)
- func PrettifyHTML(filepath string) (err error)
- func ReadText(filepath string) (string, error)
- func SetCwd(filepath string) error
- func Version(appName string) *cobra.Command
- func WriteText(filepath string, content string) error
- type FileExtension
- type MIME
Constants ¶
This section is empty.
Variables ¶
var ( AppVersion string BuildDate string CommitHash string Author string )
Build information
Functions ¶
func CaptureHTML ¶ added in v0.3.0
CaptureHTML fetches the HTML page at the URL given and returns the complete HTML content
func CleanUpHTML ¶ added in v0.3.0
CleanUpHTML cleans up the HTML content and extracts the text content
func ConvertDocToText ¶
ConvertDocToText receives MS word doc filepath as an argument and returns its text content and metadata
Dependencies:
Debian/Ubuntu: sudo apt install wv
MacOS: brew install wv
func ConvertDocxToText ¶
ConvertDocxToText receives MS word docx filepath as an argument and returns its text content and metadata
func ConvertHTMLToText ¶ added in v0.3.0
func ConvertHTMLToText(filepath string, skipPrettifyError bool) (content string, metadata map[string]string, err error)
ConvertHTMLToText receives HTML filepath as an argument and returns its text content and metadata
func ConvertOdtToText ¶
ConvertOdtToText receives odt filepath as an argument and returns its text content and metadata
func ConvertPDFToText ¶
ConvertPDFToText receives pdf filepath as an argument and returns its text content and metadata
Dependencies:
Debian/Ubuntu: sudo apt install poppler-utils
MacOS: brew install poppler
func ConvertPagesToText ¶
ConvertPagesToText receives pages filepath as an argument and returns its text content and metadata
func ConvertRTFToText ¶
ConvertRTFToText receives rtf filepath as an argument and returns its text content and metadata
Dependencies:
Debian/Ubuntu: sudo apt install unrtf
MacOS: brew install unrtf
func ConvertURLToText ¶ added in v0.3.0
func ConvertURLToText(browser *rod.Browser, inputURL string, skipPrettifyError bool) (htmlFilename, content string, metadata map[string]string, err error)
ConvertURLToText fetches the HTML page at the URL given and returns its text content and metadata
func CreateHTMLFilename ¶ added in v0.3.0
CreateHTMLFilename generates a filename for the HTML file from the URL
func FilterNonReadableCharacter ¶ added in v0.9.0
FilterNonReadableCharacter - filter out non-readable characters
func GetFilename ¶
GetFilename returns the filename of a file
func IsContentTypeHTML ¶ added in v0.6.0
IsContentTypeHTML checks if the content type is HTML
func IsHostnameValid ¶ added in v0.3.0
IsHostnameValid validates the hostname
func IsMIMETypeMatched ¶ added in v0.2.0
func IsMIMETypeMatched(fileExt FileExtension, mime MIME) bool
IsMIMETypeMatched compares *multipart.FileHeader MIME type with file extension
func ParseURLAndValidate ¶ added in v0.3.0
ParseURLAndValidate parses the URL and validates the scheme, hostname and content type
func PrettifyHTML ¶ added in v0.3.0
PrettifyHTML prettifies the HTML content using the prettier library
Dependencies:
npm init
npm install --save-dev --save-exact prettier
Types ¶
type FileExtension ¶
type FileExtension string
FileExtension is the file extension type
const ( DOC FileExtension = "doc" DOCX FileExtension = "docx" HTML FileExtension = "html" JSON FileExtension = "json" MD FileExtension = "md" ODT FileExtension = "odt" PAGES FileExtension = "pages" PDF FileExtension = "pdf" RTF FileExtension = "rtf" TXT FileExtension = "txt" )
File types
func GetFileExtension ¶
func GetFileExtension(filepath string) FileExtension
GetFileExtension returns the file extension of a file
type MIME ¶ added in v0.2.0
type MIME string
MIME types
const ( MimeDOC MIME = "application/msword" MimeDOCX MIME = "application/vnd.openxmlformats-officedocument.wordprocessingml.document" MimeHTML MIME = "text/html" MimeJSON MIME = "application/json" MimeMD MIME = "text/markdown" MimeODT MIME = "application/vnd.oasis.opendocument.text" MimePAGES MIME = "application/vnd.apple.pages" MimePDF MIME = "application/pdf" MimeRTF MIME = "application/rtf" MimeTXT MIME = "text/plain" )
MIME types