gophetch

package module
v0.0.29 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 25, 2023 License: Apache-2.0 Imports: 17 Imported by: 0

README

gophetch

GoPhetch is a library for parsing and extracting metadata and other details from HTML.

This is alpha software and is not ready for production use.

Documentation

Overview

Package gophetch is a library for fetching and extracting metadata from HTML pages.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ExtractDomain

func ExtractDomain(rawURL string) (string, error)

ExtractDomain extracts the domain from a given URL string

func ExtractSrcset

func ExtractSrcset(srcset string, relativeURL *url.URL) ([]string, []string)

ExtractSrcset attempts to match all srcset URLs including their descriptors, accounting for commas within the URLs.

Types

type Extractor

type Extractor struct {
	Rules  map[string]rules.Rule
	Errors []error
}

Extractor is the struct that encapsulates the rules used to extract metadata from HTML.

func NewExtractor

func NewExtractor() *Extractor

NewExtractor creates a new Extractor struct with the default rules.

func (*Extractor) ApplySiteSpecificRules

func (e *Extractor) ApplySiteSpecificRules(site sites.Site)

ApplySiteSpecificRules applies the custom rules for the given site.

func (*Extractor) ExtractMetadata

func (e *Extractor) ExtractMetadata(node *html.Node, targetURL *url.URL) (metadata.Metadata, error)

ExtractMetadata extracts metadata from the given HTML node. The url parameter is used to fix relative paths.

func (*Extractor) ExtractRule

func (e *Extractor) ExtractRule(node *html.Node, targetURL *url.URL, rule rules.Rule) (rules.ExtractResult, error)

func (*Extractor) ExtractRuleByKey

func (e *Extractor) ExtractRuleByKey(node *html.Node, targetURL *url.URL, key string) (rules.ExtractResult, error)

type Gophetch

type Gophetch struct {
	Parser       *Parser
	Extractor    *Extractor
	Fetchers     []fetchers.HTMLFetcher
	SiteRegistry map[string]sites.Site
}

Gophetch is the main struct that encapsulates the parser, extractor, and fetchers.

func New

func New(fetchers ...fetchers.HTMLFetcher) *Gophetch

New creates a new Gophetch struct with the provided fetchers.

func (*Gophetch) FetchAndParse

func (g *Gophetch) FetchAndParse(targetURL string) (Result, error)

FetchAndParse accepts a target URL string as its parameter. It initiates an HTTP request to fetch the HTML content from the specified URL, parses the fetched HTML to extract metadata, and encapsulates the extracted metadata, along with the response data, into a Result struct which is then returned. This method is useful when the HTML content needs to be fetched from the internet before parsing.

func (*Gophetch) ReadAndParse

func (g *Gophetch) ReadAndParse(r io.Reader, targetURL string) (Result, error)

ReadAndParse accepts two parameters: an io.Reader containing the HTML to be parsed, and a target URL string. It reads the HTML content from the provided io.Reader, parses it to extract metadata, and encapsulates the extracted metadata, along with the response data, into a Result struct which is then returned. This method is useful when the HTML content is already available and does not need to be fetched from the internet.

func (*Gophetch) RegisterSite

func (g *Gophetch) RegisterSite(site sites.Site)

RegisterSite registers a site with the Gophetch instance. This allows the Gophetch instance to apply site-specific rules when extracting metadata from the HTML content.

type Headers

type Headers map[string][]string

Headers is a map of HTTP headers

type ImageFetcher

type ImageFetcher interface {
	NewImageFromURL(url string, maxSize int) (*media.Media, error)
}

ImageFetcher is an interface for fetching images

type ImageInliner

type ImageInliner struct {
	ShouldInline ShouldInlineFunc
	// contains filtered or unexported fields
}

ImageInliner is responsible for fetching and replacing images in HTML documents.

func NewImageInliner

func NewImageInliner(opts ImageInlinerOptions) *ImageInliner

NewImageInliner creates a new ImageInliner with the given fetcher, upload function, and storage strategy.

func (*ImageInliner) InlineImages

func (inliner *ImageInliner) InlineImages(readableHTML string) (string, error)

InlineImages replaces image URLs with either base64 inline versions or cloud URLs based on the set strategy.

type ImageInlinerOptions

type ImageInlinerOptions struct {
	// ShouldInlineFunc is the function to use for determining whether an image should be inlined. Default is to inline
	// if image size is less than 100KB or if dimensions are smaller than 800x600 (based on the maxInlinedSize, maxWidth,
	// and maxHeight options).
	ShouldInlineFunc ShouldInlineFunc
	// Fetcher is the ImageFetcher to use for fetching images.
	Fetcher ImageFetcher
	// UploadFunc is the function to use for uploading images to cloud storage.
	UploadFunc UploadFunc
	// InlineStrategy is the storage strategy to use. Default is InlineAll.
	InlineStrategy InlineStrategy
	// SrcsetStrategy is the strategy to use for handling srcset attributes. Default is SrcsetSmallestImage.
	SrcsetStrategy SrcsetStrategy
	// MaxContentSize is the maximum size in bytes for images to be processed and uploaded. Default is 10MB.
	MaxContentSize int64
	// MaxInlinedSize is the maximum size in bytes for images to be processed in a hybrid strategy. Default is 100KB.
	MaxInlinedSize int
	// MaxWidth is the maximum width in pixels for images to be processed in a hybrid strategy. Default is 800.
	MaxWidth int
	// MaxHeight is the maximum height in pixels for images to be processed in a hybrid strategy. Default is 600.
	MaxHeight int
	// MediaProxyURL is the URL to prefix to the image URLs when using the InlineMediaProxy strategy.
	MediaProxyURL string
	// RelativeURL is the URL to use to fix relative URLs by making them absolute.
	RelativeURL *url.URL
}

ImageInlinerOptions are options for creating a new ImageInliner.

type InlineStrategy

type InlineStrategy int

InlineStrategy represents the different strategies for inlining images.

const (
	// InlineAll indicates that all images should be inlined.
	InlineAll InlineStrategy = iota

	// InlineNone indicates that no images should be inlined, and all images should be uploaded to
	// cloud storage using the upload function.
	InlineNone

	// InlineHybrid indicates a hybrid approach to inlining images where images are inlined if they are smaller
	// than the maxInlinedSize, maxWidth, and maxHeight options, and uploaded to cloud storage otherwise.
	InlineHybrid

	// InlineMediaProxy indicates that all images should be inlined, but the URLs should be prefixed with the proxy URL.
	InlineMediaProxy
)

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser is the struct that encapsulates the HTML parser.

func NewParser

func NewParser() *Parser

NewParser creates a new Parser struct.

func (*Parser) Headers

func (p *Parser) Headers() Headers

Headers returns the HTTP headers as a map.

func (*Parser) IsHTML

func (p *Parser) IsHTML() bool

IsHTML returns true if the response is HTML, false otherwise.

func (*Parser) MimeType

func (p *Parser) MimeType() string

MimeType returns the MIME type of the response

func (*Parser) Node

func (p *Parser) Node() *html.Node

Node returns the parsed HTML as a html.Node struct.

func (*Parser) Parse

func (p *Parser) Parse(reader io.Reader, resp *http.Response, targetURL string) error

Parse parses the HTML content from the provided io.Reader, and encapsulates the parsed HTML into a html.Node struct. It will also parse the HTTP headers from the provided http.Response struct. The targetURL parameter is used to fix relative paths.

func (*Parser) URL

func (p *Parser) URL() *url.URL

URL returns the target URL as a url.URL struct.

type RealImageFetcher

type RealImageFetcher struct{}

RealImageFetcher uses the actual implementation

func (*RealImageFetcher) NewImageFromURL

func (r *RealImageFetcher) NewImageFromURL(url string, maxSize int) (*media.Media, error)

NewImageFromURL fetches an image from the given URL.

type Result

type Result struct {
	HTMLNode    *html.Node
	Headers     map[string][]string
	IsHTML      bool
	Metadata    metadata.Metadata
	MimeType    string
	Response    *http.Response
	StatusCode  int
	FetcherName string
}

Result is the struct that encapsulates the extracted metadata, along with the response data.

type ShouldInlineFunc

type ShouldInlineFunc func(*media.Media) bool

ShouldInlineFunc is the function signature use for determining whether an image should be inlined.

type SrcsetStrategy

type SrcsetStrategy int

SrcsetStrategy represents the different strategies for handling srcset attributes.

const (
	// SrcsetSmallestImage selects the smallest image in the srcset.
	SrcsetSmallestImage SrcsetStrategy = iota

	// SrcsetLargestImage selects the largest image in the srcset.
	SrcsetLargestImage

	// SrcsetPreferredDescriptors selects an image based on the preferred descriptors.
	// Currently only looks for 2x, 1.5x, and 1x, in that order.
	SrcsetPreferredDescriptors

	// SrcsetAllImages includes all images in the srcset.
	SrcsetAllImages
)

type UploadFunc

type UploadFunc func(*media.Media) (string, error)

UploadFunc is the function signature to use for uploading images to cloud storage.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL