Documentation ¶
Overview ¶
Package gophetch is a library for fetching and extracting metadata from HTML pages.
Index ¶
- func ExtractDomain(rawURL string) (string, error)
- func ExtractSrcset(srcset string, relativeURL *url.URL) ([]string, []string)
- type Extractor
- func (e *Extractor) ApplySiteSpecificRules(site sites.Site)
- func (e *Extractor) ExtractMetadata(node *html.Node, targetURL *url.URL) (metadata.Metadata, error)
- func (e *Extractor) ExtractRule(node *html.Node, targetURL *url.URL, rule rules.Rule) (rules.ExtractResult, error)
- func (e *Extractor) ExtractRuleByKey(node *html.Node, targetURL *url.URL, key string) (rules.ExtractResult, error)
- type Gophetch
- type Headers
- type ImageFetcher
- type ImageInliner
- type ImageInlinerOptions
- type InlineStrategy
- type Parser
- type RealImageFetcher
- type Result
- type ShouldInlineFunc
- type SrcsetStrategy
- type UploadFunc
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ExtractDomain ¶
ExtractDomain extracts the domain from a given URL string
Types ¶
type Extractor ¶
Extractor is the struct that encapsulates the rules used to extract metadata from HTML.
func NewExtractor ¶
func NewExtractor() *Extractor
NewExtractor creates a new Extractor struct with the default rules.
func (*Extractor) ApplySiteSpecificRules ¶
ApplySiteSpecificRules applies the custom rules for the given site.
func (*Extractor) ExtractMetadata ¶
ExtractMetadata extracts metadata from the given HTML node. The url parameter is used to fix relative paths.
func (*Extractor) ExtractRule ¶
type Gophetch ¶
type Gophetch struct { Parser *Parser Extractor *Extractor Fetchers []fetchers.HTMLFetcher SiteRegistry map[string]sites.Site }
Gophetch is the main struct that encapsulates the parser, extractor, and fetchers.
func New ¶
func New(fetchers ...fetchers.HTMLFetcher) *Gophetch
New creates a new Gophetch struct with the provided fetchers.
func (*Gophetch) FetchAndParse ¶
FetchAndParse accepts a target URL string as its parameter. It initiates an HTTP request to fetch the HTML content from the specified URL, parses the fetched HTML to extract metadata, and encapsulates the extracted metadata, along with the response data, into a Result struct which is then returned. This method is useful when the HTML content needs to be fetched from the internet before parsing.
func (*Gophetch) ReadAndParse ¶
ReadAndParse accepts two parameters: an io.Reader containing the HTML to be parsed, and a target URL string. It reads the HTML content from the provided io.Reader, parses it to extract metadata, and encapsulates the extracted metadata, along with the response data, into a Result struct which is then returned. This method is useful when the HTML content is already available and does not need to be fetched from the internet.
func (*Gophetch) RegisterSite ¶
RegisterSite registers a site with the Gophetch instance. This allows the Gophetch instance to apply site-specific rules when extracting metadata from the HTML content.
type ImageFetcher ¶
ImageFetcher is an interface for fetching images
type ImageInliner ¶
type ImageInliner struct { ShouldInline ShouldInlineFunc // contains filtered or unexported fields }
ImageInliner is responsible for fetching and replacing images in HTML documents.
func NewImageInliner ¶
func NewImageInliner(opts ImageInlinerOptions) *ImageInliner
NewImageInliner creates a new ImageInliner with the given fetcher, upload function, and storage strategy.
func (*ImageInliner) InlineImages ¶
func (inliner *ImageInliner) InlineImages(readableHTML string) (string, error)
InlineImages replaces image URLs with either base64 inline versions or cloud URLs based on the set strategy.
type ImageInlinerOptions ¶
type ImageInlinerOptions struct { // ShouldInlineFunc is the function to use for determining whether an image should be inlined. Default is to inline // if image size is less than 100KB or if dimensions are smaller than 800x600 (based on the maxInlinedSize, maxWidth, // and maxHeight options). ShouldInlineFunc ShouldInlineFunc // Fetcher is the ImageFetcher to use for fetching images. Fetcher ImageFetcher // UploadFunc is the function to use for uploading images to cloud storage. UploadFunc UploadFunc // InlineStrategy is the storage strategy to use. Default is InlineAll. InlineStrategy InlineStrategy // SrcsetStrategy is the strategy to use for handling srcset attributes. Default is SrcsetSmallestImage. SrcsetStrategy SrcsetStrategy // MaxContentSize is the maximum size in bytes for images to be processed and uploaded. Default is 10MB. MaxContentSize int64 // MaxInlinedSize is the maximum size in bytes for images to be processed in a hybrid strategy. Default is 100KB. MaxInlinedSize int // MaxWidth is the maximum width in pixels for images to be processed in a hybrid strategy. Default is 800. MaxWidth int // MaxHeight is the maximum height in pixels for images to be processed in a hybrid strategy. Default is 600. MaxHeight int // MediaProxyURL is the URL to prefix to the image URLs when using the InlineMediaProxy strategy. MediaProxyURL string // RelativeURL is the URL to use to fix relative URLs by making them absolute. RelativeURL *url.URL }
ImageInlinerOptions are options for creating a new ImageInliner.
type InlineStrategy ¶
type InlineStrategy int
InlineStrategy represents the different strategies for inlining images.
const ( // InlineAll indicates that all images should be inlined. InlineAll InlineStrategy = iota // InlineNone indicates that no images should be inlined, and all images should be uploaded to // cloud storage using the upload function. InlineNone // InlineHybrid indicates a hybrid approach to inlining images where images are inlined if they are smaller // than the maxInlinedSize, maxWidth, and maxHeight options, and uploaded to cloud storage otherwise. InlineHybrid // InlineMediaProxy indicates that all images should be inlined, but the URLs should be prefixed with the proxy URL. InlineMediaProxy )
type Parser ¶
type Parser struct {
// contains filtered or unexported fields
}
Parser is the struct that encapsulates the HTML parser.
type RealImageFetcher ¶
type RealImageFetcher struct{}
RealImageFetcher uses the actual implementation
func (*RealImageFetcher) NewImageFromURL ¶
NewImageFromURL fetches an image from the given URL.
type Result ¶
type Result struct { HTMLNode *html.Node Headers map[string][]string IsHTML bool Metadata metadata.Metadata MimeType string Response *http.Response StatusCode int FetcherName string }
Result is the struct that encapsulates the extracted metadata, along with the response data.
type ShouldInlineFunc ¶
ShouldInlineFunc is the function signature use for determining whether an image should be inlined.
type SrcsetStrategy ¶
type SrcsetStrategy int
SrcsetStrategy represents the different strategies for handling srcset attributes.
const ( // SrcsetSmallestImage selects the smallest image in the srcset. SrcsetSmallestImage SrcsetStrategy = iota // SrcsetLargestImage selects the largest image in the srcset. SrcsetLargestImage // SrcsetPreferredDescriptors selects an image based on the preferred descriptors. // Currently only looks for 2x, 1.5x, and 1x, in that order. SrcsetPreferredDescriptors // SrcsetAllImages includes all images in the srcset. SrcsetAllImages )