extract

package module
v0.1.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 3, 2024 License: MIT Imports: 11 Imported by: 0

README

go-microdata-extract

codecov Go Go Reference Go Report Card

A Go package for extracting structured data from HTML.

Formats supported

For currently supported formats, see Statistics

Statistics

Usage statistics of structured data formats for websites

(from https://w3techs.com/technologies/overview/structured_data, 2024-12-03)

Format Usage Supported
None 23.4%
OpenGraph 67.5%
X Cards 52.2%
JSON-LD 49.6%
RDFa 39.4% -
Microdata 24.1%
Dublin Core 0.9% -
Microformats 0.4% -

Installation

go get github.com/aafeher/go-microdata-extract
import "github.com/aafeher/go-microdata-extract"

Usage

Create instance

To create a new instance with default settings, you can simply call the New() function.

e := extract.New()
Configuration defaults
  • syntaxes: []Syntax{extract.SyntaxOpenGraph, extract.SyntaxXCards, extract.SyntaxJSONLD, extract.SyntaxMicrodata}
  • userAgent: "go-microdata-extract (+https://github.com/aafeher/go-microdata-extract/blob/main/README.md)"
  • fetchTimeout: 3 seconds
Overwrite defaults
Syntaxes

To set the syntaxes whose results you want to retrieve after processing, use the SetSyntaxes() function.

e := extract.New()
e = e.SetSyntaxes([]Syntax{extract.SyntaxOpenGraph, extract.SyntaxJSONLD})

... or ...

e := extract.New().SetSyntaxes([]Syntax{extract.SyntaxOpenGraph, extract.SyntaxJSONLD})
User Agent

To set the user agent, use the SetUserAgent() function.

e := extract.New()
e = e.SetUserAgent("YourUserAgent")

... or ...

e := extract.New().SetUserAgent("YourUserAgent")
Fetch timeout

To set the fetch timeout, use the SetFetchTimeout() function. It should be specified in seconds as an uint8 value.

e := extract.New()
e = e.SetFetchTimeout(10)

... or ...

e := extract.New().SetFetchTimeout(10)
Chaining methods

In both cases, the functions return a pointer to the main object of the package, allowing you to chain these setting methods in a fluent interface style:

e := extract.New()
     .SetSyntaxes([]Syntax{extract.SyntaxOpenGraph, extract.SyntaxJSONLD})
     .SetUserAgent("YourUserAgent")
     .SetFetchTimeout(10)
Extract

Once you have properly initialized and configured your instance, you can extract structured data using the Extract() function.

The Extract() function takes in two parameters:

  • url: the URL of the webpage,
  • urlContent: an optional string pointer for the content of the URL

If you wish to provide the content yourself, pass the content as the second parameter. If not, simply pass nil and the function will fetch the content on its own. The Extract() function performs concurrent extracting and fetching optimized by the use of Go's goroutines and sync package, ensuring efficient structured data handling.

e, err := e.Extract("https://github.com/aafeher/go-microdata-extract", nil)

In this example, structured data is extracted from "https://github.com/aafeher/go-microdata-extract". The function fetches the content itself, as we passed nil as the urlContent.

Examples

Examples can be found in /examples.

Documentation

Index

Constants

This section is empty.

Variables

SYNTAXES defines an array of metadata syntax identifiers supported for parsing.

Functions

This section is empty.

Types

type Extractor

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor is a struct used for extracting metadata from web content or a provided URL. It utilizes various processors.

func New

func New() *Extractor

New creates a new instance of Extractor with default configurations and an empty map for extracted data.

func (*Extractor) Extract

func (e *Extractor) Extract(url string, urlContent *string) (*Extractor, error)

Extract retrieves metadata from the specified URL or provided content and processes it using various parsers. url: The URL to extract metadata from. urlContent: Optional pointer to a string containing HTML content. If nil, the content at the URL will be fetched.

func (*Extractor) GetExtracted

func (e *Extractor) GetExtracted() map[Syntax]any

GetExtracted returns the extracted metadata as a map by processor name from the Extractor instance.

func (*Extractor) GetExtractedJSON

func (e *Extractor) GetExtractedJSON() json.RawMessage

GetExtractedJSON returns the extracted metadata as a JSON-formatted byte array with indentation.

func (*Extractor) SetFetchTimeout

func (e *Extractor) SetFetchTimeout(fetchTimeout uint8) *Extractor

SetFetchTimeout sets the HTTP client's fetch timeout value in seconds. fetchTimeout: A uint8 value representing the timeout duration in seconds. Returns the updated Extractor instance.

func (*Extractor) SetSyntaxes

func (e *Extractor) SetSyntaxes(syntaxes []Syntax) *Extractor

SetSyntaxes sets the syntaxes that the Extractor will use for parsing metadata. Filters out unsupported syntaxes. syntaxes: A slice of Syntax representing the desired syntaxes. Returns the updated Extractor instance.

func (*Extractor) SetUserAgent

func (e *Extractor) SetUserAgent(userAgent string) *Extractor

SetUserAgent sets the User-Agent header for the HTTP client used by the Extractor. userAgent: A string representing the User-Agent to set for HTTP requests. Returns the updated Extractor instance.

type Processor

type Processor struct {
	Name Syntax
	Func func() (any, []error)
}

Processor represents a data structure to hold a processor's name and function for extracting metadata.

type Syntax

type Syntax string
const (
	// SyntaxOpenGraph is the identifier used for the Open Graph metadata syntax.
	SyntaxOpenGraph Syntax = "opengraph"

	// SyntaxXCards is the identifier used for the X Cards metadata syntax.
	SyntaxXCards Syntax = "xcards"

	// SyntaxJSONLD is the identifier used for the JSON-LD metadata syntax.
	SyntaxJSONLD Syntax = "json-ld"

	// SyntaxMicrodata is the identifier used for the W3C Microdata metadata syntax.
	SyntaxMicrodata Syntax = "microdata"
)

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL