outlinks

package
v0.0.0-...-deda848 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 14, 2025 License: Apache-2.0 Imports: 13 Imported by: 3

README

import cloudeng.io/file/crawl/outlinks

Functions

Func NewExtractors
func NewExtractors(errCh chan<- Errors, processor Process, extractors *content.Registry[Extractor]) crawl.Outlinks

NewExtractors creates a crawl.Outlinks.Extractor given instances of the lower level Extractor interface. The extractors that match the downloaded content's mime type are run for that content.

Types

Type Download
type Download struct {
	Request  download.Request
	Download download.Result
}

Download represents a single downloaded file, as opposed to download.Downloaded which represents multiple files in the same container. It's a convenience for use by the Extractor interface.

Type ErrorDetail
type ErrorDetail struct {
	download.Result
	Error error
}
Type Errors
type Errors struct {
	Request   download.Request
	Container file.FS
	Errors    []ErrorDetail
}
Methods
func (e Errors) String() string
Type Extractor
type Extractor interface {
	// ContentType returns the mime type that this extractor is capable of handling.
	ContentType() content.Type
	// Outlinks extracts outlinks from the specified downloaded file. This
	// is generally specific to the mime type of the content being processed.
	Outlinks(ctx context.Context, depth int, download Download, contents io.Reader) ([]string, error)
	// Request creates new download requests for the specified outlinks.
	Request(depth int, download Download, outlinks []string) download.Request
}

Extractor is a lower level interface for outlink extractors that allows for the separation of extracting outlinks, filtering/rewriting them and creating new download requests to retrieve them. This allows for easier customization of the crawl process, for example, to rewrite or otherwise manipulate the link names or create appropriate crawl requests for different types of outlink.

Type HTML
type HTML struct {
	// contains filtered or unexported fields
}

HTML is an outlink extractor for HTML documents. It implements both crawl.Outlinks and outlinks.Extractor.

Functions
func NewHTML() *HTML
Methods
func (ho *HTML) ContentType() content.Type
func (ho *HTML) HREFs(base string, rd io.Reader) ([]string, error)

HREFs returns the hrefs found in the provided HTML document.

func (ho *HTML) IsDup(link string) bool

IsDup returns true if link has been seen before (ie. has been used as an argument to IsDup).

func (ho *HTML) Outlinks(_ context.Context, _ int, download Download, contents io.Reader) ([]string, error)

Outlinks implements Extractor.Outlinks.

func (ho *HTML) Request(depth int, download Download, outlinks []string) download.Request

Request implements Extractor.Request.

Type PassthroughProcessor
type PassthroughProcessor struct{}

PassthroughProcessor implements Process and simply returns its input.

Methods
func (pp *PassthroughProcessor) Process(outlinks []string) []string
Type Process
type Process interface {
	Process(outlink []string) []string
}

Process is an interface for processing outlinks.

Type RegexpProcessor
type RegexpProcessor struct {
	NoFollow []string // regular expressions that match links that should be ignored.
	Follow   []string // regular expressions that match links that should be followed. Follow overrides NoFollow.
	Rewrite  []string // rewrite rules that are applied to links that are followed specified as textutil.RewriteRule strings
	// contains filtered or unexported fields
}

RegexpProcessor is an implementation of Process that uses regular expressions to determine whether a link should be ignored (nofollow), followed or rewritten. Follow overrides nofollow and only links that make it through both nofollow and follow are rewritten. Each of the rewrites is applied in turn and all of the rewritten values are returned.

Methods
func (cfg *RegexpProcessor) Compile() error

Compile is called to compile all of the regular expressions contained within the processor. It must be called before Process.

func (cfg *RegexpProcessor) Process(outlinks []string) []string

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewExtractors

func NewExtractors(errCh chan<- Errors, processor Process, extractors *content.Registry[Extractor]) crawl.Outlinks

NewExtractors creates a crawl.Outlinks.Extractor given instances of the lower level Extractor interface. The extractors that match the downloaded content's mime type are run for that content.

Types

type Download

type Download struct {
	Request  download.Request
	Download download.Result
}

Download represents a single downloaded file, as opposed to download.Downloaded which represents multiple files in the same container. It's a convenience for use by the Extractor interface.

type ErrorDetail

type ErrorDetail struct {
	download.Result
	Error error
}

type Errors

type Errors struct {
	Request   download.Request
	Container file.FS
	Errors    []ErrorDetail
}

func (Errors) String

func (e Errors) String() string

type Extractor

type Extractor interface {
	// ContentType returns the mime type that this extractor is capable of handling.
	ContentType() content.Type
	// Outlinks extracts outlinks from the specified downloaded file. This
	// is generally specific to the mime type of the content being processed.
	Outlinks(ctx context.Context, depth int, download Download, contents io.Reader) ([]string, error)
	// Request creates new download requests for the specified outlinks.
	Request(depth int, download Download, outlinks []string) download.Request
}

Extractor is a lower level interface for outlink extractors that allows for the separation of extracting outlinks, filtering/rewriting them and creating new download requests to retrieve them. This allows for easier customization of the crawl process, for example, to rewrite or otherwise manipulate the link names or create appropriate crawl requests for different types of outlink.

type HTML

type HTML struct {
	// contains filtered or unexported fields
}

HTML is an outlink extractor for HTML documents. It implements both crawl.Outlinks and outlinks.Extractor.

func NewHTML

func NewHTML() *HTML

func (*HTML) ContentType

func (ho *HTML) ContentType() content.Type

func (*HTML) HREFs

func (ho *HTML) HREFs(base string, rd io.Reader) ([]string, error)

HREFs returns the hrefs found in the provided HTML document.

func (*HTML) IsDup

func (ho *HTML) IsDup(link string) bool

IsDup returns true if link has been seen before (ie. has been used as an argument to IsDup).

func (ho *HTML) Outlinks(_ context.Context, _ int, download Download, contents io.Reader) ([]string, error)

Outlinks implements Extractor.Outlinks.

func (*HTML) Request

func (ho *HTML) Request(depth int, download Download, outlinks []string) download.Request

Request implements Extractor.Request.

type PassthroughProcessor

type PassthroughProcessor struct{}

PassthroughProcessor implements Process and simply returns its input.

func (*PassthroughProcessor) Process

func (pp *PassthroughProcessor) Process(outlinks []string) []string

type Process

type Process interface {
	Process(outlink []string) []string
}

Process is an interface for processing outlinks.

type RegexpProcessor

type RegexpProcessor struct {
	NoFollow []string // regular expressions that match links that should be ignored.
	Follow   []string // regular expressions that match links that should be followed. Follow overrides NoFollow.
	Rewrite  []string // rewrite rules that are applied to links that are followed specified as textutil.RewriteRule strings
	// contains filtered or unexported fields
}

RegexpProcessor is an implementation of Process that uses regular expressions to determine whether a link should be ignored (nofollow), followed or rewritten. Follow overrides nofollow and only links that make it through both nofollow and follow are rewritten. Each of the rewrites is applied in turn and all of the rewritten values are returned.

func (*RegexpProcessor) Compile

func (cfg *RegexpProcessor) Compile() error

Compile is called to compile all of the regular expressions contained within the processor. It must be called before Process.

func (*RegexpProcessor) Process

func (cfg *RegexpProcessor) Process(outlinks []string) []string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL