robots

package
v0.0.0-...-17aa141 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 17, 2024 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

Package robots implements a higher-level robots.txt interface.

The package implements a cache that caches robots.txt structures per hostname.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Cache

type Cache struct {
	// contains filtered or unexported fields
}

Cache implements an LRU robots cache.

The cache maintains an LRU of domain names into their robots.txt structures, when a new domain is seen the cache will fetch the robots.txt parse it, and add it to the cache.

func NewCache

func NewCache(c *http.Client, capacity int) *Cache

NewCache returns a new cache with the client and cache capacity.

func (*Cache) Allowed

func (c *Cache) Allowed(ctx context.Context, req Request) (bool, error)

Allowed returns true if the request is allowed.

The method will lookup the robots.txt structure for the domain name and check if the request user agent is allowed to fetch the URL. Subsequent calls may use the cached robots.txt structures.

Note that robots.txt lookup is simplistic, it basically takes the hostname and appends `/robots.txt` to it, this means that the X-robots header and the robots.txt meta tag are not considered.

The method returns an error if the context is canceled or if a parsing error occurs.

func (*Cache) Wait

func (c *Cache) Wait(ctx context.Context, req Request) error

Wait blocks until the given request can be sent.

Some robots.txt define a crawl delay for all or some of the useragents. The method will block until the request can go through.

type Host

type Host struct {
	// contains filtered or unexported fields
}

Host represents a host.

The host contains the host's robots.txt structures.

type Request

type Request struct {
	UserAgent string
	URL       *url.URL
}

Request represents a request.

If the UserAgent is empty it will default to `*`.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL