collyresponsible

package module
v0.0.0-...-eb4ca3b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 23, 2024 License: BSD-3-Clause Imports: 9 Imported by: 0

README

colly-responsible

Responsible crawling with Colly. For the better Internet.

Based on lessons learned while writing Idun and subsequently getting banned by half of the website operators...

Supported limits

  • HTTP status code 429
  • HREF REL NOFOLLOW
  • robots.txt
  • actual delay between requests
  • URL tests (i.e. extension, domain, etc.)
  • Max run time

Documentation

Index

Constants

View Source
const (
	NoFollow        = "nofollow"
	RobotsTxt       = "robots.txt"
	UserAgentHeader = "User-Agent"
)

Variables

This section is empty.

Functions

func Crawl

func Crawl(profile *CrawlerProfile) (err error)

func GetRobots

func GetRobots(ctx context.Context, website, userAgent string, limiter *RequestLimiter) (*robotstxt.RobotsData, error)

func TestRobotsGroup

func TestRobotsGroup(robots *robotstxt.RobotsData, url, userAgent string) bool

Types

type CrawlerProfile

type CrawlerProfile struct {
	Ctx       context.Context
	Website   string
	UserAgent string
	// Limits
	MaxDepth   int
	MaxRuntime time.Duration
	// Colly configuration
	CollyOptions []colly.CollectorOption
	CollyLimits  *colly.LimitRule
	// Custom callbacks
	ResponseHooks []func(response *colly.Response)
	URLTests      []func(url string) bool
	URLHooks      []func(url string)
}

type RequestLimiter

type RequestLimiter struct {
	SleepDelay int
	// contains filtered or unexported fields
}

func NewLimiter

func NewLimiter(sleepDelay int) *RequestLimiter

func (*RequestLimiter) Decrease

func (r *RequestLimiter) Decrease()

func (*RequestLimiter) Increase

func (r *RequestLimiter) Increase()

func (*RequestLimiter) Sleep

func (r *RequestLimiter) Sleep()

type VisitMap

type VisitMap struct {
	// contains filtered or unexported fields
}

func NewVisitMap

func NewVisitMap() *VisitMap

func (*VisitMap) Add

func (v *VisitMap) Add(url string)

func (*VisitMap) IsVisited

func (v *VisitMap) IsVisited(url string) bool

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL