robots

package module
v0.0.0-...-7ebb2b6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 6, 2015 License: MIT Imports: 5 Imported by: 3

README

robots

This package provides a robots.txt parser for the Robots Exclusion Protocol in the Go programming language.

The implementation follows Google's Robots.txt Specification.

The code is simple and straightforward. The structs exposed by this package consist of basic data types only, making it easy to encode and decode them using one of Go's encoding packages. And though performance wasn't a design goal, this package should never become your program's bottleneck.

Installation

Run

go get github.com/slyrz/robots
Example
robots := robots.New(file, "your-user-agent")
if robots.Allow("/some/path") {
	// Crawl it!
	// ...
}
License

robots is released under MIT license. You can find a copy of the MIT License in the LICENSE file.

Documentation

Overview

Package robots implements a robots.txt parser for the robots exclusion protocol.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Group

type Group struct {
	UserAgents []string
	Allow      []string
	Disallow   []string
	CrawlDelay string
}

Group stores a group found in a robots.txt file. This type is only used during parsing.

func (*Group) HasMembers

func (g *Group) HasMembers() bool

HasMembers returns true if the group has one or more of the following members: allow, disallow and crawl-delay.

func (*Group) HasUserAgents

func (g *Group) HasUserAgents() bool

HasUserAgents returns true if the group has one or more user-agents.

func (*Group) Matches

func (g *Group) Matches(name string) (bool, int)

Matches returns true if one of the group's user-agents matches the given name. The second return value is the length of the match. The length is zero for no match and wildcard matches.

type Groups

type Groups []*Group

Groups contain all groups found in a robots.txt file. This type is only used during parsing.

func NewGroups

func NewGroups(r io.Reader) (groups Groups)

NewGroups parses a robots.txt file and returns all groups found in it.

func (Groups) Find

func (g Groups) Find(name string) (result *Group)

Find returns the group that belongs to user-agent name. If no matching group is found, nil is returned.

type Robots

type Robots struct {
	CrawlDelay time.Duration
	Rules      []*Rule
}

Robots stores all relevant rules for a predefined user-agent.

func New

func New(r io.Reader, useragent string) *Robots

New parses the robots.txt file in r and returns all Rules relevant for the given user-agent.

func (*Robots) Allow

func (r *Robots) Allow(path string) bool

Allow returns true if the parsed robots.txt file allows the given path to be crawled.

type Rule

type Rule struct {
	Type    RuleType
	Length  int
	Equals  string   // Matches the whole path.
	Prefix  string   // Matches the start of a path.
	Suffix  string   // Matches the end of a path.
	Needles []string // Matches anything inside a path.
}

Rule stores a parsed allow/disallow record found in the robots.txt file.

func NewRule

func NewRule(ruleType RuleType, value string) *Rule

NewRule creates a new rule from the value element of an allow/disallow record found in the robots.txt file.

func (*Rule) Match

func (r *Rule) Match(path string) bool

Match returns true if the rule matches the given path value.

type RuleType

type RuleType uint

RuleType defines the type of a rule.

const (
	TypeDisallow RuleType = iota
	TypeAllow
)

TypeAllow means the rule allows a path to be crawled. TypeDisallow means the rule disallows crawling.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL