robotstxt

package module

v1.0.0 Latest Latest Go to latest Published: Jan 4, 2024 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/airplayx/robotstxt

Links

Open Source Insights

README ¶

What
====

This is a robots.txt exclusion protocol implementation for Go language (golang).


Build
=====

To build and run tests run `go test` in source directory.


Contribute
==========

Warm welcome.

* If desired, add your name in README.rst, section Who.
* Run `script/test && script/clean && echo ok`
* You can ignore linter warnings, but everything else must pass.
* Send your change as pull request or just a regular patch to current maintainer (see section Who).

Thank you.


Usage
=====

As usual, no special installation is required, just

    import "github.com/airplayx/robotstxt"

run `go get` and you're ready.

1. Parse
^^^^^^^^

First of all, you need to parse robots.txt data. You can do it with
functions `FromBytes(body []byte) (*RobotsData, error)` or same for `string`::

    robots, err := robotstxt.FromBytes([]byte("User-agent: *\nDisallow:"))
    robots, err := robotstxt.FromString("User-agent: *\nDisallow:")

As of 2012-10-03, `FromBytes` is the most efficient method, everything else
is a wrapper for this core function.

There are few convenient constructors for various purposes:

* `FromResponse(*http.Response) (*RobotsData, error)` to init robots data
from HTTP response. It *does not* call `response.Body.Close()`::

    robots, err := robotstxt.FromResponse(resp)
    resp.Body.Close()
    if err != nil {
        log.Println("Error parsing robots.txt:", err.Error())
    }

* `FromStatusAndBytes(statusCode int, body []byte) (*RobotsData, error)` or
`FromStatusAndString` if you prefer to read bytes (string) yourself.
Passing status code applies following logic in line with Google's interpretation
of robots.txt files:

    * status 2xx  -> parse body with `FromBytes` and apply rules listed there.
    * status 4xx  -> allow all (even 401/403, as recommended by Google).
    * other (5xx) -> disallow all, consider this a temporary unavailability.

2. Query
^^^^^^^^

Parsing robots.txt content builds a kind of logic database, which you can
query with `(r *RobotsData) TestAgent(url, agent string) (bool)`.

Explicit passing of agent is useful if you want to query for different agents. For
single agent users there is an efficient option: `RobotsData.FindGroup(userAgent string)`
returns a structure with `.Test(path string)` method and `.CrawlDelay time.Duration`.

Simple query with explicit user agent. Each call will scan all rules.

::

    allow := robots.TestAgent("/", "FooBot")

Or query several paths against same user agent for performance.

::

    group := robots.FindGroup("BarBot")
    group.Test("/")
    group.Test("/download.mp3")
    group.Test("/news/article-2012-1")


Who
===

Honorable contributors (in undefined order):

    * Ilya Grigorik (igrigorik)
    * Martin Angers (PuerkitoBio)
    * Micha Gorelick (mynameisfiber)

Initial commit and other: Sergey Shepelev temotor@gmail.com


Flair
=====

.. image:: https://travis-ci.org/temoto/robotstxt.svg?branch=master
    :target: https://travis-ci.org/temoto/robotstxt

.. image:: https://codecov.io/gh/temoto/robotstxt/branch/master/graph/badge.svg
    :target: https://codecov.io/gh/temoto/robotstxt

.. image:: https://goreportcard.com/badge/github.com/airplayx/robotstxt
    :target: https://goreportcard.com/report/github.com/airplayx/robotstxt

Documentation ¶

Overview ¶

Package robotstxt implements the robots.txt Exclusion Protocol as specified in http://www.robotstxt.org/wc/robots.html with various extensions.

Index ¶

Variables
type Group
- func (g *Group) Test(path string) bool
type ParseError
- func (e ParseError) Error() string
type RobotsData
- func (r *RobotsData) FindGroup(agent string) (ret *Group)
- func (r *RobotsData) TestAgent(path, agent string) bool
type Rule

Constants ¶

This section is empty.

Variables ¶

View Source

var WhitespaceChars = []rune{' ', '\t', '\v'}

Functions ¶

This section is empty.

Types ¶

type Group ¶

type Group struct {
	Rules      []*Rule
	Agent      string
	CrawlDelay time.Duration
}

func (*Group) Test ¶

func (g *Group) Test(path string) bool

type ParseError ¶

type ParseError struct {
	Errs []error
}

func (ParseError) Error ¶

func (e ParseError) Error() string

type RobotsData ¶

type RobotsData struct {
	// public
	Groups      map[string]*Group
	AllowAll    bool
	DisallowAll bool
	Host        string
	Sitemaps    []string
}

func FromBytes ¶

func FromBytes(body []byte) (r *RobotsData, err error)

func FromResponse ¶

func FromResponse(res *http.Response) (*RobotsData, error)

func FromStatusAndBytes ¶

func FromStatusAndBytes(statusCode int, body []byte) (*RobotsData, error)

func FromStatusAndString ¶

func FromStatusAndString(statusCode int, body string) (*RobotsData, error)

func FromString ¶

func FromString(body string) (r *RobotsData, err error)

func (*RobotsData) FindGroup ¶

func (r *RobotsData) FindGroup(agent string) (ret *Group)

FindGroup searches block of declarations for specified user-agent. From Google's spec: Only one group of group-member records is valid for a particular crawler. The crawler must determine the correct group of records by finding the group with the most specific user-agent that still matches. All other Groups of records are ignored by the crawler. The user-agent is non-case-sensitive. The order of the Groups within the robots.txt file is irrelevant.

func (*RobotsData) TestAgent ¶

func (r *RobotsData) TestAgent(path, agent string) bool

type Rule ¶

type Rule struct {
	Path    string
	Allow   bool
	Pattern *regexp.Regexp
}

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
robots.txt-check

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL