agents

package module
v0.0.0-...-ed7e749 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 21, 2024 License: MIT Imports: 11 Imported by: 2

README

crawler-user-agents

This repository contains a list of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.

Each pattern is a regular expression. It should work out-of-the-box wih your favorite regex library.

If you use this project in a commercial product, please sponsor it.

Install

Direct download

Download the crawler-user-agents.json file from this repository directly.

Javascript

crawler-user-agents is deployed on npmjs.com: https://www.npmjs.com/package/crawler-user-agents

To use it using npm or yarn:

npm install --save crawler-user-agents
# OR
yarn add crawler-user-agents

In Node.js, you can require the package to get an array of crawler user agents.

const crawlers = require('crawler-user-agents');
console.log(crawlers);
Python

Install with pip install crawler-user-agents

Then:

import crawleruseragents
if crawleruseragents.is_crawler("Googlebot/"):
   # do something

or:

import crawleruseragents
indices = crawleruseragents.matching_crawlers("bingbot/2.0")
print("crawlers' indices:", indices)
print(
    "crawler's URL:",
    crawleruseragents.CRAWLER_USER_AGENTS_DATA[indices[0]]["url"]
)

Note that matching_crawlers is much slower than is_crawler, if the given User-Agent does indeed match any crawlers.

Go

Go: use this package, it provides global variable Crawlers (it is synchronized with crawler-user-agents.json), functions IsCrawler and MatchingCrawlers.

Example of Go program:

package main

import (
	"fmt"

	"github.com/monperrus/crawler-user-agents"
)

func main() {
	userAgent := "Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)"

	isCrawler := agents.IsCrawler(userAgent)
	fmt.Println("isCrawler:", isCrawler)

	indices := agents.MatchingCrawlers(userAgent)
	fmt.Println("crawlers' indices:", indices)
	fmt.Println("crawler's URL:", agents.Crawlers[indices[0]].URL)
}

Output:

isCrawler: true
crawlers' indices: [237]
crawler' URL: https://discordapp.com

Contributing

I do welcome additions contributed as pull requests.

The pull requests should:

  • contain a single addition
  • specify a discriminant relevant syntactic fragment (for example "totobot" and not "Mozilla/5 totobot v20131212.alpha1")
  • contain the pattern (generic regular expression), the discovery date (year/month/day) and the official url of the robot
  • result in a valid JSON file (don't forget the comma between items)

Example:

{
  "pattern": "rogerbot",
  "addition_date": "2014/02/28",
  "url": "http://moz.com/help/pro/what-is-rogerbot-",
  "instances" : ["rogerbot/2.3 example UA"]
}

License

The list is under a MIT License. The versions prior to Nov 7, 2016 were under a CC-SA license.

There are a few wrapper libraries that use this data to detect bots:

Other systems for spotting robots, crawlers, and spiders that you may want to consider are:

Documentation

Index

Constants

This section is empty.

Variables

View Source
var Crawlers = func() []Crawler {
	var crawlers []Crawler
	if err := json.Unmarshal(crawlersJson, &crawlers); err != nil {
		panic(err)
	}
	return crawlers
}()

The list of crawlers, built from contents of crawler-user-agents.json.

Functions

func IsCrawler

func IsCrawler(userAgent string) bool

Returns if User Agent string matches any of crawler patterns.

func MatchingCrawlers

func MatchingCrawlers(userAgent string) []int

Finds all crawlers matching the User Agent and returns the list of their indices in Crawlers.

Types

type Crawler

type Crawler struct {
	// Regexp of User Agent of the crawler.
	Pattern string `json:"pattern"`

	// Discovery date.
	AdditionDate time.Time `json:"addition_date"`

	// Official url of the robot.
	URL string `json:"url"`

	// Examples of full User Agent strings.
	Instances []string `json:"instances"`
}

Crawler contains information about one crawler.

func (Crawler) MarshalJSON

func (c Crawler) MarshalJSON() ([]byte, error)

func (*Crawler) UnmarshalJSON

func (c *Crawler) UnmarshalJSON(b []byte) error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL