CommonCrawler

module
v0.0.0-...-b7a83b2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 6, 2019 License: MIT

README ¶

Common Crawler

🕸 A simple and easy way to extract data from Common Crawl with little or no hassle.

Go Version License Build Status Go Report Card

Notice in regards to development

Currently I do not have the capacity to hire full time, however, I do have the intention of hiring someone to help build infrastructure related to CommonCrawl. All Gitcoin bounties are currently on hold. When I do have time to further invest in this project, will discuss full time devops developer to work on said project. All payment will be done in DAI and resource allocation will be approximately 5k/mo.

As a GUI

An electron based interface that works with a Go server will be available.

As a library

Install as a dependency:

go get https://github.com/ChrisCates/CommonCrawler

Access the library functions by importing it:

import(
  cc "github.com/ChrisCates/CommonCrawler"
)

func main() {
  cc.scan()
  cc.download()
  cc.extract()
  // And so forth
}

As a command line tool

Install from source:

go install  https://github.com/ChrisCates/CommonCrawler

Or you can curl from Github:

curl https://github.com/ChrisCates/CommonCrawler/raw/master/dist/commoncrawler -o commoncrawler

Then run as a binary:

# Output help
commoncrawler --help

# Specify configuration
commoncrawler --base-uri https://commoncrawl.s3.amazonaws.com/
commoncrawler --wet-paths wet.paths
commoncrawler --data-folder output/crawl-data
commoncrawler --start 0
commoncrawler --stop 5 # -1 will loop through all wet files from wet.paths

# Start crawling the web
commoncrawler start --stop -1

Compilation and Configuration

Installing dependencies
go get github.com/logrusorgru/aurora
Downloading data with the application

First configure the type of data you want to extract.

// Config is the preset variables for your extractor
type Config struct {
    baseURI     string
    wetPaths    string
    dataFolder  string
    matchFolder string
    start       int
    stop        int
}

//Defaults
Config{
    start:       0,
    stop:        5,
    baseURI:     "https://commoncrawl.s3.amazonaws.com/",
    wetPaths:    path.Join(cwd, "wet.paths"),
    dataFolder:  path.Join(cwd, "/output/crawl-data"),
    matchFolder: path.Join(cwd, "/output/match-data"),
}
With Docker
docker build -t commoncrawler .
docker run commoncrawler
Without Docker
go build -i -o ./dist/commoncrawler ./src/*.go
./dist/commoncrawler

Or you can run simply just run it.

go run src/*.go
Resources
  • MIT Licensed

  • If people are interested or need it. I can create a documentation and tutorial page on https://commoncrawl.chriscates.ca

  • You can post issues if they are valid, and, I could potentially fund them based on priority.

Directories ¶

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL