README ¶
Common Crawler
🕸 A simple and easy way to extract data from Common Crawl with little or no hassle.
Notice in regards to development
Currently I do not have the capacity to hire full time, however, I do have the intention of hiring someone to help build infrastructure related to CommonCrawl. All Gitcoin bounties are currently on hold. When I do have time to further invest in this project, will discuss full time devops developer to work on said project. All payment will be done in DAI and resource allocation will be approximately 5k/mo.
As a GUI
An electron based interface that works with a Go server will be available.
As a library
Install as a dependency:
go get https://github.com/ChrisCates/CommonCrawler
Access the library functions by import
ing it:
import(
cc "github.com/ChrisCates/CommonCrawler"
)
func main() {
cc.scan()
cc.download()
cc.extract()
// And so forth
}
As a command line tool
Install from source:
go install https://github.com/ChrisCates/CommonCrawler
Or you can curl from Github:
curl https://github.com/ChrisCates/CommonCrawler/raw/master/dist/commoncrawler -o commoncrawler
Then run as a binary:
# Output help
commoncrawler --help
# Specify configuration
commoncrawler --base-uri https://commoncrawl.s3.amazonaws.com/
commoncrawler --wet-paths wet.paths
commoncrawler --data-folder output/crawl-data
commoncrawler --start 0
commoncrawler --stop 5 # -1 will loop through all wet files from wet.paths
# Start crawling the web
commoncrawler start --stop -1
Compilation and Configuration
Installing dependencies
go get github.com/logrusorgru/aurora
Downloading data with the application
First configure the type of data you want to extract.
// Config is the preset variables for your extractor
type Config struct {
baseURI string
wetPaths string
dataFolder string
matchFolder string
start int
stop int
}
//Defaults
Config{
start: 0,
stop: 5,
baseURI: "https://commoncrawl.s3.amazonaws.com/",
wetPaths: path.Join(cwd, "wet.paths"),
dataFolder: path.Join(cwd, "/output/crawl-data"),
matchFolder: path.Join(cwd, "/output/match-data"),
}
With Docker
docker build -t commoncrawler .
docker run commoncrawler
Without Docker
go build -i -o ./dist/commoncrawler ./src/*.go
./dist/commoncrawler
Or you can run simply just run it.
go run src/*.go
Resources
-
MIT Licensed
-
If people are interested or need it. I can create a documentation and tutorial page on https://commoncrawl.chriscates.ca
-
You can post issues if they are valid, and, I could potentially fund them based on priority.