scrapy

command module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 30, 2018 License: MIT Imports: 13 Imported by: 0

README

Build Status Go Report Card codecov

A simple web scraper

Install
go get -u github.com/dave/scrapy
Usage
scrapy [url]

The scrapy command will get get the page at url, parse it for links and get all pages that are on the same domain.

Some stats will be outputted during the processing, and a list of URLs will be printed when it's finished. You can end the job early with Ctrl+C.

Flags

Several command line flags are available:

  -length int
    	Length of the queue (default 1000)
  -timeout int
    	Request timeout in ms (default 10000)
  -url string
    	The start page (default "https://monzo.com")
  -workers int
    	Number of concurrent workers (default 5)
Library

This scraper can also be used as a library. See the scraper package.

Notes

See here for design notes and brainstorming.

Example output
Summary
-------
Queued        46
In progress   5   https://monzo.com/blog/2018/08/30/manage-your-bills
Success       22
Errors        0   

Latency
-------
   0 - 100  ***
 100 - 200 
 200 - 300 
 300 - 400  **************************
 400 - 500  ******************************
 500 - 600  ***************
 600 - 700  ***
 700 - 800  ***
 800 - 900 
 900 - 1000
1000 - 1100
1100 - 1200
1200 - 1300
1300 - 1400
1400 - 1500
1500 - 1600
1600 - 1700
1700 - 1800
1800 - 1900
1900 - 2000
2000+ 

URLs
----
https://monzo.com
https://monzo.com/-play-store-redirect
https://monzo.com/about
https://monzo.com/blog
https://monzo.com/blog/2018/07/02/publishing-our-2018-annual-report
https://monzo.com/blog/2018/07/10/making-quarterly-goals-public
https://monzo.com/blog/2018/07/25/monzo-reliability-report
https://monzo.com/blog/how-money-works
https://monzo.com/blog/latest

...

Documentation

Overview

Package main is a simple command line interface for the scraper library.

Directories

Path Synopsis
Package scraper implements a web scraper as a library
Package scraper implements a web scraper as a library
getter
Package getter defines an interface that is used to request results by URL
Package getter defines an interface that is used to request results by URL
getter/mockgetter
Package mockgetter defines a getter.Interface that returns mock results for use in tests
Package mockgetter defines a getter.Interface that returns mock results for use in tests
getter/webgetter
Package webgetter defines a getter.Interface that gets real results by HTTP
Package webgetter defines a getter.Interface that gets real results by HTTP
logger
Package logger defines an interface that is used to log events and metrics during execution
Package logger defines an interface that is used to log events and metrics during execution
logger/consolelogger
Package consolelogger defines a logger.Interface that emits logs to a writer (usually the console)
Package consolelogger defines a logger.Interface that emits logs to a writer (usually the console)
logger/mocklogger
Package mocklogger defines a logger.Interface that stores a string representation of each logged event for testing
Package mocklogger defines a logger.Interface that stores a string representation of each logged event for testing
parser
Package parser defines an interface used to parse HTML and extract links
Package parser defines an interface used to parse HTML and extract links
parser/htmlparser
Package htmlparser defines a parser.Interface that parses HTML and returns the urls from anchor href attributes
Package htmlparser defines a parser.Interface that parses HTML and returns the urls from anchor href attributes
parser/mockparser
Package mockparser defines a parser.Interface that returns dummy urls for a given input, and is used in tests
Package mockparser defines a parser.Interface that returns dummy urls for a given input, and is used in tests
queuer
Package queuer defines an interface used to queue and execute an action on items
Package queuer defines an interface used to queue and execute an action on items
queuer/concurrentqueuer
Package concurrentqueuer defines a queuer.Interface that runs several workers concurrently on a queue
Package concurrentqueuer defines a queuer.Interface that runs several workers concurrently on a queue

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL