octopus

package

v1.2.2 Latest Latest Go to latest Published: Mar 25, 2019 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/rapidclock/web-octopus

Links

Open Source Insights

Documentation ¶

Overview ¶

Package octopus implements a concurrent web crawler. The octopus uses a pipeline of channels to implement a non-blocking web crawler. The octopus also provides user configurable options that can be used to customize the behaviour of the crawler.

Features ¶

Current Features of the crawler include:

User specifiable Depth Limited Crawling
User specified valid protocols
User buildable adapters that the crawler feeds output to.
Filter Duplicates.
Filter URLs that fail a HEAD request.
User specifiable max timeout between two successive url requests.
User specifiable Max Number of Links to be crawled.

Pipeline Overview ¶

The overview of the Pipeline is given below:

Ingest
Link Absolution
Protocol Filter
Duplicate Filter
Invalid Url Filter (Urls whose HEAD request Fails) (5x) (Optional) Crawl Rate Limiter. [6]. Make GET Request 7a. Send to Output Adapter 7b. Check for Timeout (gap between two output on this channel).
Max Links Crawled Limit Filter
Depth Limit Filter
Parse Page for more URLs.

Note: The output from 7b. is fed to 8.

1 -> 2 -> 3 -> 4 -> 5 -> (5x) -> [6] -> 7b -> 8 -> 9 -> 10 -> 1

Index ¶

func New(opt *CrawlOptions) *octopus
func NewWithDefaultOptions() *octopus
type CrawlOptions
- func GetDefaultCrawlOptions() *CrawlOptions
type Node
type NodeChSet
- func MakeDefaultNodeChSet() (*NodeChSet, chan *Node, chan int)
- func MakeNodeChSet(nodeCh chan<- *Node, quitCh chan<- int) *NodeChSet
type NodeInfo
type OutputAdapter
type StdChannels

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func New ¶

func New(opt *CrawlOptions) *octopus

New - Create an Instance of the Octopus with the given CrawlOptions.

func NewWithDefaultOptions ¶

func NewWithDefaultOptions() *octopus

NewWithDefaultOptions - Create an Instance of the Octopus with the default CrawlOptions.

Types ¶

type CrawlOptions ¶

type CrawlOptions struct {
	MaxCrawlDepth         int64
	MaxCrawledUrls        int64
	StayWithinBaseHost    bool
	CrawlRatePerSec       int64
	CrawlBurstLimitPerSec int64
	RespectRobots         bool
	IncludeBody           bool
	OpAdapter             OutputAdapter
	ValidProtocols        []string
	TimeToQuit            int64
}

CrawlOptions is used to house options for crawling.

You can specify depth of exploration for each link, if crawler should ignore other host names (except from base host).

MaxCrawlDepth - Indicates the maximum depth that will be crawled,
for each new link.

MaxCrawledUrls - Specifies the Maximum Number of Unique Links that will be crawled.
Note : When combined with DepthPerLink, it will combine both.
Use -1 to indicate infinite links to be crawled (only bounded by depth of traversal).

StayWithinBaseHost - (unimplemented) Ensures crawler stays within the
level 1 link's hostname.

CrawlRatePerSec - is the rate at which requests will be made (per second).
If this is negative, Crawl feature will be ignored. Default is negative.

CrawlBurstLimitPerSec - Represents the max burst capacity with which requests
can be made. This must be greater than or equal to the CrawlRatePerSec.

RespectRobots (unimplemented) choose whether to respect robots.txt or not.

IncludeBody - (unimplemented) Include the response Body in the crawled
NodeInfo (for further processing).

OpAdapter is a user specified concrete implementation of an Output Adapter. The crawler
will pump output onto the implementation's channel returned by its Consume method.

ValidProtocols - This is an array containing the list of url protocols that
should be crawled.

TimeToQuit - represents the total time to wait between two new nodes to be
generated before the crawler quits. This is in seconds.

func GetDefaultCrawlOptions ¶

func GetDefaultCrawlOptions() *CrawlOptions

Returns an instance of CrawlOptions with the values set to sensible defaults.

type Node ¶

type Node struct {
	*NodeInfo
	Body io.ReadCloser
}

Node encloses a NodeInfo and its Body (HTML) Content.

type NodeChSet ¶

type NodeChSet struct {
	NodeCh chan<- *Node
	*StdChannels
}

NodeChSet is the standard set of channels used to build the concurrency pipelines in the crawler.

func MakeDefaultNodeChSet ¶

func MakeDefaultNodeChSet() (*NodeChSet, chan *Node, chan int)

Utility to create a NodeChSet and get full access to the Quit & Node Channel.

func MakeNodeChSet ¶

func MakeNodeChSet(nodeCh chan<- *Node, quitCh chan<- int) *NodeChSet

Utility function to create a NodeChSet given a created Node and Quit Channel.

type NodeInfo ¶

type NodeInfo struct {
	ParentUrlString string
	UrlString       string
	Depth           int64
}

NodeInfo is used to represent each crawled link and its associated crawl depth.

type OutputAdapter ¶

type OutputAdapter interface {
	Consume() *NodeChSet
}

OutputAdapter is the interface that has to be implemented in order to handle outputs from the octopus crawler.

The octopus will call the OutputAdapter.Consume( ) method and deliver all relevant output and quit signals on the channels included in the received NodeChSet.

This implies that it is the responsibility of the user who implements OutputAdapter to handle processing the output of the crawler that is delivered on the NodeChSet.NodeCh.

Implementers of the interface should listen to the included channels in the output of Consume() for output from the crawler.

type StdChannels ¶

type StdChannels struct {
	QuitCh chan<- int
}

StdChannels are used to hold the standard set of channels that are used for special operations. Will include channels for Logging, Statistics, etc. in the future.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL