Documentation ¶
Overview ¶
Package octopus implements a concurrent web crawler. The octopus uses a pipeline of channels to implement a non-blocking web crawler. The octopus also provides user configurable options that can be used to customize the behaviour of the crawler.
Features ¶
Current Features of the crawler include:
- User specifiable Depth Limited Crawling
- User specified valid protocols
- User buildable adapters that the crawler feeds output to.
- Filter Duplicates.
- Filter URLs that fail a HEAD request.
- User specifiable max timeout between two successive url requests.
- User specifiable Max Number of Links to be crawled.
Pipeline Overview ¶
The overview of the Pipeline is given below:
- Ingest
- Link Absolution
- Protocol Filter
- Duplicate Filter
- Invalid Url Filter (Urls whose HEAD request Fails) (5x) (Optional) Crawl Rate Limiter. [6]. Make GET Request 7a. Send to Output Adapter 7b. Check for Timeout (gap between two output on this channel).
- Max Links Crawled Limit Filter
- Depth Limit Filter
- Parse Page for more URLs.
Note: The output from 7b. is fed to 8.
1 -> 2 -> 3 -> 4 -> 5 -> (5x) -> [6] -> 7b -> 8 -> 9 -> 10 -> 1
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func New ¶
func New(opt *CrawlOptions) *octopus
New - Create an Instance of the Octopus with the given CrawlOptions.
func NewWithDefaultOptions ¶
func NewWithDefaultOptions() *octopus
NewWithDefaultOptions - Create an Instance of the Octopus with the default CrawlOptions.
Types ¶
type CrawlOptions ¶
type CrawlOptions struct { MaxCrawlDepth int64 MaxCrawledUrls int64 StayWithinBaseHost bool CrawlRatePerSec int64 CrawlBurstLimitPerSec int64 RespectRobots bool IncludeBody bool OpAdapter OutputAdapter ValidProtocols []string TimeToQuit int64 }
CrawlOptions is used to house options for crawling.
You can specify depth of exploration for each link, if crawler should ignore other host names (except from base host).
MaxCrawlDepth - Indicates the maximum depth that will be crawled, for each new link. MaxCrawledUrls - Specifies the Maximum Number of Unique Links that will be crawled. Note : When combined with DepthPerLink, it will combine both. Use -1 to indicate infinite links to be crawled (only bounded by depth of traversal). StayWithinBaseHost - (unimplemented) Ensures crawler stays within the level 1 link's hostname. CrawlRatePerSec - is the rate at which requests will be made (per second). If this is negative, Crawl feature will be ignored. Default is negative. CrawlBurstLimitPerSec - Represents the max burst capacity with which requests can be made. This must be greater than or equal to the CrawlRatePerSec. RespectRobots (unimplemented) choose whether to respect robots.txt or not. IncludeBody - (unimplemented) Include the response Body in the crawled NodeInfo (for further processing). OpAdapter is a user specified concrete implementation of an Output Adapter. The crawler will pump output onto the implementation's channel returned by its Consume method. ValidProtocols - This is an array containing the list of url protocols that should be crawled. TimeToQuit - represents the total time to wait between two new nodes to be generated before the crawler quits. This is in seconds.
func GetDefaultCrawlOptions ¶
func GetDefaultCrawlOptions() *CrawlOptions
Returns an instance of CrawlOptions with the values set to sensible defaults.
type Node ¶
type Node struct { *NodeInfo Body io.ReadCloser }
Node encloses a NodeInfo and its Body (HTML) Content.
type NodeChSet ¶
type NodeChSet struct { NodeCh chan<- *Node *StdChannels }
NodeChSet is the standard set of channels used to build the concurrency pipelines in the crawler.
func MakeDefaultNodeChSet ¶
Utility to create a NodeChSet and get full access to the Quit & Node Channel.
func MakeNodeChSet ¶
Utility function to create a NodeChSet given a created Node and Quit Channel.
type OutputAdapter ¶
type OutputAdapter interface {
Consume() *NodeChSet
}
OutputAdapter is the interface that has to be implemented in order to handle outputs from the octopus crawler.
The octopus will call the OutputAdapter.Consume( ) method and deliver all relevant output and quit signals on the channels included in the received NodeChSet.
This implies that it is the responsibility of the user who implements OutputAdapter to handle processing the output of the crawler that is delivered on the NodeChSet.NodeCh.
Implementers of the interface should listen to the included channels in the output of Consume() for output from the crawler.
type StdChannels ¶
type StdChannels struct {
QuitCh chan<- int
}
StdChannels are used to hold the standard set of channels that are used for special operations. Will include channels for Logging, Statistics, etc. in the future.
Source Files ¶
- core.go
- doc.go
- modelfactory.go
- models.go
- pipe_augment_linkabsolution.go
- pipe_ctrl_limitcrawl.go
- pipe_ctrl_ratelimit.go
- pipe_filter_crawldepth.go
- pipe_filter_duplicates.go
- pipe_filter_protocol.go
- pipe_filter_urlvalidation.go
- pipe_process_htmlparsing.go
- pipe_process_requisition.go
- pipe_spl_distributor.go
- pipe_spl_ingest.go
- pipe_spl_maxdelay.go
- setup.go
- stdpipefunc.go