httpsyet/

directory
v0.1.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 26, 2018 License: MIT

README

A Crawler

Use Go Channels to Build a Crawler

@jorinvo aims to provide "an implementation that abstracts the coordination using channels" in his original post (also here),

Having looked at it while ago, having read the explanations and justifications, I thought it might be a nice usecase for pipe/s generic functions.

tl;dr

Using functions generated from pipe/s (see genny.go below), the actual concurrent network to process the site traffic becomes:

	sites, seen := siteForkSeenAttr(c.sites, site.Attr)
	for _, inp := range siteStrew(sites, size) {
		siteDoneFunc(inp, c.crawl) // sites leave inside crawler's crawl
	}
	siteDoneLeave(seen, c) // seen leave without further processing

Simple, is it not? ;-)

"No locks. No condition variables. No callbacks." - as Rob Pike loves to reiterate in his slides.

Please note: no sync-WaitGroup is needed around the parallel processes (as did the original), Also Done...'s results may be safely discarded. (The traffic congestion is monitored another way.)

So, how to get there?

  • v0 - The Original
  • v1 - The separation of concerns: new struct crawling
  • v2 - A new home for the processing network: new struct traffic
  • v3 - Improve the traffic network
  • v4 - Teach traffic to be more gentle and easy to use
  • v5 - Just for the fun of it: use pipe/m to generate methods on traffic (instead of package functions)
  • And now?

Overview

In order to abstract (latin: abstrahere = "to take away") "the coordination using channels" it is taken away into a new sourcefile, and a new struct crawling gives home to the relevant data.

This leaves crawler.go focused on the actual crawl functionality.

"The coordination using channels" is redone on a slightly higher level of abstraction: using a pipe network. Yes, channels connect the parts - but they are not the main focus anymore,

And no wait channel (and it's management) is needed to keep track of the traffic. A properly used *sync.WaitGroup is all we need.

And the code for the arrangement of parts as shown above - the body of a func (c *crawling) :

  • what arrives in c.sites is separated (by the site.Attr method)
  • new (previously unseen) sites get strewed (scattered) onto a slice of channels - size determines the parallelism
    • for each ranged channel all what needs to be done is to apply the crawl function (a method of crawling)
  • what had been seen before leaves without further processing (last line)

Please note the four vital functions are taken from the pipe/s collection!

v1 has more details - please have a look.

And please notice: we've accomplished to relive crawler.go and may leave it untoched during what follows.

v2 - A new home for the processing network: new struct traffic

As the processing network is mainly about the site channel and it's observing WaitGroup, Thus, it's natural to factor it out into a new struct traffic and into a new sourcefile. Anonymous embedding allows seamless use in crawling - and it's sourcefile gets a little more compact.

Last, but not least there is also this string called result passing through a channel. In order to improve clarity we give it a named type - go is a type safe language.

v2 has more details - please have a look.

And please notice: no need here to touch the previosly refactured crawler.go except for the one line where the result is sent and we simply need to cast it to the new type.


v3 - Improve the traffic network

It's not easy to see at first, but there is still an interesting issue hiding in the original implementation.

It may become more obvious when we imagine to crawl huge numbers - e.g. finding a thousand new urls on new each page, and going down a couple of levels.

Look: Each site's feedback spawns a new goroutine in order to queueURLs the (e.g. 1000) urls found.

And most likely these will block!

Almost each and every of these many goroutines will block most of the time, as there are still urls discovered earlier to be fetched and crawled.

And each queueURLs carries the full slice of urls - no matter how many have been sent to the processing yet. (The implementation uses a straightforward range and does not attempt to shrink the slice. Idiomatic go.)

We can do differently. No need to wast many huge slices across plenty of blocked goroutines.

A battery called �djust provides a flexibly buffered pipe. so we use SitePipeAdjust in our network and do not need to have Feed spawn the queueURLs function any more. We may feed synchonously now!

But now we also do not need to bother Feed with the need of registering new traffic (using t.Add(len(urls))) up front. Instead we use SitePipeEnter (a companion of SiteDoneLeave) at the entrance of our network processor.

Thus, the network becomes more flexible and more self-contained and gives less burden to it's surroundings.

Pushing the types site and it's related sites traffic and result into separate sub-packages is just a little more tidying - respecting the original Crawler and it's living space.

v3 has more details - please have a look.

And please notice: no need here to touch the previosly refactured crawler.go.


v4 - Teach traffic to be more gentle and easy to use

Proper use of the new struct traffic is still awkward for crawling.

Thus, time to teach traffic to behave better and more robust. Sepcifically:

  • new constructor New(): don't bother the client with initialisation - and thus now no need anymore to import "sync"
  • have Processor return a signal channel to broadcast "traffic has subsided and nothing is left to be processed"
  • lazy initialisation of this mechanism upon first Feed, and Do() only synce.Once
  • new method Done() - just a convenince to receive the broadcast channel another way
  • wrap the crawl function passed to Processor and have it register the site having left - thus: no need anymore for crawling to do so in it's crawl method.

So, crawling now is 20% shorter more more focused on it's own subject, is it not?

Please also note: "launch the results closer" now happens happily before the first "feed initial urls" - no need anymore to worry for something like "goWaitAndClose is to be used after initial traffic has been added".

The client (crawling) is free to use the channel returned from Processor (as it does now) or may even use <-crawling.Done() at any time he likes or seems fit (even before(!)) the Processor is build.

And: Done() is a method familar e.g. from the "context" package - thus easy to use and understand. Easier as is <-sites.SiteDoneWait(c.Travel, c), is it not?

v4 has more details - please have a look.

And please notice: no need here to touch the previosly refactured crawler.go.


v5 - Just for the fun of it

As You see in genny.go now pipe/m is used - and m generate methods on traffic (instead of package functions).

Further, using anyThing=Site (watch the change to the public type!) generated stuff becomes public (where intended) and can be seen in godoc.

Last, but not least, the idea to have Results in a separate package has been abandoned, as this adds more complications than benefit.


And now?

Well, it's enough, is it not? Let's call it a day.

Even so ... if we look back into the smallified crawler.go it may be seen, that there is still room to use a channel and to apply a little concurrency.

You may even spot some anti-concurrency pattern - as may come from 'classical' education and experience.

Where? Look at these links and urls - collections - passed around only after having finished building / appending them.

Any such anchor might as well be sent on a (new) channel right away from where it was found, get parsed (and filtered) and might become a direct feed for the sites processor. Just one (more) single go routine would be more than happy to do so ... concurrently.

On the other hand - using concurrency too much and or in the wrong place may lead to very poor results easily. See e.g. Go code refactoring : the 23x performance hunt for some very interesting observations and refactorings around concurrent .


And now? Well - Let's keep this for another day ;-)


Directories

Path Synopsis
Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.
Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.
Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.
Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.
Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.
Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.
v3
Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.
Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.
v4
Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.
Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.
v5
Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.
Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL