httpsyet

package
v0.1.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 26, 2018 License: MIT Imports: 10 Imported by: 0

README

v2 - A new home for the processing network

The new struct traffic

Overview

As the processing network is mainly about the site channel and it's observing WaitGroup, Thus, it's natural to factor it out into a new struct traffic and into a new sourcefile. Anonymous embedding allows seamless use in crawling - and it's sourcefile gets a little more compact.

Last, but not least there is also this string called result passing through a channel. In order to improve clarity we give it a named type - go is a type safe language.


Some remarks regarding changes to source files compared with the previous version:

traffic.go

New home for guarded traffic. Move Processor and Feed (formerly add) into here.

site.go

Moved func queueURLs from crawler.go into here.

I will regret this later.

crawling.go

Use type aliases:

type site = Site
type traffic = Traffic
  • Make result an explicit type now.
  • Move methods processor and add into new traffic.go.
  • Make use of new traffic.

crawler_test.go

Just the import path.

genny.go

Adjust to having result as explicit type now.

Changes to crawler.go

No need here to touch the previosly refactured crawler.go except for the one line where the result is sent and we simply need to cast it to the new explicit type now


Back to Overview

Documentation

Overview

Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Crawler

type Crawler struct {
	Sites    []string                             // At least one URL.
	Out      io.Writer                            // Required. Writes one detected site per line.
	Log      *log.Logger                          // Required. Errors are reported here.
	Depth    int                                  // Optional. Limit depth. Set to >= 1.
	Parallel int                                  // Optional. Set how many sites to crawl in parallel.
	Delay    time.Duration                        // Optional. Set delay between crawls.
	Get      func(string) (*http.Response, error) // Optional. Defaults to http.Get.
	Verbose  bool                                 // Optional. If set, status updates are written to logger.
}

Crawler is used as configuration for Run. Is validated in Run().

func (Crawler) Run

func (c Crawler) Run() error

Run the crawler. Can return validation errors. All crawling errors are reported via logger. Output is written to writer. Crawls sites recursively and reports all external links that can be changed to HTTPS. Also reports broken links via error logger.

type Site

type Site struct {
	URL    *url.URL
	Parent *url.URL
	Depth  int
}

Site represents what travels: an URL which may have a Parent URL, and a Depth.

func (Site) Attr

func (s Site) Attr() interface{}

Attr implements the attribute relevant for ForkSiteSeenAttr, the "I've seen this site before" discriminator.

func (Site) Print

func (s Site) Print() Site

print may be used via e.g. PipeSiteFunc(sites, site.print) for tracing

type Traffic

type Traffic struct {
	Travel          chan site // to be processed
	*sync.WaitGroup           // monitor SiteEnter & SiteLeave
}

Traffic as it goes around inside a circular site pipe network, e. g. a crawling Crawler. Composed of Travel, a channel for those who travel in the traffic, and an embedded *sync.WaitGroup to keep track of congestion.

func (*Traffic) Feed

func (t *Traffic) Feed(urls []*url.URL, parent *url.URL, depth int)

Feed registers new entries and launches their dispatcher (which we intentionally left untouched).

func (*Traffic) Processor

func (t *Traffic) Processor(crawl func(s site), parallel int)

Processor builds the site traffic processing network; it is cirular if crawl uses Feed to provide feedback.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL