webcrawler

package module
v0.8.10 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 24, 2024 License: MIT Imports: 17 Imported by: 0

README

webcrawlerGo

Crawls a website and saves marked URL's contents to DB

Summary:

Crawler will crawl the provided Base URL and fetch all the valid hrefs on the page. Unseen hrefs will be added to a unique queue for fetching hrefs in them. Crawler will save the contents of the paths which are to be monitored (from models) or marked (from cmd arg). Crawler respects the robots.txt of the website being crawled.

Can use PostgreSQL when provided else will open a local sqlite3 database.

Usage:
webcrawler -baseurl <url> [OPTIONS]

-baseurl string
    Absolute base URL to crawl (required).
    E.g. <http/https>://<domain-name>
-date string
    Cut-off date upto which the latest crawled pages will be saved to disk.
    Format: YYYY-MM-DD. Applicable only with 'db2disk' flag.
    (default "<todays-date>")
-days int
    Days past which monitored URLs should be updated (default 1)
-db-dsn string
    DSN string to database.
    Supported DSN: PostgreSQL DSN (optional).
    When empty crawler will use sqlite3 driver.
-db2disk
    Use this flag to write the latest crawled content to disk.
    Customise using arguments 'path' and 'date'.
    Crawler will exit after saving to disk.
-idle-time string
    Idle time after which crawler quits when queue is empty.
    Min: 1s (default "10s")
-ignore string
    Comma ',' seperated string of url patterns to ignore.
-murls string
    Comma ',' seperated string of marked url paths to save/update.
    If the marked path is unmonitored in the database, the crawler
    will mark the URL as monitored.
    When empty, crawler will update monitored URLs from the model.
-n int
    Number of crawlers to invoke (default 10)
-path string
    Output path to save the content of crawled web pages.
    Applicable only with 'db2disk' flag. (default "./OUT/<timestamp>")
-req-delay string
    Delay between subsequent requests.
    Min: 1ms (default "50ms")
-retry int
    Number of times to retry failed GET requests.
    With retry=2, crawlers will retry the failed GET urls
    twice after initial failure. (default 2)
-ua string
    User-Agent string to use while crawling
    (default "webcrawlerGo/v<version> - Web crawler in Go")
-update-hrefs
    Use this flag to update embedded HREFs in all saved and alive URLs
    belonging to the baseurl.
-v  Display app version

Note:

  • Crawler will ignore the hrefs that begins with "file:", "javascript:", "mailto:", "tel:", "#", "data:"
  • Marking URLs with -murls option will set is_monitored=true in models.
  • Use -ignore option to ignore any pattern in url path, for e.g. to ignore paths with pdf files add '.pdf' to ignore list.
  • Will not follow URLs outside baseurl.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Crawler

type Crawler struct {
	Name string // Name of crawler for easy identification
	*CrawlerConfig
}

Crawler crawls the URL fetched from Queue and saves the contents to Models.

Crawler will quit after IdleTimeout when queue is empty

func NNewCrawlers

func NNewCrawlers(n int, namePrefix string, cfg *CrawlerConfig) ([]*Crawler, error)

NNewCrawlers returns N new Crawlers configured with cfg. Crawlers will be named with namePrefix.

func NewCrawler

func NewCrawler(name string, cfg *CrawlerConfig) (*Crawler, error)

NewCrawler return pointer to a new Crawler

func (*Crawler) Crawl

func (c *Crawler) Crawl(client *http.Client)

Crawl to begin crawling

func (*Crawler) Log added in v0.8.5

func (c *Crawler) Log(msg string)

Log writes the msg to [Crawler.Logger] and [Crawler.PrettyLogger] when present

type CrawlerConfig

type CrawlerConfig struct {
	Queue            *queue.UniqueQueue // global queue
	Models           *models.Models     // models to use
	BaseURL          *url.URL           // base URL to crawl
	UserAgent        string             // user-agent to use while crawling
	MarkedURLs       []string           // marked URL to save to model
	IgnorePatterns   []string           // URL pattern to ignore
	RequestDelay     time.Duration      // delay between subsequent requests
	IdleTimeout      time.Duration      // timeout after which crawler quits when queue is empty
	Logger           *log.Logger        // will log to [os.Stdout] when nil and when no PrettyLogger; ONLY log to file if also using PrettyLogger
	RetryTimes       int                // no. of times to retry failed request
	FailedRequests   map[string]int     // map to store failed requests stats
	KnownInvalidURLs *InvalidURLCache   // known map of invalid URLs
	Ctx              context.Context    // context to quit on SIGINT/SIGTERM

	PrettyLogger PrettyLogger // optional logger to write to screen
	// contains filtered or unexported fields
}

CrawlerConfig to configure a crawler

type InvalidURLCache

type InvalidURLCache struct {
	// contains filtered or unexported fields
}

InvalidURLCache is the cache for invalid URLs

type PrettyLogger added in v0.8.7

type PrettyLogger interface {
	// Log will send message to PrettyLogger instance
	// to be written on terminal
	Log(string)

	// Quit should initiate call to quit PrettyLogger when
	// the last crawler have exited
	Quit()
}

PrettyLogger interface is used to write scrolling logs to terminal

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL