wcrawler

package module

v0.0.0-...-f9fa47e Latest Latest Go to latest Published: Oct 23, 2023 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/gustavooferreira/wcrawler

Links

Open Source Insights

README ¶

WCrawler

WCrawler is a simple web crawler CLI tool.

NOTE: This tool was created mainly for practice purposes and therefore doesn't rely on any library that facilitates crawling.

https://user-images.githubusercontent.com/17534422/109546768-85aec680-7ac2-11eb-8c72-2dbf7c7223a8.mp4

Usage

Exploring the Web:

❯ wcrawler explore --help
Explore the web by following links up to a pre-determined depth.
A depth of zero means no limit.

Usage:
  wcrawler explore URL [flags]


Flags:
  -d, --depth uint        depth of recursion (default 5)
  -h, --help              help for explore
  -s, --nostats           don't show live stats
  -o, --output string     file to save results (default "./web_graph.json")
  -r, --retry uint        retry requests when they timeout (default 2)
  -z, --stayinsubdomain   follow links only in the same subdomain
  -t, --timeout uint      HTTP requests timeout in seconds (default 10)
  -m, --treemode          doesn't add links which would point back to known nodes
  -w, --workers uint      number of workers making concurrent requests (default 100)

Visualizing the graph in the browser:

❯ wcrawler view --help
View web links relationships in the browser

Usage:
  wcrawler view [flags]

Flags:
  -h, --help            help for view
  -i, --input string    file containing the data (default "./web_graph.json")
  -n, --noautoopen      don't open browser automatically
  -o, --output string   HTML output file (default "./web_graph.html")

This will generate a webpage and load it on your default browser.

Spheres are coloured based on the URL subdomain, you can pan, tilt and rotate the scene, drag the spheres and move them around, hover to check the URL they represent and click on them to go straight to that URL.

NOTE: If you want to see a nice graph, make sure to run wcrawler explore with the -m flag. Tree mode doesn't create links back to the original URLs making for much nicer visualizations. Its utility? None, but the graphs are undeniably more beautiful.

Naturally, if you want a proper graph of the links visited and where they point to, just disregard the -m option. Don't try to visualize that, however, cos it's going to look ugly, if not freeze your browser entirely. Consider yourself warned :)

Example

The following command will crawl the web starting at the example.com website up to a max of 8 depth levels, using 5 workers with a 6 second timeout per request and saving the collected data to /tmp/result.json.

wcrawler explore https://example.com -d 8 -w 5 -t 6 -o /tmp/result.json

The following command will then generate an HTML file with a graph view of the data collected and load it onto the default web browser. Only try to visualize the graph if you have specified the -m option! It's going to be the wrong graph, but it's going to look nice!

wcrawler view -i /tmp/result.json

Considerations

Here I'm going to discuss the design decisions and a few caveats, but only when I'm actually done with the project.

Still have a few more things to do like:

Add logic to fetch website's robots.txt file and adhere to whatever it's in there. At the moment we are just crawling everything (feeling like an outlaw here at the minute)
Show last 10 errors in the CLI while crawling
Make output more colorful
Docs, docs and more docs
Write more unittests
Increase coverage and run some benchmarks (I'm pretty sure I can speed up some parts and reduce allocations, even though this program is I/O bound more than anything else so won't benefit much from these optimizations, but practice is practice)
Add golangci-lint to travis-ci (cos it's quite nice)
Organize code in a way that makes it for a useful library (mostly done)

Third party libraries being used (directly):

Could have written the whole thing without using any library, but reusability is not a bad idea at all!

The only rule I had was to not use any library that facilitates crawling.

- github.com/gosuri/uilive     [updating terminal output in realtime]
- github.com/spf13/cobra       [CLI args and flags parsing]
- github.com/stretchr/testify  [writing unit tests]
- golang.org/x/net             [HTML parsing]
- github.com/oleiade/lane      [Provides a Queue data structure implementation]

Staying up to date

To update wcrawler to the latest version, use go get -u github.com/gustavooferreira/wcrawler.

Build

To build this project run:

make build

The wcrawler binary will be placed inside the bin/ folder.

Tests

To run tests:

make test

To get coverage:

make coverage

Free tip

If you run make without any targets, it will display all options available on the makefile followed by a short description.

Contributing

I'd normally be more than happy to accept pull requests, but given that I've created this project with the sole intent of practicing, it doesn't make sense for me to accept other people's work.

However, feel free to fork the project and add whatever new features you feel like.

I'd still be glad if you notice a bug and report it by opening an issue.

License

This project is licensed under the terms of the MIT license.

Documentation ¶

Index ¶

Constants
type AppState
- func (as *AppState) Parse(state string) error
- func (as AppState) String() string
type Connector
type Crawler
- func NewCrawler(connector Connector, initialURL string, retry int, linksWriter io.Writer, ...) (*Crawler, error)
- func (c *Crawler) Merger(wg *sync.WaitGroup)
- func (c *Crawler) Run()
- func (c *Crawler) StatsWriter(wg *sync.WaitGroup)
- func (c *Crawler) WorkerRun(wg *sync.WaitGroup)
type EdgesSet
- func NewEdgesSet() EdgesSet
- func (es EdgesSet) Add(elems ...int)
- func (es EdgesSet) Count() int
- func (es EdgesSet) Dump() []int
- func (es EdgesSet) MarshalJSON() ([]byte, error)
- func (es EdgesSet) Remove(elem int)
- func (es *EdgesSet) UnmarshalJSON(b []byte) error
type RMEntry
type Record
type RecordManager
- func NewRecordManager() *RecordManager
- func (rm *RecordManager) AddEdge(fromURL string, toURL string) error
- func (rm *RecordManager) AddRecord(entry RMEntry)
- func (rm *RecordManager) Count() int
- func (rm *RecordManager) Dump() map[string]Record
- func (rm *RecordManager) Exists(rawURL string) bool
- func (rm *RecordManager) Get(rawURL string) (Record, bool)
- func (rm *RecordManager) LoadFromReader(r io.Reader) error
- func (rm *RecordManager) SaveToWriter(w io.Writer, indent bool) error
- func (rm *RecordManager) Update(rawURL string, statusCode int, err error) error
type Result
type StatsCLIOutWriter
- func NewStatsCLIOutWriter(writer io.Writer, showErrors bool, totalWorkersCount int, depth int) *StatsCLIOutWriter
- func (sm *StatsCLIOutWriter) AddErrorEntry(value string)
- func (sm *StatsCLIOutWriter) AddLatencySample(value time.Duration)
- func (sm *StatsCLIOutWriter) IncDecDepth(value int)
- func (sm *StatsCLIOutWriter) IncDecErrorsCount(value int)
- func (sm *StatsCLIOutWriter) IncDecLinksCount(value int)
- func (sm *StatsCLIOutWriter) IncDecLinksInQueue(value int)
- func (sm *StatsCLIOutWriter) IncDecTotalRequestsCount(value int)
- func (sm *StatsCLIOutWriter) IncDecWorkersRunning(value int)
- func (sm *StatsCLIOutWriter) RunOutputFlusher()
- func (sm *StatsCLIOutWriter) SetAppState(state AppState)
- func (sm *StatsCLIOutWriter) SetDepth(value int)
- func (sm *StatsCLIOutWriter) SetErrorsCount(value int)
- func (sm *StatsCLIOutWriter) SetLinksCount(value int)
- func (sm *StatsCLIOutWriter) SetLinksInQueue(value int)
- func (sm *StatsCLIOutWriter) SetTotalRequestsCount(value int)
- func (sm *StatsCLIOutWriter) SetWorkersRunning(value int)
type StatsManager
type Task
type URLEntity
- func ExtractURL(rawURL string) (urlEntity URLEntity, err error)
- func JoinURLs(baseURL string, rawURL string) (URLEntity, error)
type WebClient
- func NewWebClient(client *http.Client) *WebClient
- func (c *WebClient) GetLinks(rawURL string) (statusCode int, links []URLEntity, latency time.Duration, err error)

Constants ¶

View Source

const (
	// AppState_Unknown represents the 'unknown' state.
	AppState_Unknown = iota
	// AppState_IDLE represents the 'idle' state.
	AppState_IDLE
	// AppState_Running represents the 'run' state.
	AppState_Running
	// AppState_Finished represents the 'finish' state.
	AppState_Finished
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type AppState ¶

type AppState int

AppState represents the current state of the App.

func (*AppState) Parse ¶

func (as *AppState) Parse(state string) error

Parse parses a string into AppState returning an error if string passed cannot be parsed into a valid state.

func (AppState) String ¶

func (as AppState) String() string

String returns the string representation of AppState.

type Connector ¶

type Connector interface {
	GetLinks(rawURL string) (statusCode int, links []URLEntity, latency time.Duration, err error)
}

Connector describes the connector interface.

type Crawler ¶

type Crawler struct {

	// Read-only vars
	InitialURL string

	Stats           bool
	ShowErrors      bool
	WorkersCount    int
	Depth           int
	StayInSubdomain bool
	TreeMode        bool
	SubDomain       string
	Retry           int
	// contains filtered or unexported fields
}

Crawler brings everything together and is responsible for starting goroutines and manage them.

func NewCrawler ¶

func NewCrawler(connector Connector, initialURL string, retry int, linksWriter io.Writer, stats bool, showErrors bool, stayinsubdomain bool, treemode bool, workersCount int, depth int) (*Crawler, error)

NewCrawler returns a new Crawler.

func (*Crawler) Merger ¶

func (c *Crawler) Merger(wg *sync.WaitGroup)

Merger gets the results from the workers (links) and keeps all the relevant information feeding the new links to workers via another channel.

func (*Crawler) Run ¶

func (c *Crawler) Run()

Run starts crawling.

func (*Crawler) StatsWriter ¶

func (c *Crawler) StatsWriter(wg *sync.WaitGroup)

StatsWriter writes stats to a io.Writer (e.g. os.Stdout)

func (*Crawler) WorkerRun ¶

func (c *Crawler) WorkerRun(wg *sync.WaitGroup)

WorkerRun represents the workers crawling links in a goroutine. Receives tasks in a channel and returns results on another. When tasks channel is closed, the workers return.

type EdgesSet ¶

type EdgesSet map[int]struct{}

func NewEdgesSet ¶

func NewEdgesSet() EdgesSet

func (EdgesSet) Add ¶

func (es EdgesSet) Add(elems ...int)

func (EdgesSet) Count ¶

func (es EdgesSet) Count() int

func (EdgesSet) Dump ¶

func (es EdgesSet) Dump() []int

func (EdgesSet) MarshalJSON ¶

func (es EdgesSet) MarshalJSON() ([]byte, error)

func (EdgesSet) Remove ¶

func (es EdgesSet) Remove(elem int)

func (*EdgesSet) UnmarshalJSON ¶

func (es *EdgesSet) UnmarshalJSON(b []byte) error

type RMEntry ¶

type RMEntry struct {
	ParentURL  string
	URL        URLEntity
	Depth      int
	StatusCode int
	ErrString  string
}

RMEntry represents an entry in the RecordManager (external interface).

type Record ¶

type Record struct {
	// Index allows easy referencing of records (used in the edges)
	Index int `json:"index"`
	// This indicates whether this is the start of the graph
	// i.e., URL provided.
	InitPoint bool   `json:"initPoint"`
	URL       string `json:"url"`
	Host      string `json:"host"`
	Depth     int    `json:"depth"`
	// Edges      []uint `json:"edges"`
	// This is supposed to be mimicing a hashset
	// We use a struct as a value as it's a bit more space efficient
	Edges      EdgesSet `json:"edges"`
	StatusCode int      `json:"statusCode"`
	ErrString  string   `json:"errString,omitempty"`
}

Record represents an entry in the RecordManager (internal state).

type RecordManager ¶

type RecordManager struct {
	// Keeps a table of Records. Key is the URL (scheme,authority,path,query)
	Records    map[string]Record
	IndexCount int
}

RecordManager keeps track of links visited and some metadata like depth level and its children.

func NewRecordManager ¶

func NewRecordManager() *RecordManager

NewRecordManager returns a new Record Manager.

func (*RecordManager) AddEdge ¶

func (rm *RecordManager) AddEdge(fromURL string, toURL string) error

AddEdge adds a new edge to a record if not already present.

func (*RecordManager) AddRecord ¶

func (rm *RecordManager) AddRecord(entry RMEntry)

AddRecord adds a record to the RecordManager.

func (*RecordManager) Count ¶

func (rm *RecordManager) Count() int

Count counts the number of records.

func (*RecordManager) Dump ¶

func (rm *RecordManager) Dump() map[string]Record

Dump returns all records in the RecordManager.

func (*RecordManager) Exists ¶

func (rm *RecordManager) Exists(rawURL string) bool

Exists checks whether this URL exists in the table.

func (*RecordManager) Get ¶

func (rm *RecordManager) Get(rawURL string) (Record, bool)

Get returns a record from the Record Manager.

func (*RecordManager) LoadFromReader ¶

func (rm *RecordManager) LoadFromReader(r io.Reader) error

LoadFromReader reads the records from a Reader in JSON format. Can pass a os.File, to read from a file.

func (*RecordManager) SaveToWriter ¶

func (rm *RecordManager) SaveToWriter(w io.Writer, indent bool) error

SaveToWriter dumps the records map into a Writer in JSON format. Can pass a os.File, to write to a file.

func (*RecordManager) Update ¶

func (rm *RecordManager) Update(rawURL string, statusCode int, err error) error

Update updates entry in the table.

type Result ¶

type Result struct {
	ParentURL  string
	StatusCode int
	Links      []URLEntity
	// Depth of the ParentURL
	Depth int
	Err   error
}

Result is what workers return in a channel.

type StatsCLIOutWriter ¶

type StatsCLIOutWriter struct {
	// contains filtered or unexported fields
}

StatsCLIOutWriter keeps track of stats and writes to a writer up to date stats.

func NewStatsCLIOutWriter ¶

func NewStatsCLIOutWriter(writer io.Writer, showErrors bool, totalWorkersCount int, depth int) *StatsCLIOutWriter

NewStatsCLIOutWriter returns a new StatsCLIOutWriter.

func (*StatsCLIOutWriter) AddErrorEntry ¶

func (sm *StatsCLIOutWriter) AddErrorEntry(value string)

func (*StatsCLIOutWriter) AddLatencySample ¶

func (sm *StatsCLIOutWriter) AddLatencySample(value time.Duration)

func (*StatsCLIOutWriter) IncDecDepth ¶

func (sm *StatsCLIOutWriter) IncDecDepth(value int)

func (*StatsCLIOutWriter) IncDecErrorsCount ¶

func (sm *StatsCLIOutWriter) IncDecErrorsCount(value int)

func (*StatsCLIOutWriter) IncDecLinksCount ¶

func (sm *StatsCLIOutWriter) IncDecLinksCount(value int)

func (*StatsCLIOutWriter) IncDecLinksInQueue ¶

func (sm *StatsCLIOutWriter) IncDecLinksInQueue(value int)

func (*StatsCLIOutWriter) IncDecTotalRequestsCount ¶

func (sm *StatsCLIOutWriter) IncDecTotalRequestsCount(value int)

func (*StatsCLIOutWriter) IncDecWorkersRunning ¶

func (sm *StatsCLIOutWriter) IncDecWorkersRunning(value int)

func (*StatsCLIOutWriter) RunOutputFlusher ¶

func (sm *StatsCLIOutWriter) RunOutputFlusher()

This functions writes the updated stats to an io.Writer Run this in a goroutine

func (*StatsCLIOutWriter) SetAppState ¶

func (sm *StatsCLIOutWriter) SetAppState(state AppState)

func (*StatsCLIOutWriter) SetDepth ¶

func (sm *StatsCLIOutWriter) SetDepth(value int)

func (*StatsCLIOutWriter) SetErrorsCount ¶

func (sm *StatsCLIOutWriter) SetErrorsCount(value int)

func (*StatsCLIOutWriter) SetLinksCount ¶

func (sm *StatsCLIOutWriter) SetLinksCount(value int)

func (*StatsCLIOutWriter) SetLinksInQueue ¶

func (sm *StatsCLIOutWriter) SetLinksInQueue(value int)

func (*StatsCLIOutWriter) SetTotalRequestsCount ¶

func (sm *StatsCLIOutWriter) SetTotalRequestsCount(value int)

func (*StatsCLIOutWriter) SetWorkersRunning ¶

func (sm *StatsCLIOutWriter) SetWorkersRunning(value int)

type StatsManager ¶

type StatsManager interface {
	SetAppState(state AppState)
	SetLinksInQueue(value int)
	IncDecLinksInQueue(value int)
	SetLinksCount(value int)
	IncDecLinksCount(value int)
	SetErrorsCount(value int)
	IncDecErrorsCount(value int)
	SetWorkersRunning(value int)
	IncDecWorkersRunning(value int)
	SetTotalRequestsCount(value int)
	IncDecTotalRequestsCount(value int)
	SetDepth(value int)
	IncDecDepth(value int)
	AddLatencySample(value time.Duration)
	RunOutputFlusher()
}

StatsManager represents a tracker of statistics related to the crawler. This interface is unfortunately quite big as it needs to support several operations on the statistics it keeps track of.

type Task ¶

type Task struct {
	URL   string
	Depth int
}

Task is what gets sent to the channel for workers to pull data from the web.

type URLEntity ¶

type URLEntity struct {
	// NetLoc represents the NetLoc portion of the URL
	NetLoc string
	// Raw represents the entire URL
	Raw string
}

URLEntity represents a URL.

func ExtractURL ¶

func ExtractURL(rawURL string) (urlEntity URLEntity, err error)

ExtractURL takes any URL and returns a URL string with scheme,authority,path ready to be used as a parent URL.

func JoinURLs ¶

func JoinURLs(baseURL string, rawURL string) (URLEntity, error)

JoinURLs behaves the same way as parent URL, except that it also includes query params. If URL provided is relative, it will join the URLs. It will return an error if URL is of an unwanted type, like 'mailto'.

type WebClient ¶

type WebClient struct {
	// contains filtered or unexported fields
}

WebClient is responsible to connect to the links and manage connections to websites. Implements Connector interface.

func NewWebClient ¶

func NewWebClient(client *http.Client) *WebClient

NewWebClient returns a new WebClient.

func (*WebClient) GetLinks ¶

func (c *WebClient) GetLinks(rawURL string) (statusCode int, links []URLEntity, latency time.Duration, err error)

GetLinks returns all the links found in the webpage.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
wcrawler
wcrawler/cli
internal
graph
ring Package ring provides an implementation of a ring buffer containing strings.	Package ring provides an implementation of a ring buffer containing strings.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL