crawler

module
v0.0.0-...-3f03cec Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 23, 2024 License: MIT

README

Crawler

Crawler is a simple web crawler expressed as a CLI tool written in Go.

It will attempt to crawl a given domain and extract all the links found on that domain. It will then follow those links and extract the information from them as well, until it has retrieved them all.

Crawler will only parse and follow links associated with the given domain.

For example, if the domain is https://example.com, then it would only parse links that are part of the example.com website. It will ignore external links, such as https://google.com or even prefixed domains like https://app.example.com.

Requirements

In order to use Crawler, a valid domain must be provided. This can be any valid website, such as https://google.com or https://www.wikipedia.org/.

Crawler was built using Go 1.22, so in order to run or build the application, users must have the Go programming language installed.

Instructions on how to install Go can be found here.

Using the CLI

Once you've selected a domain to crawl and have installed Go, Crawler can be used by just running the program from the root of the repo.

Crawler takes two arguments:

  • --domain - this is mandatory and represents the domain you wish Crawler to crawl. Format: https://example.com
  • --concurrency - this represents the amount of concurrent workers you wish Crawler to spawn. The more workers, the faster it runs, but it will consume more resources. This argument is a simple integer. Default is 5.
go run cmd/crawler/main.go --domain "<CHOSEN_DOMAIN>"

Example with concurrency

go run cmd/crawler/main.go --domain "<CHOSEN_DOMAIN>" --concurrency 10

Output:

Page https://go.dev/ has links: [https://go.dev/solutions/case-studies https://go.dev/solutions/use-cases https://go.dev/security/ https://go.dev/learn/ https://go.dev/doc/effective_go https://go.dev/doc https://go.dev/doc/devel/release https://go.dev/talks/ https://go.dev/wiki/Conferences https://go.dev/blog https://go.dev/help https://go.dev/dl https://go.dev/dl/ https://go.dev/solutions/ https://go.dev/solutions/google/ https://go.dev/solutions/paypal https://go.dev/solutions/americanexpress https://go.dev/solutions/mercadolibre https://go.dev/tour/ https://go.dev/solutions/cloud/ https://go.dev/solutions/clis/ https://go.dev/pkg/net/http/ https://go.dev/pkg/html/template/ https://go.dev/pkg/database/sql/ https://go.dev/solutions/webdev/ https://go.dev/solutions/devops/ https://go.dev/doc/install/ https://go.dev/learn https://go.dev/play https://go.dev/help/ https://go.dev/pkg/ https://go.dev/project https://go.dev/blog/ https://go.dev/brand https://go.dev/conduct https://go.dev/copyright https://go.dev/tos https://go.dev/s/website-issue]

Page https://go.dev/tos has links: [https://go.dev/solutions/case-studies https://go.dev/solutions/use-cases https://go.dev/security/ https://go.dev/learn/ https://go.dev/doc/effective_go https://go.dev/doc https://go.dev/doc/devel/release https://go.dev/talks/ https://go.dev/wiki/Conferences https://go.dev/blog https://go.dev/help https://go.dev/solutions/ https://go.dev/play https://go.dev/tour/ https://go.dev/help/ https://go.dev/pkg/ https://go.dev/project https://go.dev/dl/ https://go.dev/blog/ https://go.dev/brand https://go.dev/conduct https://go.dev/copyright https://go.dev/tos https://go.dev/s/website-issue]

Directories

Path Synopsis
cmd
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL