web-crawler

command module

v0.0.0-...-49c6e27 Latest Latest Go to latest Published: Nov 29, 2023 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/we-are-discussing-rest/web-crawler

Links

Open Source Insights

README ¶

Web Crawler System Design

Design Plan

The system will involve 7 parts overall:

Seed Url (Starting url for crawling)
- For this project I will only be allowing a single url input
- client will be a long running web server
URL Frontier
- data structure to store URLs for future downloads
- This will ensure priority/politeness to not DDOS a website
- Queue router to put data in queues - queue selector to select data from given queues
- Managed redis queue for FIFO data
  - Each redis key with the primary host will have a queue associated with it
- Workers will spin up to ingest data from the FIFO queue per key
HTML Downloader (including DNS resolution)
- Gets IP addresses from the DNS resolver and starts downloading html content
Content Parser
- Parses HTML to ensure raw text is not malformed
Content Seen?
- Data store of MD5 hashes of html content - if this data store has the md5 hash from the parser it throws away the data and continues work - if it doesn't have the hash it stores it.
Link extractor
- Extracts links from HTML page
URL filter
- Gets passed the links and stores URLs
- URLs will then be stored in the URL Frontier and the whole process will continue

Diagram

                If either the DNS resolver fails
                or parser log error and restart
                  ┌─────────────────────────┐
                  │                         │
                  │         ┌─────────┐     │
                  │         │DNS      │     │
                  │   ┌─────┤Resolver │     │
                  │   │     └───▲─────┘     │
                  │   │         │           │
                  │   │         │           │
┌─────────┐   ┌───▼───▼─┐   ┌───┴─────┐   ┌─┴───────┐
│         │   │         │   │         │   │         │
│Client   ├───►Frontier ├───►Html     ├───►Html     │
│         │   │         │   │Download │   │Parser   │
└─────────┘   └──▲───▲──┘   └─────────┘   └────┬────┘
                 │   │                         │
                 │   │       ┌───────┐    ┌────▼────┐  ┌─────────┐
                 │   │       ├───────┘    │         │  │         │
                 │   │       │ Data  ◄────┤Content  ├──►Link     │
                 │   │       │ Store │    │seen?    │  │extract  │
                 │   │       └───────┘    └┬────────┘  └────┬────┘
                 │   │                     │                │
                 │   └─────────────────────┘           ┌────▼────┐
                 │     If MD5 hash exists              │         │
                 │     restart to beginning            │URL      │
                 │                                     │Filter   │
                 │           ┌───────┐                 └───┬─────┘
                 │           ├───────┘                     │
                 └───────────┤Redis  ◄─────────────────────┘
                             │MQ     │   Url's are pushed to
                             └───────┘  redis MQ for processing

Models

The data models for this will be incredibly simple. The queue data model will take form as a redis queue per host.

{ "wikipedia": ["https://wikipedia.com", "https://wikipedia.com/test"] }
{ "go": ["https://pkg.go.com/net/http", "go.com"] }

Using this model will enable the use of grouping crawler workers only within the desired host.

For the Seen content data store it will simple be a SQLite DB containing MD5 hashes of all seen sites.

interface SeenContentModel {
  id PK int unique
  hash string
}

Frontier

Queue router and queue selector will be contained within a module. Queue router will receive links from crawlers and enqueue those links into Redis Queue. The queue selector will work on a pub/sub mechanism.

Queues have their own crawler -> crawler depth will be set by env var when program runs -> when a new queue is added the queue selector module will have a subscriber that spins up a new worker/crawler

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
cmd
server
logger
repository
utils
workers module

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL