web-crawler

command module
v0.0.0-...-49c6e27 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 29, 2023 License: MIT Imports: 10 Imported by: 0

README

Web Crawler System Design

Design Plan

The system will involve 7 parts overall:

  • Seed Url (Starting url for crawling)
    • For this project I will only be allowing a single url input
    • client will be a long running web server
  • URL Frontier
    • data structure to store URLs for future downloads
    • This will ensure priority/politeness to not DDOS a website
    • Queue router to put data in queues - queue selector to select data from given queues
    • Managed redis queue for FIFO data
      • Each redis key with the primary host will have a queue associated with it
    • Workers will spin up to ingest data from the FIFO queue per key
  • HTML Downloader (including DNS resolution)
    • Gets IP addresses from the DNS resolver and starts downloading html content
  • Content Parser
    • Parses HTML to ensure raw text is not malformed
  • Content Seen?
    • Data store of MD5 hashes of html content - if this data store has the md5 hash from the parser it throws away the data and continues work - if it doesn't have the hash it stores it.
  • Link extractor
    • Extracts links from HTML page
  • URL filter
    • Gets passed the links and stores URLs
    • URLs will then be stored in the URL Frontier and the whole process will continue
Diagram
                If either the DNS resolver fails
                or parser log error and restart
                  ┌─────────────────────────┐
                  │                         │
                  │         ┌─────────┐     │
                  │         │DNS      │     │
                  │   ┌─────┤Resolver │     │
                  │   │     └───▲─────┘     │
                  │   │         │           │
                  │   │         │           │
┌─────────┐   ┌───▼───▼─┐   ┌───┴─────┐   ┌─┴───────┐
│         │   │         │   │         │   │         │
│Client   ├───►Frontier ├───►Html     ├───►Html     │
│         │   │         │   │Download │   │Parser   │
└─────────┘   └──▲───▲──┘   └─────────┘   └────┬────┘
                 │   │                         │
                 │   │       ┌───────┐    ┌────▼────┐  ┌─────────┐
                 │   │       ├───────┘    │         │  │         │
                 │   │       │ Data  ◄────┤Content  ├──►Link     │
                 │   │       │ Store │    │seen?    │  │extract  │
                 │   │       └───────┘    └┬────────┘  └────┬────┘
                 │   │                     │                │
                 │   └─────────────────────┘           ┌────▼────┐
                 │     If MD5 hash exists              │         │
                 │     restart to beginning            │URL      │
                 │                                     │Filter   │
                 │           ┌───────┐                 └───┬─────┘
                 │           ├───────┘                     │
                 └───────────┤Redis  ◄─────────────────────┘
                             │MQ     │   Url's are pushed to
                             └───────┘  redis MQ for processing
Models

The data models for this will be incredibly simple. The queue data model will take form as a redis queue per host.

{ "wikipedia": ["https://wikipedia.com", "https://wikipedia.com/test"] }
{ "go": ["https://pkg.go.com/net/http", "go.com"] }

Using this model will enable the use of grouping crawler workers only within the desired host.

For the Seen content data store it will simple be a SQLite DB containing MD5 hashes of all seen sites.

interface SeenContentModel {
  id PK int unique
  hash string
}
Frontier

Queue router and queue selector will be contained within a module. Queue router will receive links from crawlers and enqueue those links into Redis Queue. The queue selector will work on a pub/sub mechanism.

Queues have their own crawler -> crawler depth will be set by env var when program runs -> when a new queue is added the queue selector module will have a subscriber that spins up a new worker/crawler

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
cmd
workers module

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL