offliner

module
v0.0.0-...-d2d682e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 7, 2022 License: GPL-3.0

README

offliner

Offliner is a tool to make a website offline viewable. It's a concurrent web crawler which crawls a website and saves all the pages and static files in a directory.
It can use both, multi-processing & multi-threading as its concurrency model.

offliner

Features

  • Serial scraping.
  • Multi-threaded scraping.
  • Multi-process scraping.
  • Save static files (css, js, img).
  • Edit the links on the pages to reference the local files.

Usage

You need to provide a full URL to start the scraping. You can use the defined flags to control the features. If you intend to use the multi-process form, the "process" program must exist in the same directory as the "offliner" program.

-h     show help.
-url   full URL of the start page.
-f     save static files too. It also edits the pages so the links reference the local files.
-a     use multi-processing instead of multi-threading as the concurrency model.
-n     maximum number of the pages to be saved. (default is 100)
-p     maximum number of the execution units (goroutines or processes) to run at the same time. (default is 50)
-s     run the scraper in a non-concurrent (serial) fashion.

Examples

Multi-threaded scraping. Save max 100 pages using max 90 goroutines. Save static files too.

./offliner -url=https://urmia.ac.ir -n=100 -p=90 -f

Multi-process scraping. Save max 100 pages using max 50 processes.

./offliner -url=https://urmia.ac.ir -n=100 -p=50 -a

Serial scraping. Save max 100 pages. Save static files too.

./offliner -url=https://urmia.ac.ir -n=100 -s -f

Todo

  • Improve multi-processing design.
  • Add a logger.
  • Make the scraper a separate package (library).

License

GNU General Public License v3.0

Directories

Path Synopsis
pkg
progress
Package progress implements a thread-safe progress mechanism which will be used in crawler to keep track of the progress.
Package progress implements a thread-safe progress mechanism which will be used in crawler to keep track of the progress.
queue
Package queue implements a queue data structure for strings.
Package queue implements a queue data structure for strings.
set
Package set implements a set data structure for strings.
Package set implements a set data structure for strings.
stack
Package stack implements a stack data structure for strings.
Package stack implements a stack data structure for strings.
workerpool
Package workerpool implements a workerpool of specific number of workers to apply a single task on a queue of URLs which are passed using a channel of string.
Package workerpool implements a workerpool of specific number of workers to apply a single task on a queue of URLs which are passed using a channel of string.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL