offliner

module

v0.0.0-...-d2d682e Latest Latest Go to latest Published: Jan 7, 2022 License: GPL-3.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/MehdiEidi/offliner

Links

Open Source Insights

README ¶

offliner

Offliner is a tool to make a website offline viewable. It's a concurrent web crawler which crawls a website and saves all the pages and static files in a directory.
It can use both, multi-processing & multi-threading as its concurrency model.

offliner

Features

Serial scraping.
Multi-threaded scraping.
Multi-process scraping.
Save static files (css, js, img).
Edit the links on the pages to reference the local files.

Usage

You need to provide a full URL to start the scraping. You can use the defined flags to control the features. If you intend to use the multi-process form, the "process" program must exist in the same directory as the "offliner" program.

-h     show help.
-url   full URL of the start page.
-f     save static files too. It also edits the pages so the links reference the local files.
-a     use multi-processing instead of multi-threading as the concurrency model.
-n     maximum number of the pages to be saved. (default is 100)
-p     maximum number of the execution units (goroutines or processes) to run at the same time. (default is 50)
-s     run the scraper in a non-concurrent (serial) fashion.

Examples

Multi-threaded scraping. Save max 100 pages using max 90 goroutines. Save static files too.

./offliner -url=https://urmia.ac.ir -n=100 -p=90 -f

Multi-process scraping. Save max 100 pages using max 50 processes.

./offliner -url=https://urmia.ac.ir -n=100 -p=50 -a

Serial scraping. Save max 100 pages. Save static files too.

./offliner -url=https://urmia.ac.ir -n=100 -s -f

Todo

Improve multi-processing design.
Add a logger.
Make the scraper a separate package (library).

License

GNU General Public License v3.0

Directories ¶

Path	Synopsis
cmd
pkg
progress Package progress implements a thread-safe progress mechanism which will be used in crawler to keep track of the progress.	Package progress implements a thread-safe progress mechanism which will be used in crawler to keep track of the progress.
queue Package queue implements a queue data structure for strings.	Package queue implements a queue data structure for strings.
set Package set implements a set data structure for strings.	Package set implements a set data structure for strings.
stack Package stack implements a stack data structure for strings.	Package stack implements a stack data structure for strings.
workerpool Package workerpool implements a workerpool of specific number of workers to apply a single task on a queue of URLs which are passed using a channel of string.	Package workerpool implements a workerpool of specific number of workers to apply a single task on a queue of URLs which are passed using a channel of string.
process
process2

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL