goscrape2

command module

v1.7.2 Latest Latest Go to latest Published: Jan 7, 2025 License: MIT Imports: 25 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/rickb777/goscrape2

README ¶

goscrape - create offline browsable copies of websites

A web scraper built with Go. It downloads the content of a website and allows it to be archived and read offline.

Features

Features and advantages over existing tools like wget, httrack, Teleport Pro:

Free and open source
Available for all platforms that Go supports
Files are downloaded concurrently as required
Downloaded asset files are skipped in a new scraper run if unchanged
Redirected URLs don't duplicate downloads
JPEG and PNG images can be converted down in quality to save disk space
Excluded URLS will not be fetched (unlike wget)
No incomplete temporary files are left on disk
Assets from external domains are downloaded automatically
Sane default values
Built-in webserver provides easy local access to the downloaded files
Webserver replays redirections just like the origin server
Supports logging and logfile rotation - can run as a long-lived service

Limitations

No GUI version, console only

Installation

There are 2 options to install goscrape2:

Download and unpack a binary release from Releases

or

Compile the latest release from source:

go install github.com/rickb777/goscrape2@latest

Compiling the tool from source code needs to have a recent version of Go installed (v1.23 or later).

Usage

Scrape a website by running

goscrape2 http://website.com/interesting/stuff

To serve the downloaded website directory in a locally-run webserver, use

goscrape2 --serve website.com

Options

Options can use single or double dash (e.g. -v or --v).

Usage:
  ./goscrape2 [options] [<url> ...]

  -H value
    	"name:value" HTTP header to use for scraping (can be repeated)
  -concurrency int
    	the number of concurrent downloads (default 1)
  -connect duration
    	time limit (with units, e.g. 1s) for each HTTP request to connect (default 30s)
  -cookies string
    	file containing the cookie content
  -depth int
    	download depth limit (default unlimited)
  -dir directory
    	directory to write files to and to serve files from
  -i regular expression
    	only include URLs that match a regular expression (can be repeated)
  -imagequality int
    	image quality reduction, minimum 1 to maximum 99 (re-encoding disabled by default)
  -laxage duration
    	adds to the 'expires' timestamp specified by the origin server, or creates one if absent.
    	If the origin is too conservative, this helps when doing successive runs; a negative value causes
    	revalidation instead.
  -log string
    	output log file; use "-" for stdout (default "-")
  -loopdelay duration
    	delay (with units, e.g. 1s) used between any two downloads
  -port int
    	port to use for the webserver (default 8080)
  -savecookiefile string
    	file to save the cookie content
  -serve
    	serve the website using a webserver.
    	Scraping will happen only on demand using the first URL you provide.
  -timeout duration
    	overall time limit (with units, e.g. 31s) for each HTTP request to connect and read the response
    	This is dependent on -connect and will always be greater than that timeout. (default 1m0s)
  -tries int
    	the number of tries to download each file if the server gives a 5xx error (default 1)
  -user string
    	user[:password] to use for HTTP authentication
  -useragent string
    	user agent to use for scraping
  -v	verbose output
  -x regular expression
    	exclude URLs that match a regular expression (can be repeated)
  -z	debug output

Environment

These environment variables may be set

GOSCRAPE_URLS - Adds URLs to the list to process (space separated)
GOSCRAPE_INCLUDE - Adds regular expressions to the -i include list (space separated)
GOSCRAPE_EXCLUDE - Adds regular expressions to the -x exclude list (space separated)
HTTP_PROXY, HTTPS_PROXY - Controls the proxy used for outbound connections: either a complete URL or a "host[:port]", in which case the "http" scheme is assumed. Authentication can be included with a complete URL.
NO_PROXY - A comma-separated list of values specifying hosts that should be excluded from proxying. Each value is represented by an IP address prefix (1.2.3.4), an IP address prefix in CIDR notation (1.2.3.4/8), a domain name, or a special DNS label (). An IP address prefix and domain name can also include a literal port number (1.2.3.4:80). A domain name matches that name and all subdomains. A domain name with a leading "." matches subdomains only. For example "foo.com" matches "foo.com" and "bar.foo.com"; ".y.com" matches "x.y.com" but not "y.com". A single asterisk () indicates that no proxying should be done.

Cookies

Cookies can be passed in a file using the --cookiefile parameter and a file containing cookies in the following format:

[{"name":"user","value":"123"},{"name":"sessioe","value":"sid"}]

Conditional requests: ETags and last-modified

HTTP uses ETags to tag the version of each resource. Each ETag is a hash constructed by the server somehow. Also, each file usually has a last-modified date.

goscrape2 will use both of these items provided by the server to reduce the amount of work needed if multiple sessions of downloading are run on the same start URL. Any file that is not modified doesn't need to be downloaded more than once. ETags and other metadata are stored in the state cache.

State Cache

goscrape2 keeps its state database in ~/.local/state/goscrape-cache.txt, which is dependent on the user that is running goscrape2 of course. This is only read in when goscrape2 starts; any external edits will be overwritten whilst goscrape2 is running.

Provided goscrape2 has been stopped first, the cached files (see -dir) and state database can be safely moved/copied between servers, e.g. using rsync so that the files retain their timestamps.

The state database is a text file that can be concatenated, in which case any duplicates are resolved by selecting whichever comes last. It can also be deleted, in which case future revalidation requests to the origin server will be much less network-efficient. In either case, it will be rebuilt when goscrape2 is restarted, provided the origin server is still reachable.

The state database is automatically purged if the output directory doesn't exist when goscrape2 is started.

Logfile Rotation

For a long-running service, the logfile should be periodically rotated to avoid filling up the disk. goscrape2 is designed to work well with Linux logrotate, for example using -log /var/log/goscrape.log and this configuration in /etc/logrotate.d/goscrape2

`/var/log/goscrape.log` {
  daily
  notifempty
  minsize 1M
  missingok
  rotate 28
  postrotate
    pkill -hup goscrape2
  endscript
  compress
  delaycompress
  nocreate
}

Daily, logrotate will check whether the logfile has grown too big and, if so, move it then poke goscrape2 with SIGHUP.

SystemD Service

Example SystemD service and configuration files are in the systemd/ folder, to be deployed as

/usr/sbin/goscrape2 binary
/var/lib/goscrape directory tree
/var/log/goscrape.log logfile
/etc/default/goscrape.conf default configuration
/etc/logrotate.d/goscrape log rotation
/etc/systemd/system/goscrape.service service definition

You will need to understand SystemD to use these template files.

Thanks

This tool was derived from github.com/cornelk/goscrape with thanks to the developers.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
config
db
document
download
ioutil
throttle
filter
htmlindex
images
logger Package logger is an adaptation wrapper that simplifies logging in the main code whilst also allowing a pluggable test logger.	Package logger is an adaptation wrapper that simplifies logging in the main code whilst also allowing a pluggable test logger.
mapping
scraper
server
stubclient
utc
work

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL