Go packages and CLI tool for saving web page as single HTML file
Obelisk is a Go package and CLI tool for saving web page as single HTML file, with all of its assets embedded. It's inspired by the great Monolith and intended as improvement for my old WARC package.
Features
- Embeds all resources (e.g. CSS, image, JavaScript, etc) producing a single HTML5 document that is easy to store and share.
- In case the submitted URL is not HTML (for example a PDF page), Obelisk will still save it as it is.
- Downloading each assets are done concurrently, which make the archival process for a web page is quite fast.
- Accepts cookies, useful for pages that need login or article behind paywall.
As Go package
Run following command inside your Go project :
go get -u -v github.com/go-shiori/obelisk
Next, include Obelisk in your application :
import "github.com/go-shiori/obelisk"
Now you can use Obelisk archival feature for your application. For basic usage you can check the example.
As CLI application
You can download the latest version of Obelisk from release page. To build from source, make sure you use go >= 1.13
then run following commands :
go get -u -v github.com/go-shiori/obelisk/cmd/obelisk
Now you can use it from your terminal :
$ obelisk -h
CLI tool for saving web page as single HTML file
Usage:
obelisk [url1] [url2] ... [urlN] [flags]
Flags:
-z, --gzip gzip archival result
-h, --help help for obelisk
-i, --input string path to file which contains URLs
--insecure skip X.509 (TLS) certificate verification
-c, --load-cookies string path to Netscape cookie file
--max-concurrent-download int max concurrent download at a time (default 10)
--no-css disable CSS styling
--no-embeds remove embedded elements (e.g iframe)
--no-js disable JavaScript
--no-medias remove media elements (e.g img, audio)
-o, --output string path to save archival result
-q, --quiet disable logging
--skip-resource-url-error skip process resource url error
-t, --timeout int maximum time (in second) before request timeout (default 60)
-u, --user-agent string set custom user agent
--verbose more verbose logging
There are some CLI behavior that I think need to be explained more here :
-
The --input
flag accepts text file that contains list of urls that look like this :
http://www.domain1.com/some/path
http://www.domain2.com/some/path
http://www.domain3.com/some/path
-
The --load-cookies
flag accepts Netscape cookie file that usually look like this :
# Netscape HTTP Cookie File
# https://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file! Do not edit.
#HttpOnly_.google.com TRUE / FALSE 1631153524 KEY VALUE
#HttpOnly_.google.com TRUE /ads TRUE 1621062000 KEY VALUE
.developers.google.com TRUE / FALSE 1642167486 KEY VALUE
-
If --output
flag is not specified then Obelisk will generate file name for the archive and save it in current working directory.
-
If --output
flag is set to -
and there is only one URL to process (either from input file or from CLI arguments) then the default output will be stdout
.
-
If --output
flag is specified but there are more than one URL to process, Obelisk will generate file name for the archive, but keep using the directory from the specified output path.
-
If --output
flag is specified but it sets to an existing directory, Obelisk will also generate file name for the archive.
F.A.Q
Why the name is Obelisk ?
It's inspired by Monolith, therefore it's Obelisk.
How does it compare to WARC ?
My WARC package uses bolt
database to contain archival result, which make it hard to share and view. I also think my code in WARC is not really easy to understand, so I often confused when I try to add additional feature or refactoring it.
How does it compare to Monolith ?
- Both embeds all resources to HTML file, mostly using base64 data URL. The difference is Obelisk will use inline
<script>
and <style>
for external JavaScript and CSS files. This is done because in many page the browser will struggles to load JavaScript that encoded into data URL. Inlining scripts and styles also make archival result smaller since we don't encode them using base64.
- In Obelisk all request to external URL is disabled by default using Content Security Policy, while in Monolith we need to specify it manually. This is done because in my opinion archive shouldn't need and shouldn't be able to send request to external resources.
- In Obelisk downloading assets are done concurrently. Thanks to this, Obelisk (most of the time) will be faster than Monolith when archiving a web page.
Why not just contribute to Monolith ?
- I don't have any knowledge about Rust. I do want to learn it though.
- I have a plan to update Shiori, so I need a Go package for archiving web page.
Attributions
Original logo is created by Freepik in theirs egypt and desert pack, which can be downloaded from www.flaticon.com.
License
Obelisk is distributed using MIT license, which means you can use and modify it however you want. However, if you make an enhancement for it, if possible, please send a pull request. If you like this project, please consider donating to me either via PayPal or Ko-Fi.