Greedy, Regex-Aware Binary Downloader
Table of contents
Why
This project helps you automate scraping data and downloading assets from the internet. Based on Go's Regular Expression engine and HCL, for ease of use, performance and flexibility.
Installation
Download and install the latest release.
Usage
Run the following command to generate a new configuration file in the current directory.
grab config generate
Note
Grab's configuration file uses Hashicorp's HCL.
You can always refer to their specification for topics not covered by the documentation in this repo.
Once you're happy with your configuration, you can check if everything is ok by running:
grab config check
To scrape and download assets, pass one or more URLs to the get
subcommand:
# single URL
grab get https://url.to/scrape/files?from
# list of URLs
grab get urls.ini
# at least one of each
grab get https://my.url/and urls.ini list.ini
Note
The list of URLs can contain comments, like the ini
format: all lines starting with #
and ;
will be ignored.
Quickstart
The default configuration, generated with grab config generate
already works out of the box.
global {
location = "/home/yourusername/Downloads/grab"
}
site "unsplash" {
test = "unsplash"
asset "image" {
pattern = "contentUrl\":\"([^\"]+)\""
capture = 1
transform filename {
pattern = "(?:.+)photos\\/(.*)"
replace = "$${1}.jpg"
}
}
info "title" {
pattern = "meta[^>]+property=\"og:title\"[^>]+content=\"(?P<title>[^\"]+)\""
capture = "title"
}
subdirectory {
pattern = "\\(@(?P<username>\\w+)\\)"
capture = "username"
from = body
}
}
For demonstration purposes, we can already download pictures from unsplash by using the following command:
grab get https://unsplash.com/photos/uOi3lg8fGl4
Warning
Please use this tool responsibly. Don't use this tool for Denial of Service attacks! Don't violate Copyright or intellectual property!
Internally, the program checks checks each URL passed to get
, if it matches a test
pattern inside of any site
block, it will parse find all matches for assets or data defined in asset
and info
blocks.
Once all the asset URLs are gathered, the download starts.
After running the above command, you should have a new grab
directory in your ~/Downloads
folder, containing subdirectories for each site defined in the configuration. Inside each site directories you will find all the assets extracted from the provided URLs.
The configuration syntax is based on a few fundamental blocks:
global
block defines the main download directory and global network options.
site <name>
blocks group other blocks based on the site URL.
asset <name>
blocks define what to look for from each site and how to download it.
info <name>
blocks define what strings to extract from the page body.
Additional configuration settings can be specified:
network
blocks to pass headers and other network options when making requests.
transform url
blocks to replace the asset URL before downloading.
transform filename
blocks to replace the asset's destination path.
subdirectory
blocks to organize downloads into subdirectories named by strings present in the page body or URL.
For a more in-depth look into Grab's confguration options, check out the guide.
Command Options
To get help about any command, use the help
subcommand or the --help
flag:
# to list all available commands:
grab help
# to show instructions for a specific subcommand:
grab help <subcommand>
get
Arguments
Accepts both URLs or path to lists of URLs. Both can be provided at the same time.
# grab get <url|file> [url|file...] [options]
grab get https://example.com/gallery/1 \
https://example.com/gallery/2 \
path/to/list.ini \
other/file.ini -n
Options
Long |
Short |
Default |
Description |
force |
f |
false |
To overwrite already existing files |
config |
c |
nil |
To specify the path to a configuration file |
strict |
s |
false |
To stop the program at the first encountered error |
dry-run |
n |
false |
To send requests without writing to the disk |
progress |
p |
false |
To show a progress bar |
quiet |
q |
false |
To suppress all output to stdout (errors will still be printed to stderr ). This option takes precedence over verbose |
verbose |
v |
1 |
To set the verbosity level:
-v is 1, -vv is 2 and so on...
quiet overrides this option. |
Next steps
- Retries & Timeout
- Network options with inheritance
- URL manipulation
- Destination manipulation
- Improve logging
- Check for updates
- Display a progress bar
- Add HCL eval context functions
- Distribute via various package managers:
- Homebrew
- Apt
- Chocolatey
- Scoop
- Scripting language integration
- Plugin system
- Sequential jobs (like GitHub workflows)
Credits
License
Distributed under the MIT License.