grab

package module
v2.0.0+incompatible Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 8, 2018 License: BSD-3-Clause Imports: 16 Imported by: 1

README

grab

GoDoc Build Status Go Report Card

Downloading the internet, one goroutine at a time!

$ go get github.com/cavaliercoder/grab

Grab is a Go package for downloading files from the internet with the following rad features:

  • Monitor download progress concurrently
  • Auto-resume incomplete downloads
  • Guess filename from content header or URL path
  • Safely cancel downloads using context.Context
  • Validate downloads using checksums
  • Download batches of files concurrently
  • Apply rate limiters

Requires Go v1.7+

Example

The following example downloads a PDF copy of the free eBook, "An Introduction to Programming in Go" into the current working directory.

resp, err := grab.Get(".", "http://www.golang-book.com/public/pdf/gobook.pdf")
if err != nil {
	log.Fatal(err)
}

fmt.Println("Download saved to", resp.Filename)

The following, more complete example allows for more granular control and periodically prints the download progress until it is complete.

The second time you run the example, it will auto-resume the previous download and exit sooner.

package main

import (
	"fmt"
	"os"
	"time"

	"github.com/cavaliercoder/grab"
)

func main() {
	// create client
	client := grab.NewClient()
	req, _ := grab.NewRequest(".", "http://www.golang-book.com/public/pdf/gobook.pdf")

	// start download
	fmt.Printf("Downloading %v...\n", req.URL())
	resp := client.Do(req)
	fmt.Printf("  %v\n", resp.HTTPResponse.Status)

	// start UI loop
	t := time.NewTicker(500 * time.Millisecond)
	defer t.Stop()

Loop:
	for {
		select {
		case <-t.C:
			fmt.Printf("  transferred %v / %v bytes (%.2f%%)\n",
				resp.BytesComplete(),
				resp.Size,
				100*resp.Progress())

		case <-resp.Done:
			// download is complete
			break Loop
		}
	}

	// check for errors
	if err := resp.Err(); err != nil {
		fmt.Fprintf(os.Stderr, "Download failed: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("Download saved to ./%v \n", resp.Filename)

	// Output:
	// Downloading http://www.golang-book.com/public/pdf/gobook.pdf...
	//   200 OK
	//   transferred 42970 / 2893557 bytes (1.49%)
	//   transferred 1207474 / 2893557 bytes (41.73%)
	//   transferred 2758210 / 2893557 bytes (95.32%)
	// Download saved to ./gobook.pdf
}

Design trade-offs

The primary use case for Grab is to concurrently downloading thousands of large files from remote file repositories where the remote files are immutable. Examples include operating system package repositories or ISO libraries.

Grab aims to provide robust, sane defaults. These are usually determined using the HTTP specifications, or by mimicking the behavior of common web clients like cURL, wget and common web browsers.

Grab aims to be stateless. The only state that exists is the remote files you wish to download and the local copy which may be completed, partially completed or not yet created. The advantage to this is that the local file system is not cluttered unnecessarily with addition state files (like a .crdownload file). The disadvantage of this approach is that grab must make assumptions about the local and remote state; specifically, that they have not been modified by another program.

If the local or remote file are modified outside of grab, and you download the file again with resuming enabled, the local file will likely become corrupted. In this case, you might consider making remote files immutable, or disabling resume.

Grab aims to enable best-in-class functionality for more complex features through extensible interfaces, rather than reimplementation. For example, you can provide your own Hash algorithm to compute file checksums, or your own rate limiter implementation (with all the associated trade-offs) to rate limit downloads.

Documentation

Overview

Package grab provides a HTTP download manager implementation.

Get is the most simple way to download a file:

resp, err := grab.Get("/tmp", "http://example.com/example.zip")
// ...

Get will download the given URL and save it to the given destination directory. The destination filename will be determined automatically by grab using Content-Disposition headers returned by the remote server, or by inspecting the requested URL path.

An empty destination string or "." means the transfer will be stored in the current working directory.

If a destination file already exists, grab will assume it is a complete or partially complete download of the requested file. If the remote server supports resuming interrupted downloads, grab will resume downloading from the end of the partial file. If the server does not support resumed downloads, the file will be retransferred in its entirety. If the file is already complete, grab will return successfully.

For control over the HTTP client, destination path, auto-resume, checksum validation and other settings, create a Client:

client := grab.NewClient()
client.HTTPClient.Transport.DisableCompression = true

req, err := grab.NewRequest("/tmp", "http://example.com/example.zip")
// ...
req.NoResume = true
req.HTTPRequest.Header.Set("Authorization", "Basic YWxhZGRpbjpvcGVuc2VzYW1l")

resp := client.Do(req)
// ...

You can monitor the progress of downloads while they are transferring:

client := grab.NewClient()
req, err := grab.NewRequest("", "http://example.com/example.zip")
// ...
resp := client.Do(req)

t := time.NewTicker(time.Second)
defer t.Stop()

for {
	select {
	case <-t.C:
		fmt.Printf("%.02f%% complete\n", resp.Progress())

	case <-resp.Done:
		if err := resp.Err(); err != nil {
			// ...
		}

		// ...
		return
	}
}

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	// ErrBadLength indicates that the server response or an existing file does
	// not match the expected content length.
	ErrBadLength = errors.New("bad content length")

	// ErrBadChecksum indicates that a downloaded file failed to pass checksum
	// validation.
	ErrBadChecksum = errors.New("checksum mismatch")

	// ErrNoFilename indicates that a reasonable filename could not be
	// automatically determined using the URL or response headers from a server.
	ErrNoFilename = errors.New("no filename could be determined")

	// ErrNoTimestamp indicates that a timestamp could not be automatically
	// determined using the response headers from the remote server.
	ErrNoTimestamp = errors.New("no timestamp could be determined for the remote file")

	// ErrFileExists indicates that the destination path already exists.
	ErrFileExists = errors.New("file exists")
)
View Source
var DefaultClient = NewClient()

DefaultClient is the default client and is used by all Get convenience functions.

Functions

func GetBatch

func GetBatch(workers int, dst string, urlStrs ...string) (<-chan *Response, error)

GetBatch sends multiple HTTP requests and downloads the content of the requested URLs to the given destination directory using the given number of concurrent worker goroutines.

The Response for each requested URL is sent through the returned Response channel, as soon as a worker receives a response from the remote server. The Response can then be used to track the progress of the download while it is in progress.

The returned Response channel will be closed by Grab, only once all downloads have completed or failed.

If an error occurs during any download, it will be available via call to the associated Response.Err.

For control over HTTP client headers, redirect policy, and other settings, create a Client instead.

func IsStatusCodeError

func IsStatusCodeError(err error) bool

IsStatusCodeError returns true if the given error is of type StatusCodeError.

Types

type Client

type Client struct {
	// HTTPClient specifies the http.Client which will be used for communicating
	// with the remote server during the file transfer.
	HTTPClient *http.Client

	// UserAgent specifies the User-Agent string which will be set in the
	// headers of all requests made by this client.
	//
	// The user agent string may be overridden in the headers of each request.
	UserAgent string

	// BufferSize specifies the size in bytes of the buffer that is used for
	// transferring all requested files. Larger buffers may result in faster
	// throughput but will use more memory and result in less frequent updates
	// to the transfer progress statistics. The BufferSize of each request can
	// be overridden on each Request object. Default: 32KB.
	BufferSize int
}

A Client is a file download client.

Clients are safe for concurrent use by multiple goroutines.

func NewClient

func NewClient() *Client

NewClient returns a new file download Client, using default configuration.

func (*Client) Do

func (c *Client) Do(req *Request) *Response

Do sends a file transfer request and returns a file transfer response, following policy (e.g. redirects, cookies, auth) as configured on the client's HTTPClient.

Like http.Get, Do blocks while the transfer is initiated, but returns as soon as the transfer has started transferring in a background goroutine, or if it failed early.

An error is returned via Response.Err if caused by client policy (such as CheckRedirect), or if there was an HTTP protocol or IO error. Response.Err will block the caller until the transfer is completed, successfully or otherwise.

Example
client := NewClient()
req, err := NewRequest("/tmp", "http://example.com/example.zip")
if err != nil {
	panic(err)
}

resp := client.Do(req)
if err := resp.Err(); err != nil {
	panic(err)
}

fmt.Println("Download saved to", resp.Filename)
Output:

func (*Client) DoBatch

func (c *Client) DoBatch(workers int, requests ...*Request) <-chan *Response

DoBatch executes all the given requests using the given number of concurrent workers. Control is passed back to the caller as soon as the workers are initiated.

If the requested number of workers is less than one, a worker will be created for every request. I.e. all requests will be executed concurrently.

If an error occurs during any of the file transfers it will be accessible via call to the associated Response.Err.

The returned Response channel is closed only after all of the given Requests have completed, successfully or otherwise.

Example
// create multiple download requests
reqs := make([]*Request, 0)
for i := 0; i < 10; i++ {
	url := fmt.Sprintf("http://example.com/example%d.zip", i+1)
	req, err := NewRequest("/tmp", url)
	if err != nil {
		panic(err)
	}
	reqs = append(reqs, req)
}

// start downloads with 4 workers
client := NewClient()
respch := client.DoBatch(4, reqs...)

// check each response
for resp := range respch {
	if err := resp.Err(); err != nil {
		panic(err)
	}

	fmt.Printf("Downloaded %s to %s\n", resp.Request.URL(), resp.Filename)
}
Output:

func (*Client) DoChannel

func (c *Client) DoChannel(reqch <-chan *Request, respch chan<- *Response)

DoChannel executes all requests sent through the given Request channel, one at a time, until it is closed by another goroutine. The caller is blocked until the Request channel is closed and all transfers have completed. All responses are sent through the given Response channel as soon as they are received from the remote servers and can be used to track the progress of each download.

Slow Response receivers will cause a worker to block and therefore delay the start of the transfer for an already initiated connection - potentially causing a server timeout. It is the caller's responsibility to ensure a sufficient buffer size is used for the Response channel to prevent this.

If an error occurs during any of the file transfers it will be accessible via the associated Response.Err function.

Example

This example uses DoChannel to create a Producer/Consumer model for downloading multiple files concurrently. This is similar to how DoBatch uses DoChannel under the hood except that it allows the caller to continually send new requests until they wish to close the request channel.

// create a request and a buffered response channel
reqch := make(chan *Request)
respch := make(chan *Response, 10)

// start 4 workers
client := NewClient()
wg := sync.WaitGroup{}
for i := 0; i < 4; i++ {
	wg.Add(1)
	go func() {
		client.DoChannel(reqch, respch)
		wg.Done()
	}()
}

go func() {
	// send requests
	for i := 0; i < 10; i++ {
		url := fmt.Sprintf("http://example.com/example%d.zip", i+1)
		req, err := NewRequest("/tmp", url)
		if err != nil {
			panic(err)
		}
		reqch <- req
	}
	close(reqch)

	// wait for workers to finish
	wg.Wait()
	close(respch)
}()

// check each response
for resp := range respch {
	// block until complete
	if err := resp.Err(); err != nil {
		panic(err)
	}

	fmt.Printf("Downloaded %s to %s\n", resp.Request.URL(), resp.Filename)
}
Output:

type Hook

type Hook func(*Response) error

A Hook is a user provided callback function that can be called by grab at various stages of a requests lifecycle. If a hook returns an error, the associated request is canceled and the same error is returned on the Response object.

Hook functions are called synchronously and should never block unnecessarily. Response methods that block until a download is complete, such as Response.Err, Response.Cancel or Response.Wait will deadlock. To cancel a download from a callback, simply return a non-nil error.

type RateLimiter

type RateLimiter interface {
	WaitN(ctx context.Context, n int) (err error)
}

RateLimiter is an interface that must be satisfied by any third-party rate limiters that may be used to limit download transfer speeds.

A recommended token bucket implementation can be found at https://godoc.org/golang.org/x/time/rate#Limiter.

Example
req, _ := NewRequest("", "http://www.golang-book.com/public/pdf/gobook.pdf")

// Attach a 1Mbps rate limiter, like the token bucket implementation from
// golang.org/x/time/rate.
req.RateLimiter = NewLimiter(1048576)

resp := DefaultClient.Do(req)
if err := resp.Err(); err != nil {
	log.Fatal(err)
}
Output:

type Request

type Request struct {
	// Label is an arbitrary string which may used to label a Request with a
	// user friendly name.
	Label string

	// Tag is an arbitrary interface which may be used to relate a Request to
	// other data.
	Tag interface{}

	// HTTPRequest specifies the http.Request to be sent to the remote server to
	// initiate a file transfer. It includes request configuration such as URL,
	// protocol version, HTTP method, request headers and authentication.
	HTTPRequest *http.Request

	// Filename specifies the path where the file transfer will be stored in
	// local storage. If Filename is empty or a directory, the true Filename will
	// be resolved using Content-Disposition headers or the request URL.
	//
	// An empty string means the transfer will be stored in the current working
	// directory.
	Filename string

	// SkipExisting specifies that ErrFileExists should be returned if the
	// destination path already exists. The existing file will not be checked for
	// completeness.
	SkipExisting bool

	// NoResume specifies that a partially completed download will be restarted
	// without attempting to resume any existing file. If the download is already
	// completed in full, it will not be restarted.
	NoResume bool

	// NoCreateDirectories specifies that any missing directories in the given
	// Filename path should not be created automatically, if they do not already
	// exist.
	NoCreateDirectories bool

	// IgnoreBadStatusCodes specifies that grab should accept any status code in
	// the response from the remote server. Otherwise, grab expects the response
	// status code to be within the 2XX range (after following redirects).
	IgnoreBadStatusCodes bool

	// IgnoreRemoteTime specifies that grab should not attempt to set the
	// timestamp of the local file to match the remote file.
	IgnoreRemoteTime bool

	// Size specifies the expected size of the file transfer if known. If the
	// server response size does not match, the transfer is cancelled and
	// ErrBadLength returned.
	Size int64

	// BufferSize specifies the size in bytes of the buffer that is used for
	// transferring the requested file. Larger buffers may result in faster
	// throughput but will use more memory and result in less frequent updates
	// to the transfer progress statistics. If a RateLimiter is configured,
	// BufferSize should be much lower than the rate limit. Default: 32KB.
	BufferSize int

	// RateLimiter allows the transfer rate of a download to be limited. The given
	// Request.BufferSize determines how frequently the RateLimiter will be
	// polled.
	RateLimiter RateLimiter

	// BeforeCopy is a user provided callback that is called immediately before
	// a request starts downloading. If BeforeCopy returns an error, the request
	// is cancelled and the same error is returned on the Response object.
	BeforeCopy Hook

	// AfterCopy is a user provided callback that is called immediately after a
	// request has finished downloading, before checksum validation and closure.
	// This hook is only called if the transfer was successful. If AfterCopy
	// returns an error, the request is canceled and the same error is returned on
	// the Response object.
	AfterCopy Hook
	// contains filtered or unexported fields
}

A Request represents an HTTP file transfer request to be sent by a Client.

func NewRequest

func NewRequest(dst, urlStr string) (*Request, error)

NewRequest returns a new file transfer Request suitable for use with Client.Do.

func (*Request) Context

func (r *Request) Context() context.Context

Context returns the request's context. To change the context, use WithContext.

The returned context is always non-nil; it defaults to the background context.

The context controls cancelation.

func (*Request) SetChecksum

func (r *Request) SetChecksum(h hash.Hash, sum []byte, deleteOnError bool)

SetChecksum sets the desired hashing algorithm and checksum value to validate a downloaded file. Once the download is complete, the given hashing algorithm will be used to compute the actual checksum of the downloaded file. If the checksums do not match, an error will be returned by the associated Response.Err method.

If deleteOnError is true, the downloaded file will be deleted automatically if it fails checksum validation.

To prevent corruption of the computed checksum, the given hash must not be used by any other request or goroutines.

To disable checksum validation, call SetChecksum with a nil hash.

Example
// create download request
req, err := NewRequest("", "http://example.com/example.zip")
if err != nil {
	panic(err)
}

// set request checksum
sum, err := hex.DecodeString("33daf4c03f86120fdfdc66bddf6bfff4661c7ca11c5da473e537f4d69b470e57")
if err != nil {
	panic(err)
}
req.SetChecksum(sha256.New(), sum, true)

// download and validate file
resp := DefaultClient.Do(req)
if err := resp.Err(); err != nil {
	panic(err)
}
Output:

func (*Request) URL

func (r *Request) URL() *url.URL

URL returns the URL to be downloaded.

func (*Request) WithContext

func (r *Request) WithContext(ctx context.Context) *Request

WithContext returns a shallow copy of r with its context changed to ctx. The provided ctx must be non-nil.

Example
// create context with a 100ms timeout
ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)
defer cancel()

// create download request with context
req, err := NewRequest("", "http://example.com/example.zip")
if err != nil {
	panic(err)
}
req = req.WithContext(ctx)

// send download request
resp := DefaultClient.Do(req)
if err := resp.Err(); err != nil {
	fmt.Println("error: request cancelled")
}
Output:

error: request cancelled

type Response

type Response struct {
	// The Request that was submitted to obtain this Response.
	Request *Request

	// HTTPResponse represents the HTTP response received from an HTTP request.
	//
	// The response Body should not be used as it will be consumed and closed by
	// grab.
	HTTPResponse *http.Response

	// Filename specifies the path where the file transfer is stored in local
	// storage.
	Filename string

	// Size specifies the total expected size of the file transfer.
	Size int64

	// Start specifies the time at which the file transfer started.
	Start time.Time

	// End specifies the time at which the file transfer completed.
	//
	// This will return zero until the transfer has completed.
	End time.Time

	// CanResume specifies that the remote server advertised that it can resume
	// previous downloads, as the 'Accept-Ranges: bytes' header is set.
	CanResume bool

	// DidResume specifies that the file transfer resumed a previously incomplete
	// transfer.
	DidResume bool

	// Done is closed once the transfer is finalized, either successfully or with
	// errors. Errors are available via Response.Err
	Done chan struct{}
	// contains filtered or unexported fields
}

Response represents the response to a completed or in-progress download request.

A response may be returned as soon a HTTP response is received from a remote server, but before the body content has started transferring.

All Response method calls are thread-safe.

func Get

func Get(dst, urlStr string) (*Response, error)

Get sends a HTTP request and downloads the content of the requested URL to the given destination file path. The caller is blocked until the download is completed, successfully or otherwise.

An error is returned if caused by client policy (such as CheckRedirect), or if there was an HTTP protocol or IO error.

For non-blocking calls or control over HTTP client headers, redirect policy, and other settings, create a Client instead.

Example
// download a file to /tmp
resp, err := Get("/tmp", "http://example.com/example.zip")
if err != nil {
	log.Fatal(err)
}

fmt.Println("Download saved to", resp.Filename)
Output:

func (*Response) BytesComplete

func (c *Response) BytesComplete() int64

BytesComplete returns the total number of bytes which have been copied to the destination, including any bytes that were resumed from a previous download.

func (*Response) BytesPerSecond

func (c *Response) BytesPerSecond() float64

BytesPerSecond returns the number of bytes transferred in the last second. If the download is already complete, the average bytes/sec for the life of the download is returned.

func (*Response) Cancel

func (c *Response) Cancel() error

Cancel cancels the file transfer by canceling the underlying Context for this Response. Cancel blocks until the transfer is closed and returns any error - typically context.Canceled.

func (*Response) Duration

func (c *Response) Duration() time.Duration

Duration returns the duration of a file transfer. If the transfer is in process, the duration will be between now and the start of the transfer. If the transfer is complete, the duration will be between the start and end of the completed transfer process.

func (*Response) ETA

func (c *Response) ETA() time.Time

ETA returns the estimated time at which the the download will complete, given the current BytesPerSecond. If the transfer has already completed, the actual end time will be returned.

func (*Response) Err

func (c *Response) Err() error

Err blocks the calling goroutine until the underlying file transfer is completed and returns any error that may have occurred. If the download is already completed, Err returns immediately.

func (*Response) IsComplete

func (c *Response) IsComplete() bool

IsComplete returns true if the download has completed. If an error occurred during the download, it can be returned via Err.

func (*Response) Progress

func (c *Response) Progress() float64

Progress returns the ratio of total bytes that have been downloaded. Multiply the returned value by 100 to return the percentage completed.

func (*Response) Wait

func (c *Response) Wait()

Wait blocks until the download is completed.

type StatusCodeError

type StatusCodeError int

StatusCodeError indicates that the server response had a status code that was not in the 200-299 range (after following any redirects).

func (StatusCodeError) Error

func (err StatusCodeError) Error() string

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL