Documentation ¶
Overview ¶
Package anydata provides a toolkit to transparently fetch data files, cache them locally, and automatically decompress and/or extract records from them. It does so through the use of Fetcher and Wrapper interfaces. The "formats" and "filters" sub-packages include a variety of techniques that will parse and extract records and fields and interoperate well.
Current support includes opening files from local paths and the following URL schemes:
http:// https:// ftp:// file://
Transparent decompression is enabled for files (including remote URLs) ending in:
.gz .bz2 .bzip2 .zip
Extracting files from .tar and .zip archives is also supported through the use of URL fragments (#) specifying the archive extraction path. This is supported for the following extensions:
.tar .tar.gz .tgz .tar.bz2 .tbz2 .tar.bzip2
Archives referenced multiple times are only downloaded once and re-used as necessary. For example, the following 4 resource strings will result in only 2 FTP downloads:
ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz#names.dmp ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz#nodes.dmp ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz#citations.dmp
To add support for new URL schemes, implement the Fetcher interface and use RegisterFetcher before any calls to GetFetcher. You will likely also want to use Put/GetCachedFile to reduce network roundtrips as well. To add support for new archive or compression formats, implement the Wrapper interface and call RegisterWrapper.
Example (Usage) ¶
List matching lines from a species taxonomy inside a remote tarball.
package main import ( "bufio" "fmt" "strings" "github.com/pbnjay/anydata" ) func main() { // get a Fetcher for names.dmp in the the NCBI Taxonomy tarball taxNames := "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz#names.dmp" ftch, err := anydata.GetFetcher(taxNames) if err != nil { panic(err) } // download the tarball (if necessary) err = ftch.Fetch(taxNames) if err != nil { panic(err) } // get an io.Reader to read from names.dmp rdr, err := ftch.GetReader() if err != nil { panic(err) } // print every line containing "scientific name" scanner := bufio.NewScanner(rdr) for scanner.Scan() { line := scanner.Text() if strings.Contains(line, "scientific name") { fmt.Println(line) } } }
Output: 1 | root | | scientific name | 2 | Bacteria | Bacteria <prokaryote> | scientific name | 6 | Azorhizobium | | scientific name | 7 | Azorhizobium caulinodans | | scientific name | 9 | Buchnera aphidicola | | scientific name | 10 | Cellvibrio | | scientific name | 11 | [Cellvibrio] gilvus | | scientific name | 13 | Dictyoglomus | | scientific name | 14 | Dictyoglomus thermophilum | | scientific name | 16 | Methylophilus | | scientific name | 17 | Methylophilus methylotrophus | | scientific name | ...
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func GetCachedFile ¶
GetCachedFile returns the contents of a file (identified by resource) from the cache. If the resource is too old or does not exist, returns nil.
func InitCache ¶
InitCache initializes the cache by loading prior cached dates and filenames from <cpath>/cacheinfo.json if it exists, and setting the desired data age (in days). If the cpath folder does not exist, it is created. If cacheinfo.json cannot be loaded, then an empty cache is created.
func PutCachedFile ¶
PutCachedFile saves the contents of a file (identified by resource) to the cache.
func RegisterFetcher ¶
func RegisterFetcher(f Fetcher)
RegisterFetcher adds f to the list of known Fetchers for use by GetFetcher
func RegisterWrapper ¶
func RegisterWrapper(w Wrapper)
RegisterWrapper adds w to the list of known Wrappers for use by GetFetcher
Types ¶
type Fetcher ¶
type Fetcher interface { // Fetch attempts to connect and/or fetch the resource (possibly asynchronously). // For non-file-based Fetchers, this is where API authentication, etc. should be verified. Fetch(resource string) error // GetReader returns the io.Reader for the resource. GetReader() (io.Reader, error) // Detect returns true if the resource string specified can be fetched by this instance. Detect(resource string) bool }
Fetcher describes an instance that can be used to retrieve a data set (specified by a resource string) from a local/remote data source.
func GetFetcher ¶
GetFetcher returns a Fetcher (optionally wrapped by a matching Wrapper) that will work on the specified resource string. It returns the last matching Fetcher (Wrapper) in registration order.
type Wrapper ¶
type Wrapper interface { // DetectWrap returns true if the pathname (and optional partname) specified suits this Wrapper. DetectWrap(pathname, partname string) bool // Wrap returns a wrapped Fetcher that decompresses and/or reads the optional partname from f. Wrap(f Fetcher, partname string) (Fetcher, error) }
Wrapper describes an instances that can wrap an existing Fetcher with additional functionality (such as transparent decompression).
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
Package filters provides a data-record filtering mechanism and basic implementations for typical use cases.
|
Package filters provides a data-record filtering mechanism and basic implementations for typical use cases. |
Package formats provides record-based data format specification and parsing methods which are suitable for automation.
|
Package formats provides record-based data format specification and parsing methods which are suitable for automation. |