Documentation ¶
Index ¶
- Constants
- Variables
- func FindRepositoriesByString(s string) (urls []string, err error)
- func GetBaseDir() string
- func MoveCompressFile(src, dst string) (err error)
- func MustGlob(pattern string) []string
- func PrependSchema(s string) string
- func RandomEndpoint() string
- func Render(opts *RenderOpts) error
- func UserHomeDir() string
- type About
- type Client
- type CopyHook
- type Description
- type DirLaster
- type Doer
- type GetRecord
- type HTTPError
- type Harvest
- type Header
- type Identify
- type Interval
- type Laster
- type ListIdentifiers
- type ListMetadataFormats
- type ListRecords
- type ListSets
- type Metadata
- type MetadataFormat
- type MultiError
- type OAIError
- type Record
- type RenderOpts
- type Repository
- type Request
- type RequestNode
- type Response
- type ResumptionToken
- type Set
- type Values
Constants ¶
const ( // DefaultTimeout on requests. DefaultTimeout = 10 * time.Minute // DefaultMaxRetries is the default number of retries on a single request. DefaultMaxRetries = 8 )
const Day = 24 * time.Hour
Day has 24 hours.
const Version = "0.3.6"
Version of tools.
Variables ¶
var ( // StdClient is the standard lib http client. StdClient = &Client{Doer: http.DefaultClient} // DefaultClient is the more resilient client, that will retry and timeout. DefaultClient = &Client{Doer: CreateDoer(DefaultTimeout, DefaultMaxRetries)} // DefaultUserAgent to identify crawler, some endpoints do not like the Go // default (https://golang.org/src/net/http/request.go#L462), e.g. // https://calhoun.nps.edu/oai/request. DefaultUserAgent = fmt.Sprintf("metha/%s", Version) // ControlCharReplacer helps to deal with broken XML: http://eprints.vu.edu.au/perl/oai2. Add more // weird things to be cleaned before XML parsing here. Another faulty: // http://digitalcommons.gardner-webb.edu/do/oai/?from=2016-02-29&metadataPr // efix=oai_dc&until=2016-03-31&verb=ListRecords. Replace control chars // outside XML char range. ControlCharReplacer = strings.NewReplacer( "\u0000", "", "\u0001", "", "\u0002", "", "\u0003", "", "\u0004", "", "\u0005", "", "\u0006", "", "\u0007", "", "\u0008", "", "\u0009", "", "\u000B", "", "\u000C", "", "\u000E", "", "\u000F", "", "\u0010", "", "\u0011", "", "\u0012", "", "\u0013", "", "\u0014", "", "\u0015", "", "\u0016", "", "\u0017", "", "\u0018", "", "\u0019", "", "\u001A", "", "\u001B", "", "\u001C", "", "\u001D", "", "\u001E", "", "\u001F", "", "\uFFFD", "", "\uFFFE", "", ) )
var ( // BaseDir is where all data is stored. BaseDir = filepath.Join(UserHomeDir(), ".cache", "metha") // ErrAlreadySynced signals completion. ErrAlreadySynced = errors.New("already synced") // ErrInvalidEarliestDate for unparsable earliest date. ErrInvalidEarliestDate = errors.New("invalid earliest date") )
var ( ErrInvalidVerb = errors.New("invalid OAI verb") ErrMissingVerb = errors.New("missing verb") ErrCannotGenerateID = errors.New("cannot generate ID") ErrMissingURL = errors.New("missing URL") ErrParameterMissing = errors.New("missing required parameter") )
var EndpointList string
var Endpoints = splitNonEmpty(EndpointList, "\n")
Endpoints from https://git.io/fxvs0.
Functions ¶
func FindRepositoriesByString ¶ added in v0.1.29
FindRepositoriesByString returns a list of already harvested base URLs given a fragment of the base URL.
func GetBaseDir ¶ added in v0.1.43
func GetBaseDir() string
GetBaseDir returns the base directory for the cache.
func MoveCompressFile ¶ added in v0.1.25
MoveCompressFile will atomically move and compress a source file to a destination file.
func PrependSchema ¶
PrependSchema prepends http, if its missing.
func RandomEndpoint ¶ added in v0.1.27
func RandomEndpoint() string
RandomEndpoint returns a random endpoint url.
func Render ¶ added in v0.2.16
func Render(opts *RenderOpts) error
RenderHarvest renders harvest to JSON or XML.
Types ¶
type About ¶
type About struct {
Body []byte `xml:",innerxml" json:"body,omitempty"`
}
About has addition record information.
type Client ¶
type Client struct {
Doer Doer
}
Client can execute requests.
func CreateClient ¶
CreateClient creates a client with timeout and retry properties.
type CopyHook ¶ added in v0.1.38
CopyHook is a Logrus hook that copies messages to a writer.
func NewCopyHook ¶ added in v0.1.38
NewCopyHook initializes a copy hook. By default, it copies Warn, Error, Fatal and Panic level messages. Override these by passing in other logrus.Level values.
type Description ¶
type Description struct {
Body []byte `xml:",innerxml"`
}
Description holds information about a set.
func (Description) GoString ¶
func (desc Description) GoString() string
GoString is a formatter for Description content.
type DirLaster ¶
DirLaster extract the maximum value from the files of a directory. The values are extracted per file via TransformFunc, which gets a filename and returns a token. The tokens are sorted and the lexikographically largest element is returned.
type GetRecord ¶
type GetRecord struct {
Record Record `xml:"record,omitempty" json:"record,omitempty"`
}
GetRecord returns a single record.
type Harvest ¶
type Harvest struct { BaseURL string Format string Set string From string Until string Client *Client // XXX: Factor these out into options. MaxRequests int DisableSelectiveHarvesting bool CleanBeforeDecode bool IgnoreHTTPErrors bool MaxEmptyResponses int SuppressFormatParameter bool HourlyInterval bool DailyInterval bool ExtraHeaders http.Header KeepTemporaryFiles bool Delay int // XXX: Lazy via sync.Once? Identify *Identify Started time.Time // Protects the rare case, where we are in the process of renaming // harvested files and get a termination signal at the same time. sync.Mutex }
Harvest contains parameters for mass-download. MaxRequests and CleanBeforeDecode are switches to handle broken token implementations and funny chars in responses. Some repos do not support selective harvesting (e.g. zvdd.org/oai2). Set "DisableSelectiveHarvesting" to try to grab metadata from these repositories. From and Until must always be given with 2006-01-02 layout. TODO(miku): make zero type work (lazily run identify).
func NewHarvest ¶
NewHarvest creates a new harvest. A network connection will be used for an initial Identify request.
func (*Harvest) DateLayout ¶
DateLayout converts the repository endpoints advertised granularity to Go date format strings.
type Header ¶
type Header struct { Status string `xml:"status,attr" json:"status,omitempty"` Identifier string `xml:"identifier,omitempty" json:"identifier,omitempty"` DateStamp string `xml:"datestamp,omitempty" json:"datestamp,omitempty"` SetSpec []string `xml:"setSpec,omitempty" json:"setSpec,omitempty"` }
A Header is part of other requests.
type Identify ¶
type Identify struct { RepositoryName string `xml:"repositoryName,omitempty" json:"repositoryName,omitempty"` BaseURL string `xml:"baseURL,omitempty" json:"baseURL,omitempty"` ProtocolVersion string `xml:"protocolVersion,omitempty" json:"protocolVersion,omitempty"` AdminEmail []string `xml:"adminEmail,omitempty" json:"adminEmail,omitempty"` EarliestDatestamp string `xml:"earliestDatestamp,omitempty" json:"earliestDatestamp,omitempty"` DeletedRecord string `xml:"deletedRecord,omitempty" json:"deletedRecord,omitempty"` Granularity string `xml:"granularity,omitempty" json:"granularity,omitempty"` Description []Description `xml:"description,omitempty" json:"description,omitempty"` }
Identify reports information about a repository.
type Interval ¶
Interval represents a span of time.
func (Interval) DailyIntervals ¶ added in v0.1.14
DailyIntervals segments a given interval into daily intervals.
func (Interval) HourlyIntervals ¶ added in v0.2.5
HourlyIntervals segments a given interval into hourly intervals.
func (Interval) MonthlyIntervals ¶
MonthlyIntervals segments a given interval into monthly intervals.
type ListIdentifiers ¶
type ListIdentifiers struct { Headers []Header `xml:"header,omitempty" json:"header,omitempty"` ResumptionToken ResumptionToken `xml:"resumptionToken,omitempty" json:"resumptionToken,omitempty"` }
ListIdentifiers lists headers only.
type ListMetadataFormats ¶
type ListMetadataFormats struct {
MetadataFormat []MetadataFormat `xml:"metadataFormat,omitempty" json:"metadataFormat,omitempty"`
}
ListMetadataFormats lists supported metadata formats.
type ListRecords ¶
type ListRecords struct { Records []Record `xml:"record" json:"record"` ResumptionToken ResumptionToken `xml:"resumptionToken,omitempty" json:"resumptionToken,omitempty"` }
ListRecords lists records.
type ListSets ¶
type ListSets struct { Set []Set `xml:"set,omitempty" json:"set,omitempty"` ResumptionToken ResumptionToken `xml:"resumptionToken,omitempty" json:"resumptionToken,omitempty"` }
ListSets lists available sets.
type Metadata ¶
type Metadata struct {
Body []byte `xml:",innerxml"`
}
Metadata contains the actual metadata, conforming to varying schemas.
func (Metadata) MarshalJSON ¶
MarshalJSON marshals the metadata body.
type MetadataFormat ¶
type MetadataFormat struct { MetadataPrefix string `xml:"metadataPrefix,omitempty" json:"metadataPrefix,omitempty"` Schema string `xml:"schema,omitempty" json:"schema,omitempty"` MetadataNamespace string `xml:"metadataNamespace,omitempty" json:"metadataNamespace,omitempty"` }
MetadataFormat holds information about a format.
type MultiError ¶
type MultiError struct {
Errors []error
}
MultiError collects a number of errors.
func (*MultiError) Error ¶
func (e *MultiError) Error() string
Error formats all error strings into a single string.
type OAIError ¶
type OAIError struct { Code string `xml:"code,attr" json:"code,omitempty"` Message string `xml:",chardata" json:"message,omitempty"` }
OAIError is an OAI protocol error.
type Record ¶
type Record struct { XMLName xml.Name Header Header `xml:"header,omitempty" json:"header,omitempty"` Metadata Metadata `xml:"metadata,omitempty" json:"metadata,omitempty"` About About `xml:"about,omitempty" json:"about,omitempty"` }
Record represents a single record.
type RenderOpts ¶ added in v0.2.16
type RenderOpts struct { Writer io.Writer Harvest Harvest Root string From string Until string UseJson bool }
RenderOpts controls output by the metha-cat command.
type Repository ¶
type Repository struct {
BaseURL string
}
Repository represents an OAI endpoint.
func (Repository) Formats ¶
func (r Repository) Formats() ([]MetadataFormat, error)
Formats returns a list of metadata formats.
type Request ¶
type Request struct { BaseURL string Verb string Identifier string MetadataPrefix string From string Until string Set string ResumptionToken string CleanBeforeDecode bool SuppressFormatParameter bool ExtraHeaders http.Header }
A Request can express any OAI request. Not all combination of values will yield valid requests.
type RequestNode ¶
type RequestNode struct { Verb string `xml:"verb,attr" json:"verb,omitempty"` Set string `xml:"set,attr" json:"set,omitempty"` MetadataPrefix string `xml:"metadataPrefix,attr" json:"metadataPrefix,omitempty"` }
RequestNode carries the request information into the response.
type Response ¶
type Response struct { ResponseDate string `xml:"responseDate,omitempty" json:"responseDate,omitempty"` Request RequestNode `xml:"request,omitempty" json:"request,omitempty"` Error OAIError `xml:"error,omitempty" json:"error,omitempty"` GetRecord GetRecord `xml:"GetRecord,omitempty" json:"GetRecord,omitempty"` Identify Identify `xml:"Identify,omitempty" json:"Identify,omitempty"` ListIdentifiers ListIdentifiers `xml:"ListIdentifiers,omitempty" json:"ListIdentifiers,omitempty"` ListMetadataFormats ListMetadataFormats `xml:"ListMetadataFormats,omitempty" json:"ListMetadataFormats,omitempty"` ListRecords ListRecords `xml:"ListRecords,omitempty" json:"ListRecords,omitempty"` ListSets ListSets `xml:"ListSets,omitempty" json:"ListSets,omitempty"` }
Response is the envelope. It can hold any OAI response kind.
func (*Response) CompleteListSize ¶ added in v0.1.38
CompleteListSize returns the value of completeListSize, if it exists.
func (*Response) Cursor ¶ added in v0.1.38
CompleteListSize returns the value of completeListSize, if it exists.
func (*Response) GetResumptionToken ¶
GetResumptionToken returns the resumption token or an empty string if it does not have a token. In addition, return an empty string, if cursor and complete list size are defined and are equal (doaj, refs #14865).
func (*Response) HasResumptionToken ¶
HasResumptionToken determines if the request has a ResumptionToken.
type ResumptionToken ¶ added in v0.1.38
type ResumptionToken struct { Text string `xml:",chardata"` // eyJhIjogWyIyMDE5LTAyLTIxV... CompleteListSize string `xml:"completeListSize,attr"` Cursor string `xml:"cursor,attr"` ExpirationDate string `xml:"expirationDate,attr"` }
ResupmtionToken with optional extra information.
type Set ¶
type Set struct { SetSpec string `xml:"setSpec,omitempty" json:"setSpec,omitempty"` SetName string `xml:"setName,omitempty" json:"setName,omitempty"` SetDescription Description `xml:"setDescription,omitempty" json:"setDescription,omitempty"` }
A Set has a spec, name and description.
type Values ¶
Values enhances the builtin url.Values.
func (Values) EncodeVerbatim ¶
EncodeVerbatim is like Encode(), but does not escape the keys and values.
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
cmd
|
|
metha-snapshot
Download metadata from all known endpoints (or some supplied list), generate a single JSON file.
|
Download metadata from all known endpoints (or some supplied list), generate a single JSON file. |
extra
|
|
_largecrawl
genjson extracts info from a stream of OAI DC XML records, e.g.
|
genjson extracts info from a stream of OAI DC XML records, e.g. |
pkpindex
Small util to get journal info from https://index.pkp.sfu.ca currently including 1264043 records indexed from 4960 publications.
|
Small util to get journal info from https://index.pkp.sfu.ca currently including 1264043 records indexed from 4960 publications. |
Package xflag add an additional flag type Array for repeated string flags.
|
Package xflag add an additional flag type Array for repeated string flags. |