Documentation ¶
Overview ¶
Package filter implements flexible ISIL attachments with expression trees[1], serialized as JSON. The top-level key is the label, that is to be given to a record. Here, this label is an ISIL. Each ISIL can specify a tree of filters. Intermediate nodes can be "or", "and" or "not" filters, leaf nodes contain filters, that are matched against records (like "collection", "source" or "issn").
A filter needs to implement is Apply. If the filter takes configuration options, it needs to implement UnmarshalJSON as well. Each filter can define arbitrary options, for example a HoldingsFilter can load KBART data from a single file or a list of urls.
[1] https://en.wikipedia.org/wiki/Binary_expression_tree#Boolean_expressions
The simplest filter is one, that says *yes* to all records:
{"DE-X": {"any": {}}}
On the command line:
$ span-tag -c '{"DE-X": {"any": {}}}' < input.ldj > output.ldj
Another slightly more complex example: Here, the ISIL "DE-14" is attached to a record, if the following conditions are met: There are two alternatives, each consisting of a conjuntion. The first says: IF "the record is from source id 55" AND IF "the record can be validated against one of the holding files given by their url", THEN "attach DE-14". The second says: IF "the record is from source id 49" AND "it validates against any one of the holding files given by their urls" AND "the record belongs to any one of the given collections", THEN "attach DE-14".
{ "DE-14": { "or": [ { "and": [ { "source": [ "55" ] }, { "holdings": { "urls": [ "http://www.jstor.org/kbart/collections/asii", "http://www.jstor.org/kbart/collections/as" ] } } ] }, { "and": [ { "source": [ "49" ] }, { "holdings": { "urls": [ "https://example.com/KBART_DE14", "https://example.com/KBART_FREEJOURNALS" ] } }, { "collection": [ "Turkish Family Physicans Association (CrossRef)", "Helminthological Society (CrossRef)", "International Association of Physical Chemists (IAPC) (CrossRef)", "The Society for Antibacterial and Antifungal Agents, Japan (CrossRef)", "Fundacao CECIERJ (CrossRef)" ] } ] } ] } }
If is relatively easy to add a new filter. Imagine we want to build a filter that only allows records that have the word "awesome" in their title.
We first define a new type:
type AwesomeFilter struct{}
We then implement the Apply method:
func (f *AwesomeFilter) Apply(is finc.IntermediateSchema) bool { return strings.Contains(strings.ToLower(is.ArticleTitle), "awesome") }
That is all. We need to register the filter, so we can use it in the configuration file. The "unmarshalFilter" (filter.go) method acts as a dispatcher:
func unmarshalFilter(name string, raw json.RawMessage) (Filter, error) { switch name { // Add more filters here. case "any": return &AnyFilter{}, nil case "doi": ... // Register awesome filter. No configuration options, so no need to unmarshal. case "awesome": return &AwesomeFilter{}, nil ...
We can then use the filter in the JSON configuration:
{"DE-X": {"awesome": {}}}
Further readings: http://theory.stanford.edu/~sergei/papers/sigmod10-index.pdf
XXX: Generalize. Only require fields that we need.
Taggable document should expose (maybe via interfaces):
SerialNumbers() []string PublicationTitle() string Date() string Volume() string Issue() string DatabaseName() string Tagger configuration, e.g. preferred method, failure tolerance. tagger.Tag(v interface{}) []string { ... }
Index ¶
Constants ¶
This section is empty.
Variables ¶
var Cache = make(HoldingsCache)
Cache caches holdings information.
Functions ¶
This section is empty.
Types ¶
type AndFilter ¶ added in v0.1.75
type AndFilter struct {
Filters []Filter
}
AndFilter returns true, only if all filters return true.
func (*AndFilter) Apply ¶ added in v0.1.75
func (f *AndFilter) Apply(is finc.IntermediateSchema) bool
Apply returns false if any of the filters returns false. Short circuited.
func (*AndFilter) UnmarshalJSON ¶ added in v0.1.75
UnmarshalJSON turns a config fragment into an or filter.
type AnyFilter ¶ added in v0.1.75
type AnyFilter struct {
Any struct{} `json:"any"`
}
AnyFilter validates any record.
type CacheValue ¶ added in v0.1.130
type CacheValue struct { SerialNumberMap map[string][]licensing.Entry `json:"s"` // key: ISSN WisoDatabaseMap map[string][]licensing.Entry `json:"w"` // key: WISO DB name TitleMap map[string][]licensing.Entry `json:"t"` // key: publication title }
CacheValue groups holdings and cache for fast lookups.
type CollectionFilter ¶
CollectionFilter returns true, if the record belongs to any one of the collections.
func (*CollectionFilter) Apply ¶
func (f *CollectionFilter) Apply(is finc.IntermediateSchema) bool
Apply filter.
func (*CollectionFilter) UnmarshalJSON ¶ added in v0.1.75
func (f *CollectionFilter) UnmarshalJSON(p []byte) error
UnmarshalJSON turns a config fragment into a ISSN filter.
type DOIFilter ¶
type DOIFilter struct {
Values []string
}
DOIFilter allows records with a given DOI. Can be used in conjuction with "not" to create blacklists.
func (*DOIFilter) Apply ¶
func (f *DOIFilter) Apply(is finc.IntermediateSchema) bool
Apply applies the filter.
func (*DOIFilter) UnmarshalJSON ¶ added in v0.1.75
UnmarshalJSON turns a config fragment into a filter.
type Filter ¶
type Filter interface {
Apply(finc.IntermediateSchema) bool
}
Filter returns go or no for a given record.
type HoldingsCache ¶ added in v0.1.130
type HoldingsCache map[string]CacheValue
HoldingsCache caches items keyed by filename or url. A configuration might refer to the same holding file hundreds or thousands of times, but we only want to store the content once. This map serves as a private singleton that holds licensing entries and precomputed shortcuts to find relevant entries (rows from KBART) by issn, wiso db name or title.
type HoldingsFilter ¶ added in v0.1.75
type HoldingsFilter struct { // Keep cache keys only (filename or URL of holdings document). Names []string `json:"-"` Verbose bool `json:"verbose,omitempty"` // Beside ISSN, also try to compare by title, this is fuzzy, so disabled by default. CompareByTitle bool `json:"compare-by-title,omitempty"` // Allow direct access to entries, might replace Names. CachedValues map[string]*CacheValue `json:"cache,omitempty"` }
HoldingsFilter compares a record to a kbart file. Since this filter lives in memory and the configuration for a single run (which this filter value is part of) might contain many other holdings filters, we only want to store the content once. This is done via a private cache. The holdings filter only needs to remember the keys (filename or URL) to access entries at runtime.
func (*HoldingsFilter) Apply ¶ added in v0.1.75
func (f *HoldingsFilter) Apply(is finc.IntermediateSchema) bool
Apply returns true, if there is a valid holding for a given record. This will take multiple attributes like date, volume, issue and embargo into account. This function is very specific: it works only with intermediate format and it uses specific information from that format to decide on attachment.
func (*HoldingsFilter) UnmarshalJSON ¶ added in v0.1.75
func (f *HoldingsFilter) UnmarshalJSON(p []byte) error
UnmarshalJSON deserializes this filter.
type ISSNFilter ¶ added in v0.1.75
ISSNFilter allows records with a certain ISSN.
func (*ISSNFilter) Apply ¶ added in v0.1.75
func (f *ISSNFilter) Apply(is finc.IntermediateSchema) bool
Apply applies ISSN filter on intermediate schema, no distinction between ISSN and EISSN.
func (*ISSNFilter) UnmarshalJSON ¶ added in v0.1.75
func (f *ISSNFilter) UnmarshalJSON(p []byte) error
UnmarshalJSON turns a config fragment into a filter.
type NotFilter ¶ added in v0.1.75
type NotFilter struct {
Filter Filter
}
NotFilter inverts another filter.
func (*NotFilter) Apply ¶ added in v0.1.75
func (f *NotFilter) Apply(is finc.IntermediateSchema) bool
Apply inverts another filter.
func (*NotFilter) UnmarshalJSON ¶ added in v0.1.75
UnmarshalJSON turns a config fragment into a not filter.
type OrFilter ¶ added in v0.1.75
type OrFilter struct {
Filters []Filter
}
OrFilter returns true, if at least one filter matches.
func (*OrFilter) Apply ¶ added in v0.1.75
func (f *OrFilter) Apply(is finc.IntermediateSchema) bool
Apply returns true, if any of the filters returns true. Short circuited.
func (*OrFilter) UnmarshalJSON ¶ added in v0.1.75
UnmarshalJSON turns a config fragment into a or filter.
type PackageFilter ¶ added in v0.1.59
PackageFilter allows all records of one of the given package name.
func (*PackageFilter) Apply ¶ added in v0.1.59
func (f *PackageFilter) Apply(is finc.IntermediateSchema) bool
Apply filters packages.
func (*PackageFilter) UnmarshalJSON ¶ added in v0.1.75
func (f *PackageFilter) UnmarshalJSON(p []byte) error
UnmarshalJSON turns a config fragment into a filter.
type SourceFilter ¶
type SourceFilter struct {
Values []string
}
SourceFilter allows all records with the given source id or ids.
func (*SourceFilter) Apply ¶
func (f *SourceFilter) Apply(is finc.IntermediateSchema) bool
Apply filter.
func (*SourceFilter) UnmarshalJSON ¶ added in v0.1.75
func (f *SourceFilter) UnmarshalJSON(p []byte) error
UnmarshalJSON turns a config fragment into a filter.
type SubjectFilter ¶ added in v0.1.130
SubjectFilter returns true, if the record has an exact string match to one of the given subjects.
func (*SubjectFilter) Apply ¶ added in v0.1.130
func (f *SubjectFilter) Apply(is finc.IntermediateSchema) bool
Apply filter.
func (*SubjectFilter) UnmarshalJSON ¶ added in v0.1.130
func (f *SubjectFilter) UnmarshalJSON(p []byte) error
UnmarshalJSON turns a config fragment into a ISSN filter.
type Tagger ¶ added in v0.1.75
Tagger takes a list of tags (ISILs) and annotates an intermediate schema according to a number of filters, defined per label. The tagger is loaded directly from JSON.
func (*Tagger) Tag ¶ added in v0.1.75
func (t *Tagger) Tag(is finc.IntermediateSchema) finc.IntermediateSchema
Tag takes an intermediate schema record and returns a labeled version of that record.
func (*Tagger) UnmarshalJSON ¶ added in v0.1.75
UnmarshalJSON unmarshals a complete filter config from serialized JSON.
type Tree ¶ added in v0.1.130
type Tree struct {
Root Filter
}
Tree allows polymorphic filters.
func (*Tree) Apply ¶ added in v0.1.130
func (t *Tree) Apply(is finc.IntermediateSchema) bool
Apply applies the root filter.
func (*Tree) UnmarshalJSON ¶ added in v0.1.130
UnmarshalJSON gathers the top level filter name and unmarshals the associated filter.