filter

package
v0.1.372 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 26, 2024 License: GPL-3.0 Imports: 13 Imported by: 0

Documentation

Overview

Package filter implements flexible ISIL attachments with expression trees[1], serialized as JSON. The top-level key is the label, that is to be given to a record. Here, this label is an ISIL. Each ISIL can specify a tree of filters. Intermediate nodes can be "or", "and" or "not" filters, leaf nodes contain filters, that are matched against records (like "collection", "source" or "issn").

A filter needs to implement Apply. If the filter takes configuration options, it needs to implement UnmarshalJSON as well. Each filter can define arbitrary options, for example a HoldingsFilter can load KBART data from a single file or a list of urls.

[1] https://en.wikipedia.org/wiki/Binary_expression_tree#Boolean_expressions

The simplest filter is one, that says *yes* to all records:

{"DE-X": {"any": {}}}

On the command line:

$ span-tag -c '{"DE-X": {"any": {}}}' < input.ldj > output.ldj

Another slightly more complex example: Here, the ISIL "DE-14" is attached to a record, if the following conditions are met: There are two alternatives, each consisting of a conjuntion. The first says: IF "the record is from source id 55" AND IF "the record can be validated against one of the holding files given by their url", THEN "attach DE-14". The second says: IF "the record is from source id 49" AND "it validates against any one of the holding files given by their urls" AND "the record belongs to any one of the given collections", THEN "attach DE-14".

{
  "DE-14": {
    "or": [
      {
        "and": [
          {
            "source": [
              "55"
            ]
          },
          {
            "holdings": {
              "urls": [
                "http://www.jstor.org/kbart/collections/asii",
                "http://www.jstor.org/kbart/collections/as"
              ]
            }
          }
        ]
      },
      {
        "and": [
          {
            "source": [
              "49"
            ]
          },
          {
            "holdings": {
              "urls": [
                "https://example.com/KBART_DE14",
                "https://example.com/KBART_FREEJOURNALS"
              ]
            }
          },
          {
            "collection": [
              "Turkish Family Physicans Association (CrossRef)",
              "Helminthological Society (CrossRef)",
              "International Association of Physical Chemists (IAPC) (CrossRef)",
              "The Society for Antibacterial and Antifungal Agents, Japan (CrossRef)",
              "Fundacao CECIERJ (CrossRef)"
            ]
          }
        ]
      }
    ]
  }
}

If is relatively easy to add a new filter. Imagine we want to build a filter that only allows records that have the word "awesome" in their title.

We first define a new type:

type AwesomeFilter struct{}

We then implement the Apply method:

func (f *AwesomeFilter) Apply(is finc.IntermediateSchema) bool {
    return strings.Contains(strings.ToLower(is.ArticleTitle), "awesome")
}

That is all. We need to register the filter, so we can use it in the configuration file. The "unmarshalFilter" (filter.go) method acts as a dispatcher:

func unmarshalFilter(name string, raw json.RawMessage) (Filter, error) {
    switch name {
    // Add more filters here.
    case "any":
        return &AnyFilter{}, nil
    case "doi":
        ...

    // Register awesome filter. No configuration options, so no need to unmarshal.
    case "awesome":
        return &AwesomeFilter{}, nil

    ...

We can then use the filter in the JSON configuration:

{"DE-X": {"awesome": {}}}

Further readings: http://theory.stanford.edu/~sergei/papers/sigmod10-index.pdf

XXX: Generalize. Only require fields that we need.

Taggable document should expose (maybe via interfaces):

SerialNumbers() []string
PublicationTitle() string
Date() string
Volume() string
Issue() string
DatabaseName() string

Tagger configuration, e.g. preferred method, failure tolerance.

tagger.Tag(v interface{}) []string { ... }

Index

Constants

This section is empty.

Variables

Cache caches holdings information.

Functions

This section is empty.

Types

type AndFilter added in v0.1.75

type AndFilter struct {
	Filters []Filter
}

AndFilter returns true, only if all filters return true.

func (*AndFilter) Apply added in v0.1.75

func (f *AndFilter) Apply(is finc.IntermediateSchema) bool

Apply returns false if any of the filters returns false. Short circuited.

func (*AndFilter) UnmarshalJSON added in v0.1.75

func (f *AndFilter) UnmarshalJSON(p []byte) (err error)

UnmarshalJSON turns a config fragment into an or filter.

type AnyFilter added in v0.1.75

type AnyFilter struct {
	Any struct{} `json:"any"`
}

AnyFilter validates any record.

func (*AnyFilter) Apply added in v0.1.75

Apply will just return true.

type CacheValue added in v0.1.130

type CacheValue struct {
	SerialNumberMap map[string][]licensing.Entry `json:"s"` // key: ISSN
	WisoDatabaseMap map[string][]licensing.Entry `json:"w"` // key: WISO DB name
	TitleMap        map[string][]licensing.Entry `json:"t"` // key: publication title
}

CacheValue groups holdings and cache for fast lookups.

type CollectionFilter

type CollectionFilter struct {
	Values *container.StringSet
}

CollectionFilter returns true, if the record belongs to any one of the collections.

func (*CollectionFilter) Apply

Apply filter.

func (*CollectionFilter) UnmarshalJSON added in v0.1.75

func (f *CollectionFilter) UnmarshalJSON(p []byte) error

UnmarshalJSON turns a config fragment into a ISSN filter.

type DOIFilter

type DOIFilter struct {
	Values []string
}

DOIFilter allows records with a given DOI. Can be used in conjuction with "not" to create blacklists.

func (*DOIFilter) Apply

func (f *DOIFilter) Apply(is finc.IntermediateSchema) bool

Apply applies the filter.

func (*DOIFilter) UnmarshalJSON added in v0.1.75

func (f *DOIFilter) UnmarshalJSON(p []byte) error

UnmarshalJSON turns a config fragment into a filter.

type Filter

type Filter interface {
	Apply(finc.IntermediateSchema) bool
}

Filter returns go or no for a given record.

type HoldingsCache added in v0.1.130

type HoldingsCache map[string]CacheValue

HoldingsCache caches items keyed by filename or url. A configuration might refer to the same holding file hundreds or thousands of times, but we only want to store the content once. This map serves as a private singleton that holds licensing entries and precomputed shortcuts to find relevant entries (rows from KBART) by ISSN, wiso database name or title.

type HoldingsFilter added in v0.1.75

type HoldingsFilter struct {
	// Keep cache keys only (filename or URL of holdings document).
	Names   []string `json:"-"`
	Verbose bool     `json:"verbose,omitempty"`
	// Beside ISSN, also try to compare by title, this is fuzzy, so disabled by default.
	CompareByTitle bool `json:"compare-by-title,omitempty"`
	// Allow direct access to entries, might replace Names.
	CachedValues map[string]*CacheValue `json:"cache,omitempty"`
}

HoldingsFilter compares a record to a kbart file. Since this filter lives in memory and the configuration for a single run (which this filter value is part of) might contain many other holdings filters, we only want to store the content once. This is done via a private cache. The holdings filter only needs to remember the keys (filename or URL) to access entries at runtime.

func (*HoldingsFilter) Apply added in v0.1.75

Apply returns true, if there is a valid holding for a given record. This will take multiple attributes like date, volume, issue and embargo into account. This function is very specific: it works only with intermediate format and it uses specific information from that format to decide on attachment.

func (*HoldingsFilter) UnmarshalJSON added in v0.1.75

func (f *HoldingsFilter) UnmarshalJSON(p []byte) error

UnmarshalJSON deserializes this filter.

type ISBNFilter added in v0.1.367

type ISBNFilter struct {
	Values *container.StringSet
}

ISBNFilter allows records with a certain ISBN.

func (*ISBNFilter) Apply added in v0.1.367

func (f *ISBNFilter) Apply(is finc.IntermediateSchema) bool

Apply applies ISBN filter on intermediate schema, no distinction between print or electronic ISBN.

func (*ISBNFilter) UnmarshalJSON added in v0.1.367

func (f *ISBNFilter) UnmarshalJSON(p []byte) error

UnmarshalJSON turns a config fragment into a filter.

type ISSNFilter added in v0.1.75

type ISSNFilter struct {
	Values *container.StringSet
}

ISSNFilter allows records with a certain ISSN.

func (*ISSNFilter) Apply added in v0.1.75

func (f *ISSNFilter) Apply(is finc.IntermediateSchema) bool

Apply applies ISSN filter on intermediate schema, no distinction between ISSN and EISSN.

func (*ISSNFilter) UnmarshalJSON added in v0.1.75

func (f *ISSNFilter) UnmarshalJSON(p []byte) error

UnmarshalJSON turns a config fragment into a filter.

type NotFilter added in v0.1.75

type NotFilter struct {
	Filter Filter
}

NotFilter inverts another filter.

func (*NotFilter) Apply added in v0.1.75

func (f *NotFilter) Apply(is finc.IntermediateSchema) bool

Apply inverts another filter.

func (*NotFilter) UnmarshalJSON added in v0.1.75

func (f *NotFilter) UnmarshalJSON(p []byte) (err error)

UnmarshalJSON turns a config fragment into a not filter.

type OrFilter added in v0.1.75

type OrFilter struct {
	Filters []Filter
}

OrFilter returns true, if at least one filter matches.

func (*OrFilter) Apply added in v0.1.75

func (f *OrFilter) Apply(is finc.IntermediateSchema) bool

Apply returns true, if any of the filters returns true. Short circuited.

func (*OrFilter) UnmarshalJSON added in v0.1.75

func (f *OrFilter) UnmarshalJSON(p []byte) (err error)

UnmarshalJSON turns a config fragment into a or filter.

type PackageFilter added in v0.1.59

type PackageFilter struct {
	Values *container.StringSet
}

PackageFilter allows all records of one of the given package name.

func (*PackageFilter) Apply added in v0.1.59

Apply filters packages.

func (*PackageFilter) UnmarshalJSON added in v0.1.75

func (f *PackageFilter) UnmarshalJSON(p []byte) error

UnmarshalJSON turns a config fragment into a filter.

type SourceFilter

type SourceFilter struct {
	Values []string
}

SourceFilter allows all records with the given source id or ids.

func (*SourceFilter) Apply

Apply filter.

func (*SourceFilter) UnmarshalJSON added in v0.1.75

func (f *SourceFilter) UnmarshalJSON(p []byte) error

UnmarshalJSON turns a config fragment into a filter.

type SubjectFilter added in v0.1.130

type SubjectFilter struct {
	Values *container.StringSet
}

SubjectFilter returns true, if the record has an exact string match to one of the given subjects.

func (*SubjectFilter) Apply added in v0.1.130

Apply filter.

func (*SubjectFilter) UnmarshalJSON added in v0.1.130

func (f *SubjectFilter) UnmarshalJSON(p []byte) error

UnmarshalJSON turns a config fragment into a ISSN filter.

type Tagger added in v0.1.75

type Tagger struct {
	FilterMap map[string]Tree
}

Tagger takes a list of tags (ISILs) and annotates an intermediate schema according to a number of filters, defined per label. The tagger is loaded directly from JSON.

func (*Tagger) Tag added in v0.1.75

Tag takes an intermediate schema record and returns a labeled version of that record.

func (*Tagger) UnmarshalJSON added in v0.1.75

func (t *Tagger) UnmarshalJSON(p []byte) error

UnmarshalJSON unmarshals a complete filter config from serialized JSON.

type Tree added in v0.1.130

type Tree struct {
	Root Filter
}

Tree allows polymorphic filters.

func (*Tree) Apply added in v0.1.130

func (t *Tree) Apply(is finc.IntermediateSchema) bool

Apply applies the root filter.

func (*Tree) UnmarshalJSON added in v0.1.130

func (t *Tree) UnmarshalJSON(p []byte) error

UnmarshalJSON gathers the top level filter name and unmarshals the associated filter.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL