storage

package
v0.8.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 25, 2024 License: GPL-3.0 Imports: 12 Imported by: 0

README

About keys and database lookups of url artifacts

Key Goals

store.Key(url) uint64 generates a numeric key for storing and retrieving urls from a data store. This key is intended to fulfill the following criteria:

  1. Idempotency: always generate the same key for a particular URL
  2. Performance
  3. Compatibility with most any relational DB storage.

Repeated key generation will always return the same key for a url. scrape only stores one instance of content per (canonical) url (more on that below) and isn't intended for storing versioned contents of urls. Any updated content replaces old comtent for a particular url.

Performance here largely boils down using a numeric key, as this is the most economical for storage, indexing, and sorting. Additionally upper bits of thekey provide same-domain grouping that can be used for partitioning if needed.

Compatibility here primarily boils down to the key's internal representaion being an int63 -- not all databases support uint64 natively, so the highest bit is always going to be a 0.

Key Structure

Keys are constructed in the following format:

  • [bits 0-55]: A 56 bit numeric hash of the url. Currently generated using an fnv64a hash rounded down to 56 bits. (This implementation may change in a future iteration)
  • [bits 56-62]: A 7 bit checksum of the url's domain. This provides some degree of natural grouping by domain. This could support partitioning or sharding as well, but presently the goal/assumption of the system is that the database is time-constrained in size and should not require paritioning.
  • [bit 63] Always 0

Intended Usage

Internal use only

Keys are only intended for optimizing database lookups and are not included in shared metadata/API responses. The key format is intended to be a direct representation of a URL for managing internal processes. The system provides a guarantee that it will fetch and return content for any url. It provides no such contract for IDs, nor does it provide any contract that the ID algorithm should not change.

Usage in tables

Usage inside tables is at the discretion of the database implementation and may be implemented differently across storage engines.

System assumptions

The following isn't strictly germane to keys, but describes how they are used in the context of the broader system (which did/does inform their construction).

The resource.WebPage struct (which is passed into URLDataStore.store() implementations) has 3 keys that contain URL data.

  1. OriginalURL this is literal url that was requested in the API. This value is not stored at all, but is returned to the client to ensure that a client can cross-reference a request.
  2. RequestedURL This is the URL that was actually requested from the target server, it's the output of resource.CleanURL(originalURL).
  3. URL This is the URL of the page as reported by the actual content parser, and is considered the canonical URL of the page. It is reliably the content of og:url when present.
urls table

The urls table used the stored URL (canonical, derived from the content whenever possible) along with the paired key as its id.

id_map table

The id_map table stores mappings between canonical_url and requested_url.

When handling an inbound request, the id_map table is consulted first to see if there's a mapping for the RequestedURL. If there is, this metadata for this entry is returned to the client.

Documentation

Overview

Key generation and related methods relevant to any storage backend.

Index

Constants

View Source
const (
	MASK_56       uint64 = 0xffffffffffffff
	CHECKSUM_MASK uint64 = 127 << 56
)

Variables

This section is empty.

Functions

func Key

func Key(url URLWithHostname) uint64

Produces a 63 bit uint contained in a uint64 (SQLite cannot accept uint64 with high bit set as a primary key) [Bit 63] Always 0 [Bits 62-56] A 7 bit checksum based on the domain name [Bits 55-0] A 56 bit hash of the URL (reduced from a 64 bit fnv1a hash)

Types

type URLDataStore added in v0.8.5

type URLDataStore struct {
	// contains filtered or unexported fields
}

TODO: Stop embedding the database handle this way; the URLDataStore is currently opening and closing the database connection (via this interface), which prevents other entities from using the same connection.

func NewURLDataStore added in v0.8.5

func NewURLDataStore(dbh *database.DBHandle) *URLDataStore

func (*URLDataStore) Clear added in v0.8.5

func (s *URLDataStore) Clear() error

Clear will delete all url content from the database

func (*URLDataStore) Database added in v0.8.5

func (s *URLDataStore) Database() *database.DBHandle

func (*URLDataStore) Delete added in v0.8.5

func (s *URLDataStore) Delete(url *nurl.URL) (bool, error)

Delete will only delete a url that matches the canonical URL. TODO: Evaluate desired behavior here TODO: Not accounting for lookup keys NB: TTL management is handled by maintenance routines

func (URLDataStore) Fetch added in v0.8.5

func (s URLDataStore) Fetch(url *nurl.URL) (*resource.WebPage, error)

Fetch will return the stored data for requested URL, or nil if not found.

The returned result _may_ come from a different URL than the requested URL, if we've seen the passed URL before AND the page reported it's canonical url as being different than the requested URL.

In that case, the canonical version of the content will be returned, if we have it.

func (*URLDataStore) Save added in v0.8.5

func (s *URLDataStore) Save(uptr *resource.WebPage) (uint64, error)

Save the data for a URL. Will overwrite data where the URL is the same. Save() will use the canonical url of the passed resource both for the key and for the url field in the stored data. It will also store an id map entry for the requested URL, back to the canonical URL. This mapping will also be stored in cases where the two urls are the same. Returns a key for the stored URL (which you actually can't use for anything, so this interface may change)

type URLString

type URLString string

Type that provides the

func (URLString) Hostname

func (u URLString) Hostname() string

func (URLString) String

func (u URLString) String() string

type URLWithHostname

type URLWithHostname interface {
	fmt.Stringer
	Hostname() string
}

net.URL and URLString both implement this interface, which is needed to generate a key for the URL.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL