openaccess

package module
v0.0.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 22, 2020 License: BSD-3-Clause Imports: 11 Imported by: 0

README

go-smithsonian-openaccess

Go package for working with the Smithsonian Open Access release

Important

This is work in progress. Proper documentation to follow.

Data sources

This package and the tools it exports support two types of data sources for the Smithsonian Open Access: A local file system and an AWS S3 bucket. Under the hood the code is using the GoCloud blob abstraction layer so other storage services could be supported but currently they are not.

Access to the data on a local file system is presumed to be from a checkout of the OpenAccess GitHub repository. That repo has grown sufficiently large that it can be difficult to successfully download a copy of the data.

The data itself also lives in a Smithsonian-operated AWS S3 bucket so this code has been updated to retrieve data from there if asked to. For a number of reasons specific to the Smithsonian retrieving data from their S3 bucket does not fit neatly in to the GoCloud abstraction layer but efforts have been made to hide those details from users of this code.

Most of the examples below assume a local Git checkout. For example:

$> ./bin/emit -bucket-uri file:///usr/local/OpenAccess metadata/objects/NMAH

In order to retrieve data from the Smithsonian-operated S3 bucket you would change the -bucket-uri flag to:

$> ./bin/emit -bucket-uri 's3://smithsonian-open-access?region=us-west-2' metadata/objects/NMAH

Or the following, which is included as a convenience method:

$> ./bin/emit -bucket-uri 'si://' metadata/objects/NMAH

A by-product of this work is that the code is also able to retrieve data from any other S3 bucket. For example:

$> ./bin/emit -bucket-uri 's3://YOUR-OPENACCESS-BUCKET?region=us-east-1' metadata/objects/NMAH

As of this writing the code to retrieve data from S3 buckets (other than the Smithsonian's) assumes that those buckets allow public access and have public directory listings enabled.

Tools

To build binary versions of these tools run the cli Makefile target. For example:

> make cli
go build -mod vendor -o bin/clone cmd/clone/main.go
go build -mod vendor -o bin/emit cmd/emit/main.go
go build -mod vendor -o bin/findingaid cmd/findingaid/main.go
go build -mod vendor -o bin/location cmd/location/main.go
go build -mod vendor -o bin/placename cmd/placename/main.go
clone

A command-line tool to clone OpenAccess data to a target destination.

This tool was written principally to clone OpenAccess data from the Smithsonian's smithsonian-open-access S3 bucket to a local filesystem but it can be used to clone data to and from any supported GoCloud.blob source.

> ./bin/clone -h
Usage of ./bin/clone:
  -compress
    	Compress files in the target bucket using bzip2 encoding. Files will be appended with a '.bz2' suffix.
  -force
    	Clone files even if they are present in target bucket and MD5 hashes between source and target buckets match.
  -source-bucket-uri string
    	A valid GoCloud bucket URI. Valid schemes are: file://, s3:// and si:// which is signals that data should be retrieved from the Smithsonian's 'smithsonian-open-access' S3 bucket. (default "si://")
  -target-bucket-uri string
    	A valid GoCloud bucket URI. Valid schemes are: file://, s3://.
  -workers int
    	The maximum number of concurrent workers. This is used to prevent filehandle exhaustion. (default 10)

For example:

$> ./bin/clone \
	-source-bucket-uri si:// \
	-target-bucket-uri file:///tmp \
	metadata/chndm
	
...time passes

$> ls -al /tmp/metadata/edan/chndm/*.txt 
-rw-------  1 user  wheel   571870 Nov 21 11:31 /tmp/metadata/edan/chndm/00.txt
-rw-------  1 user  wheel   569577 Nov 21 11:31 /tmp/metadata/edan/chndm/01.txt
-rw-------  1 user  wheel   492463 Nov 21 11:31 /tmp/metadata/edan/chndm/02.txt
-rw-------  1 user  wheel   480150 Nov 21 11:31 /tmp/metadata/edan/chndm/03.txt
-rw-------  1 user  wheel   647755 Nov 21 11:31 /tmp/metadata/edan/chndm/04.txt
...
-rw-------  1 user  wheel   491210 Nov 21 11:31 /tmp/metadata/edan/chndm/fd.txt
-rw-------  1 user  wheel   622405 Nov 21 11:31 /tmp/metadata/edan/chndm/fe.txt
-rw-------  1 user  wheel   510866 Nov 21 11:31 /tmp/metadata/edan/chndm/ff.txt

Or:

$> ./bin/clone \
	-source-bucket-uri file:///tmp \
	-target-bucket-uri file:///tmp/debug \
	metadata/edan/chndm
	
...less time passes

$> ls -al /tmp/debug/metadata/edan/chndm/*.txt 
-rw-------  1 user  wheel   571870 Nov 21 11:31 /tmp/debug/metadata/edan/chndm/00.txt
-rw-------  1 user  wheel   569577 Nov 21 11:31 /tmp/debug/metadata/edan/chndm/01.txt
-rw-------  1 user  wheel   492463 Nov 21 11:31 /tmp/debug/metadata/edan/chndm/02.txt
-rw-------  1 user  wheel   480150 Nov 21 11:31 /tmp/debug/metadata/edan/chndm/03.txt
-rw-------  1 user  wheel   647755 Nov 21 11:31 /tmp/debug/metadata/edan/chndm/04.txt
...
-rw-------  1 user  wheel   491210 Nov 21 11:31 /tmp/debug/metadata/edan/chndm/fd.txt
-rw-------  1 user  wheel   622405 Nov 21 11:31 /tmp/debug/metadata/edan/chndm/fe.txt
-rw-------  1 user  wheel   510866 Nov 21 11:31 /tmp/debug/metadata/edan/chndm/ff.txt

And then later on:

$> ./bin/emit \
	-json \
	-format-json \
	-bucket-uri file:///tmp/metadata/edan \
	chndm

[{
  "id": "edanmdm-chndm_1931-45-37",
  "version": "",
  "unitCode": "CHNDM",
  "linkedId": "0",
  "type": "edanmdm",
  "content": {
    "descriptiveNonRepeating": {
      "record_ID": "chndm_1931-45-37",
      "online_media": {
        "mediaCount": 1,
  ...and so on
}]
Notes
  • If no extra URI or URIs (for example metadata/chndm) are specified then the code will attempt to clone everything in the "source" bucket recursively.

  • Under the hood this code is using the GoCloud blob abstraction layer. The default behaviour for the abstraction is to assume restrictive permissions when creating new files. Unfortunately, as of this writing, there is no common way for assigning permissions using the GoCloud blob abstraction so this is something you'll need to account for separately from this tool.

emit

A command-line tool for parsing and emitting individual records from a directory containing compressed and line-delimited Smithsonian OpenAccess JSON files.

$> go run -mod vendor cmd/emit/main.go -h
  -bucket-uri string
    	A valid GoCloud bucket URI. Valid schemes are: file://, s3:// and si:// which is signals that data should be retrieved from the Smithsonian's 'smithsonian-open-access' S3 bucket.
  -format-json
    	Format JSON output for each record.
  -json
    	Emit a JSON list.
  -null
    	Emit to /dev/null
  -oembed
    	Emit results as OEmbed records
  -query value
    	One or more {PATH}={REGEXP} parameters for filtering records.
  -query-mode string
    	Specify how query filtering should be evaluated. Valid modes are: ALL, ANY (default "ALL")
  -stats
    	Display timings and statistics.
  -stdout
    	Emit to STDOUT (default true)
  -validate-edan
    	Ensure each record is a valid EDAN document.
  -validate-json
    	Ensure each record is valid JSON.
  -workers int
    	The maximum number of concurrent workers. This is used to prevent filehandle exhaustion. (default 10)

For example, processing every record in the OpenAccess dataset ensuring it is valid JSON and emitting it to /dev/null:

> go run -mod vendor cmd/emit/main.go -bucket-uri file:///usr/local/OpenAccess \
  -stdout=false \
  -validate-json \  		
  -null \
  -stats \
  -workers 20 \
  metadata/objects

2020/06/26 10:19:17 Processed 11620642 records in 12m1.141284159s

Or processing everything in the Air and Space collection as JSON, passing the result to the jq tool and searching for things with "space" in the title:

$> go run -mod vendor cmd/emit/main.go -bucket-uri file:///usr/local/OpenAccess \
   -json \
   -validate-json \  		   
   metadata/objects/NASM/ \
   | jq '.[]["title"]' \
   | grep -i 'space' \
   | sort

"Medal, NASA Space Flight, Sally Ride"
"Medal, STS-7, Smithsonian National Air and Space Museum, Sally Ride"
"Mirror, Primary Backup, Hubble Space Telescope"
"Model, 1:5, Hubble Space Telescope"
"Model, Space Shuttle, Delta-Wing High Cross-Range Orbiter Concept"
"Model, Space Shuttle, Final Orbiter Concept"
"Model, Space Shuttle, North American Rockwell Final Design, 1:15"
"Model, Space Shuttle, Straight-Wing Low Cross-Range Orbiter Concept"
"Model, Wind Tunnel, Convair Space Shuttle, 0.006 scale"
"Orbiter, Space Shuttle, OV-103, Discovery"
"Space Food, Beef and Vegetables, Mercury, Friendship 7"
"Spacecraft, Mariner 10, Flight Spare"
"Spacecraft, New Horizons, Mock-up, model"
"Suit, SpaceShipOne, Mike Melvill"

Or doing the same, but for things about kittens in the Cooper Hewitt collection:

$> go run -mod vendor cmd/emit/main.go -bucket-uri file:///usr/local/OpenAccess \
   -json \
   -validate-json \  		      
   -stats \
   metadata/objects/CHNDM/ \
   | jq '.[]["title"]' \
   | grep -i 'kitten' \
   | sort

2020/06/26 09:45:15 Processed 43695 records in 4.175884858s
"Cat and kitten"
"Tabby's Kittens"

Or something similar by not emitting a JSON list but formatting each record (as JSON) and filtering for the words "title" and "kitten":

$> go run -mod vendor cmd/emit/main.go -bucket-uri file:///usr/local/OpenAccess \
   -format-json \
   -validate-json=false \
   -stats \
   metadata/objects/CHNDM \
   | grep '"title"' \
   | grep -i 'kitten' \
   | sort
   
2020/06/26 10:02:59 Processed 43695 records in 5.045081835s
  "title": "Cat and kitten"
  "title": "Tabby\u0027s Kittens"
Inline queries

You can also specify inline queries by passing a -query parameter which is a string in the format of:

{PATH}={REGULAR EXPRESSION}

Paths follow the dot notation syntax used by the tidwall/gjson package and regular expressions are any valid Go language regular expression. Successful path lookups will be treated as a list of candidates and each candidate's string value will be tested against the regular expression's MatchString method.

For example:

$> go run -mod vendor cmd/emit/main.go -bucket-uri file:///usr/local/OpenAccess \
   -json \
   -query 'title=cats?\s+' \
   metadata/objects/CHNDM \
   | jq '.[]["title"]'
   
"View of Moat Mountain from Wildcat Brook, Jackson, New Hampshire, Looking Southwest"
"Near Falls of Wildcat Brook, Jackson, New Hampshire"

You can pass multiple -query parameters:

$> go run -mod vendor cmd/emit/main.go -bucket-uri file:///usr/local/OpenAccess \
   -json \
   -query 'title=cats?\s+' \
   -query 'title=(?i)^view' \
   metadata/objects/CHNDM \
   | jq '.[]["title"]'
   
"View of Moat Mountain from Wildcat Brook, Jackson, New Hampshire, Looking Southwest"

The default query mode is to ensure that all queries match but you can also specify that only one or more queries need to match by passing the -query-mode ANY flag:

$> go run -mod vendor cmd/emit/main.go -bucket-uri file:///usr/local/OpenAccess \
   -json \
   -query 'title=cats?\s+' \
   -query 'title=(?i)^view' \
   -query-mode ANY \
   metadata/objects/CHNDM \
   | jq '.[]["title"]'
   
"View of Santi Giovanni e Paolo a Celio, Rome"
"View of a Morning Room Interior"
"View of the Louvre from the River"
"View of the Acropolis, Athens"
"View of Santi Giovanni e Paolo a Celio, Rome"
"Views Representing the Most Considerable Transactions in the Siege of a Place, from Twelve of the Most REmarkable Sieges and Battles in Europe"
"View of Shiba Coast (Shibaura no fukei) From the Series One Hundred Famous views of Edo"
"View of Florence, Plate from \"Scelta di XXIV Vedute delle principali contrade, piazze, chiese, e palazzi della Città di Firenze\""
"View of Venice, Italy"
"View Across a River"
"View of the Canadian Falls and Goat Island"

...and so on

Did you know that there are 61 (out of 11 million) objects in the Smithsonian collection with the word "kitten" in their title?

$> go run -mod vendor cmd/emit/main.go -bucket-uri file:///usr/local/OpenAccess \
   -json \
   -query 'title=(?i)kitten' \
   -stats \
   -workers 50 \
   metadata/objects \
   | jq '.[]["title"]'
   
2020/06/26 18:22:04 Processed 62 records in 5m9.567738657s
"Cat and kitten"
"Tabby's Kittens"
"Ye Kitten (Number 17 May 1944)"
"The Kitten (No. 15 March 1944)"
"Three kittens on a stool"
"I'll Never Go Back On My Word; Leave My Kitten Alone"
"Untitled (Two Kittens)"
"Kitten Number Nine"
"Kitten No. Six"
"Bashful Baby Blues; Kitten On the Keys"
"Let's Make Believe We're Sweethearts; Three Naughty Kittens"
"Kittens playing"
"Take Me Back Again; Listen Kitten"
"Kittens"
"Diga Diga Do; Kitten WIth the Big Green Eyes, The"
"I Ain't Nothin' But a Tomcat's Kitten; I'm On My Way"
"Figurine, Kitten Small plastic"
"All the Time; Leave My Kitten Alone"
"Kitten On the Keys; That Place Down the Road Apiece"
"One Dime Blues; Three Little Kittens Rag"
"Kitten No. Eleven"
"I Ain't Nothin' But a Tomcat's Kitten; I'm On My Way"
"Weaker Kitten No. 2/64/41"
"Reward of Merit with Boy and Girl Playing with Cat and Kittens"
"Two Dollar Rag; Kitten on the Keys"
"The Kitten (No. 13 January 1944)"
"Kittens Playing with Camera"
"Reward of Merit with Two White Kittens in Basket"
"Doug and Toad - Kitten on Stump, 1942"
"I'll Never Go Back On My Word; Leave My Kitten Alone"
"The Kitten (No. 40 Sept. 1953)"
"Diga Diga Do; Kitten With the Big Green Eyes, The"
"The Kitten's Breakfast"
"Bunch Of Keys, A; Kitten On the Keys"
"The Kitten (No. 14 February 1944)"
"Figurine, Siamese Kitten"
"Tom Kitten"
"Young boy and his kitten"
"The Young Kittens"
"Little Kittens Learning Abc"
"Jump Jump of Holiday House. Three Little Kittens."
"The Kitten (Number 25 November 1946)"
"Kitten in Shoe"
"The Color Kittens"
"All the Time; Leave My Kitten Alone"
"Kitten on a Stool"
"Super Kitten; We'd Better Stop"
"Kitten mitten roller derby button"
"Little Girl holding Kitten"
"The Kitten (No. 39 May 27, 1953)"
"The Kitten (Number 18 June 1944)"
"My Love Is a Kitten; Strange Little Melody, The"
"Live and Let Live; Tom Cat's Kitten"
"Live and Let Live; Tom Cat's Kitten"
"One Dime Blues; Three Little Kittens Rag"
"The Kitten (No. 47 Dec. 1954)"
"Kittens Playing with Camera"
"\"Okimono\" Figure Of A Cat And Three Kittens"
"Mummy Of \"Kitten\""
"Plicate Kitten's Paw"
"Atlantic Kitten's Paw"
"Drosophila arawakana kittensis"
OEmbed

It is also possible to emit OpenAccess records as OEmbed documents of type "photo". An OEmbed record will be created for each media object of type "Screen Image" or "Images" associated with an OpenAccess record. OpenAccess records that do not have an suitable media objects will be excluded.

For example:

$> go run -mod vendor cmd/emit/main.go -bucket-uri file:///usr/local/OpenAccess \
   -json \
   -oembed \
   metadata/objects/NASM \
   | jq
   
[
{
  "version": "1.0",
  "type": "photo",
  "width": -1,
  "height": -1,
  "title": "Clerget 9 A Diesel, Radial 9 Engine (Gift of the Musee de L' Air)",
  "url": "https://ids.si.edu/ids/download?id=NASM-A19721334000-NASM2016-04025_screen",
  "author_name": "Clerget, Blin and Cie",
  "author_url": "https://airandspace.si.edu/collection/id/nasm_A19721334000",
  "provider_name": "National Air and Space Museum",
  "provider_url": "https://airandspace.si.edu",
  "object_uri": "si://nasm/o/A19721334000"
},
{
  "version": "1.0",
  "type": "photo",
  "width": -1,
  "height": -1,
  "title": "Swagger Stick, Royal Flying Corps (Gift of Eloise and John Charlton)",
  "url": "https://ids.si.edu/ids/download?id=NASM-A19830196000_PS01_screen",
  "author_name": "Lt. Wes D. Archer",
  "author_url": "https://airandspace.si.edu/collection/id/nasm_A19830196000",
  "provider_name": "National Air and Space Museum",
  "provider_url": "https://airandspace.si.edu",
  "object_uri": "si://nasm/o/A19830196000"
}
  ... and so on
  • The OEmbed record title property will be constructed in the form of "{OBJECT TITLE} ({OBJECT CREDIT LINE})".

  • The OEmbed record author_name property will be constructed using the OpenAccess record's content.freetext.name or content.freetext.manufacturer properties, in that order. If neither are present the author_name property will be constructed in the form of "Collection of {SMITHSONIAN UNIT NAME}".

  • The OEmbed record author_url property will that object's URL on the web.

  • The OEmbed record width and height properties are both set to "-1" to indicate that image dimensions are not available at this time.

  • The OEmbed record will contain a non-standard object_uri string that is compatiable with RFC6570 URI Templates. It is constructed in the form of si://{SMITHSONIAN_UNIT}/o/{NORMALIZAED_EDAN_OBJECT_ID}. The object_uri property should still be considered experimental. It may change or be removed in future releases.

  • {NORMALIZAED_EDAN_OBJECT_ID} strings are derived from the OpenAccess id property. The normalization rules are: Remove the leading edanmdm-{SMITHSONIAN_UNIT}_ prefix and replace all instances of the . character the with a _ character. For example the string edanmdm-nmaahc_2017.30.9 will be normalized as 2017_30_9.

findingaid

A command-line tool for emitting a CSV document mapping individual record identifiers to their corresponding OpenAccess JSON file and line number, produced from a directory containing compressed and line-delimited Smithsonian OpenAccess JSON files.

> go run -mod vendor cmd/findingaid/main.go -h
  -bucket-uri string
    	A valid GoCloud bucket URI. Valid schemes are: file://, s3:// and si:// which is signals that data should be retrieved from the Smithsonian's 'smithsonian-open-access' S3 bucket.
  -csv-header
    	Include a CSV header row in the output (default true)
  -include-all
    	Include all OpenAccess identifiers
  -include-guid content.descriptiveNonRepeating.guid
    	Include the OpenAccess content.descriptiveNonRepeating.guid identifier
  -include-openaccess-id id
    	Include the OpenAccess id identifier
  -include-record-id content.descriptiveNonRepeating.record_ID
    	Include the OpenAccess content.descriptiveNonRepeating.record_ID identifier (default true)
  -include-record-link content.descriptiveNonRepeating.record_link
    	Include the OpenAccess content.descriptiveNonRepeating.record_link identifier
  -null
    	Emit to /dev/null
  -query value
    	One or more {PATH}={REGEXP} parameters for filtering records.
  -query-mode string
    	Specify how query filtering should be evaluated. Valid modes are: ALL, ANY (default "ALL")
  -stats
    	Display timings and statistics.
  -stdout
    	Emit to STDOUT (default true)
  -workers int
    	The maximum number of concurrent workers. This is used to prevent filehandle exhaustion. (default 10)

For example:

$> go run -mod vendor cmd/findingaid/main.go -bucket-uri file:///usr/local/OpenAccess \
	metadata/objects/SAAM 

id,path,line_number
saam_1971.439.94,metadata/objects/SAAM/00.txt.bz2,1
saam_1971.439.92,metadata/objects/SAAM/08.txt.bz2,1
saam_1915.5.1,metadata/objects/SAAM/00.txt.bz2,2
saam_1971.439.78,metadata/objects/SAAM/03.txt.bz2,1
saam_XX32,metadata/objects/SAAM/12.txt.bz2,1
saam_1983.90.173,metadata/objects/SAAM/00.txt.bz2,3
saam_1970.335.1,metadata/objects/SAAM/03.txt.bz2,2
saam_1971.439.97,metadata/objects/SAAM/0d.txt.bz2,1
saam_1968.155.158,metadata/objects/SAAM/12.txt.bz2,2
saam_1967.14.149,metadata/objects/SAAM/08.txt.bz2,2
saam_1979.98.188,metadata/objects/SAAM/02.txt.bz2,1
saam_1985.66.295_540,metadata/objects/SAAM/00.txt.bz2,4
saam_1970.334,metadata/objects/SAAM/03.txt.bz2,3
saam_1968.19.12,metadata/objects/SAAM/0d.txt.bz2,2
... and so on

By default only the OpenAccess content.descriptiveNonRepeating.record_ID identifier is included in the finding aid. You can include other identifiers with their corresponding command-line flag or enable include all identifiers by passing the -include-all flag. For example:

$> go run -mod vendor cmd/findingaid/main.go -bucket-uri file:///usr/local/OpenAccess \
   -include-all \
   metadata/objects/NMAAHC
   
id,path,line_number
http://n2t.net/ark:/65665/fd53f870fc2-73af-4c50-b1c5-a3fd2829ad1f,metadata/objects/NMAAHC/ff.txt.bz2,1
nmaahc_2014.72.2,metadata/objects/NMAAHC/ff.txt.bz2,1
https://nmaahc.si.edu/object/nmaahc_2014.72.2,metadata/objects/NMAAHC/ff.txt.bz2,1
edanmdm-nmaahc_2014.72.2,metadata/objects/NMAAHC/ff.txt.bz2,1
http://n2t.net/ark:/65665/fd5343a21ed-73d9-4014-a34c-b175b84168c8,metadata/objects/NMAAHC/21.txt.bz2,1
nmaahc_2014.75.130,metadata/objects/NMAAHC/21.txt.bz2,1
https://nmaahc.si.edu/object/nmaahc_2014.75.130,metadata/objects/NMAAHC/21.txt.bz2,1
edanmdm-nmaahc_2014.75.130,metadata/objects/NMAAHC/21.txt.bz2,1
http://n2t.net/ark:/65665/fd59212a6e2-b745-4eb9-84ad-4368ffea8223,metadata/objects/NMAAHC/17.txt.bz2,1
nmaahc_2016.140.1.3,metadata/objects/NMAAHC/17.txt.bz2,1
https://nmaahc.si.edu/object/nmaahc_2016.140.1.3,metadata/objects/NMAAHC/17.txt.bz2,1
edanmdm-nmaahc_2016.140.1.3,metadata/objects/NMAAHC/17.txt.bz2,1
http://n2t.net/ark:/65665/fd599a84051-37d5-49d4-98d3-9052e5cbcea9,metadata/objects/NMAAHC/22.txt.bz2,1
nmaahc_2012.30.3,metadata/objects/NMAAHC/22.txt.bz2,1
https://nmaahc.si.edu/object/nmaahc_2012.30.3,metadata/objects/NMAAHC/22.txt.bz2,1
edanmdm-nmaahc_2012.30.3,metadata/objects/NMAAHC/22.txt.bz2,1
http://n2t.net/ark:/65665/fd53a114ad8-2cc2-4ce2-bbd0-6dd09cc715df,metadata/objects/NMAAHC/0c.txt.bz2,1
nmaahc_2013.133.1.4,metadata/objects/NMAAHC/0c.txt.bz2,1
https://nmaahc.si.edu/object/nmaahc_2013.133.1.4,metadata/objects/NMAAHC/0c.txt.bz2,1
edanmdm-nmaahc_2013.133.1.4,metadata/objects/NMAAHC/0c.txt.bz2,1
http://n2t.net/ark:/65665/fd5d302d893-ae7c-4b4d-93bb-59f87237d23a,metadata/objects/NMAAHC/1c.txt.bz2,1
nmaahc_2014.222.2,metadata/objects/NMAAHC/1c.txt.bz2,1
https://nmaahc.si.edu/object/nmaahc_2014.222.2,metadata/objects/NMAAHC/1c.txt.bz2,1
edanmdm-nmaahc_2014.222.2,metadata/objects/NMAAHC/1c.txt.bz2,1
http://n2t.net/ark:/65665/fd5ab09d12b-42bc-40f2-9557-b924d182723e,metadata/objects/NMAAHC/ff.txt.bz2,2
nmaahc_2016.166.17,metadata/objects/NMAAHC/ff.txt.bz2,2
https://nmaahc.si.edu/object/nmaahc_2016.166.17,metadata/objects/NMAAHC/ff.txt.bz2,2
edanmdm-nmaahc_2016.166.17,metadata/objects/NMAAHC/ff.txt.bz2,2
http://n2t.net/ark:/65665/fd588400c0f-66c3-4259-999d-57f112e05479,metadata/objects/NMAAHC/2b.txt.bz2,1
nmaahc_2014.263.5,metadata/objects/NMAAHC/2b.txt.bz2,1
https://nmaahc.si.edu/object/nmaahc_2014.263.5,metadata/objects/NMAAHC/2b.txt.bz2,1
... and so on

The findingaid tool also supports inline queries (described above). For example there are 4044 records with the word "panda" in their title:

go run -mod vendor cmd/findingaid/main.go -bucket-uri file:///usr/local/OpenAccess \
   -query 'title=(?i)pandas?' \
   -workers 50 \
   metadata/objects/ \
   > pandas.csv

time passes...

$> wc -l pandas.csv
    4044 pandas.csv

$> less pandas.csv
id,path,line_number
nmah_1333041,metadata/objects/NMAH/17.txt.bz2,75
nmah_1195220,metadata/objects/NMAH/1f.txt.bz2,520
nmah_1065733,metadata/objects/NMAH/32.txt.bz2,393
nmah_414524,metadata/objects/NMAH/43.txt.bz2,3302
nmah_1298355,metadata/objects/NMAH/2a.txt.bz2,4794
nmah_1333042,metadata/objects/NMAH/69.txt.bz2,4331
nmah_903687,metadata/objects/NMAH/71.txt.bz2,3133
nmah_1465552,metadata/objects/NMAH/d1.txt.bz2,137
nmah_1449233,metadata/objects/NMAH/aa.txt.bz2,4518
nmah_334375,metadata/objects/NMAH/bd.txt.bz2,2143
nmah_414787,metadata/objects/NMAH/cf.txt.bz2,2140
nmnhanthropology_8357155,metadata/objects/NMNHANTHRO/27.txt.bz2,785
nmnhanthropology_8394769,metadata/objects/NMNHANTHRO/03.txt.bz2,1232
nmnhanthropology_8426012,metadata/objects/NMNHANTHRO/04.txt.bz2,1441
nmnhanthropology_8413868,metadata/objects/NMNHANTHRO/0a.txt.bz2,1447
... and so on
location

A command-line tool for parsing line-delimited Smithsonian OpenAccess JSON files and emiting place data as a stream of CSV records.

> ./bin/location -h
Usage of ./bin/location:
  -null
    	Emit to /dev/null
  -stdout
    	Emit to STDOUT (default true)

For example:

$> ./bin/emit \
	-bucket-uri file:///usr/local/OpenAccess metadata/objects/NMAH \

   | ./bin/location

edanmdm-nmah_715051,content.freetext.place,place made,"United States: New York, New York City"
edanmdm-nmah_580165,content.freetext.place,place made,United States
edanmdm-nmah_598790,content.freetext.place,place made,"United Kingdom: England, Longport"
edanmdm-nmah_580114,content.freetext.place,place made,United States: New Jersey
edanmdm-nmah_670543,content.freetext.place,place made,United States
edanmdm-nmah_570097,content.freetext.place,place made,United Kingdom: England
edanmdm-nmah_415366,content.freetext.place,place made,Germany
...and so on
edanmdm-nmah_383309,content.freetext.place,associated place,United States
edanmdm-nmah_1957071,content.freetext.place,place made,Russia
edanmdm-nmah_1957077,content.freetext.place,place made,Russia
edanmdm-nmah_1957190,content.freetext.place,place made,Russia
edanmdm-nmah_1408250,content.freetext.place,place made,"United States: District of Columbia, Washington"
edanmdm-nmah_1321602,content.freetext.place,place made,"France: Île-de-France, Paris"

The column in the CSV output are:

Index Value Example
0 OpenAccess record ID edanmdm-nmah_715051
1 Path to the property used to lookup place data content.freetext.place
2 Label associated with place data place made
3 Place name "United States: New York, New York City"
placename

A command-line tool for extracting only placename data from a CSV stream produced by the location tool.

> ./bin/placename -h
Usage of ./bin/placename:
  -unique
    	Only unique emit placename strings once. (default true)

For example:

$> ./bin/emit -bucket-uri file:///usr/local/OpenAccess metadata/objects/NMAH \

   | ./bin/location \
   
   | ./bin/placename \
   
   | wc -l
   
12164

Or:

$> ./bin/emit -bucket-uri file:///usr/local/OpenAccess metadata/objects/ \

   | ./bin/location \
   
   | ./bin/placename

United States
Washington (D.C.)
Florida
Miami (Fla.)
France
USA
Italy
Europe
Italy or Spain
France, Europe
probably Venice, Italy
Milan, Italy
France or Italy
Florence, Italy

...time passes

From top of pass to Hoja Verde., Tamaulipas, Mexico, North America
Perto Dom Pedro II, Paraná, Brazil, South America
Zealand: peat-bog at Søgärd., Denmark, Europe
Tarumã Alta, 14 km NW of Manaus., Manaus, Brazil, South America - Neotropics
Woods near Taxodium swamp, 2 miles south of Eagletown, McCurtain Co., Oklahoma, United States, North America
Yarmouth County. Deep water of St. John (Wilson's) Lake., Nova Scotia, Canada, North America
½ mi. S. Olivet., Osage, Kansas, United States, North America
Range of low hills ca. 20 km west of Redenção, near Córrego São João and Troncamento Santa Teresa, Conceição do Araguaia, Brazil, South America - Neotropics
Tatama. Santa Cecilia. Cordillera Occidental. Vertiente Occidental, Caldas, Colombia, South America - Neotropics
San Rafael Ranch - on banky rivers. Cameron Co, Texas, United States, North America
Sultanabad, Khorassan., Khorasan [obsolete], Iran, Asia-Temperate
Pointe Du Lac, comte du St-Maurice: sur les sables du lac St-Pierre., Quebec, Canada, North America

...2.5M records later

Limburg
Japão
Lado Enclave (Congo Free State)
Mauritanie
Igboho (Nigeria)
Accra Plains
Hollywood (Fla.)
Broward County (Fla.)
GrÃ-Bretanha
América latina
Cousin
Peace River Watershed (B.C. and Alta.)
Peace River Watershed

See also

Documentation

Index

Constants

View Source
const IS_SMITHSONIAN_S3 string = "github.com/aaronland/go-smithsonian-openaccess#is_smithsonian_s3"
View Source
const SCREEN_IMAGE string = "Screen Image"

Variables

View Source
var AWS_S3_BUCKET string
View Source
var AWS_S3_REGION string
View Source
var AWS_S3_URI string
View Source
var SMITHSONIAN_DATA_FILES []string
View Source
var SMITHSONIAN_UNITS []string

Functions

func OpenBucket added in v0.0.3

func OpenBucket(ctx context.Context, uri string) (context.Context, *blob.Bucket, error)

Types

type OpenAccessRecord

type OpenAccessRecord struct {
	Id              string               `json:"id"`
	Title           string               `json:title"`
	UnitCode        string               `json:"unitCode"`
	LinkedId        string               `json:"linkedId"`
	Type            string               `json:"type"`
	URL             string               `json:"url"`
	Content         edan.IIMObjectRecord `json:"content"`
	Hash            string               `json:"hash"`
	DocSignature    string               `json:"docSignature"`
	Timestamp       int64                `json:"timestamp"`
	LastTimeUpdated int64                `json:"lastTimeUpdated"`
	Status          int                  `json:"status"`
	Version         string               `json:"version"`
	PublicSearch    bool                 `json:"publicSearch"`
	Extensions      interface{}          `json:"extensions"`
}

func (*OpenAccessRecord) CreditLine

func (rec *OpenAccessRecord) CreditLine() string

func (*OpenAccessRecord) ImageURLsWithLabel

func (rec *OpenAccessRecord) ImageURLsWithLabel(label string) ([]string, error)

func (*OpenAccessRecord) OnlineMedia

func (rec *OpenAccessRecord) OnlineMedia() (edan.IIMOnlineMedia, error)

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL