README
¶
DataDJ
Data DJ is a value-adding service for collections and archives, initially conceived at ETH Library Lab and currently in development at ETH Library. It helps to provide more convenient and efficient access to batches of digitised records and files. The service works in conjunction with collections' existing websites and search portals. The collection's website forwards the user's request for a list of files to the Data DJ, our service then gathers and compresses the files, and notifies the user via email with a convenient download link.
![data dj process overview](https://github.com/eth-library/dataset-dj/raw/1f26eb92b6f6/assets/DataDJ-simple-overview.gif)
The requests to the sample application DataDJ can be accessed at https://dj-api-ucooq6lz5a-oa.a.run.app/. The Requests presented throughout the README are written for Visual Studio Code REST Client, however they can simply be transformed to be used with other API Clients or curl
.
If you are planning to work on this project, contact us to ask for the detailed internal documentation.
Quickstart Guide
1. Request an archive from a list of files
Edit the curl request below to include your email
and the list of files
that you want to download (note the included filepath). Aditionally meta
information can be included using said field. The endpoint can be called using curl
. Once the files have been gathered and downloaded, you should receive an email with the download link. This endpoint should be called by a data collection, forwarding the files requested by a user and specifiying the users email address. Please note that the archiveID remains empty in the current iteration of the service.
Example:
POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: Bearer service_key
{
"email": "email@address.com",
"archiveID": "",
"content": [
{
"sourceID": "0ff529e3",
"files": ["/test/dir/file1", "/test/dir/file2"]
},
{
"sourceID": "eba48cdb",
"files": ["/test/dir/file3", "/test/dir/file4"]
}],
"meta": "{meta: information}"
}
API Endpoints
Check if DataDJ service is live (Public)
GET https://dj-api-ucooq6lz5a-oa.a.run.app/ping
Register Services, Taskhandler and Sources
1. Register new Service (Admin)
An admin can task the DJ to generate a new service token/key and to send an email with a redeem link to the specified email address. The service key is required by collections to interact with the DJ for anything related to creating and altering archives.
POST https://dj-api-ucooq6lz5a-oa.a.run.app/admin/createKeyLink
Content-Type: application/json
Authorization: Bearer admin_key
{
"email": "email@address.com"
}
2. Register new Taskhandler (Admin)
A taskhandler is the part of the DataDJ responsible for gathering and compressing the requested files, as well as sending an email containing a download link to the user who requested the files. In order to interact to the API part of the DataDJ, the taskhandler requires a handler token/key similar to a service key. Said key can be generated by an admin via the following request and has to be manually handed to the operator of the taskhandler in question (for now).
POST https://dj-api-ucooq6lz5a-oa.a.run.app/admin/registerHandler
Content-Type: application/json
Authorization: Bearer admin_key
3. Register new Source (Service)
A source is a representation of a collection holding files to be downloaded. This services the purpose to identify which files have to be gathered where and also to keep track of the origin of every file to provide an overview of every sources contribution to the final archive with all its files. The registration request returns a source-id which subsequentially has to be used to uniquely identify the source when interacting with the DataDJ.
POST https://dj-api-ucooq6lz5a-oa.a.run.app/source
Content-Type: application/json
Authorization: Bearer service_key
{
"name": "Test-Source-One",
"Organisation": "ETHZ"
}
Creating, modifying or downloading archives (Service)
https://dj-api-ucooq6lz5a-oa.a.run.app/archive
This endpoint expects a request that contains four fields:
{
"email":"",
"archiveID":"",
"files":[],
"meta": ""
}
email
, archiveID
and meta
are strings, whereas files
is a list of strings containing the names of the files.
Depending on which fields are left empty, the API triggers different operations. For now only option 4 is being used in tests, whereas the other option are kept for the future.
1. Create an archive from a list of files
Both email
and archiveID
are left empty, whereas files
contains the names of the files the archive should be initialised with.
Example:
POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: Bearer service_key
{
"email": "",
"archiveID": "",
"content": [
{
"sourceID": "0ff529e3",
"files": ["/test/dir/file1", "/test/dir/file2"]
},
{
"sourceID": "eba48cdb",
"files": ["/test/dir/file3", "/test/dir/file4"]
}],
"meta": "{meta: information}"
}
2. Add a list of files to an archive
email
is left empty. archiveID
contains the identifier of a previously created archive and files
the list of files you want to add to the archive.
Example:
POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: service_key
{
"email": "",
"archiveID": "e01fd941",
"content": [
{
"sourceID": "0ff529e3",
"files": ["/test/dir/file1", "/test/dir/file2"]
},
{
"sourceID": "eba48cdb",
"files": ["/test/dir/file3", "/test/dir/file4"]
}],
"meta": "{meta: information}"
}
3. Download an archive
email
contains the email address the download link is being sent to, archiveID
specifies the archive you want to download and files
is left empty. The DataDj will send you a download link that allows you to download the archive as a .zip file.
Example:
POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: Bearer service_key
{
"email": "email@address.com",
"archiveID": "e01fd941",
"content": [],
"meta": ""
}
4. Directly download a list of files as archive
email
contains the email address the download link is being sent to, archiveID
is left empty and files
contains the names of the files you want to download.
The DJ creates an archive of the files in the request and will also return its identifier in the response, in case that archive needs to be accessed or modified later on. However it is not necessary to separatly trigger the notification containing the download link as this is going to happen automatically.
Example:
POST https://dj-api-ucooq6lz5a-oa.a.run.app/archive
Content-Type: application/json
Authorization: Bearer service_key
{
"email": "email@address.com",
"archiveID": "",
"content": [
{
"sourceID": "0ff529e3",
"files": ["/test/dir/file1", "/test/dir/file2"]
},
{
"sourceID": "eba48cdb",
"files": ["/test/dir/file3", "/test/dir/file4"]
}],
"meta": "{meta: information}"
}
Currently, the /archive
endpoint returns an object describing the order which was created for the archive in question. Orders are objects telling the taskhandlers which archives should be downloaded.
{
"orderID": "a5777ffb",
"archiveID": "4afc3f67",
"email": "email@address.com",
"date": "2022-12-14 16:27:28.967665178 +0000 UTC m=+67114.216955617",
"status": "opened",
"sources": [
"0ff529e3"
]
}
Inspecting an archive (Service)
https://data-dj-2021.oa.r.appspot.com/archive/id
This endpoint allows to inspect the contents of an archive id
either in the browser or via an API client. The response is a JSON object representing the archive.
Example:
GET https://dj-api-ucooq6lz5a-oa.a.run.app/archive/a2e11165
Content-Type: application/json
Authorization: Bearer service_key
Example Response:
{
"id": "a2e11165",
"content": [
{
"sourceID": "0ff529e3",
"files": [
"/test/dir/file1",
"/test/dir/file2"
]
},
{
"sourceID": "eba48cdb",
"files": [
"/test/dir/file3",
"/test/dir/file4"
]
}
],
"meta": "{meta: information}",
"timeCreated": "2022-12-09 13:31:43.320372 +0100 CET m=+305.508934168",
"timeUpdated": "",
"status": "opened",
"sources": [
"0ff529e3",
"eba48cdb"
]
}
Local Development (Outdated)
- make a copy of
.env.example
and save it as.env.local
- replace the example directory paths, bucketnames and other settings as needed.
option a: run with go
download and run the redis image with docker
docker pull redis
docker run --name dj-redis -p 6379:6379 -d redis
start the task handler
open a terminal in project root.
export all of the variables in the .env.local
file
run the task handler
source .env.local
export $(cut -d= -f1 .env.local)
go run ./taskHandler/*.go
open a separate terminal in project root.
export all of the variables in the .env.local
file
run the api
source .env.local && export $(cut -d= -f1 .env.local)
go run ./api/*.go
note that for any changes in the environment file to take effect, you must export the variables again and restart that part of the application.
option b: (to be completed)
to run publisher and subscriber applications using docker. include the path to the .env.local file in the docker run command.
docker run --env-file=./.env.local -p 8080:8080 data-dj-image
Docker commands
docker build --platform=linux/amd64 -f Dockerfile.api -t dj-api-amd64 .
docker tag dj-api-amd64:0.0.1 europe-west6-docker.pkg.dev/data-dj-2021/dj-docker-repo/dj-api:0.0.1
docker push europe-west6-docker.pkg.dev/data-dj-2021/dj-docker-repo/dj-api:0.0.1
Steps for Google Cloud Run
- Follow instructions: https://zahadum.notion.site/Google-Cloud-4c32dcbe1cfb4b479e8680e852ef0d84
curl -X POST "0.0.0.0:8765/admin/createKeyLink"
-H "Authorization: Bearer $ADMIN_KEY"
-H "content:application/json"
-d '{"email":"barry.sunderland@outlook.com"}'`
Authentication
generates a token
saves hashed token in mongo
middleware function validates token during requests
set mongo collection to delete a document after the given number of seconds.
Does not apply if the index field is not in the document e.g. if a doc does not have expiryRequestedDate
it will not be deleted.
db.apiKeys.createIndex( { "expiryRequestedDate": 1 }, { expireAfterSeconds: 3600 } )
Useful Reference Material for Go
-
Learning Go by Jon Bodner
general reference for programming in GO; types, syntax, imports etc.
see Ch13 for writing tests
Material for MongoDB
http://www.inanzzz.com/index.php/post/g7e8/running-mongodb-migration-script-at-the-docker-startup
Directories
¶
Path | Synopsis |
---|---|
this is package subscribes to the redis channel and asynchronously handles requests to zip files
|
this is package subscribes to the redis channel and asynchronously handles requests to zip files |