This repo is a collection of scripts to download the dataset necessary to train the jibjib-model
Repo layout
The complete list of JibJib repos is:
- jibjib: Our Android app. Records sounds and looks fantastic.
- deploy: Instructions to deploy the JibJib stack.
- jibjib-model: Code for training the machine learning model for bird classification
- jibjib-api: Main API to receive database requests & audio files.
- jibjib-data: A MongoDB instance holding information about detectable birds.
- jibjib-query: A thin Python Flask API that handles communication with the TensorFlow Serving instance.
- gopeana: A API client for Europeana, written in Go.
- voice-grabber: A collection of scripts to construct the dataset required for model training
Scripts
In the top level of this repo, there are several helper scripts to create/change JSON and CSV files, as well as converter.py
to convert audio files from mp3
to wav
.
This Go script uses gopeana to populate both a JSON and CSV file with information about the on Europeana published bird voices from the Tierstimmenarchiv (open dataset of the Museum für Naturkunde Berlin)
This Go script uses the output of data_grabber/ to follow the links provided on Europeana and download the audio files.
This Python script takes input from a CSV file and uses the Wikipedia API to extract summaries about birds, then saves it in a seperate CSV.
This is a collection of scripts to:
- clean the files directory (in our case, in order to bring down the total number of classes, birds with a German Wikipedia entry were used.)
- nicely crawl Xeno Canto for audio files of birds
- download the audio files from Xeno Canto