README ¶
BeeZIM Mirror Tool
BeeZIM is a project made by Racin Nygaard and Rodrigo Q. Saramago for the We Are Millions hackathon in March 2022.
The project's goal is to empower people with an easy-to-use Dapp to upload a copy of a website onto a decentralized storage platform such as Swarm. Websites copied onto Swarm are resistant to censorship and will be permanently available, even if the original source disappears. In this sense, our Dapp has similar ambitions as to Internet Archive.
In the first release of BeeZIM, we are able to create a perfect mirror of the original source, including all (front-end)scripts, stylesheets, redirects, and images. In most cases, an end-user will not be able to tell the difference between the BeeZIM mirror and the original source. In addition, BeeZIM embeds a powerful search engine, which allows users to find content that is more relevant to them. Often this is an enhancement to the original source, as many websites provide a poor or non-existing ability to search. Everything works locally in the user's browser without communicating with any other server.
Thankfully to OpenZIM Project, which is currently maintained by a not-for-profit entity named Kiwix, websites can be highly compressed into a single file and easily shared by users or viewed in devices with minimal computational resources. Kiwix distributes archives for Wikipedia, Project Gutenberg, Stack Exchange and many more.
The compressed files follow the ZIM file format, and according to Kiwix, the entire Wikipedia can be compressed in an 80GB zim file containing more than 6 million articles, with images!
This repository provides tools to publish copies of entire websites on Swarm.
Demo
Click here to see a demo video of the main functionalities of BeeZIM.
How it works
Beezim is a command-line tool that uses the Bee API to upload content on Swarm. Files in the ZIM format can be downloaded from a web2 mirror website or provided by the user if it is already stored locally. The ZIMs are parsed, and a tar archive is generated from them.
The parser can optionally embed metadata, a text search engine and a search DApp to the archives. Each tar archive is then uploaded to swarm, and its reference (i.e., the manifest address) is returned as output. Please keep this stored so you can access your page later on. We plan to provide a key-value store to manage the metadata and references in the future, also hosted on Swarm ;) .
The search engine can be enabled during the parsing by using the option --enable-search
.
It allows users to query for texts or title in the uploaded articles.
Beezim also embeds a navigation bar and webpages to display information about the uploaded files, list the searched results and query random articles when the search tool is enabled.
The ZIM and/or tar files can be automatically deleted from the host machine after upload, using the option --clean
.
The default behavior of Beezim is to mirror
ZIMs to Swarm without append metadata or the search tool to it.
However, if you would like to be able to search on the uploaded content in a similar fashion provided by Kiwix, but without relying on server-side services or database, you can try out our search tool!
Our search tool is hosted in another repository and is called Zim Xapian Searchindex, or ZXS for short.
It is a WebAssembly library and javascript search tool that can read the indexes in the Xapian format extracted from ZIM files under X/fulltext/xapian
and X/title/xapian
.
ZXS enables the search of indexed data in your browser using the Xapian database that is already embedded in the ZIM files without interacting with a server. It is based on Xapian search engine library and compiled for WebAssembly using the Emscripten compiler. By using Beezim search tool, no server can monitor what you are searching for! Everything happens on your browser. ZXS could also allow users to search contents without an internet connection, embedding the javascript search tool and the WebAssembly engine directly in the ZIM files. Although, this is not done yet!
Preview
Articles | List uploaded files | Files information | Search bar | Search Results |
---|---|---|---|---|
How to run
The current command-line tool has the following available commands:
Swarm zim mirror command-line tool
Usage:
beezim [command]
Available Commands:
clean Clean files in datadir
download Download zim file
help Help about any command
list Shows the list of compressed websites currently maintained by Kiwix
mirror Mirror zim files to swarm
parse Parse zim file [optionally embeding a search engine and reader/searcher DApp]
upload Upload tar file to swarm
Flags:
--batch-amount int bee postage batch amount (default 100000000)
--batch-depth uint bee postage batch depth (default 30)
--batch-id string bee postage batch ID
--bee-api-url string bee api url (default "http://localhost:1633")
--bee-debug-api-url string bee debug api url (default "http://localhost:1635")
--clean delete all downloaded zim and generated tar files
--datadir string path to datadir directory (default "./datadir")
--enable-search enable search index
--gas-price string gas price for postage stamps purchase
--gateway connect to the swarm public gateway (default "https://gateway-proxy-bee-0-0.gateway.ethswarm.org")
-h, --help help for beezim
--kiwix string name of the compressed website hosted by Kiwix. Run "list" to see all available options (default "wikipedia")
--pin whether the uploaded data should be locally pinned on a node
--tag uint32 bee tag UID to the attached to the uploaded data
Use "beezim [command] --help" for more information about a command.
Configure the Bee environment
Beezim uploads files to Swarm by connecting to a bee node.
But you can use the --gateway
option to upload directly to the public swarm gateway.
However the public gateway has a maximum upload limit of 10 MB per file.
Example using the gateway:
beezim mirror --zim=wikipedia_cr_all_maxi_2022-02.zim --gateway
For best experience and convenience it is recommended that you run your own bee node before try Beezim with bigger files. See .env-example for an example of the necessary configuration parameters. Create a file named .env with configuration parameters for your system.
TL;DR
Skip to here, use our docker images and have fun!
Cli Commands
Download ZIM files
You can download zim files from the Kiwix mirror:
beezim download \
--kiwix=wikipedia \
--zim=wikipedia_es_climate_change_mini_2022-02.zim
Or providing a url:
beezim download --url=https://download.kiwix.org/zim/wikipedia/wikipedia_es_climate_change_mini_2022-02.zim
Parse ZIM files
Without embedded search engine and DApp
This converts the zim files to tar archives and embed the minimal information to them (JS, CSS, HTML) required to
upload a webpage on Swarm (i.e. index.html
and error.html
).
The index page is automatically redirected to the main page of the ZIM if it exists.
beezim parse --zim=wikipedia_es_climate_change_mini_2022-02.zim
Embedding the search engine and BeeZIM DApp
This performs the same operations as before but also adds a search engine using the Xapian index from the ZIM files and a DApp for search and navigate through the uploaded content.
beezim parse \
--zim=wikipedia_es_climate_change_mini_2022-02.zim \
--enable-search
Upload the TAR to Swarm
You can uploaded existent parsed ZIMs by using the upload
command as below.
Uploading to the public Swarm gateway
Be aware of the size limit!
beezim upload --tar=wikipedia_cr_all_maxi_2022-02.tar --gateway
Uploading one or multiple files to local node
Please check the .env-example for default ip:port configurations.
beezim upload \
--tar=wikipedia_es_climate_change_mini_2022-02.tar \
--batch-id=8e747b4aefe21a9c902337058f7aad71aa3170a9f399ece6f0bdb9f1ec432685
beezim upload all \
--batch-id=8e747b4aefe21a9c902337058f7aad71aa3170a9f399ece6f0bdb9f1ec432685
Filtering tars to be uploaded by keywords
beezim upload --kiwix="gutenberg" all \
--batch-id=8e747b4aefe21a9c902337058f7aad71aa3170a9f399ece6f0bdb9f1ec432685
Mirror
This is the default operation of BeeZIM.
It performs the download -> parser -> upload
tasks for one or many ZIMs.
The command flags are similar to the other commands.
Please type beezim mirror --help
to see the current available options.
beezim mirror \
--url=https://download.kiwix.org/zim/wikipedia/wikipedia_en_100_mini_2022-03.zim \
--batch-id=8e747b4aefe21a9c902337058f7aad71aa3170a9f399ece6f0bdb9f1ec432685 \
--enable-search
beezim mirror --kiwix=gutenberg \
--zim=gutenberg_af_all_2022-03.zim \
--batch-id=8e747b4aefe21a9c902337058f7aad71aa3170a9f399ece6f0bdb9f1ec432685 \
--enable-search
beezim mirror --kiwix=others \
--zim=alpinelinux_en_all_nopic_2021-03.zim \
--bee-api-url=http://localhost:1733 \
--bee-debug-api-url=http://localhost:1735 \
--batch-id=388b9a93fc084d350b2320bedacb3a88779867d956b20a2716512138bc88eac0
Using Docker to Build BeeZIM
Without search engine
Before start, make sure you have docker installed in your system.
If you don't plan to use the search engine and would like to mirror ZIMs as they are.
You can just install BeeZIM in your machine and use it without the --enable-search
option,
or build the BeeZIM docker image (not the docker compose).
docker build -t beezim -f Dockerfile .
With the Search Engine and Search DApp Tool
Before start, make sure you have docker and docker-compose installed in your system.
Our search DApp depends on Zim Xapian Searchindex, a WebAssembly library and javascript search tool that can read the search indexes extracted from ZIM files.
We also provide a docker-compose.yml
to download the ZXS image and build Beezim in your local machine.
You can use it running the command below:
docker-compose -f docker-compose.yml up --build --remove-orphans && docker-compose rm -fsv
docker-compose run --rm \
--user $(id -u):$(id -g) \
beezim ./bin/beezim-cli mirror \
--zim=wikipedia_es_climate_change_mini_2022-02.zim \
--bee-api-url=http://localhost:1633 \
--bee-debug-api-url=http://localhost:1635 \
--batch-id=388b9a93fc084d350b2320bedacb3a88779867d956b20a2716512138bc88eac0 \
--enable-search
There is also a script to simplify a bit the above command when running BeeZIM with docker:
./dc-beezim-cli.sh mirror \
--zim=wikipedia_es_climate_change_mini_2022-02.zim \
--batch-id=388b9a93fc084d350b2320bedacb3a88779867d956b20a2716512138bc88eac0 \
--enable-search