imageid: Similar images indexing service
![GoDoc](https://godoc.org/bitbucket.org/taringa/imageid?status.svg)
This tool allows to index a large number (millions) of images and group them in
disjoint groups of similar images. Each group of similar images is identified
by a canonical URL, which is the URL of one of the images in the group.
Image |
URL |
Canonical URL |
☂ |
http://abc.def/umbrella1.png |
http://abc.def/umbrella1.png |
♞ |
http://asd.fgh/horse.png |
http://asd.fgh/horse.png |
☂ |
http://xyz.tld/u.jpg |
http://abc.def/umbrella1.png |
Images are hashed using a combination of dhash
(difference hash)
and phash
(perceptual hash), resulting in a 128-bit hash.
Similar detection is not flawless; for example a cropped version of an image
will be detected as completely different.
Each image is processed using the following pipeline:
- If the image URL is already indexed, do nothing.
- Download the image.
- Calculate MD5. If the MD5 is already indexed, the new URL is assigned to the
similarity group.
- Calculate hash and search for similar hashes in the database (
distance <= 8
).
- If no similar hashes are found, store the hash and MD5 associated with
the URL, in a new similarity group.
- If a similar hash is found, assign the new URL and MD5 to the
similarity group. The new hash is not stored.
The service uses threads (goroutines) to leverage bandwidth and CPU:
- 10 threads for downloading images (configurable with the
IMAGEID_DOWNLOAD_WORKERS
environment variable).
runtime.NumCPU()
threads for calculating hashes, searching and indexing
(configurable with IMAGEID_PROCESS_WORKERS
).
Database
Database backend is MySQL, unless the IMAGEID_NULL_STORE
environment variable
is defined, in which case the image index is stored in main memory.
The database uses 3 tables that mimic a key-value store:
Table |
Key (k ) |
Value (v ) |
urls |
<md5(url)> |
<canonical-url> |
md5s |
<md5(img)> |
<canonical-url> |
hashes |
<hash> |
<canonical-url> |
<url>
: URL of an image
<canonical-url>
: URL of the first similar image that was indexed.
<md5>
: MD5 of the image or URL
<hash>
: dhash+phash
of the image
There is also an additional table similar_log
, which stores the log of
similar images found.
Algorithm
The similar hash search is performed using a metric tree.
Hamming distance is used to compute the distance between hashes.
At startup, the complete set of hash keys is read from the database,
and the metric tree is constructed in main memory.
Installing / Executing
A Dockerfile
is provided, which you can either use to run the server as-is,
or to extend, or simply to use as install instructions.
When the Docker container is started, the code is compiled and installed,
and then imageid-server
is run.
Two scripts are provided to build and run the container: scripts/run-dev-mysql.sh
and scripts/run-dev-standalone.sh
. The mysql
variant will run the mariadb
Docker image and use it as database backend, for persistence.
imageid-server
is the main executable. See the imageid/server
package
documentation for a description of the available HTTP endpoints.
The log is sent to stdout
/stderr
.
Usage example
Let's start the server in standalone mode (no database):
$ scripts/run-dev-standalone.sh
++ docker build -t imageid .
...
Successfully built e0e4d021c520
2015/07/01 19:52:02 [INFO] HTTP server listening at port :8080
2015/07/01 19:52:02 [INFO] Initializing DB...
Once the server is running, we can feed some images using a POST request (in another window):
$ curl -X POST 'http://localhost:8080/process?url=https://www.google.com.ar/images/srpr/logo11w.png'
"Added https://www.google.com.ar/images/srpr/logo11w.png"
In the server log window you will see:
2015/07/01 19:59:57 [DEBUG] Processing https://www.google.com.ar/images/srpr/logo11w.png
2015/07/01 19:59:57 [DEBUG] Calculated hash https://www.google.com.ar/images/srpr/logo11w.png: 8216715a5295080c877da82f450307db
2015/07/01 19:59:57 [DEBUG] New hash node: https://www.google.com.ar/images/srpr/logo11w.png
Let's feed two similar images:
$ curl -X POST 'http://localhost:8080/process?url=http://i.imgur.com/JeYm857.png'
"Added http://i.imgur.com/JeYm857.png"
$ curl -X POST 'http://localhost:8080/process?url=http://elplanc.net/wp-content/uploads/2013/11/logocajaazul.png'
"Added http://elplanc.net/wp-content/uploads/2013/11/logocajaazul.png"
Here is the log output:
2015/07/01 20:28:44 [DEBUG] Processing http://i.imgur.com/JeYm857.png
2015/07/01 20:28:44 [DEBUG] Calculated hash http://i.imgur.com/JeYm857.png: 80a486d2cadcd4803c531129c8f4af4f
2015/07/01 20:28:44 [DEBUG] New hash node: http://i.imgur.com/JeYm857.png
2015/07/01 20:29:23 [DEBUG] Processing http://elplanc.net/wp-content/uploads/2013/11/logocajaazul.png
2015/07/01 20:29:23 [DEBUG] Calculated hash http://elplanc.net/wp-content/uploads/2013/11/logocajaazul.png: 808486d2cadcd480bc531129c8f0bf0f
2015/07/01 20:29:23 [DEBUG] hash distance 5: http://elplanc.net/wp-content/uploads/2013/11/logocajaazul.png http://i.imgur.com/JeYm857.png
Let's query the canonical URLs:
$ curl 'http://localhost:8080/canonical?url=https://www.google.com.ar/images/srpr/logo11w.png'
{"Canonical":"https://www.google.com.ar/images/srpr/logo11w.png"}
$ curl 'http://localhost:8080/canonical?url=http://i.imgur.com/JeYm857.png'
{"Canonical":"http://i.imgur.com/JeYm857.png"}
$ curl 'http://localhost:8080/canonical?url=http://elplanc.net/wp-content/uploads/2013/11/logocajaazul.png'
{"Canonical":"http://i.imgur.com/JeYm857.png"}
$ curl 'http://localhost:8080/canonical?url=http://unknown.url'
{"Canonical":""}