README ¶
Shortest URL
URL shortener API example using Go and Chi.
You enter a URL such as https://github.com/darioblanco and it returns a short URL such as http://urlshortener/64fc5e.
Dependencies
The following dependencies are required:
go
git
The following dependency is optional (but recommended to test in a production-like scenario)
docker
The command make init-deps
will always check if any of them is missing. Such
command is always part of any other Makefile command.
Configuration
A default configuration is set in cmd/config.default.yaml
, but it can be overwritten using
environment variables. This already gives flexibility for any kind of deployment.
Yaml config | Environment Variable | Description | Default |
---|---|---|---|
environment |
SHORTESTURL_ENVIRONMENT |
The environment used to define different logging and cache strategies. It can be one of prod , stage , test and dev . If dev is provided, Redis is not needed as a dependency. The test environment is used in docker-compose. |
dev |
httpHost |
SHORTESTURL_HTTP_HOST |
The http host for the server, swagger and encoded urls. | localhost |
httpPort |
SHORTESTURL_HTTP_PORT |
The http port for the server, swagger and encoded urls. | 3000 |
httpScheme |
SHORTESTURL_HTTP_SCHEME |
The http scheme to use in the server, swagger and encoded urls. | http |
redisHost |
SHORTESTURL_REDIS_HOST |
The redis host. Ignored for dev environment. |
localhost |
redisPort |
SHORTESTURL_REDIS_PORT |
The redis port. Ignored for dev environment. |
6379 |
urlLength |
SHORTESTURL_URL_LENGTH |
The length of the shortened url, a bigger number will reduce possible collisions (solving collisions requires extra computational effort). | 6 |
urlExpirationInHours |
SHORTESTURL_URL_EXPIRATION_IN_HOURS |
The maximum time in hours in which a shortened url will live in the system. A value of 0 means they are kept indefinitely. |
0 |
version |
SHORTESTURL_VERSION |
The version of the released application. Useful for CI/CD pipelines. | unknown |
How to Run
Local
To run the application locally
make run
To run with hot reload, so any changes saved in your files will be reflected in the running server.
make run-hmr
Docker
Alternatively, to
To run the application in docker:
- Make sure you run
make init
before (docker
is needed as a dependency) - Type
docker-compose up --build
, both theredis
andserver
container will run in the foreground
See the docker-compose CLI for more ways to run the dockerized application.
Swagger
You can browse the swagger documentation at http://localhost:3000/docs/index.html
.
Deploy
This application can be deployed with docker
, building the Dockerfile
and running the image
somewhere. Or directly with the binary generated in tmp/shortesturl
.
A reverse proxy in front of this application that does SSL termination is strictly recommended. This application does not handle HTTPS and it expects a proxy to handle that part.
Commit style
The commit messages follow the Conventional Commits style. Thanks to this, we could develop automated pipelines that automatically create release notes based on developer commits and the type of change that is set.
I have an open source tool called release-wizard which does exactly this task.
Project Structure
The code is structured in three folders
app
: holds all application codecmd
: defines different entrypoints for the application (in case we would like to run it as a server, or as a script, etc...)docs
: has autogenerated Swagger code and documentation
In addition, there are a few relevant files:
.air.toml
: configures HMR.goverreport.yml
: defines test coverage settingsdocker-compose.yaml
: sets up theserver
andredis
docker containers and their networkDockerfile
: contains commands to assemble theserver
docker imageMakefile
: sets useful rules for running the application, testing, etc...shortesturl.http
: allows testing the endpoints with VisualStudioCode/humao.rest-client. The application should be running first.
app
folder
The app folder defines the app
package, which exposes an abstraction that can be configurable in different entrypoints.
In addition, this folder defines a series of internal packages (won't be browsable outside the app
package scope):
cache
: fast store abstraction with basicGet
andSetIfNotExists
commands. It implementsredis
under the hood. If the environment isdev
,miniredis
will be loaded instead (eliminating the need to haveredis
as a dependency to the project)config
: the configuration auto loader. It implementsviper
under the hood.http
: http abstraction that conforms to Go'shttp.Handler
. It implementschi
under the hood.logging
: logging abstraction that implementszap
under the hood.
cmd
folder
The cmd
folder will have the default configuration stored as a YAML file. Such configuration
is always used in case that environment variables are not overriding.
See configuration for more details.
In addition, every subfolder will define a different entrypoint to the application, with a main.go
file inside. At this moment, we only need a server
entrypoint.
Tests
To run the tests (with its respective coverage):
make test
This project aims for a 98% test coverage of all files, with the exception of
cmd/server/main.go
and app/app.go
. Those files are considered entry points
of the application, thus they are tested with make run
.
In addition, app/internal/cache/cache.go
is
not properly shown in the coverage due to go limitations, as technically the app/internal/http/api_test.go
file is using the optimistic lock and eventually could trigger its retry mechanism. Therefore,
a threshold of 98% test coverage has been set instead of 100%. It would be possible to test the
retry and error branch in the SetIfNotExists
function, but that would require creating a mock
over the cache inherent client instead of using miniredis. I decided not to spend extra time on
that mock wrapper as the benefit for this task is little and I consider that the code is production
ready. However, if the project would grow, I would strongly consider even reaching 100% test coverage
with unit tests there.
All files ending in _test.go
are performing unit tests, except api_test.go
that follows an integration test approach for the /encode
and /decode
endpoints.
Files ending in _mock.go
are thought to expose mocks and stubs to different packages
that might need them.
Therefore, these mocks will be defined within the package of the real implementation,
even if it does not use them. Mockery is an interesting tool to
automatically generate such mocks, though this project does not implement it as it is not complex enough.
The mock files are automatically ignored from the coverage report.
Manual tests
You can manually check the endpoints, using the shortesturl.http
file and
VisualStudioCode/humao.rest-client.
I consider that strong automated testing is the way to go in order to have a codebase that is maintainable (especially when developing collaboratively), but manual tests is helpful also during development and for demo purposes.
Benchmarks
In addition, I have create a few basic benchmarks that would help to analyze the speed of the
encode and decode functions. These are set in app/internal/http/benchmarks_test.go
.
These are not tests! Therefore, no asserts are done in this file. However, they will be very helpful to analyze the impact of possible improvements in the algorithms.
Algorithm
The encode/decode algorithm implemented has no restrictions about how should it work, and the following requirements picked from the README:
- The encoded url MUST be decodable into its original form
- The short urls MUST be kept in memory. This one actually tries to reduce the overhead of the challenge, thus a persistence solution is not required.
In order to match the requirements, we need a bijective function,
which for a pairing between L
(a set of long urls l
) and S
(a set of short urls s
),
hold the following properties:
- Each element of
L
must be paired with one (and only one) element ofS
:f(l1) = s1
andf(l1) != f(l2)
ifl1 != l2
- Each element of
S
must be paired with one (and only one) element ofL
:g(s1) = l1
andg(s1) != g(s2)
ifs1 != s2
Persistence
I have selected Redis for production for several reasons:
- The data structure required is very simple, a key/store database is enough
- It is easy to set keys to expire (e.g. if expiring urls is required in the future)
- I am very familiar with this store
- It has high availability options and can be configured to persist to disk if necessary while retaining the in-memory speed
- It gives close to O(1) complexity for search, insert and delete in any scale*
*We could think that an O(n) in the worst case could be terrible if Redis has millions of entries, but Redis use separate chaining hash tables, splitting the items (n) into buckets (k), which will grow with the number of entries, giving a O(1+n/k). In practice Redis keep that n/k low, because it performs the rehashing in the background. Therefore, we can consider that a solution like Redis could be implemented for our url shortener at scale.
However, the task explicitly defines that You do not need to persist short URLs to a database. Keep them in memory..
I thought about creating a very simple hash table by myself, which in order to do at scale, I should
then define smaller hash tables partitioned by keyspace. In addition, I strongly considered this other
requirement: Please organize, design, test and document your code as if it were going into production
.
I would not recommend going to production with a custom hash table, because there are a lot of implications
like race conditions that will have to be solved. We will basically need to protect the data structure
from other go routines whenever we check if the value is there and insert. Redis is doing this very
easily with the SetNX
operation for instance.
In order to keep this task loyal to what has been asked, I have determined the following approach:
- If the environment is
dev
, miniredis is used in the background, which is just a simple, cheap and in-memory redis replacement. Therefore, the urls are kept in memory, and this simplifies a lot the amount of code I need to write. - If the environment is not
dev
, the application will read the redis connection parameters and use an external instance. This can be tested also locally, but it requiresdocker
. - I have developed a
cache
package that transparently usesminiredis
or a realredis
based on the given environment variable.
With this decision I hope to give a solution to the requirement of keeping the urls in memory (without having a persistence dependency in development mode) while defining a codebase that can be easily deployed to production with a more resilient setup.
Thought process
First approach: auto incremented counter
My first approach towards a URL shortener was basically to store the urls as they come and have a counter,
thus the counter will be incremented for each url that is shortened. Technically, that would be already
a URL shortener, because we could use the database id like http://localhost:3000/200
, but it makes
obvious that somebody could iterate through those urls and consume /decode
to figure out what long
url was before, which is not what we want.
Transforming the database id with base62
(for instance) would make the problem less apparent but it would
still be possible to have url iteration if somebody decides to encode the ids with that format and
use /decode
.
Basically, any solution relying only in the auto increment value set up in the database is a no-go for me, because it would be quite simple to browse urls shortened by others. I want to give a solution that would make it reasonably difficult to do such url iteration, thus nobody would bother trying when there is so little to earn from it (at the end they would not know from which user the url belongs).
The MD5 strategy
Therefore, I decided to use an MD5 hash because is fast (for being a hash) and seems like a good fit for this use case, as it should be enough to discourage url iteration. Also, I am not worried about MD5 collisions, because if we use the rule of thumb where we will see our first collision after hashing 2^n/2 items, and MD5 creates strings of 128 bits, we could technically have 2^64 "collision-free" urls, and we will never store so many entries in our database, as I would then set up an expiration in order to reduce that load.
However, we want to create a url shortener, and not a url longener, if I output the MD5
directly, a website like https://darioblanco.com
would then be http://localhost:3000/72b746fa3f83d4d932516cb27313c0eb
,
which is terrible. Therefore, I am picking the first 6 digits of such MD5, creating something like
http://localhost:3000/72b746
. If we deploy this to production, for instance to the shy.to
domain,
then we will have https://shy.to/72b746
, which saves 2 characters from a root domain url that is
usually not one we would shorten in any case.
Picking the first 6 digits of an MD5 greatly increases the probability of having collisions. As MD5
yields hexadecimal digits ([0-9a-f]), each hex character is a nibble (4 bits) and the strings have 128 bits,
128/4 gives 32 characters. If we use only 6 characters, then we will have 24 bits, and 2^12 "collision-free"
urls, which is 4096
, so collisions might happen often, as soon as our service is getting loaded with data!
In order to solve the collision problem, I will use a simple in-memory key/store (Redis), abstracted in the cache
package.
If the first 6 digits of such MD5 is already in the store, it will shift the MD5 and keep trying until
it finds one that can be stored. I am not worried about the complexity when looking up if the MD5, because
Redis claims to have a close to O(1) complexity in that operation, and collisions can be reduced greatly
if we increase the characters of the url or if we aggressively expire the urls stored in the system.
In addition, the advantage of using the "short MD5" approach is that we won't have duplicated urls in our database, which is a problem that arises with some auto increment strategies. Thus I am fine committing to the trade-off of looking up for an MD5 at least once for each url encode in an in-memory database, in order to protect from automatic url iteration, while having as a side-effect no duplicates in the store.
The url decode is then really trivial, the encoded url will be stored as a key, and its value will be
the original url. Therefore a close to O(1) complexity for the /decode
endpoint.
The race condition
However, when encoding URLs, there could be a race condition if two operations happen over the same URL, because there is a check to verify if the shortened md5 from the url was already generated. That one is simple to solve, because the set could be idempotent in the worst case.
However, what about a race condition between two completely different URLs that resolve to the same 6 digits shortened slug?
This is when things can get problematic because I would not achieve the bijective requirement.
When the system has already a slug for a URL, and a second URL resolves to the same slug, by design
I am shifting that slug (what I call a "collision"). As there is a GET
operation to verify if the
candidate slug has the same url (or a different one), the SET
operation that goes afterwards should
be performed if the previous condition does not change.
Initially, I decided to put a mutex for the GET
and SET
operations that fixed the problem,
but it could impact performance under heavy loads and it would not fix the problem in a multiple
replica scenario.
Finally, I decided to implement a Redis transaction
in conjunction with the WATCH
command, following the optimistic locking approach,
because conflicts will be rare, it does not lock the system (as a pessimistic locking like mutex would
put a burden in the performance), and we expect that a single retry would be enough to solve the conflict.
Therefore, the transaction will be aborted if the watched key changed (as the assumption we had when doing GET
is different now).
With optimistic locking we just retry the redis operation, which in a second time would retrieve the shortened
url in GET and the race condition would never happen again for that url. The maximum number of operations is
set to 5 (4 retries).
Improvements
Possible improvements to this project:
- A very useful endpoint would be
GET /{shortUrlSlug}
, that will automatically perform a redirect to the long url. Therefore, when somebody typeshttps://{shortesturl-hostname}/64fc5e
, it will automatically redirect tohttps://github.com/darioblanco
. I have not developed it because it was not in the scope of the task. - For a production deploy, it is recommended to have a reverse proxy in front, that does SSL termination. To handle the load, I would use Kubernetes with horizontal pod autoscaling that will automatically create replicas of this application based on CPU and memory usage. As we use Redis and a transaction to solve race conditions from its tooling, this code can already scale with different replicas.
- Instead of using Redis as final persistence, it could be used as a cache,
with an expiration for the keys that can be refreshed based on cache hits, or an LRU approach.
The final persistence could be done in Cassandra or MongoDB, using the
shortUrlSlug
as a partition key (or sharding key) to distribute the storage and increase search performance. These databases could give us different tools to handle the above mentioned race condition, even with multiple replicas. However, Redis could scale very well without introducing more complex persistence layers, it can be HA and it could also store data in disk, thus I would be very careful before analyzing a different database.