Gitaly
What
Gitaly is a daemon handles all the git calls made by GitLab.
To see where it fits in please look at GitLab's architecture
References
Name
Gitaly is a tribute to git and the town of Aly. Where the town of
Aly has zero inhabitants most of the year we would like to reduce the number of
disk operations to zero for most actions. It doesn't hurt that it sounds like
Italy, the capital of which is the destination of all roads. All git actions in
GitLab end up in Gitaly.
Reason
For GitLab.com the git access is slow.
When looking at Rugged::Repository.new
performance data we can see that our P99 spikes up to 30 wall seconds, while the CPU time keeps in the realm of the 15 milliseconds. Pointing at filesystem access as the culprit.
Our P99 access time to just create a Rugged::Repository object, which is loading and processing the git objects from disk, spikes over 30 seconds, making it basically unusable. We also saw that just walking through the branches of gitlab-ce requires 2.4 wall seconds.
We considered to move to metal to fix our problems with higher performaning hardware. But our users are using GitLab in the cloud so it should work great there. And this way the increased performance will benefit every GitLab user.
Gitaly will make our situation better in a few steps:
- One central place to monitor operations
- Performance improvements doing less and caching more
- Move the git operations from the app to the file/git server with git rpc (routing git access over JSON HTTP calls)
- Use Git ketch to allow active-active (push to a local server), and distributed read operations (read from a secondary).
Decisions
All design decision should be added here.
- Why are we considering to use Git Ketch? It is open source, uses the git protocol itself, is made by experts in distributed systems (Google), and is as simple as we can think of. We have to accept that we'll have to run the JVM on the Git servers.
- We'll keep using the existing sharding functionality in GitLab to be able to add new servers. Currently we can use it to have multiple file/git servers. Later we will need multiple Git Ketch clusters.
- We need to get rid of NFS mounting at some point because one broken NFS server causes all the application servers to fail to the point where you can't even ssh in.
- We want to move the git executable as close to the disk as possible to reduce latency, hence the need for git rpc to talk between the app server and git.
- Cached metadata is stored in Redis LRU
- Cached payloads are stored in files since Redis can't store large objects
- Why not use GitLab Git? So workhorse and ssh access can use the same system. We need this to manage cache invalidation.
- Why not make this a library for most users instead of a daemon/server? If it is a library it is hard to do in memory caching, and we will still need to keep the NFS shares mounted in the application hosts.
- Can we focus on instrumenting first before building Gitaly? Prometheus doesn't work with Unicorn.
- How do we ship this quickly without affecting users? Behind a feature flag like we did with workhorse. We can update it independently in production.
- How much memory will this use? Guess 50MB, we will save memory in the rails app, guess more in sidekiq (GBs but not sure), but initially more because more libraries are still loaded everywhere.
- What will we use for git rpc? JSON over HTTP initially to keep it simple. If measurements show out it isn't fast enough we can switch to a binary protocol. But binary protocols slow down iteration and debugging.
- What packaging tool do we use? Govendor because we like it more
- How will the networking work? A unix socket for git operations and TCP for monitoring. This prevents having to build out authentication at this early stage. https://gitlab.com/gitlab-org/gitaly/issues/16
- We'll include the /vendor directory in source control https://gitlab.com/gitlab-org/gitaly/issues/18
- Use gitaly-client or HTTP/websocket clients? gitlab-shell copies the SSH stream, both ways, to gitaly over a websocket, workhorse just forwards the request to Gitaly, let’s use HTTP. https://gitlab.com/gitlab-org/gitaly/issues/5#note_20294280
- We will use E3 from BitBucket to measure performance closely in isolation.
- Use environment variables for setting configurations (see #20).
Iterate
Instead of moving everything to Gitaly and only than optimize performance we'll iterate so we quickly have results
The iteration process is as follows
- Move a specific set of functions without modification
- Measure their performance
- Try to improve the performance by reducing reads and/or caching
- Measure the effect
Some examples of a specific set of functions:
Plan
We use our issues board for keeping our work in progress up to date in a single place. Please refer to it to see the current status of the project.
- Absorb gitlab_git
- Milestone 0.1
- Move more functions in accordance with the iterate process, starting with the ones with have the highest impact.
- Change the connection on the workers from a unix socket to an actual TCP socket to reach Gitaly
- Build Gitaly fleet that will have the NFS mount points and will run Gitaly
- Move to GitRPC model where GitLab is not accessing git directly but through Gitaly
- Remove the git NFS mount points from the worker fleet
- Remove gitlab git from Gitlab Rails
- Move to active-active with Git Ketch, with this we can read from any node, greatly reducing the number of IOPS on the leader.
- Move to the most performant and cost effective cloud