README ¶
What is this?
In short
Be the GoDoc.org of k8s configuration files.
More explicitly
Support k8s document indexing from open-source configurations in order to make it easy for people to learn to use a new feature, explore k8s configs in a central hub, and see some metrics about kustomize use.
We want people to be able to support three main classes of queries:
-
Structured document queries: how should I use the following fields
- Grace periods:
spec:template:spec:terminationGracePeriod
? - Kustomize inline patch:
patches:patch
?
- Grace periods:
-
Key value queries: how should I use this more specific use case of a structure configuration.
- HorizontalPodAutoScalers:
kind=HorizontalPodAutoScaler
? - Patches on StatefulSets:
patches:target:kind=StatefulSet
?
- HorizontalPodAutoScalers:
-
Full text search: search the comments and the document text from any type of k8s config file.
Road map
There is a lot that can be added in order to improve the state of this application. Some more details along with general thoughts and comments can be found in the Roadmap.md file in this directory. This README contains only what can be considered as mostly complete and iterable parts of this project.
Running this project
Everything is configured using kubernetes, so it should be easy for people to spin this up on any k8s cluster. Everything should just work (TM).
The config files live in the config
directory.
config
├── base
│ └── kustomization.yaml
├── crawler
│ ├── base
│ │ ├── github_api_secret.txt
│ │ └── kustomization.yaml
│ ├── cronjob
│ │ ├── cronjob.yaml
│ │ └── kustomization.yaml
│ └── job
│ ├── job.yaml
│ └── kustomization.yaml
├── elastic
│ └── ...
├── redis
│ ├── document_keystore
│ │ ├── kustomization.yaml
│ │ ├── redis.yaml
│ │ └── service.yaml
│ └── http_cache
│ ├── kustomization.yaml
│ ├── redis.yaml
│ └── service.yaml
├── webapp
│ ├── backend
│ │ ├── deployment.yaml
│ │ ├── kustomization.yaml
│ │ └── service.yaml
│ └── frontend
│ ├── deployment.yaml
│ ├── kustomization.yaml
│ └── service.yaml
└── schema_files
└── kustomization_index
├── es_index_mappings.json
└── es_index_settings.json
To get everything up and running you have to:
-
Get some instance of elasticsearch working... and configure the configmapGenerator in
config/base
to point to the right endpoint(s). The configurations that need this value to be populated are the following:config/crawler/cronjob
to run periodic crawls.config/crawler/job
to run crawls on demand.config/webapp/backend
to run the search server.
-
Configure the elasticsearch indices:
kustomize build config/schema_files/kustomization_index | kubectl apply -f -
This will run a curl
command that reads json data from a ConfigMap. This will
setup the schema. If you want to make more complex modifications to the
schema, you should refer to the elastic docs to figure out whether the mapping
can be added to the current index, or whether you will need to copy the
existing index into a different one with the appropriate mappings. Modifications
can be made by using the elasticsearch go library and writing a simple program,
or it can be made with any http command to the appropriate server endpoint from
within the cluster. Unfortunately I did not have the time to write a few helper
tools for this. Feel free to contact me if you need help with modifying
elasticsearch configs, I'm by no means an expert, but I can try to help.
- (Optional) run the redis http chache for the crawler:
kubectl apply -k config/redis/http_cache
This will create a deployment for the cache, and a service. The crawler should
be configured to connect to the http_cache
if it exists, but you can always
check the logs to make sure it connects, and that the identifiers match in the
crawler configuration and for the service endpoint.
The please be aware that the cache does not have a persistent volume.
- Configure the main redis instance:
kubectl apply -k config/redis/document_keystore
This will create a StatefulSet with a volume of 4GiB for a redis instance.
- Get an access token from GitHub.
To be able to kindly ask GitHub for it's data on k8s config files, you'll need to create an access_token. From my understanding, this is the only way to do these code search queries (without first specifying a repository).
To generate a token, go to your GitHub's account in Settings > Developer Settings > Personal access tokens. It should look like this.
From here you want to generate a new token and have the following configuration:
If you have uses for any other data from this token, (org data, or something else) you can pick and choose, but be careful since it can grant this application access to your notifications, etc. However, any such extension is explicitly a non-goal and would not be maintained by this project.
- Launch the crawler:
kustomize build config/crawler/cronjob | kubectl apply -f -
This will periodically run the crawler every day according to the cron timing rules in the cronjob.yaml file.
Instead, to get the crawler running now, you can run:
kustomize build config/crawler/cronjob | kubectl apply -f -
which will launch a non-periodic version of the crawler. It will take a few minutes for the crawler to split the search, but then config files should start to get populated within 20 minutes. It may take a while to do the first crawl, since it has to fetch rate-limited endpoints for each new file it finds. It should get significantly faster to update in the future.
- Launch the search backend
kustomize build config/webapp/backend | kubectl apply -f -
- Launch the search frontend
kustomize build config/webapp/frontend | kubectl apply -f -
Notes about the components
Elasticsearch
I will add a basic working setup soon. I just did the lazy thing and used an already packaged solution. Most clouds will provide their own elastic environments, however, Elasticsearch is also working on their own implementation of a , which might be worth checking out. Please note that it comes with its own license agreement.
Redis
There are two Redis instances that are used in this application.
One of them is configured to have on disk persistence, so make sure to have that set up in your kubernetes cluster. Also note that it is running on a single master node (i.e. it does not automatically shard keys to multiple head nodes as part of a highly available cluster). Since it's storing a sparse graph, I can't imagine this being much of an issue, but it's probably worth mentioning.
The other Redis instance is running as a HTTP (RFC 7234) cache for etags from GitHub (or any other document store from which we could crawl/index). This one does not require full persistent storage on disk. The caching strategy is an LRU cache which is probably a good starting point. It might be worth it to investigate other cache policies, but I think LRU will work well since documents may or may not expire anyway, and the amount of memory allocated for keys is fairly large, so eviction of frequently used documents seems unlikely anyway.
Nginx + Angular
There is a Dockerfile included for generating the container image with Nginx (using the default package) and adding all of the supporting compiled angular files. Any modifications to the code-base should be compatible with this setup, so all that's needed is to rebuild the container image, and possibly modify the image tags in the k8s file.
Supporting Go binaries
There are a few go binaries that each have their own Dockerfile to build containers in which to run them on k8s, namely the crawler and the search service. Their configurations are not optimal (read: needs to be cleaned up), but they are functional.
Technical details
Overall design and implementations
There are a few components that are all running together in order to get the overall application to work smoothly. This section will provide a brief overview of each component with the following sections going into more details.
The overall structure is outlined in the following figure:
Crawler
The leftmost component consists of a crawler with an http cache of GitHub queries does two things, it first looks at the list of documents in elasticsearch and tries to update them. In doing so, it maintains a set of newly updated files to exclude them from other parts of the crawl.
To find newly added documents, the crawler crawls any new dependencies introduced in the document updating step and it also queries GitHub for the most recently indexed kustomization.* files. Each new file will be processed for efficient text queries and put into the document index. Any new dependency will also incur more crawl operations. Finally, a graphical representation of the documents and their dependencies is built in Redis to be used for graph algorithms such as PageRank and component analysis.
Data library
There are a few helper libaries for dealing with Elasticsearch, Redis and documents. This is not persistent, nor is it centralized. They act as small components that help to package common pieces of code. Eventually it may make sense to merge all of it together and make a proper persistent model around this while providing an external API for document insertion/deletion. But that is definitely out of scope in terms of getting this to run. However there are limitations with the current model in terms of minimizing the API surface for the different components of the application. For now this problem is mostly mitigated by having the query server only connected to a data node of the Elasticsearch cluster, but the problem of knowing what is accessible and what isn't is left to the programmer instead of being clearly and explicitly supported by the API.
Server
Uses the data library to communicate with the data store and answer queries. Processes the user entered text queries into somewhat optimized elasticsearch queries. Provides a few endpoints to get different metrics and to eventually allow for registration of remote repositories.
This application has an exposing service in order to allow users of the application access to queries and the results.
Nginx + Angular
Communicates directly with the backend server to forward user queries and their results. Presents the results on an interface. It's still pretty simple looking but it seems usable (to me).
Crawling GitHub
With the use of API keys, GitHub allows account owners to search for files using their API.
The search endpoints allow for the use of metadata search
that is fairly useful/powerful. For instance they provide a filename:
keyword
that permits us to look for kustomization.yaml
, kustomization.yml
, etc.
This enables the fetching of a list of kustomization documents, from which
we can get the actual content from another endpoint
(raw.githubusercontent.com).
However, the search API is fairly limited. There is a restriction to the number of documents that can be retrieved from this method. One possible way to mitigate this would be to periodically query GitHub for results, sorted by the last indexed time. This would allow you to collect most documents from this point forwards. The downside to this is that it may require a large number of requests to their API since you cannot know when new files will be added. Furthermore, there is a possibility that you would not be able to get all of files either, depending on the velocity of growth.
The approach that was taken to mitigate this is to use the filesize:
keyword
and to shard the search space into contiguous buckets of appropriate size in
order to get all of the documents. This is fairly efficient, since you can find
a good enough way to shard the documents in
lg(max file size) * number of documents / 1000
API queries. Moreover, since
queries are paginated with at most 100 results per query, this solution is
competitive with getting the optimal (non-contiguous) sharding of result sets.
Furthermore, filesize queries can be cached to minimize the total number of
queries called to the API in order to shard the search space. This is done by
querying for file size intervals that always start with 0..X and binary
searching over the filesize:
space. This will allow you to reuse a lot of
queries when you're looking for the next range, since it is upper bounded and
lower bounded to a smaller number of queries within a range that has also been
queried. I think this is only true because filesizes are power law distributed,
so searches will typically require less queries as they progress from left to
right.
However, this method in no way depends on intervals of the form 0..X, as the number of documents in the many intervals of the range search could be added together to also make this work. This approach just seemed simpler to implement, maintain, and debug so it was preferred.
To get an idea of how efficient this method is, to shard the search space of 7000 documents, it will only take ~90 API range queries which should only take a few minutes. While actually fetching the documents and their relevant metadata (creation time, etc.) will take several hours. Furthermore, this could be made more efficient if a prior distribution is approximated. This prior could be scaled to the number of documents that need to be fetched, and then finding a shard that has an adequate number of requests, will only take a few queries per shard. It could probably be supported in a constant number of size queries if the size of each shard is halved which shouldn't have terrible performance impact for the retrieval. However, there where more pressing things to implement. I might revisit this later.
Document Indexing and Processing
In order to support simple text queries the structured documents must be processed in some way that makes searching them easy. The current method is to recursively traverse the map of configurations to generate each sub-path and each key-value pair for the leaf nodes of the recursion tree.
However, note that this means that a document has to be valid yaml/json format in order for indexing to happen. The rest of the document is treated as mostly text and uses default text settings from Elasticsearch.
What this means is that for the following yaml document:
resources:
- service.yaml
- deployment.yaml
configmapGenerator:
- name: app-configuration
files:
- config.yaml
patchesJson6902:
- target:
version: v1
kind: StatefulSet
name: ss-name
path: ss-patch.yaml
- target:
version: v1
kind: Deployment
name: dep-name
path: dep-patch.yaml
the following flattened structure would look like:
{
"identifiers": [
"resources",
"configmapGenerator",
"configmapGenerator:name",
"configmapGenerator:files",
"patchesJson6902",
"patchesJson6902:target",
"patchesJson6902:target:version",
"patchesJson6902:target:kind",
"patchesJson6902:target:name",
"patchesJson6902:path",
],
"values": [
"resources=service.yaml",
"resources=deployment.yaml",
"configmapGenerator:name=app-configuration",
"configmapGenerator:files=config.yaml",
"patchesJson6902:target:version=v1",
"patchesJson6902:target:kind=StatefulSet",
"patchesJson6902:target:name=ss-name",
"patchesJson6902:path=ss-patch.yaml",
"patchesJson6902:target:kind=Deployment",
"patchesJson6902:target:name=dep-name",
"patchesJson6902:path=dep-patch.yaml",
],
...
}
Note that unique paths and values are deduplicated.
On the search side, exact queries will be prioritized, but the document paths and key=value pairs will also be analyzed with 3-grams to have some amount of fuzzy search. The reason that a Levenshtein-Distance was not used instead, is due to searching multiple fields at the same time, which is a use case where Elasticsearch does not support proper fuzzy searching.
Document Search
Given a text query, each token is considered separately. Each token will be fed through a handful of analyzers on the Elasticsearch side, and will be compared with the reverse document index of each document fields. It will then determine the best matching documents. Text ordering is largely insignificant. This makes sense for the structured search, but may leave room for improvement for the text only search within the document.
Each token must be matched, so each white space character acts as a conjunction of individual queries. There are also ways of telling Elasticsearch that some things should match, but I think for now it makes more sense to leave it as is.
I think this behavior is sufficient to make the search feel fairly intuitive while providing support for fairly complex use cases.
Metrics Computation
From the each kustomization document that is indexed, we can find it's resources that are publicly available. This includes other kustomizations. From this, we can build a directed graph of dependencies and reverse dependencies.
This opens up the possibility to add a plethora of graph metrics that can give the project maintainers feedback and insight into how people are using their tools.
Some of these are useful such as getting an idea for how large the dependency graphs actually grow in practice, and can be used to find popular kustomizations within the corpus. This lends itself to implementing PageRank to help bubble up popular results as good search results. I unfortunately did not have the time to implement the algorithm, but I do plan to revisit this sometime soon to add a few good and efficient implementations of useful graph algorithms that would be useful to have. See the Roadmap.md for a more complete list of features that could be added and how I think they could be implemented.
Directories ¶
Path | Synopsis |
---|---|
cmd
|
|
Package crawler provides helper methods and defines an interface for lauching source repository crawlers that retrieve files from a source and forwards to a channel for indexing and retrieval.
|
Package crawler provides helper methods and defines an interface for lauching source repository crawlers that retrieve files from a source and forwards to a channel for indexing and retrieval. |
github
Package github implements the crawler.Crawler interface, getting data from the Github search API.
|
Package github implements the crawler.Crawler interface, getting data from the Github search API. |