crawl

module
v0.0.0-...-e86fd7f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 23, 2021 License: Apache-2.0

README

What is this?

In short

Be the GoDoc.org of k8s configuration files.

More explicitly

Support k8s document indexing from open-source configurations in order to make it easy for people to learn to use a new feature, explore k8s configs in a central hub, and see some metrics about kustomize use.

We want people to be able to support three main classes of queries:

  1. Structured document queries: how should I use the following fields

    • Grace periods: spec:template:spec:terminationGracePeriod?
    • Kustomize inline patch: patches:patch?
  2. Key value queries: how should I use this more specific use case of a structure configuration.

    • HorizontalPodAutoScalers: kind=HorizontalPodAutoScaler?
    • Patches on StatefulSets: patches:target:kind=StatefulSet?
  3. Full text search: search the comments and the document text from any type of k8s config file.

Road map

There is a lot that can be added in order to improve the state of this application. Some more details along with general thoughts and comments can be found in the Roadmap.md file in this directory. This README contains only what can be considered as mostly complete and iterable parts of this project.

Running this project

Everything is configured using kubernetes, so it should be easy for people to spin this up on any k8s cluster. Everything should just work (TM).

The config files live in the config directory.

config
├── base
│   └── kustomization.yaml
├── crawler
│   ├── base
│   │   ├── github_api_secret.txt
│   │   └── kustomization.yaml
│   ├── cronjob
│   │   ├── cronjob.yaml
│   │   └── kustomization.yaml
│   └── job
│       ├── job.yaml
│       └── kustomization.yaml
├── elastic
│   └── ...
├── redis
│   ├── document_keystore
│   │   ├── kustomization.yaml
│   │   ├── redis.yaml
│   │   └── service.yaml
│   └── http_cache
│       ├── kustomization.yaml
│       ├── redis.yaml
│       └── service.yaml
├── webapp
│   ├── backend
│   │   ├── deployment.yaml
│   │   ├── kustomization.yaml
│   │   └── service.yaml
│   └── frontend
│       ├── deployment.yaml
│       ├── kustomization.yaml
│       └── service.yaml
└── schema_files
    └── kustomization_index
        ├── es_index_mappings.json
        └── es_index_settings.json

To get everything up and running you have to:

  1. Get some instance of elasticsearch working... and configure the configmapGenerator in config/base to point to the right endpoint(s). The configurations that need this value to be populated are the following:

    • config/crawler/cronjob to run periodic crawls.
    • config/crawler/job to run crawls on demand.
    • config/webapp/backend to run the search server.
  2. Configure the elasticsearch indices:

kustomize build config/schema_files/kustomization_index | kubectl apply -f -

This will run a curl command that reads json data from a ConfigMap. This will setup the schema. If you want to make more complex modifications to the schema, you should refer to the elastic docs to figure out whether the mapping can be added to the current index, or whether you will need to copy the existing index into a different one with the appropriate mappings. Modifications can be made by using the elasticsearch go library and writing a simple program, or it can be made with any http command to the appropriate server endpoint from within the cluster. Unfortunately I did not have the time to write a few helper tools for this. Feel free to contact me if you need help with modifying elasticsearch configs, I'm by no means an expert, but I can try to help.

  1. (Optional) run the redis http chache for the crawler:
kubectl apply -k config/redis/http_cache

This will create a deployment for the cache, and a service. The crawler should be configured to connect to the http_cache if it exists, but you can always check the logs to make sure it connects, and that the identifiers match in the crawler configuration and for the service endpoint.

The please be aware that the cache does not have a persistent volume.

  1. Configure the main redis instance:
kubectl apply -k config/redis/document_keystore

This will create a StatefulSet with a volume of 4GiB for a redis instance.

  1. Get an access token from GitHub.

To be able to kindly ask GitHub for it's data on k8s config files, you'll need to create an access_token. From my understanding, this is the only way to do these code search queries (without first specifying a repository).

To generate a token, go to your GitHub's account in Settings > Developer Settings > Personal access tokens. It should look like this.

GitHub Token 1

From here you want to generate a new token and have the following configuration:

GitHub Token 1

If you have uses for any other data from this token, (org data, or something else) you can pick and choose, but be careful since it can grant this application access to your notifications, etc. However, any such extension is explicitly a non-goal and would not be maintained by this project.

  1. Launch the crawler:
kustomize build config/crawler/cronjob | kubectl apply -f -

This will periodically run the crawler every day according to the cron timing rules in the cronjob.yaml file.

Instead, to get the crawler running now, you can run:

kustomize build config/crawler/cronjob | kubectl apply -f -

which will launch a non-periodic version of the crawler. It will take a few minutes for the crawler to split the search, but then config files should start to get populated within 20 minutes. It may take a while to do the first crawl, since it has to fetch rate-limited endpoints for each new file it finds. It should get significantly faster to update in the future.

  1. Launch the search backend
kustomize build config/webapp/backend | kubectl apply -f -
  1. Launch the search frontend
kustomize build config/webapp/frontend | kubectl apply -f -

Notes about the components

Elasticsearch

I will add a basic working setup soon. I just did the lazy thing and used an already packaged solution. Most clouds will provide their own elastic environments, however, Elasticsearch is also working on their own implementation of a k8s operator, which might be worth checking out. Please note that it comes with its own license agreement.

Redis

There are two Redis instances that are used in this application.

One of them is configured to have on disk persistence, so make sure to have that set up in your kubernetes cluster. Also note that it is running on a single master node (i.e. it does not automatically shard keys to multiple head nodes as part of a highly available cluster). Since it's storing a sparse graph, I can't imagine this being much of an issue, but it's probably worth mentioning.

The other Redis instance is running as a HTTP (RFC 7234) cache for etags from GitHub (or any other document store from which we could crawl/index). This one does not require full persistent storage on disk. The caching strategy is an LRU cache which is probably a good starting point. It might be worth it to investigate other cache policies, but I think LRU will work well since documents may or may not expire anyway, and the amount of memory allocated for keys is fairly large, so eviction of frequently used documents seems unlikely anyway.

Nginx + Angular

There is a Dockerfile included for generating the container image with Nginx (using the default package) and adding all of the supporting compiled angular files. Any modifications to the code-base should be compatible with this setup, so all that's needed is to rebuild the container image, and possibly modify the image tags in the k8s file.

Supporting Go binaries

There are a few go binaries that each have their own Dockerfile to build containers in which to run them on k8s, namely the crawler and the search service. Their configurations are not optimal (read: needs to be cleaned up), but they are functional.

Technical details

Overall design and implementations

There are a few components that are all running together in order to get the overall application to work smoothly. This section will provide a brief overview of each component with the following sections going into more details.

The overall structure is outlined in the following figure: overview

Crawler

The leftmost component consists of a crawler with an http cache of GitHub queries does two things, it first looks at the list of documents in elasticsearch and tries to update them. In doing so, it maintains a set of newly updated files to exclude them from other parts of the crawl.

To find newly added documents, the crawler crawls any new dependencies introduced in the document updating step and it also queries GitHub for the most recently indexed kustomization.* files. Each new file will be processed for efficient text queries and put into the document index. Any new dependency will also incur more crawl operations. Finally, a graphical representation of the documents and their dependencies is built in Redis to be used for graph algorithms such as PageRank and component analysis.

Data library

There are a few helper libaries for dealing with Elasticsearch, Redis and documents. This is not persistent, nor is it centralized. They act as small components that help to package common pieces of code. Eventually it may make sense to merge all of it together and make a proper persistent model around this while providing an external API for document insertion/deletion. But that is definitely out of scope in terms of getting this to run. However there are limitations with the current model in terms of minimizing the API surface for the different components of the application. For now this problem is mostly mitigated by having the query server only connected to a data node of the Elasticsearch cluster, but the problem of knowing what is accessible and what isn't is left to the programmer instead of being clearly and explicitly supported by the API.

Server

Uses the data library to communicate with the data store and answer queries. Processes the user entered text queries into somewhat optimized elasticsearch queries. Provides a few endpoints to get different metrics and to eventually allow for registration of remote repositories.

This application has an exposing service in order to allow users of the application access to queries and the results.

Nginx + Angular

Communicates directly with the backend server to forward user queries and their results. Presents the results on an interface. It's still pretty simple looking but it seems usable (to me).

Crawling GitHub

With the use of API keys, GitHub allows account owners to search for files using their API.

The search endpoints allow for the use of metadata search that is fairly useful/powerful. For instance they provide a filename: keyword that permits us to look for kustomization.yaml, kustomization.yml, etc. This enables the fetching of a list of kustomization documents, from which we can get the actual content from another endpoint (raw.githubusercontent.com).

However, the search API is fairly limited. There is a restriction to the number of documents that can be retrieved from this method. One possible way to mitigate this would be to periodically query GitHub for results, sorted by the last indexed time. This would allow you to collect most documents from this point forwards. The downside to this is that it may require a large number of requests to their API since you cannot know when new files will be added. Furthermore, there is a possibility that you would not be able to get all of files either, depending on the velocity of growth.

The approach that was taken to mitigate this is to use the filesize: keyword and to shard the search space into contiguous buckets of appropriate size in order to get all of the documents. This is fairly efficient, since you can find a good enough way to shard the documents in lg(max file size) * number of documents / 1000 API queries. Moreover, since queries are paginated with at most 100 results per query, this solution is competitive with getting the optimal (non-contiguous) sharding of result sets. Furthermore, filesize queries can be cached to minimize the total number of queries called to the API in order to shard the search space. This is done by querying for file size intervals that always start with 0..X and binary searching over the filesize: space. This will allow you to reuse a lot of queries when you're looking for the next range, since it is upper bounded and lower bounded to a smaller number of queries within a range that has also been queried. I think this is only true because filesizes are power law distributed, so searches will typically require less queries as they progress from left to right.

However, this method in no way depends on intervals of the form 0..X, as the number of documents in the many intervals of the range search could be added together to also make this work. This approach just seemed simpler to implement, maintain, and debug so it was preferred.

To get an idea of how efficient this method is, to shard the search space of 7000 documents, it will only take ~90 API range queries which should only take a few minutes. While actually fetching the documents and their relevant metadata (creation time, etc.) will take several hours. Furthermore, this could be made more efficient if a prior distribution is approximated. This prior could be scaled to the number of documents that need to be fetched, and then finding a shard that has an adequate number of requests, will only take a few queries per shard. It could probably be supported in a constant number of size queries if the size of each shard is halved which shouldn't have terrible performance impact for the retrieval. However, there where more pressing things to implement. I might revisit this later.

Document Indexing and Processing

In order to support simple text queries the structured documents must be processed in some way that makes searching them easy. The current method is to recursively traverse the map of configurations to generate each sub-path and each key-value pair for the leaf nodes of the recursion tree.

However, note that this means that a document has to be valid yaml/json format in order for indexing to happen. The rest of the document is treated as mostly text and uses default text settings from Elasticsearch.

What this means is that for the following yaml document:

resources:
- service.yaml
- deployment.yaml

configmapGenerator:
- name: app-configuration
  files:
  - config.yaml

patchesJson6902:
- target:
    version: v1
    kind: StatefulSet
    name: ss-name
  path: ss-patch.yaml
- target:
    version: v1
    kind: Deployment
    name: dep-name
  path: dep-patch.yaml

the following flattened structure would look like:

{
  "identifiers": [
    "resources",
    "configmapGenerator",
    "configmapGenerator:name",
    "configmapGenerator:files",
    "patchesJson6902",
    "patchesJson6902:target",
    "patchesJson6902:target:version",
    "patchesJson6902:target:kind",
    "patchesJson6902:target:name",
    "patchesJson6902:path",
  ],
  "values": [
    "resources=service.yaml",
    "resources=deployment.yaml",
    "configmapGenerator:name=app-configuration",
    "configmapGenerator:files=config.yaml",
    "patchesJson6902:target:version=v1",
    "patchesJson6902:target:kind=StatefulSet",
    "patchesJson6902:target:name=ss-name",
    "patchesJson6902:path=ss-patch.yaml",
    "patchesJson6902:target:kind=Deployment",
    "patchesJson6902:target:name=dep-name",
    "patchesJson6902:path=dep-patch.yaml",
  ],
  ...
}

Note that unique paths and values are deduplicated.

On the search side, exact queries will be prioritized, but the document paths and key=value pairs will also be analyzed with 3-grams to have some amount of fuzzy search. The reason that a Levenshtein-Distance was not used instead, is due to searching multiple fields at the same time, which is a use case where Elasticsearch does not support proper fuzzy searching.

Given a text query, each token is considered separately. Each token will be fed through a handful of analyzers on the Elasticsearch side, and will be compared with the reverse document index of each document fields. It will then determine the best matching documents. Text ordering is largely insignificant. This makes sense for the structured search, but may leave room for improvement for the text only search within the document.

Each token must be matched, so each white space character acts as a conjunction of individual queries. There are also ways of telling Elasticsearch that some things should match, but I think for now it makes more sense to leave it as is.

I think this behavior is sufficient to make the search feel fairly intuitive while providing support for fairly complex use cases.

Metrics Computation

From the each kustomization document that is indexed, we can find it's resources that are publicly available. This includes other kustomizations. From this, we can build a directed graph of dependencies and reverse dependencies.

This opens up the possibility to add a plethora of graph metrics that can give the project maintainers feedback and insight into how people are using their tools.

Some of these are useful such as getting an idea for how large the dependency graphs actually grow in practice, and can be used to find popular kustomizations within the corpus. This lends itself to implementing PageRank to help bubble up popular results as good search results. I unfortunately did not have the time to implement the algorithm, but I do plan to revisit this sometime soon to add a few good and efficient implementations of useful graph algorithms that would be useful to have. See the Roadmap.md for a more complete list of features that could be added and how I think they could be implemented.

Directories

Path Synopsis
cmd
Package crawler provides helper methods and defines an interface for lauching source repository crawlers that retrieve files from a source and forwards to a channel for indexing and retrieval.
Package crawler provides helper methods and defines an interface for lauching source repository crawlers that retrieve files from a source and forwards to a channel for indexing and retrieval.
github
Package github implements the crawler.Crawler interface, getting data from the Github search API.
Package github implements the crawler.Crawler interface, getting data from the Github search API.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL