hercules

package module
v8.0.1+incompatible Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 20, 2019 License: Apache-2.0 Imports: 12 Imported by: 0

README

Hercules

Fast, insightful and highly customizable Git history analysis.

GoDoc Travis build Status AppVeyor build status Docker build status Code coverage Go Report Card Apache 2.0 license

OverviewHow To UseInstallationContributionsLicense


Overview

Hercules is an amazingly fast and highly customizable Git repository analysis engine written in Go. Batteries are included. It is powered by go-git and Babelfish.

There are two command-line tools: hercules and labours.py. The first is the program written in Go which takes a Git repository and runs a Directed Acyclic Graph (DAG) of analysis tasks over the full commit history. The second is the Python script which draws some predefined plots. These two tools are normally used together through a pipe. It is possible to write custom analyses using the plugin system. It is also possible to merge several analysis results together. The commit history includes branches, merges, etc.

Blog posts: 1, 2. Presentation.

Hercules DAG of Burndown analysis

The DAG of burndown and couples analyses with UAST diff refining. Generated with hercules --burndown --burndown-people --couples --feature=uast --dry-run --dump-dag doc/dag.dot https://github.com/src-d/hercules

git/git image

torvalds/linux line burndown (granularity 30, sampling 30, resampled by year). Generated with hercules --burndown --first-parent --pb https://github.com/torvalds/linux | python3 labours.py -f pb -m burndown-project

Installation

Grab hercules binary from the Releases page. labours.py requires the Python packages listed in requirements.txt:

pip3 install -r requirements.txt

pip3 is the Python package manager.

Numpy and Scipy can be installed on Windows using http://www.lfd.uci.edu/~gohlke/pythonlibs/

Build from source

You are going to need Go (>= v1.10), protoc, and dep.

go get -d gopkg.in/src-d/hercules.v7/cmd/hercules
cd $GOPATH/src/gopkg.in/src-d/hercules.v7
make

Replace $GOPATH with %GOPATH% on Windows.

Contributions

...are welcome! See CONTRIBUTING and code of conduct.

License

Apache 2.0

Usage

The most useful and reliably up-to-date command line reference:

hercules --help

Some examples:

# Use "memory" go-git backend and display the burndown plot. "memory" is the fastest but the repository's git data must fit into RAM.
hercules --burndown https://github.com/src-d/go-git | python3 labours.py -m burndown-project --resample month
# Use "file system" go-git backend and print some basic information about the repository.
hercules /path/to/cloned/go-git
# Use "file system" go-git backend, cache the cloned repository to /tmp/repo-cache, use Protocol Buffers and display the burndown plot without resampling.
hercules --burndown --pb https://github.com/git/git /tmp/repo-cache | python3 labours.py -m burndown-project -f pb --resample raw

# Now something fun
# Get the linear history from git rev-list, reverse it
# Pipe to hercules, produce burndown snapshots for every 30 days grouped by 30 days
# Save the raw data to cache.yaml, so that later is possible to python3 labours.py -i cache.yaml
# Pipe the raw data to labours.py, set text font size to 16pt, use Agg matplotlib backend and save the plot to output.png
git rev-list HEAD | tac | hercules --commits - --burndown https://github.com/git/git | tee cache.yaml | python3 labours.py -m burndown-project --font-size 16 --backend Agg --output git.png

labours.py -i /path/to/yaml allows to read the output from hercules which was saved on disk.

Caching

It is possible to store the cloned repository on disk. The subsequent analysis can run on the corresponding directory instead of cloning from scratch:

# First time - cache
hercules https://github.com/git/git /tmp/repo-cache

# Second time - use the cache
hercules --some-analysis /tmp/repo-cache
Docker image
docker run --rm srcd/hercules hercules --burndown --pb https://github.com/git/git | docker run --rm -i -v $(pwd):/io srcd/hercules labours.py -f pb -m burndown-project -o /io/git_git.png

Built-in analyses

Project burndown
hercules --burndown
python3 labours.py -m burndown-project

Line burndown statistics for the whole repository. Exactly the same what git-of-theseus does but much faster. Blaming is performed efficiently and incrementally using a custom RB tree tracking algorithm, and only the last modification date is recorded while running the analysis.

All burndown analyses depend on the values of granularity and sampling. Granularity is the number of days each band in the stack consists of. Sampling is the frequency with which the burnout state is snapshotted. The smaller the value, the more smooth is the plot but the more work is done.

There is an option to resample the bands inside labours.py, so that you can define a very precise distribution and visualize it different ways. Besides, resampling aligns the bands across periodic boundaries, e.g. months or years. Unresampled bands are apparently not aligned and start from the project's birth date.

Files
hercules --burndown --burndown-files
python3 labours.py -m burndown-file

Burndown statistics for every file in the repository which is alive in the latest revision.

Note: it will generate separate graph for every file. You might don't want to run it on repository with many files.

People
hercules --burndown --burndown-people [-people-dict=/path/to/identities]
python3 labours.py -m burndown-person

Burndown statistics for the repository's contributors. If -people-dict is not specified, the identities are discovered by the following algorithm:

  1. We start from the root commit towards the HEAD. Emails and names are converted to lower case.
  2. If we process an unknown email and name, record them as a new developer.
  3. If we process a known email but unknown name, match to the developer with the matching email, and add the unknown name to the list of that developer's names.
  4. If we process an unknown email but known name, match to the developer with the matching name, and add the unknown email to the list of that developer's emails.

If -people-dict is specified, it should point to a text file with the custom identities. The format is: every line is a single developer, it contains all the matching emails and names separated by |. The case is ignored.

Churn matrix

Wireshark top 20 churn matrix

Wireshark top 20 devs - churn matrix

hercules --burndown --burndown-people [-people-dict=/path/to/identities]
python3 labours.py -m churn-matrix

Beside the burndown information, --burndown-people collects the added and deleted line statistics per developer. Thus it can be visualized how many lines written by developer A are removed by developer B. This indicates collaboration between people and defines expertise teams.

The format is the matrix with N rows and (N+2) columns, where N is the number of developers.

  1. First column is the number of lines the developer wrote.
  2. Second column is how many lines were written by the developer and deleted by unidentified developers (if -people-dict is not specified, it is always 0).
  3. The rest of the columns show how many lines were written by the developer and deleted by identified developers.

The sequence of developers is stored in people_sequence YAML node.

Code ownership

Ember.js top 20 code ownership

Ember.js top 20 devs - code ownership

hercules --burndown --burndown-people [-people-dict=/path/to/identities]
python3 labours.py -m ownership

--burndown-people also allows to draw the code share through time stacked area plot. That is, how many lines are alive at the sampled moments in time for each identified developer.

Couples

Linux kernel file couples

torvalds/linux files' coupling in Tensorflow Projector

hercules --couples [-people-dict=/path/to/identities]
python3 labours.py -m couples -o <name> [--couples-tmp-dir=/tmp]

Important: it requires Tensorflow to be installed, please follow official instructions.

The files are coupled if they are changed in the same commit. The developers are coupled if they change the same file. hercules records the number of couples throughout the whole commit history and outputs the two corresponding co-occurrence matrices. labours.py then trains Swivel embeddings - dense vectors which reflect the co-occurrence probability through the Euclidean distance. The training requires a working Tensorflow installation. The intermediate files are stored in the system temporary directory or --couples-tmp-dir if it is specified. The trained embeddings are written to the current working directory with the name depending on -o. The output format is TSV and matches Tensorflow Projector so that the files and people can be visualized with t-SNE implemented in TF Projector.

Structural hotness
      46  jinja2/compiler.py:visit_Template [FunctionDef]
      42  jinja2/compiler.py:visit_For [FunctionDef]
      34  jinja2/compiler.py:visit_Output [FunctionDef]
      29  jinja2/environment.py:compile [FunctionDef]
      27  jinja2/compiler.py:visit_Include [FunctionDef]
      22  jinja2/compiler.py:visit_Macro [FunctionDef]
      22  jinja2/compiler.py:visit_FromImport [FunctionDef]
      21  jinja2/compiler.py:visit_Filter [FunctionDef]
      21  jinja2/runtime.py:__call__ [FunctionDef]
      20  jinja2/compiler.py:visit_Block [FunctionDef]

Thanks to Babelfish, hercules is able to measure how many times each structural unit has been modified. By default, it looks at functions; refer to Semantic UAST XPath manual to switch to something else.

hercules --shotness [--shotness-xpath-*]
python3 labours.py -m shotness

Couples analysis automatically loads "shotness" data if available.

Jinja2 functions grouped by structural hotness

hercules --shotness --pb https://github.com/pallets/jinja | python3 labours.py -m couples -f pb

Aligned commit series

tensorflow/tensorflow

tensorflow/tensorflow aligned commit series of top 50 developers by commit number.

hercules --devs [-people-dict=/path/to/identities]
python3 labours.py -m devs -o <name>

We record how many commits made, as well as lines added, removed and changed per day for each developer. We plot the resulting commit time series using a few tricks to show the temporal grouping. In other words, two adjacent commit series should look similar after normalization.

  1. We compute the distance matrix of the commit series. Our distance metric is Dynamic Time Warping. We use FastDTW algorithm which has linear complexity proportional to the length of time series. Thus the overall complexity of computing the matrix is quadratic.
  2. We compile the linear list of commit series with Seriation technique. Particularly, we solve the Travelling Salesman Problem which is NP-complete. However, given the typical number of developers which is less than 1,000, there is a good chance that the solution does not take much time. We use Google or-tools solver.
  3. We find 1-dimensional clusters in the resulting path with HDBSCAN algorithm and assign colors accordingly.
  4. Time series are smoothed by convolving with the Slepian window.

This plot allows to discover how the development team evolved through time. It also shows "commit flashmobs" such as Hacktoberfest. For example, here are the revealed insights from the tensorflow/tensorflow plot above:

  1. "Tensorflow Gardener" is classified as the only outlier.
  2. The "blue" group of developers covers the global maintainers and a few people who left (at the top).
  3. The "red" group shows how core developers join the project or become less active.
Added vs changed lines through time

tensorflow/tensorflow

tensorflow/tensorflow added and changed lines through time.

hercules --devs [-people-dict=/path/to/identities]
python3 labours.py -m old-vs-new -o <name>

--devs from the previous section allows to plot how many lines were added and how many existing changed (deleted or replaced) through time. This plot is smoothed.

Sentiment (positive and negative code)

Django sentiment

hercules --sentiment --pb https://github.com/django/django | python3 labours.py -m sentiment -f pb

We extract new or changed comments from source code on every commit, apply BiDiSentiment general purpose sentiment recurrent neural network and plot the results. Requires libtensorflow. E.g. sadly, we need to hide the rect from the documentation finder for now is negative and Theano has a built-in optimization for logsumexp (...) so we can just write the expression directly is positive. Don't expect too much though - as was written, the sentiment model is general purpose and the code comments have different nature, so there is no magic (for now).

Hercules must be built with "tensorflow" tag - it is not by default:

make TAGS=tensorflow

Such a build requires libtensorflow.

Everything in a single pass
hercules --burndown --burndown-files --burndown-people --couples --shotness --devs [-people-dict=/path/to/identities]
python3 labours.py -m all

Plugins

Hercules has a plugin system and allows to run custom analyses. See PLUGINS.md.

Merging

hercules combine is the command which joins several analysis results in Protocol Buffers format together.

hercules --burndown --pb https://github.com/src-d/go-git > go-git.pb
hercules --burndown --pb https://github.com/src-d/hercules > hercules.pb
hercules combine go-git.pb hercules.pb | python3 labours.py -f pb -m burndown-project --resample M

Bad unicode errors

YAML does not support the whole range of Unicode characters and the parser on labours.py side may raise exceptions. Filter the output from hercules through fix_yaml_unicode.py to discard such offending characters.

hercules --burndown --burndown-people https://github.com/... | python3 fix_yaml_unicode.py | python3 labours.py -m people

Plotting

These options affects all plots:

python3 labours.py [--style=white|black] [--backend=] [--size=Y,X]

--style sets the general style of the plot (see labours.py --help). --background changes the plot background to be either white or black. --backend chooses the Matplotlib backend. --size sets the size of the figure in inches. The default is 12,9.

(required in macOS) you can pin the default Matplotlib backend with

echo "backend: TkAgg" > ~/.matplotlib/matplotlibrc

These options are effective in burndown charts only:

python3 labours.py [--text-size] [--relative]

--text-size changes the font size, --relative activate the stretched burndown layout.

Custom plotting backend

It is possible to output all the information needed to draw the plots in JSON format. Simply append .json to the output (-o) and you are done. The data format is not fully specified and depends on the Python code which generates it. Each JSON file should contain "type" which reflects the plot kind.

Caveats

  1. Processing all the commits may fail in some rare cases. If you get an error similar to https://github.com/src-d/hercules/issues/106 please report there and specify --first-parent as a workaround.
  2. Burndown collection may fail with an Out-Of-Memory error. See the next session for the workarounds.
  3. Parsing YAML in Python is slow when the number of internal objects is big. hercules' output for the Linux kernel in "couples" mode is 1.5 GB and takes more than an hour / 180GB RAM to be parsed. However, most of the repositories are parsed within a minute. Try using Protocol Buffers instead (hercules --pb and labours.py -f pb).
  4. To speed up yaml parsing
    # Debian, Ubuntu
    apt install libyaml-dev
    # macOS
    brew install yaml-cpp libyaml
    
    # you might need to re-install pyyaml for changes to make effect
    pip uninstall pyyaml
    pip --no-cache-dir install pyyaml
    

Burndown Out-Of-Memory

If the analyzed repository is big and extensively uses branching, the burndown stats collection may fail with an OOM. You should try the following:

  1. Read the repo from disk instead of cloning into memory.
  2. Use --skip-blacklist to avoid analyzing the unwanted files. It is also possible to constrain the --language.
  3. Use the hibernation feature: --hibernation-distance 10 --burndown-hibernation-threshold=1000. Play with those two numbers to start hibernating right before the OOM.
  4. Hibernate on disk: --burndown-hibernation-disk --burndown-hibernation-dir /path.
  5. --first-parent, you win.

Documentation

Overview

Package hercules contains the functions which are needed to gather various statistics from a Git repository.

The analysis is expressed in a form of the tree: there are nodes - "pipeline items" - which require some other nodes to be executed prior to selves and in turn provide the data for dependent nodes. There are several service items which do not produce any useful statistics but rather provide the requirements for other items. The top-level items include:

- BurndownAnalysis - line burndown statistics for project, files and developers.

- CouplesAnalysis - coupling statistics for files and developers.

- ShotnessAnalysis - structural hotness and couples, by any Babelfish UAST XPath (functions by default).

The typical API usage is to initialize the Pipeline class:

import "gopkg.in/src-d/go-git.v4"

var repository *git.Repository
// ...initialize repository...
pipeline := hercules.NewPipeline(repository)

Then add the required analysis:

ba := pipeline.DeployItem(&hercules.BurndownAnalysis{}).(hercules.LeafPipelineItem)

This call will add all the needed intermediate pipeline items. Then link and execute the analysis tree:

pipeline.Initialize(nil)
result, err := pipeline.Run(pipeline.Commits(false))

Finally extract the result:

result := result[ba].(hercules.BurndownResult)

The actual usage example is cmd/hercules/root.go - the command line tool's code.

Hercules depends heavily on https://github.com/src-d/go-git and leverages the diff algorithm through https://github.com/sergi/go-diff.

Besides, BurndownAnalysis involves File and RBTree. These are low level data structures which enable incremental blaming. File carries an instance of RBTree and the current line burndown state. RBTree implements the red-black balanced binary tree and is based on https://github.com/yasushi-saito/rbtree.

Coupling stats are supposed to be further processed rather than observed directly. labours.py uses Swivel embeddings and visualises them in Tensorflow Projector.

Shotness analysis as well as other UAST-featured items relies on [Babelfish](https://doc.bblf.sh) and requires the server to be running.

Index

Constants

View Source
const (
	// BoolConfigurationOption reflects the boolean value type.
	BoolConfigurationOption = core.BoolConfigurationOption
	// IntConfigurationOption reflects the integer value type.
	IntConfigurationOption = core.IntConfigurationOption
	// StringConfigurationOption reflects the string value type.
	StringConfigurationOption = core.StringConfigurationOption
	// FloatConfigurationOption reflects a floating point value type.
	FloatConfigurationOption = core.FloatConfigurationOption
	// StringsConfigurationOption reflects the array of strings value type.
	StringsConfigurationOption = core.StringsConfigurationOption
	// MessageFinalize is the status text reported before calling LeafPipelineItem.Finalize()-s.
	MessageFinalize = core.MessageFinalize
)
View Source
const (
	// ConfigPipelineDAGPath is the name of the Pipeline configuration option (Pipeline.Initialize())
	// which enables saving the items DAG to the specified file.
	ConfigPipelineDAGPath = core.ConfigPipelineDAGPath
	// ConfigPipelineDumpPlan is the name of the Pipeline configuration option (Pipeline.Initialize())
	// which outputs the execution plan to stderr.
	ConfigPipelineDumpPlan = core.ConfigPipelineDumpPlan
	// ConfigPipelineDryRun is the name of the Pipeline configuration option (Pipeline.Initialize())
	// which disables Configure() and Initialize() invocation on each PipelineItem during the
	// Pipeline initialization.
	// Subsequent Run() calls are going to fail. Useful with ConfigPipelineDAGPath=true.
	ConfigPipelineDryRun = core.ConfigPipelineDryRun
	// ConfigPipelineCommits is the name of the Pipeline configuration option (Pipeline.Initialize())
	// which allows to specify the custom commit sequence. By default, Pipeline.Commits() is used.
	ConfigPipelineCommits = core.ConfigPipelineCommits
)
View Source
const (
	// DependencyCommit is the name of one of the three items in `deps` supplied to PipelineItem.Consume()
	// which always exists. It corresponds to the currently analyzed commit.
	DependencyCommit = core.DependencyCommit
	// DependencyIndex is the name of one of the three items in `deps` supplied to PipelineItem.Consume()
	// which always exists. It corresponds to the currently analyzed commit's index.
	DependencyIndex = core.DependencyIndex
	// DependencyIsMerge is the name of one of the three items in `deps` supplied to PipelineItem.Consume()
	// which always exists. It indicates whether the analyzed commit is a merge commit.
	// Checking the number of parents is not correct - we remove the back edges during the DAG simplification.
	DependencyIsMerge = core.DependencyIsMerge
	// DependencyAuthor is the name of the dependency provided by identity.Detector.
	DependencyAuthor = identity.DependencyAuthor
	// DependencyBlobCache identifies the dependency provided by BlobCache.
	DependencyBlobCache = plumbing.DependencyBlobCache
	// DependencyDay is the name of the dependency which DaysSinceStart provides - the number
	// of days since the first commit in the analysed sequence.
	DependencyDay = plumbing.DependencyDay
	// DependencyFileDiff is the name of the dependency provided by FileDiff.
	DependencyFileDiff = plumbing.DependencyFileDiff
	// DependencyTreeChanges is the name of the dependency provided by TreeDiff.
	DependencyTreeChanges = plumbing.DependencyTreeChanges
	// DependencyUastChanges is the name of the dependency provided by Changes.
	DependencyUastChanges = uast.DependencyUastChanges
	// DependencyUasts is the name of the dependency provided by Extractor.
	DependencyUasts = uast.DependencyUasts
	// FactCommitsByDay contains the mapping between day indices and the corresponding commits.
	FactCommitsByDay = plumbing.FactCommitsByDay
	// FactIdentityDetectorPeopleCount is the name of the fact which is inserted in
	// identity.Detector.Configure(). It is equal to the overall number of unique authors
	// (the length of ReversedPeopleDict).
	FactIdentityDetectorPeopleCount = identity.FactIdentityDetectorPeopleCount
	// FactIdentityDetectorPeopleDict is the name of the fact which is inserted in
	// identity.Detector.Configure(). It corresponds to identity.Detector.PeopleDict - the mapping
	// from the signatures to the author indices.
	FactIdentityDetectorPeopleDict = identity.FactIdentityDetectorPeopleDict
	// FactIdentityDetectorReversedPeopleDict is the name of the fact which is inserted in
	// identity.Detector.Configure(). It corresponds to identity.Detector.ReversedPeopleDict -
	// the mapping from the author indices to the main signature.
	FactIdentityDetectorReversedPeopleDict = identity.FactIdentityDetectorReversedPeopleDict
)

Variables

View Source
var BinaryGitHash = "<unknown>"

BinaryGitHash is the Git hash of the Hercules binary file which is executing.

View Source
var BinaryVersion = 0

BinaryVersion is Hercules' API version. It matches the package name.

View Source
var Registry = core.Registry

Registry contains all known pipeline item types.

Functions

func EnablePathFlagTypeMasquerade

func EnablePathFlagTypeMasquerade()

EnablePathFlagTypeMasquerade changes the type of all "path" command line arguments from "string" to "path". This operation cannot be canceled and is intended to be used for better --help output.

func LoadCommitsFromFile

func LoadCommitsFromFile(path string, repository *git.Repository) ([]*object.Commit, error)

LoadCommitsFromFile reads the file by the specified FS path and generates the sequence of commits by interpreting each line as a Git commit hash.

func PathifyFlagValue

func PathifyFlagValue(flag *pflag.Flag)

PathifyFlagValue changes the type of a string command line argument to "path".

func SafeYamlString

func SafeYamlString(str string) string

SafeYamlString escapes the string so that it can be reliably used in YAML.

Types

type CachedBlob

type CachedBlob = plumbing.CachedBlob

CachedBlob allows to explicitly cache the binary data associated with the Blob object. Such structs are returned by DependencyBlobCache.

type CommonAnalysisResult

type CommonAnalysisResult = core.CommonAnalysisResult

CommonAnalysisResult holds the information which is always extracted at Pipeline.Run().

func MetadataToCommonAnalysisResult

func MetadataToCommonAnalysisResult(meta *core.Metadata) *CommonAnalysisResult

MetadataToCommonAnalysisResult copies the data from a Protobuf message.

type ConfigurationOption

type ConfigurationOption = core.ConfigurationOption

ConfigurationOption allows for the unified, retrospective way to setup PipelineItem-s.

type ConfigurationOptionType

type ConfigurationOptionType = core.ConfigurationOptionType

ConfigurationOptionType represents the possible types of a ConfigurationOption's value.

type FeaturedPipelineItem

type FeaturedPipelineItem = core.FeaturedPipelineItem

FeaturedPipelineItem enables switching the automatic insertion of pipeline items on or off.

type FileDiffData

type FileDiffData = plumbing.FileDiffData

FileDiffData is the type of the dependency provided by plumbing.FileDiff.

type LeafPipelineItem

type LeafPipelineItem = core.LeafPipelineItem

LeafPipelineItem corresponds to the top level pipeline items which produce the end results.

type NoopMerger

type NoopMerger = core.NoopMerger

NoopMerger provides an empty Merge() method suitable for PipelineItem.

type OneShotMergeProcessor

type OneShotMergeProcessor = core.OneShotMergeProcessor

OneShotMergeProcessor provides the convenience method to consume merges only once.

type Pipeline

type Pipeline = core.Pipeline

Pipeline is the core Hercules entity which carries several PipelineItems and executes them. See the extended example of how a Pipeline works in doc.go

func NewPipeline

func NewPipeline(repository *git.Repository) *Pipeline

NewPipeline initializes a new instance of Pipeline struct.

type PipelineItem

type PipelineItem = core.PipelineItem

PipelineItem is the interface for all the units in the Git commits analysis pipeline.

func ForkCopyPipelineItem

func ForkCopyPipelineItem(origin PipelineItem, n int) []PipelineItem

ForkCopyPipelineItem clones items by copying them by value from the origin.

func ForkSamePipelineItem

func ForkSamePipelineItem(origin PipelineItem, n int) []PipelineItem

ForkSamePipelineItem clones items by referencing the same origin.

type PipelineItemRegistry

type PipelineItemRegistry = core.PipelineItemRegistry

PipelineItemRegistry contains all the known PipelineItem-s.

type ResultMergeablePipelineItem

type ResultMergeablePipelineItem = core.ResultMergeablePipelineItem

ResultMergeablePipelineItem specifies the methods to combine several analysis results together.

Directories

Path Synopsis
cmd
contrib
pb
Package pb is a generated protocol buffer package.
Package pb is a generated protocol buffer package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL