hercules

package module
v2.0.0-...-7ef6ec8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 22, 2017 License: Apache-2.0 Imports: 18 Imported by: 0

README

Hercules Build Status codecov

This project calculates and plots the lines burndown and other fun stats in Git repositories. Exactly the same what git-of-theseus does actually, but using go-git. Why? source{d} builds it's own data pipeline to process every git repository in the world and the calculation of the annual burnout ratio will be embedded into it. hercules contains an open source implementation of the specific git blame flavour on top of go-git. Blaming is performed incrementally using the custom RB tree tracking algorithm, only the last modification date is recorded.

There are two tools: hercules and labours.py. The first is the program written in Go which collects the burndown and other stats from a Git repository. The second is the Python script which draws the stack area plots and optionally resamples the time series. These two tools are normally used together through the pipe. hercules prints results in plain text. The first line is four numbers: UNIX timestamp which corresponds to the time the repository was created, UNIX timestamp of the last commit, granularity and sampling. Granularity is the number of days each band in the stack consists of. Sampling is the frequency with which the burnout state is snapshotted. The smaller the value, the more smooth is the plot but the more work is done.

git/git image

torvalds/linux burndown (granularity 30, sampling 30, resampled by year)

There is an option to resample the bands inside labours.py, so that you can define a very precise distribution and visualize it different ways. Besides, resampling aligns the bands across periodic boundaries, e.g. months or years. Unresampled bands are apparently not aligned and start from the project's birth date.

There is a presentation available.

Installation

You are going to need Go (>= v1.8) and Python 2 or 3.

go get gopkg.in/src-d/hercules.v2/cmd/hercules
pip install -r requirements.txt
wget https://github.com/src-d/hercules/raw/master/labours.py
Windows

Numpy and SciPy are requirements. Install the correct version by downloading the wheel from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy.

Usage

# Use "memory" go-git backend and display the plot. This is the fastest but the repository data must fit into RAM.
hercules https://github.com/src-d/go-git | python3 labours.py --resample month
# Use "file system" go-git backend and print the raw data.
hercules /path/to/cloned/go-git
# Use "file system" go-git backend, cache the cloned repository to /tmp/repo-cache, use Protocol Buffers and display the unresampled plot.
hercules -pb https://github.com/git/git /tmp/repo-cache | python3 labours.py -f pb --resample raw

# Now something fun
# Get the linear history from git rev-list, reverse it
# Pipe to hercules, produce the snapshots for every 30 days grouped by 30 days
# Save the raw data to cache.yaml, so that later is possible to python3 labours.py -i cache.yaml
# Pipe the raw data to labours.py, set text font size to 16pt, use Agg matplotlib backend and save the plot to output.png
git rev-list HEAD | tac | hercules -commits - https://github.com/git/git | tee cache.yaml | python3 labours.py --font-size 16 --backend Agg --output git.png

labours.py -i /path/to/yaml allows to read the output from hercules which was saved on disk.

Caching

It is possible to store the cloned repository on disk. The subsequent analysis can run on the corresponding directory instead of cloning from scratch:

# First time - cache
hercules https://github.com/git/git /tmp/repo-cache

# Second time - use the cache
hercules /tmp/repo-cache
Docker image
docker run --rm srcd/hercules hercules -pb https://github.com/git/git | docker run --rm -i -v $(pwd):/io srcd/hercules labours.py -f pb -o /io/git_git.png

Extensions

Files
hercules -files
python3 labours.py -m files

Burndown statistics for every file in the repository which is alive in the latest revision.

People
hercules -people [-people-dict=/path/to/identities]
python3 labours.py -m person

Burndown statistics for developers. If -people-dict is not specified, the identities are discovered by the following algorithm:

  1. We start from the root commit towards the HEAD. Emails and names are converted to lower case.
  2. If we process an unknown email and name, record them as a new developer.
  3. If we process a known email but unknown name, match to the developer with the matching email, and add the unknown name to the list of that developer's names.
  4. If we process an unknown email but known name, match to the developer with the matching name, and add the unknown email to the list of that developer's emails.

If -people-dict is specified, it should point to a text file with the custom identities. The format is: every line is a single developer, it contains all the matching emails and names separated by |. The case is ignored.

Churn matrix

Wireshark top 20 churn matrix

Wireshark top 20 devs - churn matrix

hercules -people [-people-dict=/path/to/identities]
python3 labours.py -m churn_matrix

Besides the burndown information, -people collects the added and deleted line statistics per developer. It shows how many lines written by developer A are removed by developer B. The format is the matrix with N rows and (N+2) columns, where N is the number of developers.

  1. First column is the number of lines the developer wrote.
  2. Second column is how many lines were written by the developer and deleted by unidentified developers (if -people-dict is not specified, it is always 0).
  3. The rest of the columns show how many lines were written by the developer and deleted by identified developers.

The sequence of developers is stored in people_sequence YAML node.

Code ownership

Ember.js top 20 code ownership

Ember.js top 20 devs - code ownership

hercules -people [-people-dict=/path/to/identities]
python3 labours.py -m ownership

-people also allows to draw the code share through time stacked area plot. That is, how many lines are alive at the sampled moments in time for each identified developer.

Couples

Linux kernel file couples

torvalds/linux files' coupling in Tensorflow Projector

hercules -couples [-people-dict=/path/to/identities]
python3 labours.py -m couples -o <name> [--couples-tmp-dir=/tmp]

Important: it requires Tensorflow to be installed, please follow official instuctions.

The files are coupled if they are changed in the same commit. The developers are coupled if they change the same file. hercules records the number of couples throught the whole commti history and outputs the two corresponding co-occurrence matrices. labours.py then trains Swivel embeddings - dense vectors which reflect the co-occurrence probability through the Euclidean distance. The training requires a working Tensorflow installation. The intermediate files are stored in the system temporary directory or --couples-tmp-dir if it is specified. The trained embeddings are written to the current working directory with the name depending on -o. The output format is TSV and matches Tensorflow Projector so that the files and people can be visualized with t-SNE implemented in TF Projector.

Everything in a single pass
hercules -files -people -couples [-people-dict=/path/to/identities]
python3 labours.py -m all

Bad unicode errors

YAML does not support the whole range of Unicode characters and the parser on labours.py side may raise exceptions. Filter the output from hercules through fix_yaml_unicode.py to discard such offending characters.

hercules -people https://github.com/... | python3 fix_yaml_unicode.py | python3 labours.py -m people

Plotting

These options affects all plots:

python3 labours.py [--style=white|black] [--backend=] [--size=Y,X]

--style changes the background to be either white ("black" foreground) or black ("white" foreground). --backend chooses the Matplotlib backend. --size sets the size of the figure in inches. The default is 12,9.

(required in macOS) you can pin the default Matplotlib backend with

echo "backend: TkAgg" > ~/.matplotlib/matplotlibrc

These options are effective in burndown charts only:

python3 labours.py [--text-size] [--relative]

--text-size changes the font size, --relative activate the stretched burndown layout.

Custom plotting backend

It is possible to output all the information needed to draw the plots in JSON format. Simply append .json to the output (-o) and you are done. The data format is not fully specified and depends on the Python code which generates it. Each JSON file should contain "type" which reflects the plot kind.

Caveats

  1. Currently, go-git's file system storage backend is considerably slower than the in-memory one, so you should clone repos instead of reading them from disk whenever possible. Please note that the in-memory storage may require much RAM, for example, the Linux kernel takes over 200GB in 2017.
  2. Parsing YAML in Python is slow when the number of internal objects is big. hercules' output for the Linux kernel in "couples" mode is 1.5 GB and takes more than an hour / 180GB RAM to be parsed. However, most of the repositories are parsed within a minute. Try using Protocol Buffers instead (hercules -pb and labours.py -f pb).
  3. To speed-up yaml parsing
    # Debian, Ubuntu
    apt install libyaml-dev
    # macOS
    brew install yaml-cpp libyaml
    
    # you might need to re-install pyyaml for changes to make effect
    pip uninstall pyyaml
    pip --no-cache-dir install pyyaml
    

License

Apache 2.0.

Documentation

Overview

Package hercules contains the functions which are needed to gather various statistics from a Git repository.

The analysis is expressed in a form of the tree: there are nodes - "pipeline items" - which require some other nodes to be executed prior to selves and in turn provide the data for dependent nodes. There are several service items which do not produce any useful statistics but rather provide the requirements for other items. The top-level items are:

- BurndownAnalysis - line burndown statistics for project, files and developers. - Couples - coupling statistics for files and developers.

The typical API usage is to initialize the Pipeline class:

  import "gopkg.in/src-d/go-git.v4"

	var repository *git.Repository
	// ...initialize repository...
	pipeline := hercules.NewPipeline(repository)

Then add the required analysis tree nodes:

  pipeline.AddItem(&hercules.BlobCache{})
	pipeline.AddItem(&hercules.DaysSinceStart{})
	pipeline.AddItem(&hercules.TreeDiff{})
	pipeline.AddItem(&hercules.FileDiff{})
	pipeline.AddItem(&hercules.RenameAnalysis{SimilarityThreshold: 80})
	pipeline.AddItem(&hercules.IdentityDetector{})

Then initialize BurndownAnalysis:

  burndowner := &hercules.BurndownAnalysis{
    Granularity:  30,
		Sampling:     30,
  }
  pipeline.AddItem(burndowner)

Then execute the analysis tree:

  pipeline.Initialize()
	result, err := pipeline.Run(commits)

Finally extract the result:

burndownResults := result[burndowner].(hercules.BurndownResult)

The actual usage example is cmd/hercules/main.go - the command line tool's code.

Hercules depends heavily on https://github.com/src-d/go-git and leverages the diff algorithm through https://github.com/sergi/go-diff.

Besides, hercules defines File and RBTree. These are low level data structures required by BurndownAnalysis. File carries an instance of RBTree and the current line burndown state. RBTree implements the red-black balanced binary tree and is based on https://github.com/yasushi-saito/rbtree.

Coupling stats are supposed to be further processed rather than observed directly. labours.py uses Swivel embeddings and visualises them in Tensorflow Projector.

Index

Constants

View Source
const MISSING_AUTHOR = (1 << 18) - 1
View Source
const SELF_AUTHOR = (1 << 18) - 2
View Source
const TreeEnd int = -1

TreeEnd denotes the value of the last leaf in the tree.

Variables

This section is empty.

Functions

func LoadCommitsFromFile

func LoadCommitsFromFile(path string, repository *git.Repository) ([]*object.Commit, error)

func ParseMailmap

func ParseMailmap(contents string) map[string]object.Signature

ParseMailmap parses the contents of .mailmap and returns the mapping between signature parts. It does *not* follow the full signature matching convention, that is, developers are identified by email and by name independently.

Types

type BlobCache

type BlobCache struct {
	IgnoreMissingSubmodules bool
	// contains filtered or unexported fields
}

func (*BlobCache) Consume

func (self *BlobCache) Consume(deps map[string]interface{}) (map[string]interface{}, error)

func (*BlobCache) Finalize

func (cache *BlobCache) Finalize() interface{}

func (*BlobCache) Initialize

func (cache *BlobCache) Initialize(repository *git.Repository)

func (*BlobCache) Name

func (cache *BlobCache) Name() string

func (*BlobCache) Provides

func (cache *BlobCache) Provides() []string

func (*BlobCache) Requires

func (cache *BlobCache) Requires() []string

type BurndownAnalysis

type BurndownAnalysis struct {
	// Granularity sets the size of each band - the number of days it spans.
	// Smaller values provide better resolution but require more work and eat more
	// memory. 30 days is usually enough.
	Granularity int
	// Sampling sets how detailed is the statistic - the size of the interval in
	// days between consecutive measurements. It is usually a good idea to set it
	// <= Granularity. Try 15 or 30.
	Sampling int

	// TrackFiles enables or disables the fine-grained per-file burndown analysis.
	// It does not change the top level burndown results.
	TrackFiles bool

	// The number of developers for which to collect the burndown stats. 0 disables it.
	PeopleNumber int

	// Debug activates the debugging mode. Analyse() runs slower in this mode
	// but it accurately checks all the intermediate states for invariant
	// violations.
	Debug bool
	// contains filtered or unexported fields
}

BurndownAnalyser allows to gather the line burndown statistics for a Git repository.

func (*BurndownAnalysis) Consume

func (analyser *BurndownAnalysis) Consume(deps map[string]interface{}) (map[string]interface{}, error)

func (*BurndownAnalysis) Finalize

func (analyser *BurndownAnalysis) Finalize() interface{}

Finalize() returns the list of snapshots of the cumulative line edit times and the similar lists for every file which is alive in HEAD. The number of snapshots (the first dimension >[]<[]int64) depends on Analyser.Sampling (the more Sampling, the less the value); the length of each snapshot depends on Analyser.Granularity (the more Granularity, the less the value).

func (*BurndownAnalysis) Initialize

func (analyser *BurndownAnalysis) Initialize(repository *git.Repository)

func (*BurndownAnalysis) Name

func (analyser *BurndownAnalysis) Name() string

func (*BurndownAnalysis) Provides

func (analyser *BurndownAnalysis) Provides() []string

func (*BurndownAnalysis) Requires

func (analyser *BurndownAnalysis) Requires() []string

type BurndownResult

type BurndownResult struct {
	GlobalHistory   [][]int64
	FileHistories   map[string][][]int64
	PeopleHistories [][][]int64
	PeopleMatrix    [][]int64
}

type Couples

type Couples struct {
	// The number of developers for which to build the matrix. 0 disables this analysis.
	PeopleNumber int
	// contains filtered or unexported fields
}

func (*Couples) Consume

func (couples *Couples) Consume(deps map[string]interface{}) (map[string]interface{}, error)

func (*Couples) Finalize

func (couples *Couples) Finalize() interface{}

func (*Couples) Initialize

func (couples *Couples) Initialize(repository *git.Repository)

func (*Couples) Name

func (couples *Couples) Name() string

func (*Couples) Provides

func (couples *Couples) Provides() []string

func (*Couples) Requires

func (couples *Couples) Requires() []string

type CouplesResult

type CouplesResult struct {
	PeopleMatrix []map[int]int64
	PeopleFiles  [][]int
	FilesMatrix  []map[int]int64
	Files        []string
}

type DaysSinceStart

type DaysSinceStart struct {
	// contains filtered or unexported fields
}

func (*DaysSinceStart) Consume

func (days *DaysSinceStart) Consume(deps map[string]interface{}) (map[string]interface{}, error)

func (*DaysSinceStart) Finalize

func (days *DaysSinceStart) Finalize() interface{}

func (*DaysSinceStart) Initialize

func (days *DaysSinceStart) Initialize(repository *git.Repository)

func (*DaysSinceStart) Name

func (days *DaysSinceStart) Name() string

func (*DaysSinceStart) Provides

func (days *DaysSinceStart) Provides() []string

func (*DaysSinceStart) Requires

func (days *DaysSinceStart) Requires() []string

type File

type File struct {
	// contains filtered or unexported fields
}

A file encapsulates a balanced binary tree to store line intervals and a cumulative mapping of values to the corresponding length counters. Users are not supposed to create File-s directly; instead, they should call NewFile(). NewFileFromTree() is the special constructor which is useful in the tests.

Len() returns the number of lines in File.

Update() mutates File by introducing tree structural changes and updating the length mapping.

Dump() writes the tree to a string and Validate() checks the tree integrity.

func NewFile

func NewFile(time int, length int, statuses ...Status) *File

NewFile initializes a new instance of File struct.

time is the starting value of the first node;

length is the starting length of the tree (the key of the second and the last node);

statuses are the attached interval length mappings.

func NewFileFromTree

func NewFileFromTree(keys []int, vals []int, statuses ...Status) *File

NewFileFromTree is an alternative constructor for File which is used in tests. The resulting tree is validated with Validate() to ensure the initial integrity.

keys is a slice with the starting tree keys.

vals is a slice with the starting tree values. Must match the size of keys.

statuses are the attached interval length mappings.

func (*File) Dump

func (file *File) Dump() string

Dump formats the underlying line interval tree into a string. Useful for error messages, panic()-s and debugging.

func (*File) Len

func (file *File) Len() int

Len returns the File's size - that is, the maximum key in the tree of line intervals.

func (*File) Status

func (file *File) Status(index int) interface{}

func (*File) Update

func (file *File) Update(time int, pos int, ins_length int, del_length int)

Update modifies the underlying tree to adapt to the specified line changes.

time is the time when the requested changes are made. Sets the values of the inserted nodes.

pos is the index of the line at which the changes are introduced.

ins_length is the number of inserted lines after pos.

del_length is the number of removed lines after pos. Deletions come before the insertions.

The code inside this function is probably the most important one throughout the project. It is extensively covered with tests. If you find a bug, please add the corresponding case in file_test.go.

func (*File) Validate

func (file *File) Validate()

Validate checks the underlying line interval tree integrity. The checks are as follows:

1. The minimum key must be 0 because the first line index is always 0.

2. The last node must carry TreeEnd value. This is the maintained invariant which marks the ending of the last line interval.

3. Node keys must monotonically increase and never duplicate.

type FileDiff

type FileDiff struct {
}

FileDiff calculates the difference of files which were modified.

func (*FileDiff) Consume

func (diff *FileDiff) Consume(deps map[string]interface{}) (map[string]interface{}, error)

func (*FileDiff) Finalize

func (diff *FileDiff) Finalize() interface{}

func (*FileDiff) Initialize

func (diff *FileDiff) Initialize(repository *git.Repository)

func (*FileDiff) Name

func (diff *FileDiff) Name() string

func (*FileDiff) Provides

func (diff *FileDiff) Provides() []string

func (*FileDiff) Requires

func (diff *FileDiff) Requires() []string

type FileDiffData

type FileDiffData struct {
	OldLinesOfCode int
	NewLinesOfCode int
	Diffs          []diffmatchpatch.Diff
}

type FileGetter

type FileGetter func(path string) (*object.File, error)

type IdentityDetector

type IdentityDetector struct {
	// Maps email || name  -> developer id.
	PeopleDict map[string]int
	// Maps developer id -> description
	ReversePeopleDict []string
}

func (*IdentityDetector) Consume

func (self *IdentityDetector) Consume(deps map[string]interface{}) (map[string]interface{}, error)

func (*IdentityDetector) Finalize

func (id *IdentityDetector) Finalize() interface{}

func (*IdentityDetector) GeneratePeopleDict

func (id *IdentityDetector) GeneratePeopleDict(commits []*object.Commit)

func (*IdentityDetector) Initialize

func (id *IdentityDetector) Initialize(repository *git.Repository)

func (*IdentityDetector) LoadPeopleDict

func (id *IdentityDetector) LoadPeopleDict(path string) error

func (*IdentityDetector) Name

func (id *IdentityDetector) Name() string

func (*IdentityDetector) Provides

func (id *IdentityDetector) Provides() []string

func (*IdentityDetector) Requires

func (id *IdentityDetector) Requires() []string

type Pipeline

type Pipeline struct {
	// OnProgress is the callback which is invoked in Analyse() to output it's
	// progress. The first argument is the number of processed commits and the
	// second is the total number of commits.
	OnProgress func(int, int)
	// contains filtered or unexported fields
}

func NewPipeline

func NewPipeline(repository *git.Repository) *Pipeline

func (*Pipeline) AddItem

func (pipeline *Pipeline) AddItem(item PipelineItem)

func (*Pipeline) Commits

func (pipeline *Pipeline) Commits() []*object.Commit

Commits returns the critical path in the repository's history. It starts from HEAD and traces commits backwards till the root. When it encounters a merge (more than one parent), it always chooses the first parent.

func (*Pipeline) Initialize

func (pipeline *Pipeline) Initialize()

func (*Pipeline) RemoveItem

func (pipeline *Pipeline) RemoveItem(item PipelineItem)

func (*Pipeline) Run

func (pipeline *Pipeline) Run(commits []*object.Commit) (map[PipelineItem]interface{}, error)

Run executes the pipeline.

commits is a slice with the sequential commit history. It shall start from the root (ascending order).

type PipelineItem

type PipelineItem interface {
	// Name returns the name of the analysis.
	Name() string
	// Provides returns the list of keys of reusable calculated entities.
	// Other items may depend on them.
	Provides() []string
	// Requires returns the list of keys of needed entities which must be supplied in Consume().
	Requires() []string
	// Initialize prepares and resets the item. Consume() requires Initialize()
	// to be called at least once beforehand.
	Initialize(*git.Repository)
	// Consume processes the next commit.
	// deps contains the required entities which match Depends(). Besides, it always includes
	// "commit" and "index".
	// Returns the calculated entities which match Provides().
	Consume(deps map[string]interface{}) (map[string]interface{}, error)
	// Finalize returns the result of the analysis.
	Finalize() interface{}
}

type RenameAnalysis

type RenameAnalysis struct {
	// SimilarityThreshold adjusts the heuristic to determine file renames.
	// It has the same units as cgit's -X rename-threshold or -M. Better to
	// set it to the default value of 90 (90%).
	SimilarityThreshold int
	// contains filtered or unexported fields
}

func (*RenameAnalysis) Consume

func (ra *RenameAnalysis) Consume(deps map[string]interface{}) (map[string]interface{}, error)

func (*RenameAnalysis) Finalize

func (ra *RenameAnalysis) Finalize() interface{}

func (*RenameAnalysis) Initialize

func (ra *RenameAnalysis) Initialize(repository *git.Repository)

func (*RenameAnalysis) Name

func (ra *RenameAnalysis) Name() string

func (*RenameAnalysis) Provides

func (ra *RenameAnalysis) Provides() []string

func (*RenameAnalysis) Requires

func (ra *RenameAnalysis) Requires() []string

type Status

type Status struct {
	// contains filtered or unexported fields
}

A status is the something we would like to update during File.Update().

func NewStatus

func NewStatus(data interface{}, update func(interface{}, int, int, int)) Status

type TreeDiff

type TreeDiff struct {
	// contains filtered or unexported fields
}

func (*TreeDiff) Consume

func (treediff *TreeDiff) Consume(deps map[string]interface{}) (map[string]interface{}, error)

func (*TreeDiff) Finalize

func (treediff *TreeDiff) Finalize() interface{}

func (*TreeDiff) Initialize

func (treediff *TreeDiff) Initialize(repository *git.Repository)

func (*TreeDiff) Name

func (treediff *TreeDiff) Name() string

func (*TreeDiff) Provides

func (treediff *TreeDiff) Provides() []string

func (*TreeDiff) Requires

func (treediff *TreeDiff) Requires() []string

Directories

Path Synopsis
cmd
hercules
Package main provides the command line tool to gather the line burndown statistics from Git repositories.
Package main provides the command line tool to gather the line burndown statistics from Git repositories.
Package pb is a generated protocol buffer package.
Package pb is a generated protocol buffer package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL