Documentation ¶
Overview ¶
Package git implements derivation of a file graph from git log and optionally from the file structure.
Change-log-based distance ¶
This distance is derived from the probability that two files appeared in the same commit. The core idea is that relevant files tend to be modified together.
Distance(x, y) = -log(P(x is relevant to y)) P(x is relevant to y) := sum(1/(len(c.files)-1) for c in y.commits if x in c.files) / len(y.commits)
or in English,
- the distance from x to y is -log of the probability that x is relevant to y
- x is relevant to y if x is likely to appear in a commit that touches y, where the commit is chosen randomly and independently.
There are more nuances to this formula, e.g. file removals are not counted towards len(commit.files), and commits with len(files) = 1 or len(files) > limit are ignored. File renames are also taken care of.
Note that this formula penalizes large commits. The more files are modified in a commit, the weaker is the strength of its signal.
This graph defines distance only between files, and not directories.
File-structure-based distance ¶
This distance is derived from the file structure. It is the number of edges between two files in the *file tree*, i.e. the number of hops one has to make to navigate from one file to another in the file tree. For example, given the following file stucture:
// ├── foo/ │ ├── bar.cc │ └── baz.cc └── qux.cc
The distance between //foo/bar.cc and //foo/baz.cc is 2 because they have a common parent, and the distance between //foo/bar.cc and and //qux.cc is 3 because the path goes through the root.
The file-structured-based distance can compensate for the weakness of the change-log-based distance.
This distance formula is disabled by default, and can be enabled in EdgeReader.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type EdgeReader ¶
type EdgeReader struct { // ChangeLogDistanceFactor is the multiplier for distances derived from the // git log. If both ChangeLogDistanceFactor and FileStructureDistanceFactor // are zero, then ChangeLogDistanceFactor defaults to 1. If zero then // change-log-based edges are ignored. ChangeLogDistanceFactor float64 // FileStructureDistanceFactor is the multiplier for distances derived from // the file structure. If zero, then such edges are not reported. FileStructureDistanceFactor float64 }
EdgeReader implements filegraph.EdgeReader. It works only with nodes returned by Graph.Node().
type Graph ¶
type Graph struct { // Commit is the git commit that the graph state corresponds to. Commit string // contains filtered or unexported fields }
Graph is a file graph based on the git history.
The graph represents aggregated history of all file changes in the repo, rather than the state of the repo at a single point of time. In particular, the graph may include nodes for files that no longer exist. It is generally not possible to tell if a node is a file or a directory, because it might have been a file at one point of time, and a directory at another.
TODO(nodir): introduce a decay function to remove old nodes/edges.
func Load ¶
Load returns a file graph for a git repository. Caches the graph under the .git directory. May take minutes and log progress if the cache is cold.
If the cache exists, but no longer matches the current ref commit, then applies new changes to the loaded graph and updates the cache.
func (*Graph) Node ¶
Node returns a node by its name. Returns nil if the node is not found or if the name is empty. See also Node.Name().
Idempotent: calling many times with the same name returns the same Node object.
func (*Graph) Update ¶
Update updates the graph based on changes in a git repository. This is the only way to mutate the Graph. Applies all changes reachable from rev, but not from g.Commit, and updates g.Commit.
If returns an error which wasn't returned by the callback, then it is possible that the graph is corrupted.
func (*Graph) Write ¶
Write writes the graph to w. It is the opposite of (*Graph).Read().
Spec:
graph = header version git-commit-hash root total-number-of-edges root-edges header = 54 version = 0 root = node node = prob-sum-denominator number-of-children children-sorted-by-base-name children-sorted-by-base-name = child* child = base-name node root-edges = node-edges node-edges = number-of-edges edge* edge = index-of-the-adjacent-node-as-found-in-the-file prob-sum edges-of-children-sorted-by-base-name edges-of-children-sorted-by-base-name = edge* where all integer types are encoded as varint all strings are encoded as length-prefixed utf8 `*` means "0 or more"
type LoadOptions ¶
type LoadOptions struct { UpdateOptions // Ref is the git ref to load the graph for. // Defaults to refs/heads/main. // // If it is refs/heads/main, but it does not exist, then falls back to // refs/heads/master. Ref string }
LoadOptions are options for Load() function.
type SelectionStrategy ¶
type SelectionStrategy struct { Graph *Graph EdgeReader *EdgeReader // MaxDistance decides whether a test is to be selected: if it is closer or // equal than MaxDistance, then it is selected. Otherwise, skipped. // // Ignored by SelectEval. MaxDistance float64 // OnTestNotFound is called when a test file is not found in the filegraph and // is not among changed files. If nil, then the file name is logged. // // Ignored by Select. OnTestNotFound func(ctx context.Context, tv *evalpb.TestVariant) }
SelectionStrategy implements a selection strategy based on a git graph.
func (*SelectionStrategy) Select ¶
func (s *SelectionStrategy) Select(changedFiles []string, skipFile func(name string) (keepGoing bool))
Select calls skipTestFile for each test file that should be skipped. Does not skip files that it does not know about.
func (*SelectionStrategy) SelectEval ¶
SelectEval implements eval.Strategy. It can be used to evaluate data quality of the graph. It is a version of Select specifically for evaluation.
This function has minimal input validation. It is not designed to be called by the evaluation framework directly. Instead it should be wrapped with another strategy function that does the validation. In particular, this function does not check in.ChangedFiles[i].Repo and does not check for file patterns that must be exempted from RTS.
type UpdateOptions ¶
type UpdateOptions struct { // Callback, if not nil, is called each time after each commit is processed // and Graph.Commit is updated. Callback func() error // MaxCommitSize is the maximum number of files touched by a commit. // Commits that exceed this limit are ignored. // The rationale is that large commits provide a weak signal of file // relatedness and are expensive to process, O(N^2). MaxCommitSize int }
UpdateOptions are options for Graph.Update().