README ¶
Removal of package cycles
This package contains functions, algorithms, data structures and tools required to remove all package cycles from a conjure definition. This is required because with conjure-go v6, each conjure package generates a Go package and the Go compiler does not admit package cycles.
This document outlines the goals and constraints of this process, the algorithm for removing cycles and the proof for correctness and optimality for the algorithm within the constraints.
Glossary
- A type is either a conjure object (whether it's an alias for a primitive type, a struct or an union),
error or service. It is denoted as a tuple
(name, package)
. - A conjure package is a package according to the original conjure definition.
- A Go package is a package according to the intermediary conjure definition (that is, it's ready to be compiled to Go).
- A graph is a set of nodes and edges
that connect these nodes.
The number of nodes and edges is normally represented by the variable names
n
andm
, respectively. - A node or vertex is a unit of a graph.
We'll normally represent nodes with the variable names
u
andv
. - An edge connects two nodes on a graph, that is, between two nodes, there is either an edge or there isn't.
- A partition of a graph is a grouping of nodes such as each node of the graph is part of exactly one group. A partition is deemed minimal according to some condition if it minimizes the number of groups according to the condition.
- A directed graph is a graph where all edges have a direction. When an edge has a direction, the edge defines exactly one source node and one destination node. In this case we say there is an edge from the souce node to the destination node.
- The type graph of a conjure definition is a graph where each type represents a node in the graph.
If the type of node
u
is an object or union that references a type of nodev
, then there is a directed edge fromu
tov
. - A path on a directed graph is an ordered list of nodes such that you can go from one to the next traversing edges whose source is the current node and the destination is the next node.
- A cycle on a directed graph is a path with at least 2 nodes that starts and ends in the same node.
- A directed acyclic graph (a.k.a. DAG) is a directed graph with no cycles.
- A strongly connected component (a.k.a. SCC) is a maximal subset of nodes in a directed graph such that two nodes belong to the same SCC if and only if there is a path from one to another. In other words, they are part of a cycle. They are maximal in a sense that an SCC cannot contain another SCC. SCCs define a partition of a graph, that is, a node can only be part of a single SCC and there is only a single possible SCC partition.
- The strongly connected component graph (a.k.a. SCC graph) is a graph of strongly connected components.
If there is an edge from node
u
to nodev
in the original graph, then eitheru
andv
are part of the same SCC or there is an edge from the SCC ofu
to the SCC ofv
in the SCC graph. The SCC graph is a DAG because if there were any cycles in the SCC graph, the SCCs would not be maximal. - A package set is a set of conjure packages. Two package sets are said to be equal if and only if their sets are equal, that is, if one set contains a package, the other contains that package as well. The package set of an SCC is the set of all packages of the types that make up the SCC in the type graph.
- A topological order of a DAG is an ordered list of all nodes
such that if there is a path from
u
tov
, thenu
appears beforev
in the order. There can be more than one topological orders.
Goals and constraints
In conjure, types aren't just standalone concepts - they are grouped by what is called "conjure packages". Each type is part of exactly one conjure package. This is used to create a logical grouping of concepts that help organize the generated code. When compiling a conjure definition to a specific language, conjure packages are often translated into that language's own concept of package.
The problem is that there are constraints that a specific language applies to its package structure that are not valid per conjure's package definition. One such constraint is that the Go compiler does not admit import cycles, that is, a file in a package cannot import another package that also directly or indirectly imports the current package. There is no such restriction in the conjure language that conjure packages cannot form cycles, which results in some conjure definitions generating Go code that does not compile.
The ultimate goal of the algorithm proposed here is to create an intermediary conjure definition during the generation of Go code in order to remove any conjure package cycles.
One trivial solution would be to compile all conjure packages into a single Go package. However, this would represent a major compile break for all conjure definitions that currently do not have conjure package cycles, which we believe are the majority of conjure definitions used by Palantir. We'd like to both (1) not break definitions that do not have conjure package cycles and (2) do a reasonably minimal change to the conjure definitions that do.
We may consider doing a major revamp of conjure-go that includes major compile breaks, but this is out of scope for the short term and would require a major version bump for conjure-go v6, which at the time of writing this document is not in the plans.
Given that, the algorithm proposed below to remove conjure package cycles is also designed to respect the following constraints:
- If there are no package cycles in the conjure definition, then the definition remains unchanged.
- Types cannot be moved between conjure packages of the original definition. In other words, the Go package a type ends up in is either the original conjure package the type came from or a brand new one that is generated by the algorithm.
- The number of new packages in the intermediary conjure definition is minimal respecting the constraints above. The minimum number of new packages may not be optimal.
- The algorithm is stable, that is, if the conjure definition changes and no new conjure package cycles are introduced, the Go package structure stays the same. In addition, if a new package cycle is introduced, a reasonably minimal change is done.
Note that we'll not guarantee nor make a best effort to minimize the number of types that are moved out of their original packages. The algorithm that minimizes the number of additional Go package gives little room for optimization in that direction. We expect this to be fine since we don't expect there to be complex package cycles that admit multiple solutions. In addition, the problem of minimizing the number of types moved out of their package is believed to be NP-hard, though we don't have formal proof.
The algorithm
Summary
There are two types of cycles that can be present in a conjure definition:
- Non-type cycles are cycles where there is no cycle in the conjure definition's type graph, but cycles would be created if all nodes of the same conjure package would be merged.
- Type cycles are cycles in the type graph such that there is more than one conjure package observed over the types that make up the cycle.
The general idea of the algorithm is to break non-type cycles by splitting conjure packages and break type cycles by moving all types that are part of the cycle into a brand-new package. In addition, we'll do a best effort to merge the resulting packages while respecting the constraints above and not re-creating the cycles we just broke.
The algorithm is then divided into five steps:
- Build the type graph of the conjure definition.
- Calculate the strongly connected components of the type graph.
- Calculate a minimal partition of the SCC graph such that two SCCs can be part of the same group if (1) they have the same package set and (2) there is no path in the SCC graph from one to the other that crosses an SCC with a different package set.
- For each group in the partition, assign a unique Go package name that is derived from the package set of the SCCs of the group.
- For each type in the conjure definition, find its SCC from step 2, then find the group it landed in step 3 and modify the package to the unique package name assigned to the group in step 4. Deduplicate type names within the same Go package as required. This yields the intermediary conjure definition.
The overall runtime complexity is O(n^2/64 + ng + m)
where n
is the number of types, g
is the minimum number of
Go packages and m
is the number of edges in the type graph.
Build the type graph of the conjure definition
This is pretty straightfoward. The conjure definition is read and for each type that is ever declared or referenced, a new node is added to the type graph.
The edges of the graph are built according to the following rules:
- If the type represented by node
u
is an alias to another type represented by nodev
, an edge is added fromu
tov
. - If the type represented by node
u
is an object that references another type represented by nodev
in one of its fields, an edge is added fromu
tov
. - If the type represented by node
u
is an union that can be another type represented by nodev
, an edge is added fromu
tov
. - If the type represented by node
u
is an alias to a primitive type or an enum, it contains no outgoing edges. - If the type represented by node
u
is an alias to a primitive data structure that references another type represented by nodev
(for example,list<v>
,optional<v>
orset<v>
), an edge is added fromu
tov
. - If the type represented by node
u
is an external reference or falls back to a type represented by nodev
, an edge is added fromu
tov
. - If the type represented by node
u
is an error that admits an object of a type represented by nodev
as a safe or unsafe param, an edge is added fromu
tov
. - If the type represented by node
u
is a service that admits an object of a type represented by nodev
as a parameter, return value or marker, an edge is added fromu
tov
.
The type graph can be built in O(n+m)
runtime and memory.
Calculate strongly connected components
Why do we use strongly connected components? From the constraints of the Go compiler, it is imperative that all types within a type cycle belong to the same Go package. Because of constraint number 2, unless all types within a type cycle are already part of the same conjure package, they all need to be moved into a brand-new package. If a type belongs to more than one cycle, the types of all such cycles must be moved into the same brand-new package.
In conclusion, for each type u
, the maximal set of types that can reach u
and can be reached by u
in the type
graph must belong to the same Go package as u
. That is exactly the definition of strongly connected components,
so the problem is essentially to look for strongly connected components in order to form new packages. Thinking about
the problem in terms of strongly connected components also introduces a lot of algorithms and theorems known from
graph theory literature.
Each SCC is also assigned a package set that is the union of all packages of the types that make up the SCC.
In order to calculate the strongly connected components we use
Kosaraju's algorithm, which runs in O(n+m)
runtime and memory.
Partition strongly connected components
At this point, if we assign a Go package to each strongly connected component, we'd already have a valid solution for the ultimate goal. However, it would not comply with constraints 1 and 3 because if two SCCs originate from the same package set and there are no paths from one SCC to another that crosses an SCC with a different package set, they can be assigned with same Go package without creating a Go import cycle.
In other words, we have to calculate a minimal partition (i.e. minimum number of groups) of the SCC graph such that two SCCs can be part of the same group if (1) they have the same package set and (2) there is no path in the SCC graph from one to the other that crosses an SCC with a different package set.
Calculate disallowed dependencies
First, we need to be able to verify if two SCCs can be part of the same group efficiently.
Let su
be an SCC. Let dependencies(su)
be the set of all SCCs that are reachable by su
in the SCC graph.
Let disallowed(su)
be the subset of dependencies(su)
of the SCCs that cannot be grouped with su
,
that is, sv
belongs to disallowed(su)
if and only if there is a path from su
to sv
that crosses an SCC with
a package set different from su
's.
Checking if two SCCs su
and sv
can be grouped together simply means checking if the package set of su
and sv
are equal, su
is not in disallowed(sv)
and sv
is not in disallowed(su)
.
dependencies(su)
can be calculated by the following recursive algorithm: Let sv
be a direct dependent of su
,
that is, there is an edge from su
to sv
in the SCC graph. sv
is added to dependencies(su)
.
Because dependencies are transitive, dependencies(sv)
merges in to dependencies(su)
.
disallowed(su)
can be calculated by the following recursive algorithm: Let sv
be a direct dependent of su
,
that is, there is an edge from su
to sv
in the SCC graph.
- If
su
andsv
have different package sets, all dependencies ofsv
have a path fromsu
to them that crosses an SCC of a different package set thansu
's. Therefore,dependencies(sv)
merges in todisallowed(su)
andsv
is added todisallowed(su)
. - If
su
andsv
have the same package set andsw
is an SCC indisallowed(sv)
, then there is a path fromsv
tosw
that crosses an SCC of a different package set thansv
, which means that there is a path fromsu
tosw
that crosses an SCC of a different package set thansu
, sosw
is part ofdisallowed(su)
and thereforedisallowed(sv)
merges in todisallowed(su)
.
Since the merge set operation is O(n)
, this recursive algorithm takes O(n^2 + m)
runtime and O(n^2)
memory with
the merging sets being the bottleneck. We can, however, use the bitset
data structure to represent dependencies
and disallowed
and conduct the O(n)
merges and store the sets
with a multiplicative factor of 1/64
.
From this point on, we conduct the following greedy algorithm:
- Create "buckets" for each package set. Buckets are a set of SCCs and cannot be empty.
- Process SCCs in reverse topological order in the SCC graph.
- This is done because it helps group SCCs according to their depth in the SCC graph.
- For an SCC
su
, look at all the existing buckets for its package set.- For each bucket, pick the first SCC in it (say
sv
) and check ifsu
can be merged withsv
. If it can, putsu
in it and stop iterating over buckets. - If there are none, create a new bucket for
su
's package set and putsu
in it.
- For each bucket, pick the first SCC in it (say
- When two SCCs are merged, we must update the
dependencies
anddisallowed
sets of all SCCs that can reach the merged nodes. This is done by traversing the reverse graph of the SCC graph and updating the sets using the same recursive algorithm as above.
The algorithm above runs in worst case O(n^3 + m)
as a dfs on the reverse SCC graph is required for each SCC to
recompute the dependencies
and disallowed
sets of all SCCs that can reach the merging SCC.
The bitset optimization also applies here, which enables the algorithm to run with a multiplicative factor of 1/64
.
An important note is that when nodes are merged, new edges are introduced in the SCC graph, which means that m
can
increase up to the order of n
. In practice, however, it is expected that most merges will result in small graph
searches as we can do pruning on the search.
The partition of the SCC graph is simply the grouping by buckets.
Constraint 1 is satisfied because if the graph has no package cycles, all SCCs of a package are able to be merged
as disallowed(su)
will never contain an SCC of the same package as su
.
Constraint 3 is satisfied in a best effort manner. That is, the algorithm is not optimal, but a greedy approach is
used to merge as much SCCs as possible.
In the image above, we see the conversion from the SCC graph to a simplified disallowed graph.
The edges are defined by: su
has an edge to sv
if and only if they have the same package set
and sv
is in disallowed(su)
. The disallowed DAG is a DAG because there is an edge from su
to sv
only if sv
is part of disallowed(su)
, which is a subset of dependencies(su)
and the dependencies
graph is a DAG as a corollary of the SCC graph being a DAG.
Node 1 is disallowed to merge with node 3 because there is a path from 1 to 3 that crosses node 5, which has a different package set. Node 3 is disallowed to merge with node 4 because there is a path from 3 to 4 that crosses node 6, which has a different package set. Node 5 is disallowed to merge with node 6 because there is a path from 5 to 6 that crosses node 3, which has a different package set. Node 2 is disallowed to merge with node 4 because there is a path from 2 to 4 that crosses node 7, which has a different package set.
Assign Go package names
Once we have the SCC partition, we can merge all the SCCs in the same group to form a single Go package. By the conditions of the partition, two SCCs are selected to be grouped only if their package set is equal. Because of this, the name of the final Go package only needs to be a function of the package set.
Conjure packages are named with multiple terms separated by periods. Each term translates to a directory in the resulting Go code with the final word being the Go package name.
The algorithm for calculating a reasonable merged package name is defined as follows:
- Calculate the maximum word prefix for all merging conjure packages in the package set. That is, the maximum prefix common to all packages that ends with a period.
- For each merging conjure package name, remove the prefix and any other period.
- Merge the merging conjure package names into a single string in lexicographical order. Separate by underscore
_
. - Prepend the common prefix.
For example,
[
"com.palantir.foo",
"com.palantir.bar"
]
becomes
"com.palantir.bar_foo"
and
[
"com.palantir.api",
"com.palantir.root.foo",
"com.palantir.root.bar"
]
becomes
"com.palantir.api_rootbar_rootfoo"
In addition, we'll need to de-duplicate any types with conflicting names in the merging conjure packages. We'll do it by prepending the original conjure package name to the type name if a type is present more than one time in the Go package. If we've already seen a type of the same name, we append number starting from 1 and increasing each time.
For example,
[
"com.palantir.a:Foo",
"com.palantir.b:Foo"
]
becomes
[
"com.palantir.a_b:AFoo",
"com.palantir.a_b:BFoo"
]
Create the intermediary definition
Finally, we can build a map from the old tuple (package,name)
to a new tuple (package,name)
using the modifications described in the step above. If more than one Go package shares the same package name and path,
we'll iterate over Go packages in the increasing order of bucket depth and, if we see a package
we've seen before, append a number starting from 1 and increasing each time.
The reason we chose the bucket order is to ensure stability of the algorithm: the assigned package name will only
change if the disallowed DAG changes, and that only happens when a new conjure package cycle is introduced.
For example,
[
"com.palantir.foo",
"com.palantir.foo"
]
becomes
[
"com.palantir.foo",
"com.palantir.foo1"
]
Then, we just need to iterate over the entire definition and apply the map on every reference to a type we see. The definition can then be handed off to the rest of the generation algorithm, which doesn't have to care about any of this.
Test cases
In order to better illustrate and to build maintainable unit tests, 4 test cases were included in the testdata directory with an illustration, expected input and output conjure definitions and the generated code. Let's go over each one of them to explain what is going on.
No cycles
This test case contains a conjure definition with neither type or non-type cycles.
Each node in the type graph yields its own SCC. Each SCC is made up of a unitary package set.
The SCC graph is equal to the type graph.
disallowed
only contains SCCs of different packages, so all SCCs of the same package can be merged.
Because no conjure packages need to be merged, no types need to be renamed.
The resulting definition is exactly the same as the original one.
Cycle within package
This test case contains a conjure definition with a package com.palantir.foo
that contains a cycle of types.
All nodes in the type graph for package com.palantir.foo
are brough to the same SCC.
Every other node gets its own SCC. Each SCC is made up of a unitary package set.
There is exactly one node for each package set in the SCC graph.
disallowed
only contains SCCs of different packages, so all SCCs of the same package can be merged.
Because no conjure packages need to be merged, no types need to be renamed.
The resulting definition is exactly the same as the original one.
Non-type cycle
This test case contains a conjure definition with a non-type cycle, e.g. a conjure package cycle without type cycles.
Each node in the type graph yields its own SCC. Each SCC is made up of a unitary package set.
The SCC graph is equal to the type graph.
There is a path from the SCC of com.palantir.foo:Type3
to the SCC of com.palantir.foo:Type4
that contains
com.palantir.bar:Type1
, which is of a different package.
Because of this, package com.palantir.foo
is broken with one package com.palantir.foo1
containing Type1/3
and com.palantir.foo
containing Type2/4
.
The reason for the naming is that we process the SCCs in reverse topological order.
Alternatively, package com.palantir.bar
could have been the broken one with types
com.palanti.bar:Type1
and com.palantir.bar:Type3
ending up in different packages.
The criteria for breaking up is based on the reverse topological sort: the SCC of com.palantir.bar:Type1
and
of com.palantir.bar:Type3
are merged first, which prevents the SCC of com.palantir.foo:Type3
and
of com.palantir.foo:Type4
from merging.
The resulting definition has one more package.
Non-type cycle on disconnected graph
This test case contains a conjure definition with a non-type cycle, e.g. a conjure package cycle without type cycles. In addition, the dependencies that cause the package cycle are part of distinct connected components.
Each node in the type graph yields its own SCC. Each SCC is made up of a unitary package set. The SCC graph is equal to the type graph.
Initially, disallowed(su)
contains no other SCCs of the same package as su
.
However, when merging the SCCs of com.palantir.bar:Type2
and com.palantir.bar:Type3
, a non-type cycle is formed.
The new dependencies are propagated and disallowed
is updated for the SCCs of com.palantir.foo:Type1
and com.palantir.foo:Type3
, which now contain the SCCs of com.palantir.foo:Type2
and of com.palantir.foo:Type4
.
Because of this, package com.palantir.foo
is broken with one package com.palantir.foo1
containing Type1/3
and com.palantir.foo
containing Type2/4
.
The resulting definition has one more package.
Type cycles
This test case contains a conjure definition with a type cycle.
Types com.palantir.foo:Type2/3/4
and com.palantir.bar:Type1/2
are grouped to the same SCC with package set
{com.palantir.foo,com.palantir.bar}
. Every other node gets their own SCC.
There is exactly one node for each package set in the SCC graph. disallowed
only contain SCCs of different package
sets for all SCCs, so all SCCs of the same package set are able to be merged.
The SCC that contains package set {com.palantir.foo,com.palantir.bar}
receives the Go package com.palantir.bar_foo
.
Because it contains two types named Type3
, com.palantir.foo:type3
is renamed to com.palantir.barfoo:FooType3
and com.palantir.bar:type3
is renamed to com.palantir.barfoo:BarType3
.
The resulting definition has one more package.
Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func RemovePackageCycles ¶
func RemovePackageCycles(def spec.ConjureDefinition) (spec.ConjureDefinition, error)
RemovePackageCycles modifies the conjure definition in order to remove package cycles in the compiled Go code. Please check the README.md file of this package for information on how this is done.
Types ¶
This section is empty.