nftables

package
v1.31.0-rc.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 6, 2024 License: Apache-2.0 Imports: 32 Imported by: 1

README

NFTables kube-proxy

This is an implementation of service proxying via the nftables API of the kernel netfilter subsystem.

General theory of netfilter

Packet flow through netfilter looks something like:

             +================+      +=====================+
             | hostNetwork IP |      | hostNetwork process |
             +================+      +=====================+
                         ^                |
  -  -  -  -  -  -  -  - | -  -  -  -  - [*] -  -  -  -  -  -  -  -  -
                         |                v
                     +-------+        +--------+
                     | input |        | output |
                     +-------+        +--------+
                         ^                |
      +------------+     |   +---------+  v      +-------------+
      | prerouting |-[*]-+-->| forward |--+-[*]->| postrouting |
      +------------+         +---------+         +-------------+
            ^                                           |
 -  -  -  - | -  -  -  -  -  -  -  -  -  -  -  -  -  -  |  -  -  -  -
            |                                           v
       +---------+                                  +--------+
   --->| ingress |                                  | egress |--->
       +---------+                                  +--------+

where the [*] represents a routing decision, and all of the boxes except in the top row represent netfilter hooks. More detailed versions of this diagram can be seen at https://en.wikipedia.org/wiki/Netfilter#/media/File:Netfilter-packet-flow.svg and https://wiki.nftables.org/wiki-nftables/index.php/Netfilter_hooks but note that in the the standard version of this diagram, the top two boxes are squished together into "local process" which (a) fails to make a few important distinctions, and (b) makes it look like a single packet can go input -> "local process" -> output, which it cannot. Note also that the ingress and egress hooks are special and mostly not available to us; kube-proxy lives in the middle section of diagram, with the five main netfilter hooks.

There are three paths through the diagram, called the "input", "forward", and "output" paths, depending on which of those hooks it passes through. Packets coming from host network namespace processes always take the output path, while packets coming in from outside the host network namespace (whether that's from an external host or from a pod network namespace) arrive via ingress and take the input or forward path, depending on the routing decision made after prerouting; packets destined for an IP which is assigned to a network interface in the host network namespace get routed along the input path; anything else (including, in particular, packets destined for a pod IP) gets routed along the forward path.

kube-proxy's use of nftables hooks

Kube-proxy uses nftables for seven things:

  • Using DNAT to rewrite traffic from service IPs (cluster IPs, external IPs, load balancer IP, and NodePorts on node IPs) to the corresponding endpoint IPs.

  • Using SNAT to masquerade traffic as needed to ensure that replies to it will come back to this node/namespace (so that they can be un-DNAT-ed).

  • Dropping packets that are filtered out by the LoadBalancerSourceRanges feature.

  • Dropping packets for services with Local traffic policy but no local endpoints.

  • Rejecting packets for services with no local or remote endpoints.

  • Dropping packets to ClusterIPs which are not yet allocated.

  • Rejecting packets to undefined ports of ClusterIPs.

This is implemented as follows:

  • We do the DNAT for inbound traffic in prerouting: this covers traffic coming from off-node to all types of service IPs, and traffic coming from pods to all types of service IPs. (We must do this in prerouting, because the choice of endpoint IP may affect whether the packet then gets routed along the input path or the forward path.)

  • We do the DNAT for outbound traffic in output: this covers traffic coming from host-network processes to all types of service IPs. Regardless of the final destination, the traffic will take the "output path". (In the case where a host-network process connects to a service IP that DNATs it to a host-network endpoint IP, the traffic will still initially take the "output path", but then reappear on the "input path".)

  • LoadBalancerSourceRanges firewalling has to happen before service DNAT, so we do that on prerouting and output as well, with a lower (i.e. more urgent) priority than the DNAT chains.

  • The drop and reject rules for services with no endpoints don't need to happen explicitly before or after any other rules (since they match packets that wouldn't be matched by any other rules). But with kernels before 5.9, reject is not allowed in prerouting, so we can't just do them in the same place as the source ranges firewall. So we do these checks from input, forward, and output for @no-endpoint-services and from input for @no-endpoint-nodeports to cover all the possible paths.

  • Masquerading has to happen in the postrouting hook, because "masquerade" means "SNAT to the IP of the interface the packet is going out on", so it has to happen after the final routing decision. (We don't need to masquerade packets that are going to a host network IP, because masquerading is about ensuring that the packet eventually gets routed back to the host network namespace on this node, so if it's never getting routed away from there, there's nothing to do.)

  • We install a reject rule for ClusterIPs matching @cluster-ips set and a drop rule for ClusterIPs belonging to any of the ServiceCIDRs in forward and output hook, with a higher (i.e. less urgent) priority than the DNAT chains making sure all valid traffic directed for ClusterIPs is already DNATed. Drop rule will only be installed if MultiCIDRServiceAllocator feature is enabled.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CleanupLeftovers

func CleanupLeftovers(ctx context.Context) bool

CleanupLeftovers removes all nftables rules and chains created by the Proxier It returns true if an error was encountered. Errors are logged.

func NewDualStackProxier

func NewDualStackProxier(
	ctx context.Context,
	syncPeriod time.Duration,
	minSyncPeriod time.Duration,
	masqueradeAll bool,
	masqueradeBit int,
	localDetectors map[v1.IPFamily]proxyutil.LocalTrafficDetector,
	hostname string,
	nodeIPs map[v1.IPFamily]net.IP,
	recorder events.EventRecorder,
	healthzServer *healthcheck.ProxierHealthServer,
	nodePortAddresses []string,
	initOnly bool,
) (proxy.Provider, error)

NewDualStackProxier creates a MetaProxier instance, with IPv4 and IPv6 proxies.

Types

type Proxier

type Proxier struct {
	// contains filtered or unexported fields
}

Proxier is an nftables based proxy

func NewProxier

func NewProxier(ctx context.Context,
	ipFamily v1.IPFamily,
	syncPeriod time.Duration,
	minSyncPeriod time.Duration,
	masqueradeAll bool,
	masqueradeBit int,
	localDetector proxyutil.LocalTrafficDetector,
	hostname string,
	nodeIP net.IP,
	recorder events.EventRecorder,
	healthzServer *healthcheck.ProxierHealthServer,
	nodePortAddressStrings []string,
	initOnly bool,
) (*Proxier, error)

NewProxier returns a new nftables Proxier. Once a proxier is created, it will keep nftables up to date in the background and will not terminate if a particular nftables call fails.

func (*Proxier) OnEndpointSliceAdd

func (proxier *Proxier) OnEndpointSliceAdd(endpointSlice *discovery.EndpointSlice)

OnEndpointSliceAdd is called whenever creation of a new endpoint slice object is observed.

func (*Proxier) OnEndpointSliceDelete

func (proxier *Proxier) OnEndpointSliceDelete(endpointSlice *discovery.EndpointSlice)

OnEndpointSliceDelete is called whenever deletion of an existing endpoint slice object is observed.

func (*Proxier) OnEndpointSliceUpdate

func (proxier *Proxier) OnEndpointSliceUpdate(_, endpointSlice *discovery.EndpointSlice)

OnEndpointSliceUpdate is called whenever modification of an existing endpoint slice object is observed.

func (*Proxier) OnEndpointSlicesSynced

func (proxier *Proxier) OnEndpointSlicesSynced()

OnEndpointSlicesSynced is called once all the initial event handlers were called and the state is fully propagated to local cache.

func (*Proxier) OnNodeAdd

func (proxier *Proxier) OnNodeAdd(node *v1.Node)

OnNodeAdd is called whenever creation of new node object is observed.

func (*Proxier) OnNodeDelete

func (proxier *Proxier) OnNodeDelete(node *v1.Node)

OnNodeDelete is called whenever deletion of an existing node object is observed.

func (*Proxier) OnNodeSynced

func (proxier *Proxier) OnNodeSynced()

OnNodeSynced is called once all the initial event handlers were called and the state is fully propagated to local cache.

func (*Proxier) OnNodeUpdate

func (proxier *Proxier) OnNodeUpdate(oldNode, node *v1.Node)

OnNodeUpdate is called whenever modification of an existing node object is observed.

func (*Proxier) OnServiceAdd

func (proxier *Proxier) OnServiceAdd(service *v1.Service)

OnServiceAdd is called whenever creation of new service object is observed.

func (*Proxier) OnServiceCIDRsChanged added in v1.30.0

func (proxier *Proxier) OnServiceCIDRsChanged(cidrs []string)

OnServiceCIDRsChanged is called whenever a change is observed in any of the ServiceCIDRs, and provides complete list of service cidrs.

func (*Proxier) OnServiceDelete

func (proxier *Proxier) OnServiceDelete(service *v1.Service)

OnServiceDelete is called whenever deletion of an existing service object is observed.

func (*Proxier) OnServiceSynced

func (proxier *Proxier) OnServiceSynced()

OnServiceSynced is called once all the initial event handlers were called and the state is fully propagated to local cache.

func (*Proxier) OnServiceUpdate

func (proxier *Proxier) OnServiceUpdate(oldService, service *v1.Service)

OnServiceUpdate is called whenever modification of an existing service object is observed.

func (*Proxier) Sync

func (proxier *Proxier) Sync()

Sync is called to synchronize the proxier state to nftables as soon as possible.

func (*Proxier) SyncLoop

func (proxier *Proxier) SyncLoop()

SyncLoop runs periodic work. This is expected to run as a goroutine or as the main loop of the app. It does not return.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL