tailsamplingprocessor

package module
v0.94.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 8, 2024 License: Apache-2.0 Imports: 23 Imported by: 12

README

Tail Sampling Processor

Status
Stability beta: traces
Distributions contrib, aws, grafana, observiq, splunk, sumo
Issues Open issues Closed issues
Code Owners @jpkrohling

The tail sampling processor samples traces based on a set of defined policies. All spans for a given trace MUST be received by the same collector instance for effective sampling decisions. Before performing sampling, spans will be grouped by trace_id. Therefore, the tail sampling processor can be used directly without the need for the groupbytraceprocessor.

This processor must be placed in pipelines after any processors that rely on context, e.g. k8sattributes. It reassembles spans into new batches, causing them to lose their original context.

Please refer to config.go for the config spec.

The following configuration options are required:

  • policies (no default): Policies used to make a sampling decision

Multiple policies exist today and it is straight forward to add more. These include:

  • always_sample: Sample all traces
  • latency: Sample based on the duration of the trace. The duration is determined by looking at the earliest start time and latest end time, without taking into consideration what happened in between. Supplying no upper bound will result in a policy sampling anything greater than threshold_ms.
  • numeric_attribute: Sample based on number attributes (resource and record)
  • probabilistic: Sample a percentage of traces. Read a comparison with the Probabilistic Sampling Processor.
  • status_code: Sample based upon the status code (OK, ERROR or UNSET)
  • string_attribute: Sample based on string attributes (resource and record) value matches, both exact and regex value matches are supported
  • trace_state: Sample based on TraceState value matches
  • rate_limiting: Sample based on rate
  • span_count: Sample based on the minimum and/or maximum number of spans, inclusive. If the sum of all spans in the trace is outside the range threshold, the trace will not be sampled.
  • boolean_attribute: Sample based on boolean attribute (resource and record).
  • ottl_condition: Sample based on given boolean OTTL condition (span and span event).
  • and: Sample based on multiple policies, creates an AND policy
  • composite: Sample based on a combination of above samplers, with ordering and rate allocation per sampler. Rate allocation allocates certain percentages of spans per policy order. For example if we have set max_total_spans_per_second as 100 then we can set rate_allocation as follows
    1. test-composite-policy-1 = 50 % of max_total_spans_per_second = 50 spans_per_second
    2. test-composite-policy-2 = 25 % of max_total_spans_per_second = 25 spans_per_second
    3. To ensure remaining capacity is filled use always_sample as one of the policies

The following configuration options can also be modified:

  • decision_wait (default = 30s): Wait time since the first span of a trace before making a sampling decision
  • num_traces (default = 50000): Number of traces kept in memory.
  • expected_new_traces_per_sec (default = 0): Expected number of new traces (helps in allocating data structures)

Each policy will result in a decision, and the processor will evaluate them to make a final decision:

  • When there's an "inverted not sample" decision, the trace is not sampled;
  • When there's a "sample" decision, the trace is sampled;
  • When there's a "inverted sample" decision and no "not sample" decisions, the trace is sampled;
  • In all other cases, the trace is NOT sampled

An "inverted" decision is the one made based on the "invert_match" attribute, such as the one from the string tag policy.

Examples:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100
    expected_new_traces_per_sec: 10
    policies:
      [
          {
            name: test-policy-1,
            type: always_sample
          },
          {
            name: test-policy-2,
            type: latency,
            latency: {threshold_ms: 5000, upper_threshold_ms: 10000}
          },
          {
            name: test-policy-3,
            type: numeric_attribute,
            numeric_attribute: {key: key1, min_value: 50, max_value: 100}
          },
          {
            name: test-policy-4,
            type: probabilistic,
            probabilistic: {sampling_percentage: 10}
          },
          {
            name: test-policy-5,
            type: status_code,
            status_code: {status_codes: [ERROR, UNSET]}
          },
          {
            name: test-policy-6,
            type: string_attribute,
            string_attribute: {key: key2, values: [value1, value2]}
          },
          {
            name: test-policy-7,
            type: string_attribute,
            string_attribute: {key: key2, values: [value1, val*], enabled_regex_matching: true, cache_max_size: 10}
          },
          {
            name: test-policy-8,
            type: rate_limiting,
            rate_limiting: {spans_per_second: 35}
         },
         {
            name: test-policy-9,
            type: string_attribute,
            string_attribute: {key: http.url, values: [\/health, \/metrics], enabled_regex_matching: true, invert_match: true}
         },
         {
            name: test-policy-10,
            type: span_count,
            span_count: {min_spans: 2, max_spans: 20}
         },
         {
             name: test-policy-11,
             type: trace_state,
             trace_state: { key: key3, values: [value1, value2] }
         },
         {
              name: test-policy-12,
              type: boolean_attribute,
              boolean_attribute: {key: key4, value: true}
         },
         {
              name: test-policy-13,
              type: ottl_condition,
              ottl_condition: {
                   error_mode: ignore,
                   span: [
                        "attributes[\"test_attr_key_1\"] == \"test_attr_val_1\"",
                        "attributes[\"test_attr_key_2\"] != \"test_attr_val_1\"",
                   ],
                   spanevent: [
                        "name != \"test_span_event_name\"",
                        "attributes[\"test_event_attr_key_2\"] != \"test_event_attr_val_1\"",
                   ]
              }
         },
         {
            name: and-policy-1,
            type: and,
            and: {
              and_sub_policy: 
              [
                {
                  name: test-and-policy-1,
                  type: numeric_attribute,
                  numeric_attribute: { key: key1, min_value: 50, max_value: 100 }
                },
                {
                    name: test-and-policy-2,
                    type: string_attribute,
                    string_attribute: { key: key2, values: [ value1, value2 ] }
                },
              ]
            }
         },
         {
            name: composite-policy-1,
            type: composite,
            composite:
              {
                max_total_spans_per_second: 1000,
                policy_order: [test-composite-policy-1, test-composite-policy-2, test-composite-policy-3],
                composite_sub_policy:
                  [
                    {
                      name: test-composite-policy-1,
                      type: numeric_attribute,
                      numeric_attribute: {key: key1, min_value: 50, max_value: 100}
                    },
                    {
                      name: test-composite-policy-2,
                      type: string_attribute,
                      string_attribute: {key: key2, values: [value1, value2]}
                    },
                    {
                      name: test-composite-policy-3,
                      type: always_sample
                    }
                  ],
                rate_allocation:
                  [
                    {
                      policy: test-composite-policy-1,
                      percent: 50
                    },
                    {
                      policy: test-composite-policy-2,
                      percent: 25
                    }
                  ]
              }
          },
        ]

Refer to tail_sampling_config.yaml for detailed examples on using the processor.

A Practical Example

Imagine that you wish to configure the processor to implement the following rules:

  1. Rule 1: Not all teams are ready to move to tail sampling. Therefore, sample all traces that are not from the team team_a.

  2. Rule 2: Sample only 0.1 percent of Readiness/liveness probes

  3. Rule 3: service-1 has a noisy endpoint /v1/name/{id}. Sample only 1 percent of such traces.

  4. Rule 4: Other traces from service-1 should be sampled at 100 percent.

  5. Rule 5: Sample all traces if there is an error in any span in the trace.

  6. Rule 6: Add an escape hatch. If there is an attribute called app.force_sample in the span, then sample the trace at 100 percent.

Here is what the configuration would look like:

tail_sampling:
  decision_wait: 10s
  num_traces: 100
  expected_new_traces_per_sec: 10
  policies: [
      {
        # Rule 1: use always_sample policy for services that don't belong to team_a and are not ready to use tail sampling
        name: backwards-compatibility-policy,
        type: and,
        and:
          {
            and_sub_policy:
              [
                {
                  name: services-using-tail_sampling-policy,
                  type: string_attribute,
                  string_attribute:
                    {
                      key: service.name,
                      values:
                        [
                          list,
                          of,
                          services,
                          using,
                          tail_sampling,
                        ],
                      invert_match: true,
                    },
                },
                { name: sample-all-policy, type: always_sample },
              ],
          },
      },
      # BEGIN: policies for team_a
      {
        # Rule 2: low sampling for readiness/liveness probes
        name: team_a-probe,
        type: and,
        and:
          {
            and_sub_policy:
              [
                {
                  # filter by service name
                  name: service-name-policy,
                  type: string_attribute,
                  string_attribute:
                    {
                      key: service.name,
                      values: [service-1, service-2, service-3],
                    },
                },
                {
                  # filter by route
                  name: route-live-ready-policy,
                  type: string_attribute,
                  string_attribute:
                    {
                      key: http.route,
                      values: [/live, /ready],
                      enabled_regex_matching: true,
                    },
                },
                {
                  # apply probabilistic sampling
                  name: probabilistic-policy,
                  type: probabilistic,
                  probabilistic: { sampling_percentage: 0.1 },
                },
              ],
          },
      },
      {
        # Rule 3: low sampling for a noisy endpoint
        name: team_a-noisy-endpoint-1,
        type: and,
        and:
          {
            and_sub_policy:
              [
                {
                  name: service-name-policy,
                  type: string_attribute,
                  string_attribute:
                    { key: service.name, values: [service-1] },
                },
                {
                  # filter by route
                  name: route-name-policy,
                  type: string_attribute,
                  string_attribute:
                    {
                      key: http.route,
                      values: [/v1/name/.+],
                      enabled_regex_matching: true,
                    },
                },
                {
                  # apply probabilistic sampling
                  name: probabilistic-policy,
                  type: probabilistic,
                  probabilistic: { sampling_percentage: 1 },
                },
              ],
          },
      },
      {
        # Rule 4: high sampling for other endpoints
        name: team_a-service-1,
        type: and,
        and:
          {
            and_sub_policy:
              [
                {
                  name: service-name-policy,
                  type: string_attribute,
                  string_attribute:
                    { key: service.name, values: [service-1] },
                },
                {
                  # invert match - apply to all routes except the ones specified
                  name: route-name-policy,
                  type: string_attribute,
                  string_attribute:
                    {
                      key: http.route,
                      values: [/v1/name/.+],
                      enabled_regex_matching: true,
                      invert_match: true,
                    },
                },
                {
                  # apply probabilistic sampling
                  name: probabilistic-policy,
                  type: probabilistic,
                  probabilistic: { sampling_percentage: 100 },
                },
              ],
          },
      },
      {
        # Rule 5: always sample if there is an error
        name: team_a-status-policy,
        type: and,
        and:
          {
            and_sub_policy:
              [
                {
                  name: service-name-policy,
                  type: string_attribute,
                  string_attribute:
                    {
                      key: service.name,
                      values:
                        [
                          list,
                          of,
                          services,
                          using,
                          tail_sampling,
                        ],
                    },
                },
                {
                  name: trace-status-policy,
                  type: status_code,
                  status_code: { status_codes: [ERROR] },
                },
              ],
          },
      },
      {
        # Rule 6:
        # always sample if the force_sample attribute is set to true
        name: team_a-force-sample,
        type: boolean_attribute,
        boolean_attribute: { key: app.force_sample, value: true },
      },
      # END: policies for team_a
    ]
Scaling collectors with the tail sampling processor

This processor requires all spans for a given trace to be sent to the same collector instance for the correct sampling decision to be derived. When scaling the collector, you'll then need to ensure that all spans for the same trace are reaching the same collector. You can achieve this by having two layers of collectors in your infrastructure: one with the load balancing exporter, and one with the tail sampling processor.

While it's technically possible to have one layer of collectors with two pipelines on each instance, we recommend separating the layers in order to have better failure isolation.

Probabilistic Sampling Processor compared to the Tail Sampling Processor with the Probabilistic policy

The probabilistic sampling processor and the probabilistic tail sampling processor policy work very similar: based upon a configurable sampling percentage they will sample a fixed ratio of received traces. But depending on the overall processing pipeline you should prefer using one over the other.

As a rule of thumb, if you want to add probabilistic sampling and...

...you are not using the tail sampling processor already: use the probabilistic sampling processor. Running the probabilistic sampling processor is more efficient than the tail sampling processor. The probabilistic sampling policy makes decision based upon the trace ID, so waiting until more spans have arrived will not influence its decision.

...you are already using the tail sampling processor: add the probabilistic sampling policy. You are already incurring the cost of running the tail sampling processor, adding the probabilistic policy will be negligible. Additionally, using the policy within the tail sampling processor will ensure traces that are sampled by other policies will not be dropped.

FAQ

Q. Why am I seeing high values for the error metric sampling_trace_dropped_too_early?

A. This is likely a load issue. If the collector is processing more traces in-memory than the num_traces configuration option allows, some will have to be dropped before they can be sampled. Increasing the value of num_traces can help resolve this error, at the expense of increased memory usage.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewFactory

func NewFactory() processor.Factory

NewFactory returns a new factory for the Tail Sampling processor.

Types

type AndCfg added in v0.44.0

type AndCfg struct {
	SubPolicyCfg []AndSubPolicyCfg `mapstructure:"and_sub_policy"`
}

AndCfg holds the common configuration to all and policies.

type AndSubPolicyCfg added in v0.44.0

type AndSubPolicyCfg struct {
	// contains filtered or unexported fields
}

AndSubPolicyCfg holds the common configuration to all policies under and policy.

type BooleanAttributeCfg added in v0.73.0

type BooleanAttributeCfg struct {
	// Tag that the filter is going to be matching against.
	Key string `mapstructure:"key"`
	// Value indicate the bool value, either true or false to use when matching against attribute values.
	// BooleanAttribute Policy will apply exact value match on Value
	Value bool `mapstructure:"value"`
}

BooleanAttributeCfg holds the configurable settings to create a boolean attribute filter sampling policy evaluator.

type CompositeCfg added in v0.38.0

type CompositeCfg struct {
	MaxTotalSpansPerSecond int64                   `mapstructure:"max_total_spans_per_second"`
	PolicyOrder            []string                `mapstructure:"policy_order"`
	SubPolicyCfg           []CompositeSubPolicyCfg `mapstructure:"composite_sub_policy"`
	RateAllocation         []RateAllocationCfg     `mapstructure:"rate_allocation"`
}

CompositeCfg holds the configurable settings to create a composite sampling policy evaluator.

type CompositeSubPolicyCfg added in v0.60.0

type CompositeSubPolicyCfg struct {

	// Configs for and policy evaluator.
	AndCfg AndCfg `mapstructure:"and"`
	// contains filtered or unexported fields
}

CompositeSubPolicyCfg holds the common configuration to all policies under composite policy.

type Config

type Config struct {
	// DecisionWait is the desired wait time from the arrival of the first span of
	// trace until the decision about sampling it or not is evaluated.
	DecisionWait time.Duration `mapstructure:"decision_wait"`
	// NumTraces is the number of traces kept on memory. Typically most of the data
	// of a trace is released after a sampling decision is taken.
	NumTraces uint64 `mapstructure:"num_traces"`
	// ExpectedNewTracesPerSec sets the expected number of new traces sending to the tail sampling processor
	// per second. This helps with allocating data structures with closer to actual usage size.
	ExpectedNewTracesPerSec uint64 `mapstructure:"expected_new_traces_per_sec"`
	// PolicyCfgs sets the tail-based sampling policy which makes a sampling decision
	// for a given trace when requested.
	PolicyCfgs []PolicyCfg `mapstructure:"policies"`
}

Config holds the configuration for tail-based sampling.

type LatencyCfg added in v0.29.0

type LatencyCfg struct {
	// Lower bound in milliseconds. Retaining original name for compatibility
	ThresholdMs int64 `mapstructure:"threshold_ms"`
	// Upper bound in milliseconds.
	UpperThresholdmsMs int64 `mapstructure:"upper_threshold_ms"`
}

LatencyCfg holds the configurable settings to create a latency filter sampling policy evaluator

type NumericAttributeCfg

type NumericAttributeCfg struct {
	// Tag that the filter is going to be matching against.
	Key string `mapstructure:"key"`
	// MinValue is the minimum value of the attribute to be considered a match.
	MinValue int64 `mapstructure:"min_value"`
	// MaxValue is the maximum value of the attribute to be considered a match.
	MaxValue int64 `mapstructure:"max_value"`
	// InvertMatch indicates that values must not match against attribute values.
	// If InvertMatch is true and Values is equal to '123', all other values will be sampled except '123'.
	// Also, if the specified Key does not match any resource or span attributes, data will be sampled.
	InvertMatch bool `mapstructure:"invert_match"`
}

NumericAttributeCfg holds the configurable settings to create a numeric attribute filter sampling policy evaluator.

type OTTLConditionCfg added in v0.78.0

type OTTLConditionCfg struct {
	ErrorMode           ottl.ErrorMode `mapstructure:"error_mode"`
	SpanConditions      []string       `mapstructure:"span"`
	SpanEventConditions []string       `mapstructure:"spanevent"`
}

OTTLConditionCfg holds the configurable setting to create a OTTL condition filter sampling policy evaluator.

type PolicyCfg

type PolicyCfg struct {

	// Configs for defining composite policy
	CompositeCfg CompositeCfg `mapstructure:"composite"`
	// Configs for defining and policy
	AndCfg AndCfg `mapstructure:"and"`
	// contains filtered or unexported fields
}

PolicyCfg holds the common configuration to all policies.

type PolicyType

type PolicyType string

PolicyType indicates the type of sampling policy.

const (
	// AlwaysSample samples all traces, typically used for debugging.
	AlwaysSample PolicyType = "always_sample"
	// Latency sample traces that are longer than a given threshold.
	Latency PolicyType = "latency"
	// NumericAttribute sample traces that have a given numeric attribute in a specified
	// range, e.g.: attribute "http.status_code" >= 399 and <= 999.
	NumericAttribute PolicyType = "numeric_attribute"
	// Probabilistic samples a given percentage of traces.
	Probabilistic PolicyType = "probabilistic"
	// StatusCode sample traces that have a given status code.
	StatusCode PolicyType = "status_code"
	// StringAttribute sample traces that an attribute, of type string, matching
	// one of the listed values.
	StringAttribute PolicyType = "string_attribute"
	// RateLimiting allows all traces until the specified limits are satisfied.
	RateLimiting PolicyType = "rate_limiting"
	// Composite allows defining a composite policy, combining the other policies in one
	Composite PolicyType = "composite"
	// And allows defining a And policy, combining the other policies in one
	And PolicyType = "and"
	// SpanCount sample traces that are have more spans per Trace than a given threshold.
	SpanCount PolicyType = "span_count"
	// TraceState sample traces with specified values by the given key
	TraceState PolicyType = "trace_state"
	// BooleanAttribute sample traces having an attribute, of type bool, that matches
	// the specified boolean value [true|false].
	BooleanAttribute PolicyType = "boolean_attribute"
	// OTTLCondition sample traces which match user provided OpenTelemetry Transformation Language
	// conditions.
	OTTLCondition PolicyType = "ottl_condition"
)

type ProbabilisticCfg added in v0.34.0

type ProbabilisticCfg struct {
	// HashSalt allows one to configure the hashing salts. This is important in scenarios where multiple layers of collectors
	// have different sampling rates: if they use the same salt all passing one layer may pass the other even if they have
	// different sampling rates, configuring different salts avoids that.
	HashSalt string `mapstructure:"hash_salt"`
	// SamplingPercentage is the percentage rate at which traces are going to be sampled. Defaults to zero, i.e.: no sample.
	// Values greater or equal 100 are treated as "sample all traces".
	SamplingPercentage float64 `mapstructure:"sampling_percentage"`
}

ProbabilisticCfg holds the configurable settings to create a probabilistic sampling policy evaluator.

type RateAllocationCfg added in v0.38.0

type RateAllocationCfg struct {
	Policy  string `mapstructure:"policy"`
	Percent int64  `mapstructure:"percent"`
}

RateAllocationCfg used within composite policy

type RateLimitingCfg

type RateLimitingCfg struct {
	// SpansPerSecond sets the limit on the maximum nuber of spans that can be processed each second.
	SpansPerSecond int64 `mapstructure:"spans_per_second"`
}

RateLimitingCfg holds the configurable settings to create a rate limiting sampling policy evaluator.

type SpanCountCfg added in v0.54.0

type SpanCountCfg struct {
	// Minimum number of spans in a Trace
	MinSpans int32 `mapstructure:"min_spans"`
	MaxSpans int32 `mapstructure:"max_spans"`
}

SpanCountCfg holds the configurable settings to create a Span Count filter sampling policy evaluator

type StatusCodeCfg added in v0.30.0

type StatusCodeCfg struct {
	StatusCodes []string `mapstructure:"status_codes"`
}

StatusCodeCfg holds the configurable settings to create a status code filter sampling policy evaluator.

type StringAttributeCfg

type StringAttributeCfg struct {
	// Tag that the filter is going to be matching against.
	Key string `mapstructure:"key"`
	// Values indicate the set of values or regular expressions to use when matching against attribute values.
	// StringAttribute Policy will apply exact value match on Values unless EnabledRegexMatching is true.
	Values []string `mapstructure:"values"`
	// EnabledRegexMatching determines whether match attribute values by regexp string.
	EnabledRegexMatching bool `mapstructure:"enabled_regex_matching"`
	// CacheMaxSize is the maximum number of attribute entries of LRU Cache that stores the matched result
	// from the regular expressions defined in Values.
	// CacheMaxSize will not be used if EnabledRegexMatching is set to false.
	CacheMaxSize int `mapstructure:"cache_max_size"`
	// InvertMatch indicates that values or regular expressions must not match against attribute values.
	// If InvertMatch is true and Values is equal to 'acme', all other values will be sampled except 'acme'.
	// Also, if the specified Key does not match on any resource or span attributes, data will be sampled.
	InvertMatch bool `mapstructure:"invert_match"`
}

StringAttributeCfg holds the configurable settings to create a string attribute filter sampling policy evaluator.

type TraceStateCfg added in v0.54.0

type TraceStateCfg struct {
	// Tag that the filter is going to be matching against.
	Key string `mapstructure:"key"`
	// Values indicate the set of values to use when matching against trace_state values.
	Values []string `mapstructure:"values"`
}

TraceStateCfg holds the common configuration for trace states.

Directories

Path Synopsis
internal
idbatcher
Package idbatcher defines a pipeline of fixed size in which the elements are batches of ids.
Package idbatcher defines a pipeline of fixed size in which the elements are batches of ids.
sampling
Package sampling contains the interfaces and data types used to implement the various sampling policies.
Package sampling contains the interfaces and data types used to implement the various sampling policies.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL