ogórek

package module
v1.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 20, 2025 License: MIT Imports: 15 Imported by: 80

README

ogórek

GoDoc Build Status

ogórek is a Go library for encoding and decoding pickles.

Fuzz Testing

Fuzz testing has been implemented for decoder and encoder. To run fuzz tests do the following:

go get github.com/dvyukov/go-fuzz/go-fuzz
go get github.com/dvyukov/go-fuzz/go-fuzz-build
go-fuzz-build github.com/kisielk/og-rek
go-fuzz -bin=./ogórek-fuzz.zip -workdir=./fuzz

Documentation

Overview

Package ogórek(*) is a library for decoding/encoding Python's pickle format.

Use Decoder to decode a pickle from input stream, for example:

d := ogórek.NewDecoder(r)
obj, err := d.Decode() // obj is any representing decoded Python object

Use Encoder to encode an object as pickle into output stream, for example:

e := ogórek.NewEncoder(w)
err := e.Encode(obj)

The following table summarizes mapping of basic types in between Python and Go:

Python	   Go
------	   --

None	↔  ogórek.None
bool	↔  bool
int	↔  int64
int	←  int, intX, uintX
long	↔  *big.Int
float	↔  float64
float	←  floatX
list	↔  []any
tuple	↔  ogórek.Tuple

For dicts there are two modes. In the first, default, mode Python dicts are decoded into standard Go map. This mode tries to use builtin Go type, but cannot mirror py behaviour fully because e.g. int(1), big.Int(1) and float64(1.0) are all treated as different keys by Go, while Python treats them as being equal. It also does not support decoding dicts with tuple used in keys:

dict    ↔  map[any]any                       PyDict=n mode, default
        ←  ogórek.Dict

With PyDict=y mode, however, Python dicts are decoded as ogórek.Dict which mirrors behaviour of Python dict with respect to keys equality, and with respect to which types are allowed to be used as keys.

dict    ↔  ogórek.Dict                       PyDict=y mode
        ←  map[any]any

For strings there are also two modes. In the first, default, mode both py2/py3 str and py2 unicode are decoded into string with py2 str being considered as UTF-8 encoded. Correspondingly for protocol ≤ 2 Go string is encoded as UTF-8 encoded py2 str, and for protocol ≥ 3 as py3 str / py2 unicode. ogórek.ByteString can be used to produce bytestring objects after encoding even for protocol ≥ 3. This mode tries to match Go string with str type of target Python depending on protocol version, but looses information after decoding/encoding cycle:

py2/py3 str  ↔  string                       StrictUnicode=n mode, default
py2 unicode  →  string
py2 str      ←  ogórek.ByteString

However with StrictUnicode=y mode there is 1-1 mapping in between py2 unicode / py3 str vs Go string, and between py2 str vs ogórek.ByteString. In this mode decoding/encoding and encoding/decoding operations are always identity with respect to strings:

py2 unicode / py3 str  ↔  string             StrictUnicode=y mode
py2 str                ↔  ogórek.ByteString

For bytes, unconditionally to string mode, there is direct 1-1 mapping in between Python and Go types:

bytes        ↔  ogórek.Bytes   (~)
bytearray    ↔  []byte

Python classes and instances are mapped to Class and Call, for example:

Python				Go
------	   			--

decimal.Decimal            ↔    ogórek.Class{"decimal", "Decimal"}
decimal.Decimal("3.14")    ↔    ogórek.Call{
					ogórek.Class{"decimal", "Decimal"},
					ogórek.Tuple{"3.14"},
				}

In particular on Go side it is thus by default safe to decode pickles from untrusted sources(^).

Pickle protocol versions

Over the time the pickle stream format was evolving. The original protocol version 0 is human-readable with versions 1 and 2 extending the protocol in backward-compatible way with binary encodings for efficiency. Protocol version 2 is the highest protocol version that is understood by standard pickle module of Python2. Protocol version 3 added ways to represent Python bytes objects from Python3(~). Protocol version 4 further enhances on version 3 and completely switches to binary-only encoding. Protocol version 5 added support for out-of-band data(%). Please see https://docs.python.org/3/library/pickle.html#data-stream-format for details.

On decoding ogórek detects which protocol is being used and automatically handles all necessary details.

On encoding, for compatibility with Python2, by default ogórek produces pickles with protocol 2. Bytes thus, by default, will be unpickled as str on Python2 and as bytes on Python3. If an earlier protocol is desired, or on the other hand, if Bytes needs to be encoded efficiently (protocol 2 encoding for bytes is far from optimal), and compatibility with pure Python2 is not an issue, the protocol to use for encoding could be explicitly specified, for example:

e := ogórek.NewEncoderWithConfig(w, &ogórek.EncoderConfig{
	Protocol: 3,
})
err := e.Encode(obj)

See EncoderConfig.Protocol for details.

Persistent references

Pickle was originally created for serialization in ZODB (http://zodb.org) object database, where on-disk objects can reference each other similarly to how one in-RAM object can have a reference to another in-RAM object.

When a pickle with such persistent reference is decoded, ogórek represents the reference with Ref placeholder similarly to Class and Call. However it is possible to hook into decoding and process such references in application specific way, for example loading the referenced object from the database:

d := ogórek.NewDecoderWithConfig(r, &ogórek.DecoderConfig{
	PersistentLoad: ...
})
obj, err := d.Decode()

Similarly, for encoding, an application can hook into serialization process and turn pointers to some in-RAM objects into persistent references.

Please see DecoderConfig.PersistentLoad and EncoderConfig.PersistentRef for details.

Handling unpickled values

On Python two different objects with different types can represent essentially the same entity. For example 1 (int) and 1L (long) represent integer number one via two different types and are decoded by ogórek into Go types int64 and big.Int correspondingly. However on the Python side those two representations are often used interchangeably and programs are usually expected to handle both with the same effect. To help handling decoded values with such differences ogórek provides utilities that bring objects to common type irregardless of which type variant was used in the pickle stream. For example AsInt64 tries to represent unpickled value as int64 if possible and errors if not.

For strings the situation is similar, but a bit different. On Python3 strings are unicode strings and binary data is represented by bytes type. However on Python2 strings are bytestrings and could contain both text and binary data. In the default mode py2 strings, the same way as py2 unicode, are decoded into Go strings. However in StrictUnicode mode py2 strings are decoded into ByteString - the type specially dedicated to represent them on Go side. There are two utilities to help programs handle all those bytes/string data in the pickle stream in uniform way:

  • the program should use AsString if it expects text data - either unicode string, or byte string.
  • the program should use AsBytes if it expects binary data - either bytes, or byte string.

Using the helpers fits into Python3 strings/bytes model but also allows to handle the data generated from under Python2.

Similarly Dict considers ByteString to be equal to both string and Bytes with the same underlying content. This allows programs to access Dict via string/bytes keys following Python3 model, while still being able to handle dictionaries generated from under Python2.

--------

(*) ogórek is Polish for "pickle".

(~) bytes can be produced only by Python3 or zodbpickle (https://pypi.org/project/zodbpickle), not by standard Python2. Respectively, for protocol ≤ 2, what ogórek produces is unpickled as bytes by Python3 or zodbpickle, and as str by Python2.

(^) contrary to Python implementation, where malicious pickle can cause the decoder to run arbitrary code, including e.g. os.system("rm -rf /").

(%) ogórek currently does not support out-of-band data.

Index

Constants

This section is empty.

Variables

View Source
var ErrInvalidPickleVersion = errors.New("invalid pickle version")

Functions

func AsInt64 added in v1.3.0

func AsInt64(x any) (int64, error)

AsInt64 tries to represent unpickled value to int64.

Python int is decoded as int64, while Python long is decoded as big.Int. Go code should use AsInt64 to accept normal-range integers independently of their Python representation.

func AsString added in v1.3.0

func AsString(x any) (string, error)

AsString tries to represent unpickled value as string.

It succeeds only if the value is either string, or ByteString. It does not succeed if the value is Bytes or any other type.

ByteString is treated related to string because ByteString represents str type from py2 which can contain both string and binary data.

Types

type ByteString added in v1.3.0

type ByteString string

ByteString represents str from Python2 in StrictUnicode mode.

See StrictUnicode mode documentation in top-level package overview for details.

func (ByteString) GoString added in v1.3.0

func (v ByteString) GoString() string

type Bytes added in v1.1.0

type Bytes string

Bytes represents Python's bytes.

func AsBytes added in v1.3.0

func AsBytes(x any) (Bytes, error)

AsBytes tries to represent unpickled value as Bytes.

It succeeds only if the value is either Bytes, or ByteString. It does not succeed if the value is string or any other type.

ByteString is treated related to Bytes because ByteString represents str type from py2 which can contain both string and binary data.

func (Bytes) GoString added in v1.3.0

func (v Bytes) GoString() string

make Bytes, ByteString and unicode to be represented by %#v distinctly from string (without GoString %#v emits just "..." for all string, Bytes and unicode)

type Call

type Call struct {
	Callable Class
	Args     Tuple
}

Call represents Python's call.

type Class

type Class struct {
	Module, Name string
}

Class represents a Python class.

type Decoder

type Decoder struct {
	// contains filtered or unexported fields
}

Decoder is a decoder for pickle streams.

func NewDecoder

func NewDecoder(r io.Reader) *Decoder

NewDecoder returns a new Decoder with the default configuration.

The decoder will decode the pickle stream in r.

func NewDecoderWithConfig added in v1.1.0

func NewDecoderWithConfig(r io.Reader, config *DecoderConfig) *Decoder

NewDecoderWithConfig is similar to NewDecoder, but returns decoder with the specified configuration.

config must not be nil.

func (*Decoder) Decode

func (d *Decoder) Decode() (any, error)

Decode decodes the pickle stream and returns the result or an error.

type DecoderConfig added in v1.1.0

type DecoderConfig struct {
	// PersistentLoad, if !nil, will be used by decoder to handle persistent references.
	//
	// Whenever the decoder finds an object reference in the pickle stream
	// it will call PersistentLoad. If PersistentLoad returns !nil object
	// without error, the decoder will use that object instead of Ref in
	// the resulted built Go object.
	//
	// An example use-case for PersistentLoad is to transform persistent
	// references in a ZODB database of form (type, oid) tuple, into
	// equivalent-to-type Go ghost object, e.g. equivalent to zodb.BTree.
	//
	// See Ref documentation for more details.
	PersistentLoad func(ref Ref) (any, error)

	// StrictUnicode, when true, requests to decode to Go string only
	// Python unicode objects. Python2 bytestrings (py2 str type) are
	// decoded into ByteString in this mode. See StrictUnicode mode
	// documentation in top-level package overview for details.
	StrictUnicode bool

	// PyDict, when true, requests to decode Python dicts as ogórek.Dict
	// instead of builtin map. See PyDict mode documentation in top-level
	// package overview for details.
	PyDict bool
}

DecoderConfig allows to tune Decoder.

type Dict added in v1.3.0

type Dict struct {
	// contains filtered or unexported fields
}

Dict represents dict from Python in PyDict mode.

It mirrors Python with respect to which types are allowed to be used as keys, and with respect to keys equality. For example Tuple is allowed to be used as key, and all int(1), float64(1.0) and big.Int(1) are considered to be equal.

For strings, similarly to Python3, Bytes and string are considered to be not equal, even if their underlying content is the same. However with same underlying content ByteString, because it represents str type from Python2, is treated equal to both Bytes and string.

See PyDict mode documentation in top-level package overview for details.

Note: similarly to builtin map Dict is pointer-like type: its zero-value represents nil dictionary that is empty and invalid to use Set on.

func NewDict added in v1.3.0

func NewDict() Dict

NewDict returns new empty dictionary.

func NewDictWithData added in v1.3.0

func NewDictWithData(kv ...any) Dict

NewDictWithData returns new dictionary with preset data.

kv should be key₁, value₁, key₂, value₂, ...

func NewDictWithSizeHint added in v1.3.0

func NewDictWithSizeHint(size int) Dict

NewDictWithSizeHint returns new empty dictionary with preallocated space for size items.

func (Dict) Del added in v1.3.0

func (d Dict) Del(key any)

Del removes equal keys from the dictionary.

All entries with key equal to the query are looked up and removed.

Del panics if key's type is not allowed to be used as Dict key.

func (Dict) Get added in v1.3.0

func (d Dict) Get(key any) any

Get returns value associated with equal key.

An entry with key equal to the query is looked up and corresponding value is returned.

nil is returned if no matching key is present in the dictionary.

Get panics if key's type is not allowed to be used as Dict key.

func (Dict) Get_ added in v1.3.0

func (d Dict) Get_(key any) (value any, ok bool)

Get_ is comma-ok version of Get.

func (Dict) GoString added in v1.3.0

func (d Dict) GoString() string

GoString returns detailed human-readable representation of the dictionary.

func (Dict) Iter added in v1.3.0

func (d Dict) Iter() func(yield func(any, any) bool)

Iter returns iterator over all elements in the dictionary.

The order to visit entries is arbitrary.

func (Dict) Len added in v1.3.0

func (d Dict) Len() int

Len returns the number of items in the dictionary.

func (Dict) Set added in v1.3.0

func (d Dict) Set(key, value any)

Set sets key to be associated with value.

Any previous keys, equal to the new key, are removed from the dictionary before the assignment.

Set panics if key's type is not allowed to be used as Dict key.

func (Dict) String added in v1.3.0

func (d Dict) String() string

String returns human-readable representation of the dictionary.

type Encoder

type Encoder struct {
	// contains filtered or unexported fields
}

An Encoder encodes Go data structures into pickle byte stream

func NewEncoder

func NewEncoder(w io.Writer) *Encoder

NewEncoder returns a new Encoder with the default configuration.

The encoder will emit pickle stream into w.

func NewEncoderWithConfig added in v1.1.0

func NewEncoderWithConfig(w io.Writer, config *EncoderConfig) *Encoder

NewEncoderWithConfig is similar to NewEncoder, but returns the encoder with the specified configuration.

config must not be nil.

func (*Encoder) Encode

func (e *Encoder) Encode(v any) error

Encode writes the pickle encoding of v to w, the encoder's writer

type EncoderConfig added in v1.1.0

type EncoderConfig struct {
	// Protocol specifies which pickle protocol version should be used.
	Protocol int

	// PersistentRef, if !nil, will be used by encoder to encode objects as persistent references.
	//
	// Whenever the encoders sees pointer to a Go struct object, it will call
	// PersistentRef to find out how to encode that object. If PersistentRef
	// returns nil, the object is encoded regularly. If !nil - the object
	// will be encoded as an object reference.
	//
	// See Ref documentation for more details.
	PersistentRef func(obj any) *Ref

	// StrictUnicode, when true, requests to always encode Go string
	// objects as Python unicode independently of used pickle protocol.
	// See StrictUnicode mode documentation in top-level package overview
	// for details.
	StrictUnicode bool
}

EncoderConfig allows to tune Encoder.

type None

type None struct{}

None is a representation of Python's None.

type OpcodeError

type OpcodeError struct {
	Key byte
	Pos int
}

OpcodeError is the error that Decode returns when it sees unknown pickle opcode.

func (OpcodeError) Error

func (e OpcodeError) Error() string

type Ref

type Ref struct {
	// persistent ID of referenced object.
	//
	// used to be string for protocol 0, but "upgraded" to be arbitrary
	// object for later protocols.
	Pid any
}

Ref is the default representation for a Python persistent reference.

Such references are used when one pickle somehow references another pickle in e.g. a database.

See https://docs.python.org/3/library/pickle.html#pickle-persistent for details.

See DecoderConfig.PersistentLoad and EncoderConfig.PersistentRef for ways to tune Decoder and Encoder to handle persistent references with user-specified application logic.

type Tuple

type Tuple []any

Tuple is a representation of Python's tuple.

type TypeError

type TypeError struct {
	// contains filtered or unexported fields
}

func (*TypeError) Error

func (te *TypeError) Error() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL