sdhash

package module
v0.0.0-...-a7b5530 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 17, 2021 License: Apache-2.0 Imports: 19 Imported by: 2

README

sdhash

Tests codecov Go Report Card GoDoc Release Language License

sdhash is a tool that processes binary data and produces similarity digests using bloom filters. Two binary files with common parts produces two similar digests. sdhash is able to compare the similarity digests to produce a score. A score close to 0 means that two file are very different, a score equals to 100 means that two file are equal.

Features

  • calculate similarity digests of many files in a short time
  • compare a large amount of digests using precalculated indexes
  • the comparison can also be made during the digest process
  • same results of original sdhash with similar performance, but entirely rewritten in go language

Getting started

The sdhash package is available as binaries and as a library.

Binaries

The binaries for all platforms are available on the Releases page.

Library
  1. Install sdhash package with the command below
$ go get -u github.com/eciavatta/sdhash
  1. Import it in your code and start play around
package main

import (
	"fmt"
	"github.com/eciavatta/sdhash"
)

func main() {
	factoryA, _ := sdhash.CreateSdbfFromFilename("a.bin")
	sdbfA := factoryA.Compute()

	factoryB, _ := sdhash.CreateSdbfFromFilename("b.bin")
	sdbfB := factoryB.Compute()

	fmt.Println(sdbfA.String())
	fmt.Println(sdbfB.String())
	fmt.Println(sdbfA.Compare(sdbfB))
}

Documentation

The library documentation is published at pkg.go.dev/github.com/eciavatta/sdhash. How sdhash works is described in this paper, and here you can find a tutorial of the original version of sdhash.

License

sdhash is originally created by Vassil Roussev and Candice Quates and is licensed under Apache-2.0 License. The implementation in golang was made by Emiliano Ciavatta and is also licensed under Apache-2.0 License.

Documentation

Index

Constants

View Source
const (
	MinFileSize = 512 // Minimum file size for a Sdbf file.

)

Variables

View Source
var (
	BfSize         uint32 = 256    // BfSize is the size of each bloom filters
	PopWinSize     uint32 = 64     // PopWinSize is the size of the sliding window used to hash input.
	MaxElem        uint32 = 160    // MaxElem is maximum number of elements in each bloom filter in stream mode.
	MaxElemDd      uint32 = 192    // MaxElem is maximum number of elements in each bloom filter in block mode.
	Threshold      uint32 = 16     // Threshold is the minimum value of the score above witch chunks are considered.
	BlockSize             = 4 * kB // BlockSize is the block size used to generate chunk ranks.
	EntropyWinSize        = 64     // EntropyWinSize is the entropy window size used to generate chunk ranks.
)

Functions

This section is empty.

Types

type BloomFilter

type BloomFilter interface {

	// ElemCount returns the number of elements in the BloomFilter.
	ElemCount() uint64

	// MaxElem returns the maximum number of elements that can be present in the BloomFilter.
	MaxElem() uint64

	// BitsPerElem returns the number of bits for each elements of the BloomFilter.
	BitsPerElem() float64

	// WriteToFile serialize the current BloomFilter to a file specified by filename.
	WriteToFile(filename string) error

	// String returns the serialized representation of the BloomFilter.
	String() string
	// contains filtered or unexported methods
}

BloomFilter represent a bloom filter and it is used to calculate similarity digests.

func NewBloomFilter

func NewBloomFilter() BloomFilter

NewBloomFilter returns a new BloomFilter with the default initial values.

func NewBloomFilterFromIndexFile

func NewBloomFilterFromIndexFile(indexFileName string) (BloomFilter, error)

NewBloomFilterFromIndexFile read a BloomFilter serialized into a file.

func NewBloomFilterFromString

func NewBloomFilterFromString(filter string) (BloomFilter, error)

NewBloomFilterFromString create a new BloomFilter from a serialized string.

type Sdbf

type Sdbf interface {

	// Name of the of the file or data this Sdbf represents.
	Name() string

	// Size of the hash data for this Sdbf.
	Size() uint64

	// InputSize of the data that the hash was generated from.
	InputSize() uint64

	// FilterCount returns the number of bloom filters count.
	FilterCount() uint32

	// Compare two Sdbf and provide a similarity score ranges between 0 and 100.
	// A score of 0 means that the two files are very different, a score of 100 means that the two files are equals.
	Compare(other Sdbf) int

	// CompareSample compare two Sdbf with sampling and provide a similarity score ranges between 0 and 100.
	// A score of 0 means that the two files are very different, a score of 100 means that the two files are equals.
	CompareSample(other Sdbf, sample uint32) int

	// String returns the encoded Sdbf as a string.
	String() string

	// GetIndex returns the BloomFilter index used during the digesting process.
	GetIndex() BloomFilter

	// GetSearchIndexesResults returns search indexes results.
	// The return value is an array of size == len(searchIndexes), and each elements has another array of length bfCount.
	GetSearchIndexesResults() [][]uint32

	// Fast modify the bloom filter buffer for faster comparison.
	// Warning: the operation overwrite the original buffer.
	Fast()
}

Sdbf represent the similarity digest of a file and can be compared for similarity to others Sdbf.

func ParseSdbfFromString

func ParseSdbfFromString(digest string) (Sdbf, error)

ParseSdbfFromString decode a Sdbf from a digest string.

type SdbfFactory

type SdbfFactory interface {

	// WithBlockSize sets the block size for the block mode.
	// The default value of 0 involves in a Sdbf generated in stream mode.
	WithBlockSize(blockSize uint32) SdbfFactory

	// WithInitialIndex sets the initial BloomFilter index.
	// Without setting an initial index the factory creates a new empty BloomFilter.
	WithInitialIndex(initialIndex BloomFilter) SdbfFactory

	// WithSearchIndexes sets a list of BloomFilter which are checked for similarity during digesting process.
	// Without setting a value the searching operation during the digesting process is disabled.
	WithSearchIndexes(searchIndexes []BloomFilter) SdbfFactory

	// WithName sets the name of the Sdbf in the output.
	WithName(name string) SdbfFactory

	// Compute start the digesting process and provide a Sdbf with the result.
	Compute() Sdbf
}

SdbfFactory can be used to create a Sdbf from a binary source.

func CreateSdbfFromBytes

func CreateSdbfFromBytes(buffer []uint8) (SdbfFactory, error)

CreateSdbfFromBytes returns a factory which can produce a Sdbf from a bytes buffer.

func CreateSdbfFromFilename

func CreateSdbfFromFilename(filename string) (SdbfFactory, error)

CreateSdbfFromFilename returns a factory which can produce a Sdbf of a file.

func CreateSdbfFromReader

func CreateSdbfFromReader(r io.Reader) (SdbfFactory, error)

CreateSdbfFromReader returns a factory which can produce a Sdbf from a io.Reader.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL