steno-similar

command
v0.0.0-...-5c30f4a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 20, 2022 License: AGPL-3.0 Imports: 8 Imported by: 0

README




Usage: steno-similar [OPTIONS] DBFILE1 DBFILE2
Options:
  -m float
    	match threshold (0=no matching, 1=all ngrams matched) (default 0.4)
  -n int
    	ngram size (default 3)
  -s int
    	ignore articles shorter than this this number of words (default 100)
  -v	verbose output
  -l string
	language (en/ru/es)
eg:

  $ steno-similar -m 0.5 -v poop1.db poop2.db


Indexing algorithm:

1) initialise the ngram table (to hold a list of doc IDs for each possible ngram)

2) For each document in the source:

    a) lowercase text, remove punctuation, remove stopwords, apply stemming
      eg:
      doc 1: "National Socialism is a bit silly."
      doc 2: "Nihilists! Fuck me. I mean, say what you want about the tenets of National Socialism, Dude, at least it's an ethos."

      becomes:
      doc 1: "nation social silly"
      doc 2: "nihili fuck mean tenet nation social dud ethos"

    b) split into ngrams
      eg (with ngramSize=2)
      doc 1: " nation social" "social silly"
      doc 2: "nihili fuck", "fuck mean", "mean tenet", "tenet nation", "nation social", "social dud" "dud least" "least ethos"

    c) go through each ngram and append the doc id to its slot in the ngram table.
       So, after indexing the two example docs above, the ngram table looks like this:

        "nation social"  => [1,2]
        "social silly" => [1]
        "nihili fuck"  => [2]
        "fuck mean"  => [2]
        "mean tenet"  => [2]
        "tenet nation"  => [2]
        "social dud"  => [2]
        "dud least"  => [2]
        "least ethos"  => [2]


Matching algorithm:

to match a document ("the query doc")against the index:

eg doc 3: "National socialism is fucking mean."

1) split up the document text into ngrams, as above.
   => doc 3: "nation social" "social fuck" "fuck mean"

2) get a list of _all_ the potential match docs which contain any of those ngrams
   =>
        "nation social"  => [1,2]
        "social fuck" => []  (empty - wasn't in original corpus)
        "fuck mean"  => [2]
        so the potential match docs are: [1,2]

   For each one we calculate a match factor:

   a) count how many of the query doc ngrams appear in this potential match doc
   => docid 1:  1 match
   => docid 2:  2 matches

   b) divide by the number of ngrams in the query doc to get match factor
   => docid 1: 1/3 = 0.33333...
   => docid 2: 2/3 = 0.66666...

   c) discard any documents below the match threshold

   d) perform a diff to compare the original text of the two documents
      (NOT the ngrams!) and display this in the HTML report







Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL