text

package module
v0.0.0-...-8bb1b95 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 7, 2017 License: MIT Imports: 5 Imported by: 2

README

text

Common text handling for go-dedup

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetDoubleMetaphone

func GetDoubleMetaphone(document string, dc TextCleanserDecorator) []string

func GetWords

func GetWords(document string, dc TextCleanserDecorator) []string

func Ident

func Ident(s string) string

Ident -- "identity" just return the same string

Types

type Doc2Words

type Doc2Words func(document string) []string

Doc2Words defines the function type for doc to words

Example

for standalone test, change package to `main` and the next func def to, func main() {

//package main

package main

import (
	"fmt"

	"github.com/go-dedup/text"
)

var Doc2words = text.GetWordsFactory(text.Decorators(
	text.SplitCamelCase,
	text.ToLower,
	text.RemovePunctuation,
	text.Compact,
	text.Trim,
))

// for standalone test, change package to `main` and the next func def to,
// func main() {
func main() {
	for _, d := range testDoc {
		fmt.Printf("%v\n", Doc2words(string(d)))
	}

}

var testDoc = [][]byte{
	[]byte("(ebook) GNU - PYTHON Standard Library with myConstantVariable (2001)"),
	// }
	// var testDoc2 = [][]byte{
	[]byte("Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic"),
	[]byte("2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic"),
	[]byte("2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic"),
	[]byte("2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic"),
	[]byte("2013 Ford Fiesta Sedan - 22,116 kms Body is in perfect condition. No mechanical problems. Oil change and maintenance package done in March/17. Registered inspection done in April/16. $10,000 firm (sales tax is extra). Call … 22,120km | Automatic"),
	[]byte("2015 Ford Explorer Sport SUV, Crossover This vehicle is a real beauty and a pleasure to drive. It is in excellent condition and has been store inside since purchased in 2015. It has not been driven in winter other then to go for service.!… 18,600km | Automatic"),
	[]byte("Ford F-150. Lariat DO NOT BUY. Truck has been in the shop 50 days so far. It has had a vibration since day one and Ford cannot get rid of it. The have done everything possible to the underside of this truck and it is… 11,000km | Automatic"),
	[]byte("2016 Ford Mustang 2016 Ford Mustang white with black stripes, this car is in showroom shape and it only has 14,000kms. this beast has never been in an accident nor does it have one scratch on the body. i purchased 20… 14,000km | Automatic"),
}
Output:

[ebook gnu python standard library with my constant variable 2001]
[ford f 150 lariat do not buy truck has been in the shop 50 days so far it has had a vibration since day one and ford cannot get rid of it the have done everything possible to the underside of this truck and it is 11000km automatic]
[2016 ford mustang 2016 ford mustang white with black stripes this car is in showroom shape and it only has 14000kms this beast has never been in an accident nor does it have one scratch on the body i purchased 20 14000km automatic]
[2013 ford fiesta sedan 22116 kms body is in perfect condition no mechanical problems oil change and maintenance package done in march 17 registered inspection done in april 16 $10000 firm sales tax is extra call 22120km automatic]
[2015 ford explorer sport suv crossover this vehicle is a real beauty and a pleasure to drive it is in excellent condition and has been store inside since purchased in 2015 it has not been driven in winter other then to go for service 18600km automatic]
[2013 ford fiesta sedan 22116 kms body is in perfect condition no mechanical problems oil change and maintenance package done in march 17 registered inspection done in april 16 $10000 firm sales tax is extra call 22120km automatic]
[2015 ford explorer sport suv crossover this vehicle is a real beauty and a pleasure to drive it is in excellent condition and has been store inside since purchased in 2015 it has not been driven in winter other then to go for service 18600km automatic]
[ford f 150 lariat do not buy truck has been in the shop 50 days so far it has had a vibration since day one and ford cannot get rid of it the have done everything possible to the underside of this truck and it is 11000km automatic]
[2016 ford mustang 2016 ford mustang white with black stripes this car is in showroom shape and it only has 14000kms this beast has never been in an accident nor does it have one scratch on the body i purchased 20 14000km automatic]

func GetDoubleMetaphoneFactory

func GetDoubleMetaphoneFactory(dc TextCleanserDecorator) Doc2Words

func GetWordsFactory

func GetWordsFactory(dc TextCleanserDecorator) Doc2Words

type TextCleanser

type TextCleanser func(string) string

TextCleanser defines the function type for text cleansing

Example

for standalone test, change package to `main` and the next func def to, func main() {

package main

import (
	"fmt"

	"github.com/go-dedup/text"
)

// for standalone test, change package to `main` and the next func def to,
// func main() {
func main() {
	s := "Hello~~, play_ground#5!"

	var fn text.TextCleanser = text.Ident
	fmt.Println(fn(s))

	var fn2 = text.ToLower(fn)
	fmt.Println(fn2(s))

	var fn3 text.TextCleanser = text.Ident
	fn3 = text.ToAppend(" -GOLANG")(text.ToLower(text.ToPrepend("DECORATED: ")(fn3)))
	fmt.Println(fn3(s))

	// dec is now a text.TextCleanserDecorator, to use it, you still need to
	// pass it the function of type text.TextCleanser that you want to decorate.
	dec := text.Decorators(
		text.ToAppend(" -GOLANG"),
		text.SplitCamelCase,
		text.ToLower,
		text.ToPrepend("DECORATED: "),
		text.RemovePunctuation,
	)

	fn4 := dec(text.Ident)
	fmt.Println(fn4(s))
	s += "\n.\n%% Something extra: UpperCamelCase and someInitMethod.\n"
	fmt.Printf(".\n>>>>\n'%s'\n", s)
	fmt.Printf("%#v\n", text.GetWords(s, dec))

	dec = text.Decorators(
		dec,
		text.Compact,
	)
	fmt.Printf("%#v\n", text.GetWords(s, dec))

	fn5 := text.GetWordsFactory(dec)
	fmt.Printf("%#v\n", fn5(s))

	s = "Andrej cabrillo Gallegos Germany Jankelowicz"
	fmt.Printf(".\n>>>>\n'%s'\n", s)
	dec = text.Decorators(
		text.ToDoubleMetaphone,
	)
	fmt.Printf("%#v\n", text.GetWords(s, dec))
	fmt.Printf("%#v\n", text.GetDoubleMetaphone(s, text.Decorators()))

	dec = text.Decorators(
		text.SplitCamelCase,
		text.Compact,
	)
	fn5 = text.GetDoubleMetaphoneFactory(dec)
	fmt.Printf("%#v\n", fn5(s))

	s = "NãoMeFazMal ÇaNeMeFaitPasMal PòssoMangiâFàMâ"
	fmt.Printf(".\n>>>>\n'%s'\n", s)
	dec = text.Decorators(
		text.SplitCamelCaseUnicode,
	)
	fmt.Printf("%#v\n", text.GetWords(s, dec))

}

// to show the full code in GoDoc
type dummy struct {
}
Output:

Hello~~, play_ground#5!
hello~~, play_ground#5!
DECORATED: hello~~, play_ground#5! -golang
DECORATED hello   play ground 5   golang
.
>>>>
'Hello~~, play_ground#5!
.
%% Something extra: UpperCamelCase and someInitMethod.
'
[]string{"DECORATED", "hello", "", "", "play", "ground", "5", "", "", "", "", "", "", "something", "extra", "upper", "camel", "case", "and", "some", "init", "method", "", "", "", "golang"}
[]string{"DECORATED", "hello", "play", "ground", "5", "something", "extra", "upper", "camel", "case", "and", "some", "init", "method", "golang"}
[]string{"DECORATED", "hello", "play", "ground", "5", "something", "extra", "upper", "camel", "case", "and", "some", "init", "method", "golang"}
.
>>>>
'Andrej cabrillo Gallegos Germany Jankelowicz'
[]string{"antrjkprlklkskrmnjnklts", "antrkprkksjrmnanklfx"}
[]string{"antrj", "antr", "kprl", "kpr", "klks", "kks", "krmn", "jrmn", "jnklts", "anklfx"}
[]string{"antrj", "antr", "kprl", "kpr", "klks", "kks", "krmn", "jrmn", "jnklts", "anklfx"}
.
>>>>
'NãoMeFazMal ÇaNeMeFaitPasMal PòssoMangiâFàMâ'
[]string{"Não", "Me", "Faz", "Mal", "Ça", "Ne", "Me", "Fait", "Pas", "Mal", "Pòsso", "Mangiâ", "Fà", "Mâ"}

func Compact

func Compact(c TextCleanser) TextCleanser

Compact cleanse all consecutive punctuations into a single space

func RemovePunctuation

func RemovePunctuation(c TextCleanser) TextCleanser

RemovePunctuation cleanse all punctuations from the text

func SplitCamelCase

func SplitCamelCase(c TextCleanser) TextCleanser

SplitCamelCase split each CamelCase word in the text to individual words

func SplitCamelCaseUnicode

func SplitCamelCaseUnicode(c TextCleanser) TextCleanser

SplitCamelCaseUnicode split each CamelCase word in the text to individual words, unicode aware.

func ToDoubleMetaphone

func ToDoubleMetaphone(c TextCleanser) TextCleanser

ToDoubleMetaphone transforms the text to DoubleMetaphones

func ToLower

func ToLower(c TextCleanser) TextCleanser

ToLower cleanse the text to lower case

func Trim

func Trim(c TextCleanser) TextCleanser

Trim cleanse all leading and trailing spaces

type TextCleanserDecorator

type TextCleanserDecorator func(TextCleanser) TextCleanser

TextCleanserDecorator is the text cleansing function Decorator

func Decorators

Decorators "merges" the passed in decorators and returns a singe decorator.

func ToAppend

func ToAppend(suffix string) TextCleanserDecorator

ToAppend manipulates the text by appending a suffix

func ToPrepend

func ToPrepend(prefix string) TextCleanserDecorator

ToPrepend manipulates the text by pre-pending with a prefix

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL