cpd

package module

v1.0.0 Latest Latest Go to latest Published: Feb 19, 2022 License: Apache-2.0 Imports: 15 Imported by: 10

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/softlandia/cpd

Links

Open Source Insights

README ¶

code page detect

(c) softlandia@gmail.com

download: go get -u github.com/softlandia/cpd
install: go install

golang library for detecting code page of text files
multibyte code pages and single-byte Russian code pages are supported:

no	ID	Name	uint16
1.	ASCII	"ASCII"	3
2.	ISOLatinCyrillic	"ISO-8859-5"	8
3.	CP866	"CP866"	2086
4.	Windows1251	"Windows-1251"	2251
5.	UTF8	"UTF-8"	106
6.	UTF16LE	"UTF-16LE"	1014
7.	UTF16BE	"UTF-16BE"	1013
8.	KOI8R	"KOI8-R"	2084
9.	UTF32LE	"UTF-32LE"	1019
10.	UTF32BE:	"UTF-32BE"	1018

feature

encoding is determined both by the presence of the bom attribute and by heuristic
if file contain only latin symbols from first half of code page, this file detected as UTF-8
this is not a mistake, this is a completely correct statement
have touble with detecting UTF32 without russians char

ATTANTION! library support multithreading

dependences

"golang.org/x/text/encoding/charmap"
"golang.org/x/text/transform"

types

IDCodePage uint16 - index of code page, support String() interface

cp := cpd.UTF8
fmt.Printf("code page index: %d, name: %s\n", cp, cp)
>> code page index: 106, name: UTF-8

variables

ReadBufSize int = 1024 // default count of byte to read from input reader for detecting

functions

CodepageDetect(r io.Reader) (IDCodePage, error)
FileCodepageDetect(fn string, stopStr ...string) (IDCodePage, error)
DecodeUTF16be(s string) string
DecodeUTF16le(s string) string
NewReader(r io.Reader, cpn ...string) (io.Reader, error)
NewReaderTo(r io.Reader, cpn string) (io.Reader, error)
CodepageAutoDetect(content []byte) (result IDCodePage)

description

func CodepageAutoDetect(content []byte) (result IDCodePage) 
  autodetect code page from input slice of byte
  use this function instead golang.org/x/net/html/charset.DetermineEncoding()

CodepageDetect(r io.Reader) (IDCodePage, error)
  detect code page of ascii data from reader 'r' 
  use library 'reflect' to check input reader
  default read only first 1024 byte from 'r' (var ReadBufSize to change this setting)

FileCodepageDetect(fn string, stopStr ...string) (IDCodePage, error)
  detect code page of text file "fn", read first 1024 byte (var ReadBufSize to change this setting)
  return error if problem with file "fn"
  return cpd.ASCII if code page not detected
  return one of next constant (code_pages_id.go): cpd.IBM866, cpd.Windows1251, cpd.KOI8R, cpd.UTF8, UTF16LE, UTF16BE
  file must contain characters of the Rusian alphabet
  input parameter `stopStr` not using

func StrConvertCodePage(s string, fromCP, toCP IDCodePage) (string, error)  
  convert string from one code page to another, support Windows1251 & IBM866

func FileConvertCodePage(fileName string, fromCP, toCP IDCodePage) error
  convert code page file with "fileName", support Windows1251 & IBM866

func DecodeUTF16be(s string) string 
  convert input string from UTF-16BE to Utf-8

func DecodeUTF16le(s string) string 
  convert input string from UTF-16LE to Utf-8

NewReader(r io.Reader, cpn ...string) (io.Reader, error)
  decoding input reader in UTF-8
  cpn may contain the name of the encoding of the input data, 
  we can ommit cpn, then the encoding of the input data is determined automatically

NewReaderTo(r io.Reader, cpn string) (io.Reader, error)
  encode input reader (MUST BE UTF-8) to specified enconding

tests and static analysis

coverage: 89.8%
folder "test_files" contain files for testing, do not remove/change/add if want support tests is work
folder sample contain:

tohex -- encode the input string to the specified encoding and return the string from the hexadecimal code of the received runes
detect-all-files -- displays the encoding of all files in the current folder
cpname -- work with encodinng names

file linter.md report from golangci-lint

Documentation ¶

Overview ¶

Index ¶

Variables
func CodepageAsString(codepage IDCodePage) string
func DecodeUTF16be(s string) string
func DecodeUTF16le(s string) string
func FileConvertCodepage(fileName string, fromCP, toCP IDCodePage) error
func IsSeparator(r rune) bool
func NewReader(r io.Reader, cpn ...string) (io.Reader, error)
func NewReaderTo(r io.Reader, cpn string) (io.Reader, error)
func StrConvertCodepage(s string, fromCP, toCP IDCodePage) (string, error)
func SupportedEncoder(cpn string) bool
func ValidUTF8(data []byte) bool
type CodePage
- func (o CodePage) FirstAlphabetPos(d []byte) int
- func (o CodePage) MatchingRunes() string
- func (o CodePage) String() string
type IDCodePage
- func CheckBOM(buf []byte) (id IDCodePage, res bool)
- func CodepageAutoDetect(b []byte) IDCodePage
- func CodepageDetect(r io.Reader) (IDCodePage, error)
- func FileCodepageDetect(fn string, stopStr ...string) (IDCodePage, error)
- func (i IDCodePage) BomLen() int
- func (i IDCodePage) DeleteBom(s string) (res string)
- func (i IDCodePage) DeleteBomFromReader(r io.Reader) io.Reader
- func (i IDCodePage) ReaderHasBom(r io.Reader) bool
- func (i IDCodePage) String() string
- func (i IDCodePage) StringHasBom(s string) bool
type MatchRes
- func (m MatchRes) String() string
type TCodepagesDic
- func NewCodepageDic() TCodepagesDic
- func (o TCodepagesDic) Match(data []byte) (result IDCodePage)

Constants ¶

This section is empty.

Variables ¶

View Source

var Boms = []struct {
	Bom []byte
	id  IDCodePage
}{
	{[]byte{0xef, 0xbb, 0xbf}, UTF8},
	{[]byte{0x00, 0x00, 0xfe, 0xff}, UTF32BE},
	{[]byte{0xff, 0xfe, 0x00, 0x00}, UTF32LE},
	{[]byte{0xfe, 0xff}, UTF16BE},
	{[]byte{0xff, 0xfe}, UTF16LE},
}

Boms - byte oder mark - special bytes for

View Source

var ReadBufSize int = 1024

ReadBufSize - byte count for reading from file, func FileCodePageDetect()

Functions ¶

func CodepageAsString ¶

func CodepageAsString(codepage IDCodePage) string

CodepageAsString - return name of char set with id codepage if codepage not exist - return ""

func DecodeUTF16be ¶

func DecodeUTF16be(s string) string

DecodeUTF16be - decode slice of byte from UTF16 to UTF8

func DecodeUTF16le ¶

func DecodeUTF16le(s string) string

DecodeUTF16le - decode slice of byte from UTF16 to UTF8

func FileConvertCodepage ¶

func FileConvertCodepage(fileName string, fromCP, toCP IDCodePage) error

FileConvertCodepage - replace code page text file from one to another support convert only from/to Windows1251/IBM866

func IsSeparator ¶

func IsSeparator(r rune) bool

IsSeparator - return true if input rune is SPACE or PUNCT

func NewReader ¶

func NewReader(r io.Reader, cpn ...string) (io.Reader, error)

NewReader - conversion to UTF-8 return input reader if input contain less 4 bytes return input reader if input contain ASCII data if cpn[0] exist, then using it as input codepage name

func NewReaderTo ¶

func NewReaderTo(r io.Reader, cpn string) (io.Reader, error)

NewReaderTo - creates a new reader encoding from UTF-8 to the specified codepage return input reader and error if output codepage not found, or unsupport encoding if input str contains the BOM char, then BOM be deleted

func StrConvertCodepage ¶

func StrConvertCodepage(s string, fromCP, toCP IDCodePage) (string, error)

StrConvertCodepage - convert string from one code page to another function for future, at now support convert only from/to Windows1251/IBM866

func SupportedEncoder ¶

func SupportedEncoder(cpn string) bool

SupportedEncoder - check codepage name

func ValidUTF8 ¶

func ValidUTF8(data []byte) bool

ValidUTF8 - return true if input slice contain true UTF-8

Types ¶

type CodePage ¶

type CodePage struct {
	NumByte  byte //number of byte using in codepage
	MatchRes      //count of matching

	Boms []byte //default BOM for this codepage
	// contains filtered or unexported fields
}

CodePage - realize code page

func (CodePage) FirstAlphabetPos ¶

func (o CodePage) FirstAlphabetPos(d []byte) int

FirstAlphabetPos - return position of first alphabet возвращает позицию первого алфавитного символа данной кодировки встреченную в отсортированном массиве

func (CodePage) MatchingRunes ¶

func (o CodePage) MatchingRunes() string

MatchingRunes - return string with rune/counts

func (CodePage) String ¶

func (o CodePage) String() string

type IDCodePage ¶

type IDCodePage uint16

IDCodePage - index of code page implements interface String()

const (
	// ASCII is the uint16 identifier with IANA name US-ASCII (MIME: US-ASCII).
	// ANSI X3.4-1986
	// Reference: RFC2046
	ASCII IDCodePage = 3

	// ISOLatinCyrillic is the MIB identifier with IANA name ISO_8859-5:1988 (MIME: ISO-8859-5).
	//
	// ISO-IR: International Register of Escape Sequences
	// Note: The current registration authority is IPSJ/ITSCJ, Japan.
	// Reference: RFC1345
	ISOLatinCyrillic IDCodePage = 8

	// UTF8 is the uint16 identifier with IANA name UTF-8.
	//
	// rfc3629
	// Reference: RFC3629
	UTF8 IDCodePage = 106

	// Unicode is the uint16 identifier with IANA name ISO-10646-UCS-2.
	//
	// the 2-octet Basic Multilingual Plane, aka Unicode
	// this needs to specify network byte order: the standard
	// does not specify (it is a 16-bit integer space)
	Unicode IDCodePage = 1000

	// UnicodeASCII is the uint16 identifier with IANA name ISO-10646-UCS-Basic.
	//
	// ASCII subset of Unicode.  Basic Latin = collection 1
	// See ISO 10646, Appendix A
	UnicodeASCII IDCodePage = 1002

	// UTF7 is the uint16 identifier with IANA name UTF-7.
	//
	// rfc2152
	// Reference: RFC2152
	UTF7 IDCodePage = 1012

	// UTF16BE is the uint16 identifier with IANA name UTF-16BE.
	//
	// rfc2781
	// Reference: RFC2781
	UTF16BE IDCodePage = 1013

	// UTF16LE is the uint16 identifier with IANA name UTF-16LE.
	//
	// rfc2781
	// Reference: RFC2781
	UTF16LE IDCodePage = 1014

	// UTF32 is the uint16 identifier with IANA name UTF-32.
	//
	// https://www.unicode.org/unicode/reports/tr19/
	UTF32 IDCodePage = 1017

	// UTF32BE is the uint16 identifier with IANA name UTF-32BE.
	//
	// https://www.unicode.org/unicode/reports/tr19/
	UTF32BE IDCodePage = 1018

	// UTF32LE is the uint16 identifier with IANA name UTF-32LE.
	//
	// https://www.unicode.org/unicode/reports/tr19/
	UTF32LE IDCodePage = 1019

	// KOI8R is the uint16 identifier with IANA name KOI8-R (MIME: KOI8-R).
	//
	// rfc1489 , based on GOST-19768-74, ISO-6937/8,
	// INIS-Cyrillic, ISO-5427.
	// Reference: RFC1489
	KOI8R IDCodePage = 2084

	// CP866 is the uint16 identifier with IANA name IBM866.
	//
	// IBM NLDG Volume 2 (SE09-8002-03) August 1994
	CP866 IDCodePage = 2086

	// CP1251 is the uint16 identifier with IANA name windows-1251.
	//
	// Microsoft http://www.iana.org/assignments/charset-reg/windows-1251
	CP1251 IDCodePage = 2251

	// Windows1252 is the uint16 identifier with IANA name windows-1252.
	//
	// Microsoft http://www.iana.org/assignments/charset-reg/windows-1252
	Windows1252 IDCodePage = 2252
)

func CheckBOM ¶

func CheckBOM(buf []byte) (id IDCodePage, res bool)

CheckBOM - check buffer for match to utf-8, utf-16le or utf-16be BOM

func CodepageAutoDetect ¶

func CodepageAutoDetect(b []byte) IDCodePage

CodepageAutoDetect - auto detect code page of input content

func CodepageDetect ¶

func CodepageDetect(r io.Reader) (IDCodePage, error)

CodepageDetect - detect code page of ascii data from reader 'r'

func FileCodepageDetect ¶

func FileCodepageDetect(fn string, stopStr ...string) (IDCodePage, error)

FileCodepageDetect - detect codepage of text file

func (IDCodePage) BomLen ¶

func (i IDCodePage) BomLen() int

BomLen - return lenght in bytes of BOM for this for codepage no have Bom, return 0

func (IDCodePage) DeleteBom ¶

func (i IDCodePage) DeleteBom(s string) (res string)

DeleteBom - return string without prefix bom bytes

func (IDCodePage) DeleteBomFromReader ¶

func (i IDCodePage) DeleteBomFromReader(r io.Reader) io.Reader

DeleteBomFromReader - return reader after removing BOM from it

func (IDCodePage) ReaderHasBom ¶

func (i IDCodePage) ReaderHasBom(r io.Reader) bool

ReaderHasBom - check reader to BOM prefix

func (IDCodePage) String ¶

func (i IDCodePage) String() string

func (IDCodePage) StringHasBom ¶

func (i IDCodePage) StringHasBom(s string) bool

StringHasBom - return true if input string has BOM prefix

type MatchRes ¶

type MatchRes struct {
	// contains filtered or unexported fields
}

MatchRes - result criteria countMatch - the number of letters founded in text countCvPairs - then number of pairs consonans+vowels

func (MatchRes) String ¶

func (m MatchRes) String() string

type TCodepagesDic ¶

type TCodepagesDic map[IDCodePage]CodePage

TCodepagesDic - type to store all supported code page

func NewCodepageDic ¶

func NewCodepageDic() TCodepagesDic

NewCodepageDic - create a new map by copying the global

func (TCodepagesDic) Match ¶

func (o TCodepagesDic) Match(data []byte) (result IDCodePage)

Match - return the id of code page to which the data best matches call function match of each codepage

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
sample
cpname
detect-all-files
tohex

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL