isUTF8
Detect whether a file is well-formed UTF-8 or not.
isUTF8 is written in Go and uses memory mapped files to run as quickly as possible. It uses the golang.org/x/sys/unix package and will probably run only on Unix-like systems (e.g., MacOS, Linux). A portable and simpler but slower approach could use ordinary file I/O and utf8.Valid or utf8.ValidString.
On a 2016 MacBook Pro, isUTF8 checked a 1GB file in around 1 second, about 30% faster than a nearly identical C program compiled with gcc's ‑O3 flag (run times will vary depending on the system and how much of the file is already in memory cache).
For information about well-formed UTF-8 see The Unicode Standard, Chapter 3 Conformance, Table 3-7 Well-Formed UTF-8 Byte Sequences.
Prerequisites
Go programming language.
golang.org/x/sys/unix package. Not part of the standard Go installation so it must be installed separately.
go get golang.org/x/sys/unix
Building
git clone https://github.com/mfuhr/isUTF8.git
cd isUTF8
go test
go build
To install under $GOPATH/bin:
go install
To see test coverage:
go test -coverprofile=coverage.out
go tool cover -func=coverage.out
go tool cover -html=coverage.out
Examples
$ ./isUTF8 testdata/test_utf8.txt
true testdata/test_utf8.txt
$ echo $?
0
$ ./isUTF8 testdata/test_latin1.txt
false testdata/test_latin1.txt
$ echo $?
1
Status
In active development (June 2017). Behavior, especially the output, subject to change.