kmd

module
v0.0.0-...-3fbc642 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 19, 2024 License: MIT

README

CI

Kmd

Karol Michal Daemon

HTTP/CLI interface for tesseract-ocr written in Go.

Architecture

Client/server.

  • A client kmc is a simple static binary which talks to server. Curl would be enough too.
  • Server kmd is an HTTP server listening on unix socket /run/user/1000/kmd.sock. It uses https://github.com/klippa-app/go-pdfium?tab=readme-ov-file#webassembly to render pdf as png and gotesseract for the final OCR. It has a native dependencies on a tesseract-ocr, so is expected to run inside a Docker container.

Usage

Server

# build
$ mage build

# run server via systemd socket activation
# will be handier once Docker files will be ready
$ systemd-socket-activate -l ${XDG_RUNTIME_DIR:-/run/user/`id`}/kmd.sock ./kmd

$ curl --unix-socket /run/user/1000/kmd.sock http://localhost/
version: 5.4.1
languages:
 * ces
 * eng

Client

# build
$ mage build

$ kmc info
version: 5.4.1
languages:
 * ces
 * eng

$ kmc ocr internal/testdata/hello.png
Hello, world!

TODO

  1. make unix path configurable
  • tests
  • Dockerfile + public docker imaaazzz + GHA updating the stuff
  • install a systemctl file(s)
  • HTTP/Accept for server - implement text/plain and application/json at least
  • errors reporting - maybe terrasect can't report errors other way than printing it?
Error in pixReadStream: Pdf reading is not supported
Leptonica Error in pixRead: pix not read: /tmp/orcserver1026615170

Why OCR?

I've said this before and I'll say it again: PDF sucks Rajesh Koothrappali, Ph.D.

Library from Russ Cox: https://pkg.go.dev/rsc.io/pdf parses one of the testing PDF as

{Font:Arial CE FontSize:9.168 X:146.04 Y:663.5 W:0 S:2} {Font:Arial CE
FontSize:9.168 X:146.076672 Y:663.5 W:0 S:^@} {Font:Arial CE FontSize:9.168
X:146.076672 Y:663.5 W:0 S:^C} {Font:Arial CE FontSiz e:9.168 X:146.076672
Y:663.5 W:0 S:^@} {Font:Arial CE FontSize:9.168 X:146.076672 Y:663.5 W:0 S:^W} 

And an amazing github.com/pdfcpu/pdfcpu

BT
/F1 9.24 Tf
1 0 0 1 410.86 709.66 Tm
/GS11 gs
0 g
/GS12 gs
0 G
[<00290044004E>-3<0057>-4<005800550044000300FE0011>] TJ

Commercial offerings exists, haven't tried https://unidoc.io/post/pdf-text-extraction-in-golang-with-unipdf/ https://docs.apryse.com/documentation/go/guides/features/extraction/text-extract/

Directories

Path Synopsis
cmd
kmc
kmd
A trivial server for a tesseract-ocr
A trivial server for a tesseract-ocr
internal
ocr
os

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL