pardocs

package module
v1.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 27, 2020 License: AGPL-3.0 Imports: 2 Imported by: 0

README

go-pardocs

Tools to process published Parliament documents (PDFs only) into more accessible form. Spiritual successor of https://github.com/leowmjw/parliamentMY-QA-blast

Assumes:

  • OSX dev environment
  • Go v1.13 and above (uses go mod)
  • Release for Linux, Windows, OSX available as cross-compile

Usage

Planning
$ ./go-pardocs plan -session <name> -type <L|BL> [-force] [-dir <workspace>] <sourcePDFPath>
Example:
	./go-pardocs plan -session par14sesi1 -type L ./raw/Lisan/JDR12032019.pdf
	./go-pardocs plan -session par13sesi3 -type L ./raw/Lisan/JWP DR 161018.pdf
	./go-pardocs plan -session par12sesi1 -type L ./raw/Lisan/20140327__DR_JawabLisan_clean.pdf

Splitting
$ ./go-pardocs split -session par14sesi1 -type BL <file>

Output

$ ls ./splitout
...                                 par14sesi1-soalan-BukanLisan-3.pdf
README.md                           par14sesi1-soalan-BukanLisan-4.pdf
par14sesi1-soalan-BukanLisan-1.pdf  par14sesi1-soalan-BukanLisan-5.pdf
par14sesi1-soalan-BukanLisan-2.pdf  par14sesi1-soalan-BukanLisan-6.pdf
...
Splitting with optional date prefix

Example: Parlimen 14 Sesi 2 Mesyuarat 3; 04 Disember 2019

Run the plan

$ ./go-pardocs plan -session 20191204-par14sesi2mesy3 -type L ./raw/Lisan/JDR04122019.pdf

Split with the date prefix in session parameter

$ ./go-pardocs split -session 20191204-par14sesi2mesy3 -type L ./raw/Lisan/JDR04122019.pdf

Output

$ ls ./splitout
20191204-par14sesi2mesy3-soalan-Lisan-1.pdf
20191204-par14sesi2mesy3-soalan-Lisan-10.pdf
20191204-par14sesi2mesy3-soalan-Lisan-11.pdf
20191204-par14sesi2mesy3-soalan-Lisan-12.pdf
20191204-par14sesi2mesy3-soalan-Lisan-13.pdf
...
20191204-par14sesi2mesy3-soalan-Lisan-6.pdf
20191204-par14sesi2mesy3-soalan-Lisan-7.pdf
20191204-par14sesi2mesy3-soalan-Lisan-8.pdf
20191204-par14sesi2mesy3-soalan-Lisan-9.pdf

IMPORTANT!

The API for split seems to be broken for certain malformed PDF [ most older ones from Parliamant ;P ].

Issue for the above is tracked at: https://github.com/hhrutter/pdfcpu/issues/87

The program will try to use the API but if it fails; the fall-back is to using the pdfcpu command directly.

2019/06/16 21:01:09 Unexpected error split via API:  dict=pagesDict entry=Tabs: unsupported in version 1.4
This file could be PDF/A compliant but pdfcpu only supports versions <= PDF V1.7

2019/06/16 21:01:09 Falling back to split using pdfcpu CLI ..

pdfcpu is assumed to be in the default bin folder of Golang installation ~/go/bin/pdfcpu AND pdfcpu version 0.1.23 or above

HOWTO Split file to smaller pieces for analysis / development

Assumes: pdfcpu has been downloaded

EXAMPLE: Split to 15 pages chunk

$ ~/go/bin/pdfcpu split  \ 
        ~/Downloads/Pertanyaan\ Jawapan\ Bukan\ Lisan\ 22019.pdf \
        raw/BukanLisan/split 15

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CommandMode

type CommandMode int

CommandMode specifies the operation being executed.

const (
	PLAN CommandMode = iota
	SPLIT
	RESET
)

The available commands.

type Configuration

type Configuration struct {

	// Parliament Session Label
	ParliamentSession string

	// Hansard Type
	HansardType hansard.HansardType

	// ./raw + ./data folders are assumed to be relative to this dir
	WorkingDir string

	// Source PDF can be anywhere; maybe make it a Reader to be read direct from S3?
	SourcePDFPath string

	// Command being executed.
	Cmd CommandMode
}

Configuration of a Context.

type ParliamentDocs

type ParliamentDocs struct {
	Conf Configuration
}

func (*ParliamentDocs) Plan

func (pd *ParliamentDocs) Plan()

func (*ParliamentDocs) Reset

func (pd *ParliamentDocs) Reset()

func (*ParliamentDocs) Split

func (pd *ParliamentDocs) Split()

Directories

Path Synopsis
cmd
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL