pagedata

package
v0.10.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 11, 2020 License: AGPL-3.0, AGPL-3.0 Imports: 13 Imported by: 0

README

pdfpagedata

read/write gradex pagedata frompdf pages

Why

Tracking pages in PDF documents when they are split into separate files can be done a couple of ways

-- postpend something to the filename for page -- put info in metadata -- use some page text, off page -- protocol buf into stream object

Unfortunately, processing thousands of pages, though different hands, needs more safety than fragile filenames can provide. Plus we will have duplicate files with non-duplicate annotations. So what do we do when two people re-upload their different files with the same name, and then either overwrite or make some modification to the filename? How do we recover from an unfortunate name choice here?

We could stash info in the metadata, but that tends to be file-level, so it is not clear how to handle duplicate custom metadata fields when multiple files from different documents are joined, then split, then joined again etc. Bseides, I've seen editors mess with the metadata, and I don't fancy users editing it either.

Off-page page text seems fragile too, but I get some comfort from reading that people an NOT crop when they want to. A test has been included for this very purpose -which is passing.

Wrinkles

text written in the same place gets read back out in some sort of merged way, so pageData is written in a tiny font (like 0.00001) and randomly scattered around a location that is far off the page. Tag destruction is detected (such as for clases), and multiple page datas on a page are supported.

A collision is possible ... we could always consider writing each hidden data twice ...

Future

Protocol buf into a stream object seems like a more robust way (and it avoids crop and collision worries) but it is probably about a half-day or a day to develop so that makes it a roadmap item for now.

Documentation

Index

Constants

View Source
const (
	IsPage    = "page"
	IsRegion  = "region"
	IsCover   = "cover"
	IsMontage = "montage"

	IsAnonymous = "anonymous"
	IsIdentity  = "identity"
)
View Source
const (
	StartTag        = "<gradex-pagedata>"
	EndTag          = "</gradex-pagedata>"
	StartTagOffset  = len(StartTag)
	EndTagOffset    = len(EndTag)
	StartHash       = "<hash>"
	EndHash         = "</hash>"
	StartHashOffset = len(StartHash)
	EndHashOffset   = len(EndHash)
)

Variables

This section is empty.

Functions

func AddPageDataToPDF added in v0.8.4

func AddPageDataToPDF(inputPath string, outputPath string, pdMap map[int]PageData) error

modified from https://github.com/unidoc/unipdf-examples/blob/master/text/pdf_insert_text.go

func GetLen

func GetLen(input map[int]PageData) int

func GetLinkMap added in v0.5.0

func GetLinkMap(pageDataMap map[int]PageData) (map[int]Link, error)

A non-nil error means there is a broken sequence on at least one page the Linkmap has the details ....

func MarshalOneToCreator

func MarshalOneToCreator(c *creator.Creator, pd *PageData) error

func PrettyPrintStruct

func PrettyPrintStruct(layout interface{}) error

func TriageFile

func TriageFile(inputPath string) (map[int]Summary, error)

func UnMarshalAllFromFile

func UnMarshalAllFromFile(inputPath string) (map[int]PageData, error)

Types

type Field

type Field struct {
	Key   string `json:"k"`
	Value string `json:"v"`
}

type FileDetail

type FileDetail struct {
	Path   string `json:"path"`
	UUID   string `json:"UUID"`
	Number int    `json:"number"`
	Of     int    `json:"of"`
}

type ItemDetail

type ItemDetail struct {
	What    string `json:"what"`
	When    string `json:"when"`
	Who     string `json:"who"`
	UUID    string `json:"UUID"`
	WhoType string `json:"whoType"`
}

whotype exam number:EN matriculation number:UUN etc

type Link struct {
	First    string
	Last     string
	Sequence []string
	IsLinked bool
}

type PageData

type PageData struct {
	Current  PageDetail   `json:"current"`
	Previous []PageDetail `json:"previous"`
	Revision int          `json:"revision"`
}

type PageDetail

type PageDetail struct {
	Is       string            `json:"is"` //page, region
	Own      FileDetail        `json:"own"`
	Original FileDetail        `json:"original"`
	Current  FileDetail        `json:"current"`
	Item     ItemDetail        `json:"item"`
	Process  ProcessDetail     `json:"process"`
	UUID     string            `json:"UUID"` //for mapping the previous page datas later
	Follows  string            `json:"follows"`
	Revision int               `json:"revision"` //if we want to rewrite history ....
	Data     []Field           `json:"data"`
	Comments []comment.Comment `json:"comments"`
}

use custom data for group authorship, if individual authorship must be tracked here else use a group id e.g. group-<uuid> which has the individual authors recorded elsewhere, along with the original submission.

type ProcessDetail

type ProcessDetail struct {
	Name     string  `json:"name"`
	UUID     string  `json:"UUID"` // process batch UUID
	UnixTime int64   `json:"unixTime"`
	For      string  `json:"for"`
	ToDo     string  `json:"toDo"`
	By       string  `json:"by"`
	Data     []Field `json:"data"`
}

type Summary

type Summary struct {
	Is   string //page, region, cover-page etc
	What string //item
	For  string //proc
	ToDo string //proc
}

Used in triaging files at ingest/staging

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL