docx

package module
v0.0.0-...-1d2080a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 13, 2023 License: MIT Imports: 15 Imported by: 0

README

go-docx

tests goreport GoDoc reference

Replace placeholders inside docx documents with speed and confidence.

This project provides a simple and clean API to perform replacing of user-defined placeholders. Without the uncertainty that the placeholders may be ripped apart by the WordprocessingML engine used to create the document.

Example

  • Simple: The API exposed is kept to a minimum in order to stick to the purpose.
  • Fast: go-docx is fast since it operates directly on the byte contents instead mapping the XMLs to a custom data struct.
  • Zero dependencies: go-docx is build with the go stdlib only, no external dependencies.

Table of Contents


➤ Purpose

The task at hand was to replace a set of user-defined placeholders inside a docx document archive with calculated values. All current implementations in Golang which solve this problem use a naive approach by attempting to strings.Replace() the placeholders.

Due to the nature of the WordprocessingML specification, a placeholder which is defined as {the-placeholder} may be ripped apart inside the resulting XML. The placeholder may then be in two fragments for example {the- and placeholder} which are spaced apart inside the XML.

The naive approach therefore is not always working. To provide a way to replace placeholders, even if they are fragmented, is the purpose of this library.

➤ Getting Started

All you need is to go get github.com/lukasjarosch/go-docx

func main() {
        // replaceMap is a key-value map whereas the keys
	// represent the placeholders without the delimiters
	replaceMap := docx.PlaceholderMap{
		"key":                         "REPLACE some more",
		"key-with-dash":               "REPLACE",
		"key-with-dashes":             "REPLACE",
		"key with space":              "REPLACE",
		"key_with_underscore":         "REPLACE",
		"multiline":                   "REPLACE",
		"key.with.dots":               "REPLACE",
		"mixed-key.separator_styles#": "REPLACE",
		"yet-another_placeholder":     "REPLACE",
	}

        // read and parse the template docx
	doc, err := docx.Open("template.docx")
	if err != nil {
	    panic(err)
	}

        // replace the keys with values from replaceMap
	err = doc.ReplaceAll(replaceMap)
	if err != nil {
	    panic(err)
	}

        // write out a new file
	err = doc.WriteToFile("replaced.docx")
	if err != nil {
		panic(err)
	}
}

Placholders

Placeholders are delimited with { and }, nesting of placeholders is not possible. Currently, there is no way to change the placeholders as I do not see a reason to do so.

Styling

The way this lib works is that a placeholder is just a list of fragments. When detecting the placeholders inside the XML, it looks for the OpenDelimiter and CloseDelimiter. The first fragment found (e.g. {foo of placeholder {foo-bar}) will be replaced with the value from the ReplaceMap.

This means that technically you can style only the OpenDelimiter inside the Word-Document and the whole value will be styled like that after replacing. Although I do not recommend to do that as the WordprocessingML spec is somewhat fragile in this case. So it's best to just style the whole placeholder.

But, for whatever reason there might be, you can do that.

➤ Terminology

To not cause too much confusion, here is a list of terms which you might come across.

  • Parser: Every file which this lib handles (document, footers and headers) has their own parser attached since everything is relative to the underlying byte-slice (aka. file).

  • Position: A Position is just a Start and End offset, relative to the byte slice of the document of a parser.

  • Run: Describes the pair <w:r> and </w:r> and thus has two Positions for the open and close tag. Since they are Positions, they have a Start and End Position which point to < and > of the tag. A run also consists of a TagPair.

  • Placeholder: A Placeholder is basically just a list of PlaceholderFragments representing a full placeholder extracted by a Parser.

  • PlaceholderFragment: A PlaceholderFragment is a parsed fragment of a placeholder since those will most likely be ripped apart by WordprocessingML. The Placeholder {foo-bar-baz} might ultimately consist of 5 fragments ( {, foo-, bar-, baz, }). The fragment is at the heart of replacing. It knows to which Run it belongs to and has methods of manipulating these byte-offsets. Additionally it has a Position which describes the offset inside the TagPair since the fragments don't always start at the beginning of one (e.g. <w:t>some text {fragment-start</w:t>)

➤ How it works

This section will give you a short overview of what's actually going on. And honenstly.. it's a much needed reference for my future self :D.

Overview

The project does rely on some invariants of the WordprocessingML spec which defines the docx structure. A good overview over the spec can be found on: officeopenxml.com.

Since this project aims to work only on text within the document, it currently only focuses on the runs (<w:r> element). A run always encloses a text (<w:t> element) thus finding all runs inside the docx is the first step. Keep in mind that a run does not need to have a text element. It can also contain an image for example. But all text literals will always be inside a run, within their respective text tags.

To illustrate that, here is how this might look inside the document.xml.

 <w:p>
    <w:r>
        <w:t>{key-with-dashes}</w:t>
    </w:r>
</w:p>

One can clearly see that replacing the {key-with-dashes} placeholder is quite simple. Just do a strings.Replace(), right? Wrong!

Although this might work on 70-80% of the time, it will not work reliably. The reason is how the WordprocessingML spec is set-up. It will fragment text-literals based on many different circumstances.

For example if you added half of the placeholder, saved and quit Word, and then add the second half of the placeholder, it might happen (in order to preserve session history), that the placeholder will look something like that (simplified).

 <w:p>
    <w:r>
        <w:t>{key-</w:t>
    </w:r>
    <w:r>
        <w:t>with-dashes}</w:t>
    </w:r>
</w:p>

As you can clearly see, doing a simple replace doesn't do it in this case.

Premises

In order to achive the goal of reliably replacing values inside a docx archive, the following premises are considered:

  • Text literals are always inside <w:t> tags
  • <w:t> tags only occur inside <w:r> tags
  • All placeholders are delimited with predefined runes ({ and } in this case)
  • Placeholders cannot be nested (e.g. {foo {bar}})

Order of operations

Here I will outline what happens in order to achieve the said goal.

  1. Open the *.docx file specified and extract all files in which replacement should take place. Currently, there files extracted are word/document.xml, word/footer<X>.xml and word/header<X>.xml. Any content which resides in different files requires a modification.

  2. First XML pass. Iterate over a given file (e.g. the document.xml) and find all <w:r> and </w:r> tags inside the bytes of the file. Remember the positions given by the custom io.Reader implementation. Note Singleton tags are handled correctly (e.g. <w:r/>).

  3. Second XML pass. Basically the same as the first pass, just this time the text tags (<w:t>) inside the found runs are extracted.

  4. Placeholder extraction. At this point all text literals are known by their offset inside the file. Using the premise that no placeholder nesting is allowed, the placeholder fragments can be extracted from the text runs. At the end a placeholder may be described by X fragments. The result of the extraction is the knowledge of which placeholders are located inside the document and at which positions the fragments start and end.

  5. Making use of the positions and replace some content. This is the stage where all the placeholders need to be replaced by their expected values given in a PlaceholderMap. The process can rougly be outlined in two steps:

    • The first fragment of the placeholder (e.g. {foo-) is replaced by the actual value. This also explains why one only has to style the first fragment inside the document. As you cannot see the fragments it is still a good idea to style the whole placeholder as needed.
    • All other fragments of the placeholders are cut out, removing the leftovers.

All the steps taken in 5. require cumbersome shifting of the offsets. This is the tricky part where the most debugging happened (gosh, so many offsets). The given explanation is definitely enough to grasp the concept, leaving out the messy bits.

➤ License

This software is licensed under the MIT license.

Documentation

Index

Constants

View Source
const (
	// RunElementName is the local name of the XML tag for runs (<w:r>, </w:r> and <w:r/>)
	RunElementName = "r"
	// TextElementName is the local name of the XML tag for text-runs (<w:t> and </w:t>)
	TextElementName = "t"
)
View Source
const (
	// OpenDelimiter defines the opening delimiter for the placeholders used inside a docx-document.
	OpenDelimiter rune = '{'
	// CloseDelimiter defines the closing delimiter for the placeholders used inside a docx-document.
	CloseDelimiter rune = '}'
)
View Source
const (
	// DocumentXml is the relative path where the actual document content resides inside the docx-archive.
	DocumentXml = "word/document.xml"
)

Variables

View Source
var (
	// HeaderPathRegex matches all header files inside the docx-archive.
	HeaderPathRegex = regexp.MustCompile(`word/header[0-9]*.xml`)
	// FooterPathRegex matches all footer files inside the docx-archive.
	FooterPathRegex = regexp.MustCompile(`word/footer[0-9]*.xml`)
)
View Source
var (
	// RunOpenTagRegex matches all OpenTags for runs, including eventually set attributes
	RunOpenTagRegex = regexp.MustCompile(`(<w:r).*>`)
	// RunCloseTagRegex matches the close tag of runs
	RunCloseTagRegex = regexp.MustCompile(`(</w:r>)`)
	// RunSingletonTagRegex matches a singleton run tag
	RunSingletonTagRegex = regexp.MustCompile(`(<w:r/>)`)
	// TextOpenTagRegex matches all OpenTags for text-runs, including eventually set attributes
	TextOpenTagRegex = regexp.MustCompile(`(<w:t).*>`)
	// TextCloseTagRegex matches the close tag of text-runs
	TextCloseTagRegex = regexp.MustCompile(`(</w:t>)`)
	// ErrTagsInvalid is returned if the parsing failed and the result cannot be used.
	// Typically this means that one or more tag-offsets were not parsed correctly which
	// would cause the document to become corrupted as soon as replacing starts.
	ErrTagsInvalid = errors.New("one or more tags are invalid and will cause the XML to be corrupt")
)
View Source
var (
	// OpenDelimiterRegex is used to quickly match the opening delimiter and find it'str positions.
	OpenDelimiterRegex = regexp.MustCompile(string(OpenDelimiter))
	// CloseDelimiterRegex is used to quickly match the closing delimiter and find it'str positions.
	CloseDelimiterRegex = regexp.MustCompile(string(CloseDelimiter))
)
View Source
var (
	// ErrPlaceholderNotFound is returned if there is no placeholder inside the document.
	ErrPlaceholderNotFound = errors.New("placeholder not found in document")
)

Functions

func AddPlaceholderDelimiter

func AddPlaceholderDelimiter(s string) string

AddPlaceholderDelimiter will wrap the given string with OpenDelimiter and CloseDelimiter. If the given string is already a delimited placeholder, it is returned unchanged.

func IsDelimitedPlaceholder

func IsDelimitedPlaceholder(s string) bool

IsDelimitedPlaceholder returns true if the given string is a delimited placeholder. It checks whether the first and last rune in the string is the OpenDelimiter and CloseDelimiter respectively. If the string is empty, false is returned.

func NewFragmentID

func NewFragmentID() int

NewFragmentID returns the next Fragment.ID

func NewRunID

func NewRunID() int

NewRunID returns the next Fragment.ID

func RemovePlaceholderDelimiter

func RemovePlaceholderDelimiter(s string) string

RemovePlaceholderDelimiter removes OpenDelimiter and CloseDelimiter from the given text. If the given text is not a delimited placeholder, it is returned unchanged.

func ResetFragmentIdCounter

func ResetFragmentIdCounter()

ResetFragmentIdCounter will reset the fragmentId counter to 0

func ResetRunIdCounter

func ResetRunIdCounter()

ResetRunIdCounter will reset the runId counter to 0

func ValidatePositions

func ValidatePositions(document []byte, runs []*Run) error

ValidatePositions will iterate over all runs and their texts (if any) and ensure that they match their respective regex. If the validation failed, the replacement will not work since offsets are wrong.

Types

type Document

type Document struct {
	// contains filtered or unexported fields
}

Document exposes the main API of the library. It represents the actual docx document which is going to be modified. Although a 'docx' document actually consists of multiple xml files, that fact is not exposed via the Document API. All actions on the Document propagate through the files of the docx-zip-archive.

func Open

func Open(path string) (*Document, error)

Open will open and parse the file pointed to by path. The file must be a valid docx file or an error is returned.

func OpenBytes

func OpenBytes(b []byte) (*Document, error)

OpenBytes allows to create a Document from a byte slice. It behaves just like Open().

Note: In this case, the docxFile property will be nil!

func (*Document) Close

func (d *Document) Close()

Close will close everything :)

func (*Document) GetFile

func (d *Document) GetFile(fileName string) []byte

GetFile returns the content of the given fileName if it exists.

func (*Document) Placeholders

func (d *Document) Placeholders() (placeholders []*Placeholder)

Placeholders returns all placeholders from the docx document.

func (*Document) Replace

func (d *Document) Replace(key, value string) error

Replace will attempt to replace the given key with the value in every file.

func (*Document) ReplaceAll

func (d *Document) ReplaceAll(placeholderMap PlaceholderMap) error

ReplaceAll will iterate over all files and perform the replacement according to the PlaceholderMap.

func (*Document) Runs

func (d *Document) Runs() (runs []*Run)

Runs returns all runs from all parsed files.

func (*Document) SetFile

func (d *Document) SetFile(fileName string, fileBytes []byte) error

SetFile allows setting the file contents of the given file. The fileName must be known, otherwise an error is returned.

func (*Document) Write

func (d *Document) Write(writer io.Writer) error

Write is responsible for assembling a new .docx docxFile using the modified data as well as all remaining files. Docx files are basically zip archives with many XMLs included. Files which cannot be modified through this lib will just be read from the original docx and copied into the writer.

func (*Document) WriteToFile

func (d *Document) WriteToFile(file string) error

WriteToFile will write the document to a new file. It is important to note that the target file cannot be the same as the path of this document. If the path is not yet created, the function will attempt to MkdirAll() before creating the file.

type DocumentRuns

type DocumentRuns []*Run

DocumentRuns is a convenience type used to describe a slice of runs. It also implements Push() and Pop() which allows it to be used as LIFO stack.

func (*DocumentRuns) Pop

func (dr *DocumentRuns) Pop() *Run

Pop will return the last Run added to the stack and remove it.

func (*DocumentRuns) Push

func (dr *DocumentRuns) Push(run *Run)

Push will push a new Run onto the DocumentRuns stack

func (DocumentRuns) WithText

func (dr DocumentRuns) WithText() DocumentRuns

WithText returns all runs with the HasText flag set

type FileMap

type FileMap map[string][]byte

FileMap is just a convenience type for the map of fileName => fileBytes

func (FileMap) Write

func (fm FileMap) Write(writer io.Writer, filename string) error

Write will try to write the bytes from the map into the given writer.

type Placeholder

type Placeholder struct {
	Fragments []*PlaceholderFragment
}

Placeholder is the internal representation of a parsed placeholder from the docx-archive. A placeholder usually consists of multiple PlaceholderFragments which specify the relative byte-offsets of the fragment inside the underlying byte-data.

func ParsePlaceholders

func ParsePlaceholders(runs DocumentRuns, docBytes []byte) (placeholders []*Placeholder, err error)

ParsePlaceholders will, given the document run positions and the bytes, parse out all placeholders including their fragments.

func (Placeholder) EndPos

func (p Placeholder) EndPos() int64

EndPos returns the absolute end position of the placeholder.

func (Placeholder) StartPos

func (p Placeholder) StartPos() int64

StartPos returns the absolute start position of the placeholder.

func (Placeholder) Text

func (p Placeholder) Text(docBytes []byte) string

Text assembles the placeholder fragments using the given docBytes and returns the full placeholder literal.

func (Placeholder) Valid

func (p Placeholder) Valid() bool

Valid determines whether the placeholder can be used. A placeholder is considered valid, if all fragments are valid.

type PlaceholderFragment

type PlaceholderFragment struct {
	ID       int      // ID is used to identify the fragments globally.
	Position Position // Position of the actual fragment within the run text. 0 == (Run.Text.OpenTag.End + 1)
	Number   int      // numbering fragments for ease of use. Numbering is scoped to placeholders.
	Run      *Run
}

PlaceholderFragment is a part of a placeholder within the document.xml If the full placeholder is e.g. '{foo-bar}', the placeholder might be ripped apart according to the WordprocessingML spec. So it will most likely occur, that the placeholders are split into multiple fragments (e.g. '{foo' and '-bar}').

func NewPlaceholderFragment

func NewPlaceholderFragment(number int, pos Position, run *Run) *PlaceholderFragment

NewPlaceholderFragment returns an initialized PlaceholderFragment with a new, auto-incremented, ID.

func (PlaceholderFragment) EndPos

func (p PlaceholderFragment) EndPos() int64

EndPos returns the absolute end position of the fragment.

func (*PlaceholderFragment) ShiftAll

func (p *PlaceholderFragment) ShiftAll(deltaLength int64)

ShiftAll will shift all fragment position markers by the given amount. The function is used if the underlying byte-data changed and the whole PlaceholderFragment needs to be shifted to a new position to be correct again.

For example, 10 bytes were added to the document and this PlaceholderFragment is positioned after that change inside the document. In that case one needs to shift the fragment by +10 bytes using ShiftAll(10).

func (*PlaceholderFragment) ShiftCut

func (p *PlaceholderFragment) ShiftCut(cutLength int64)

ShiftCut will shift the fragment position markers in such a way that the fragment can be considered empty. This is used in order to preserve the correct positions of the tags.

The function is used if the actual value (text-run value) of the fragment has been removed. For example the fragment-text was: 'remove-me' (9 bytes) If that data was removed from the document, the positions (not all positions) of the fragment need to be adjusted. The text positions are set equal (start == end).

func (*PlaceholderFragment) ShiftReplace

func (p *PlaceholderFragment) ShiftReplace(deltaLength int64)

ShiftReplace is used to adjust the fragment positions after the text value has been replaced. The function is used if the text-value of the fragment has been replaced with different bytes. For example, the fragment text was 'placeholder' (11 bytes) which is replaced with 'a-super-awesome-value' (21 bytes) In that case the deltaLength would be 10. In order to accommodate for the change in bytes you'd need to call ShiftReplace(10)

func (PlaceholderFragment) StartPos

func (p PlaceholderFragment) StartPos() int64

StartPos returns the absolute start position of the fragment.

func (PlaceholderFragment) String

func (p PlaceholderFragment) String(docBytes []byte) string

String spits out the most important bits and pieces of a fragment and can be used for debugging purposes.

func (PlaceholderFragment) Text

func (p PlaceholderFragment) Text(docBytes []byte) string

Text returns the actual text of the fragment given the source bytes. If the given byte slice is not large enough for the offsets, an empty string is returned.

func (PlaceholderFragment) TextLength

func (p PlaceholderFragment) TextLength(docBytes []byte) int64

TextLength returns the actual length of the fragment given a byte source.

func (PlaceholderFragment) Valid

func (p PlaceholderFragment) Valid() bool

Valid returns true if all positions of the fragment are valid.

type PlaceholderMap

type PlaceholderMap map[string]interface{}

PlaceholderMap is the type used to map the placeholder keys (without delimiters) to the replacement values

type Position

type Position struct {
	Start int64
	End   int64
}

Position is a generic position of a tag, represented by byte offsets

func (Position) Match

func (p Position) Match(regexp *regexp.Regexp, data []byte) bool

Match will apply a MatchString using the given regex on the given data and returns true if the position matches the regex inside the data.

func (Position) Valid

func (p Position) Valid() bool

Valid returns true if Start <= End. Only then the position can be used, otherwise there will be a 'slice out of bounds' along the way.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader is a very basic io.Reader implementation which is capable of returning the current position.

func NewReader

func NewReader(s string) *Reader

NewReader returns a new Reader given a string source.

func (*Reader) Len

func (r *Reader) Len() int

Len returns the current length of the stream which has been read.

func (*Reader) Pos

func (r *Reader) Pos() int64

Pos returns the current position which the reader is at.

func (*Reader) Read

func (r *Reader) Read(b []byte) (int, error)

Read implements the io.Reader interface.

func (*Reader) ReadByte

func (r *Reader) ReadByte() (byte, error)

ReadByte implements hte io.ByteReader interface.

func (*Reader) Size

func (r *Reader) Size() int64

Size returns the size of the string to read.

func (*Reader) String

func (r *Reader) String() string

String implements the Stringer interface.

type Replacer

type Replacer struct {
	ReplaceCount int
	BytesChanged int64
	// contains filtered or unexported fields
}

Replacer is the key struct which works on the parsed DOCX document.

func NewReplacer

func NewReplacer(docBytes []byte, placeholder []*Placeholder) *Replacer

NewReplacer returns a new Replacer.

func (*Replacer) Bytes

func (r *Replacer) Bytes() []byte

Bytes returns the document bytes. If called after Replace(), the bytes will be modified.

func (*Replacer) Replace

func (r *Replacer) Replace(placeholderKey string, value string) error

Replace will replace all occurrences of the placeholderKey with the given value. The function is synced with a mutex as it is not concurrency safe.

type Run

type Run struct {
	TagPair
	ID      int
	Text    TagPair // Text is the <w:t> tag pair which is always within a run and cannot be standalone.
	HasText bool
}

Run defines a non-block region of text with a common set of properties. It is specified with the <w:r> element. In our case the run is specified by four byte positions (start and end tag).

func NewEmptyRun

func NewEmptyRun() *Run

NewEmptyRun returns a new, empty run which has only an ID set.

func (*Run) GetText

func (r *Run) GetText(documentBytes []byte) string

GetText returns the text of the run, if any. If the run does not have a text or the given byte slice is too small, an empty string is returned

func (*Run) String

func (r *Run) String(bytes []byte) string

String returns a string representation of the run, given the source bytes. It may be helpful in debugging.

type RunParser

type RunParser struct {
	// contains filtered or unexported fields
}

RunParser can parse a list of Runs from a given byte slice.

func NewRunParser

func NewRunParser(doc []byte) *RunParser

NewRunParser returns an initialized RunParser given the source-bytes.

func (*RunParser) Execute

func (parser *RunParser) Execute() error

Execute will fire up the parser. The parser will do two passes on the given document. First, all <w:r> tags are located and marked. Then, inside that run tags the <w:t> tags are located.

func (*RunParser) Runs

func (parser *RunParser) Runs() DocumentRuns

Runs returns the all runs found by the parser.

type TagPair

type TagPair struct {
	OpenTag  Position
	CloseTag Position
}

TagPair describes an opening and closing tag position.

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL