Documentation ¶
Overview ¶
Package xmlutils is actually more like content-analysis-utils, including the ability to process XML files, both with and without DOCTYPE. .
Index ¶
- Variables
- func DoParseRaw_xml(s string) (xtokens []CT.CToken, err error)
- func DoParse_xml(s string) (xtokens []CT.CToken, err error)
- func DoParse_xml_locationAware(s string) (xtokens []CT.LAToken, err error)
- func NewConfiguredDecoder(r io.Reader) *xml.Decoder
- type CommonCPR
- type ContentityBasics
- type ContypingInfo
- type DitaContype
- type DitaFlavor
- type DoctypeMType
- type KeyElmTriplet
- type MType
- type NS
- type NSsnapshot
- type PIDFPIfields
- type PIDSIDcatalogFileRecord
- type ParsedDoctype
- type ParsedPreamble
- type ParserResults_xml
- type SliceBounds
- type XmlCatalogFile
- type XmlContype
- type XmlDoctype
- type XmlPeek
- type XmlPublicID
- type XmlSystemID
Constants ¶
This section is empty.
Variables ¶
var DITArootElms = []string{
"topic", "concept", "reference", "task", "bookmap",
"map", "glossentry", "glossgroup"}
DITArootElms are all the XML root elements that can be classified as DITA-type. Note that LwDITA uses only "topic". 2024.04: Add "map"!
var DITAtypeFileExtensions = []string{".dita", ".ditamap", ".ditaval"}
DITAtypeFileExtensions are all the file extensions that are automatically classified as being DITA-type.
var DTDtypeFileExtensions = []string{".dtd", ".mod", ".ent"}
DTDtypeFileExtensions are all the file extensions that are automatically classified as being DTD-type.
var DTMTmap = []DoctypeMType{ {"html", "html/cnt/html5", "html", false, true}, {"//DTD LIGHTWEIGHT DITA Topic//", "xml/cnt/topic", "topic", true, true}, {"//DTD LW DITA Topic//", "xml/cnt/topic", "topic", true, true}, {"//DTD XDITA Topic//", "html/cnt/topic", "topic", true, true}, {"//DTD LIGHTWEIGHT DITA Map//", "xml/map/---", "map", true, true}, {"//DTD LW DITA Map//", "xml/map/---", "map", true, true}, {"//DTD XDITA Map//", "html/map/---", "map", true, true}, {"//DTD DITA Concept//", "xml/cnt/concept", "concept", false, false}, {"//DTD DITA Topic//", "xml/cnt/topic", "topic", false, false}, {"//DTD DITA Task//", "xml/cnt/task", "task", false, false}, {"//DTD HTML 4.", "html/cnt/html4", "html", false, false}, {"//DTD XHTML 1.0 ", "html/cnt/xhtml1.0", "html", false, false}, {"//DTD XHTML 1.1//", "html/cnt/xhtml1.1", "html", false, false}, {"//DTD MathML 2.0//", "html/cnt/mathml", "", false, false}, {"//DTD SVG 1.0//", "xml/img/svg1.0", "svg", false, false}, {"//DTD SVG 1.1", "xml/img/svg", "svg", false, false}, {"//DTD XHTML Basic 1.1//", "html/cnt/topic", "html", false, false}, {"//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//", "html/cnt/blarg", "html", false, false}, }
DTMTmap maps DOCTYPEs to MTypes (and: Is it LwDITA ?). This list should suffice for all ordinary XML files (except of course Docbook).
var DitaContypes = []DitaContype{"Map", "Bookmap", "Topic", "Task", "Concept",
"Reference", "Dita", "Glossary", "Conrefs", "LwMap", "LwTopic"}
DitaContypes - see "type DitaContype".
var DitaFlavors = []DitaFlavor{"1.2", "1.3", "XDITA", "HDITA", "MDATA"}
DitaFlavors - see "type DitaFlavor".
var HtmlKeyContentElms = []string{"main", "content"}
HtmlKeyContentElms is elements that often surround the actual page content.
var HtmlSectioningContentElms = []string{"article", "aside", "nav", "section"}
HtmlSectioningContentElms have internal sections and subsections.
var HtmlSectioningRootElms = []string{
"blockquote", "body", "details", "dialog", "fieldset", "figure", "td"}
HtmlSectioningRootElms have their OWN outlines, separate from the outlines of their ancestors, i.e. self-contained hierarchies.
var HtmlSelfClosingTags = []string{
"area",
"base",
"br",
"col",
"command",
"embed",
"hr",
"img",
"input",
"keygen",
"link",
"meta",
"param",
"source",
"track",
"wbr",
}
var KeyElmTriplets = []*KeyElmTriplet{
{"html", "head", "body"},
{"topic", "prolog", "body"},
{"map", "topicmeta", ""},
{"reference", "", ""},
{"task", "", ""},
{"bookmap", "", ""},
{"glossentry", "", ""},
{"glossgroup", "", ""},
{"meta", "", ""},
}
var MarkdownFileExtensions = []string{".md", ".mdown", ".markdown", ".mkdn"}
MarkdownFileExtensions are all the file extensions that are automatically classified as being Markdown-type, even tho we generally use a regex instead.
var MiscFileExtensions = []string{".sqlar"}
MiscFileExtensions are all the file extensions that we want to procss.
var NS_OASIS_XML_CATALOG = "urn:oasis:names:tc:entity:xmlns:xml:catalog:"
NS_OASIS_XML_CATALOG is the OASIS namespace for XML catalogs.
var NS_XML = "http://www.w3.org/XML/1998/namespace"
NS_XML is the XML namespace.
var STD_PREAMBLE CT.Raw = xml.Header
STD_PREAMBLE is "<?xml version="1.0" encoding="UTF-8"?>" + "\n"
var XML_NS_Recognized = []string{
"lang",
"space",
"base",
"id",
"Father",
}
XML_NS_Recognized is recognized values in the "xml:" namespace.
var XmlContypes = []XmlContype{"Unknown", "DTD", "DTDmod", "DTDent",
"RootTagData", "RootTagMixedContent", "MultipleRootTags", "INVALID"}
XmlContypes note: maybe DTDmod should be DTDelms.
Functions ¶
func DoParse_xml ¶
DoParse_xml takes a string, so we can assume that we can discard it after use cos the caller has another copy of it. To be safe, it copies every token using `xml.CopyToken(T)`.
Types ¶
type CommonCPR ¶
type CommonCPR struct { NodeDepths []int FilePosns []*CT.FilePosition CPR_raw string // Writer is usually the GTokens Writer io.Writer }
func NewCommonCPR ¶
func NewCommonCPR() *CommonCPR
type ContentityBasics ¶
type ContentityBasics struct { // XmlRoot is not meaningful for non-XML XmlRoot CT.Span Text CT.Span Meta CT.Span // MetaFormat is? "YAML","XML" MetaFormat string // MetaProps uses dot separators if hierarchy is needed MetaProps SU.PropSet }
ContentityBasics has Raw,Root,Text,Meta,MetaProps and is embedded in XU.AnalysisRecord. .
func (*ContentityBasics) CheckTopTags ¶
func (p *ContentityBasics) CheckTopTags() (bool, string)
HasRootTag returns true is a root element was found, and a message about missing top-level constructs, and can write warnings. .
func (*ContentityBasics) HasNone ¶
func (p *ContentityBasics) HasNone() bool
func (*ContentityBasics) SetToNonXml ¶
func (p *ContentityBasics) SetToNonXml(L int)
SetToNonXml just needs the length of the content. .
type ContypingInfo ¶
ContypingInfo has simple fields related to typing content (i.e. determining its type). .
func (ContypingInfo) MultilineString ¶
func (p ContypingInfo) MultilineString() (s string)
func (*ContypingInfo) ParseDoctype ¶
func (pC *ContypingInfo) ParseDoctype(sRaw CT.Raw) (*ParsedDoctype, error)
ParseDoctype should probably NOT be a method on ContypingInfo !!
AnalyzeDoctype expects to receive a file extension plus a content type as determined by the HTTP stdlib. However a DOCTYPE is always considered authoritative, so this func can ignore things like the file extension, and overwrite or set any field it wants to.
It works by first trying to match the DOCTYPE against a list. If that fails, stronger measures are called for.
Note two things about this function:
Firstly, it can handle PID, SID, or both:
<!DOCTYPE topic PUBLIC "-//OASIS//DTD LWDITA Topic//EN"> <!DOCTYPE topic PUBLIC "-//OASIS//DTD LWDITA Topic//EN" "./foo.dtd"> <!DOCTYPE topic SYSTEM "./foo.dtd">
Secondly, it can handle a less-than-complete declaration:
DOCTYPE topic PUBLIC "-//OASIS//DTD LWDITA Topic//EN" (and variations) topic PUBLIC "-//OASIS//DTD LWDITA Topic//EN" (and variations) PUBLIC "-//OASIS//DTD LWDITA Topic//EN" (and variations)
The last one is quite important because it is the format that appears in XML catalog files. .
func (ContypingInfo) String ¶
func (p ContypingInfo) String() (s string)
type DitaContype ¶
type DitaContype string
DitaContype is a [Lw]DITA Topic, Map, etc. See enumeration "DitaContypes".
type DitaFlavor ¶
type DitaFlavor string
DitaFlavor is a [Lw]DITA flavor. See enumeration "DitaFlavors".
type DoctypeMType ¶
type DoctypeMType struct { ToMatch string DoctypesMType string RootElm string IsLwDITA bool // LwDITA, HTML5, and not much more (if any) IsInScope bool }
DoctypeMType maps a DOCTYPE string to an MType string and a bool, Is it LwDITA?
type KeyElmTriplet ¶
func GetKeyElmTriplet ¶
func GetKeyElmTriplet(localName string) *KeyElmTriplet
type MType ¶
type MType string
Possible TODO:
type XmlDoctypeFamily string
The XmlDoctypeFamilies are the broad groups of DOCTYPES.
var XmlDoctypeFamilies = []XmlDoctypeFamily { "lwdita", "dita13", "dita", "html5", "html4", "svg", "mathml", "other", }
.
type NS ¶
type NS struct { // Prefix is the shorthand version. Prefix string // URI is the full version. URI string }
NS is e.g. { "xml", "http://www.w3.org/XML/1998/namespace" }
type NSsnapshot ¶
One of these has to be filled in for the NS declarations at the top of a content file. Also this can describe the NS state at any point in parsing or traversing a content tree.
type PIDFPIfields ¶
type PIDFPIfields struct { // Registration is "+" or "-" Registration string // IsOasis but if not, then could be any of many others IsOasis bool // Organization is "OASIS" or maybe something else Organization string // PublicTextClass is typically "DTD" (filename.dtd) // or "ELEMENTS" (filename.mod) PublicTextClass string // PublicTextDesc is the distinguishing string, // e.g. PUBLIC "-//OASIS//DTD (_PublicTextDesc_)//EN". // It can end with the root tag of the document // (e.g. "Topic"). It can have an optional // embedded version number, such as "DITA 1.3". PublicTextDesc string }
PIDFPIfields holds the parsed results of a PID (PublicID) a.k.a. Formal Public Identifier, for example "-//OASIS//DTD LIGHTWEIGHT DITA Topic//EN"
func (PIDFPIfields) String ¶
func (p PIDFPIfields) String() string
type PIDSIDcatalogFileRecord ¶
type PIDSIDcatalogFileRecord struct { // XMLName probably does not ever need to be printed. XMLName xml.Name `xml:"public"` // XmlPublicID (PID) (FPI) is the DOCTYPE string XmlPublicID `xml:"publicId,attr"` PIDFPIfields // PublicID // XmlSystemID is the path to the file. Tipicly a relative filepath. XmlSystemID `xml:"uri,attr"` // The filepath long form, as resolved. // Note that we must use a string in order to avoid an import cycle. AbsFilePath string // FU.AbsFilePath HttpPath string Err error // in case an entry barfs }
PIDSIDcatalogFileRecord representa a line item from a parsed XML catalog file. One with a simple structure, such as the catalog file for LwDITA. This same struct is also used to record the PID and/or SID of a DOCTYPE declaration.
func NewPIDSIDcatalogFileRecord ¶
func NewPIDSIDcatalogFileRecord(pid string, sid string) (*PIDSIDcatalogFileRecord, error)
NewPIDSIDcatalogFileRecord is pretty self-explanatory.
func NewSIDPIDcatalogRecordfromStartTag ¶
func NewSIDPIDcatalogRecordfromStartTag(ct CT.CToken) (pID *PIDSIDcatalogFileRecord, err error)
func (PIDSIDcatalogFileRecord) DString ¶
func (p PIDSIDcatalogFileRecord) DString() string
DString returns a comprehensive dump.
func (PIDSIDcatalogFileRecord) Echo ¶
func (p PIDSIDcatalogFileRecord) Echo() string
Echo returns the public ID _unquoted_. <!DOCTYPE topic "-//OASIS//DTD LIGHTWEIGHT DITA Topic//EN">
func (*PIDSIDcatalogFileRecord) HasPID ¶
func (p *PIDSIDcatalogFileRecord) HasPID() bool
func (*PIDSIDcatalogFileRecord) HasSID ¶
func (p *PIDSIDcatalogFileRecord) HasSID() bool
func (PIDSIDcatalogFileRecord) String ¶
func (p PIDSIDcatalogFileRecord) String() string
String returns the juicy part. For example, <!DOCTYPE topic "-//OASIS//DTD LIGHTWEIGHT DITA Topic//EN"> maps to "DTD LIGHTWEIGHT DITA Topic".
type ParsedDoctype ¶
type ParsedDoctype struct { CT.Raw // Raw Doctype string // PIDSIDcatalogFileRecord is the PID + SID. PIDSIDcatalogFileRecord // DTrootElm is the tag declared in the DOCTYPE, which // should match the root tag in the text of the file. DTrootElm string // contains filtered or unexported fields }
ParsedDoctype is a parse of a complete DOCTYPE declaration. For [Lw]DITA, what interests us is something like
PUBLIC "-//OASIS//DTD (PublicTextDesc)//EN" or sometimes PUBLIC "-//OASIS//ELEMENTS (PublicTextDesc)//EN" and maybe followed by SYSTEM...
The structure of a DOCTYPE is like so:
- PUBLIC | SYSTEM = Availability
- - = Registration = Organization & DTD are not registeredd with ISO.
- OASIS = Organization
- DTD = Public Text Class (CAPACITY | CHARSET | DOCUMENT | DTD | ELEMENTS | ENTITIES | LPD | NONSGML | NOTATION | SHORTREF | SUBDOC | SYNTAX | TEXT )
- (*) = Public Text Description, incl. any version number
- EN = Public Text Language
- URL = optional, explicit
We don't include the raw DOCTYPE here because this structure can be optional but we still need to have the Doctype string in the DB as a separate column, even if it is empty (i.e. "").
type ParsedPreamble ¶
type ParsedPreamble struct { // Do not include a trailing newline. Preamble_raw string // e.g. "0" means XML 1.0 MinorVersion string // Valid values and forms are TBS. Encoding string // "yes" or "no" IsStandalone bool }
ParsedPreamble is a parse of an optional PI (processing instruction) at the start of an XML file. The most typical form is defined in the stdlib:
"<?xml version="1.0" encoding="UTF-8"?>" + "\n"
Here the major version MUST be 1. XML has a version 1.1 but nobody uses it, so also the minor version MUST be 0, because that's what the Go stdlib XML parser understands, and anything else is gonna cause crazy breakage. Fields:
<?xml version="version_number" <= required, "1.0" encoding="encoding_declaration" <= optional, assume "UTF-8" standalone="standalone_status" ?> <= optional, can be "yes", assume "no"
Probably any errors returned by this function should be panicked on, because any such error is pretty fundamental and also ridiculous. Note also that strictly speaking, an XML preamble is NOT a PI.
var STD_PreambleParsed ParsedPreamble
STD_PreambleFields is our parse of variable "STD_PREAMBLE".
func ParsePreamble ¶
func ParsePreamble(sRaw CT.Raw) (*ParsedPreamble, error)
NewXmlPreambleFields parses an XML preamble, which (BTW) MUST be the first line in a file. XML version MUST be "1.0". Encoding handling is incomplete.
- Example: <?xml version="1.0" encoding='UTF-8' standalone="yes"?>
- Also OK: xml version="1.0" encoding='UTF-8' standalone="yes"
- Also OK: version="1.0" encoding='UTF-8' standalone="yes"
- Also OK: fields as documented for struct "XmlPreambleFields".
type ParserResults_xml ¶
func GenerateParserResults_xml ¶
func GenerateParserResults_xml(s string) (*ParserResults_xml, error)
func (*ParserResults_xml) NodeCount ¶
func (p *ParserResults_xml) NodeCount() int
func (*ParserResults_xml) NodeDebug ¶
func (p *ParserResults_xml) NodeDebug(i int) string
func (*ParserResults_xml) NodeEcho ¶
func (p *ParserResults_xml) NodeEcho(i int) string
func (*ParserResults_xml) NodeInfo ¶
func (p *ParserResults_xml) NodeInfo(i int) string
type SliceBounds ¶
type SliceBounds struct {
BegIdx, EndIdx int
}
type XmlCatalogFile ¶
type XmlCatalogFile struct { XMLName xml.Name `xml:"catalog"` // "public" or "system" Prefer string `xml:"prefer,attr"` XmlPublicIDsubrecords []PIDSIDcatalogFileRecord `xml:"public"` // We do this so we can peel off the directory path AbsFilePath string }
XmlCatalogFile represents a parsed XML catalog file, at the top level.
func NewXmlCatalogFile ¶
func NewXmlCatalogFile(fpath string) (pXC *XmlCatalogFile, err error)
NewXmlCatalogFile is a convenience function that reads in the file and then processes the file contents. It is not clear what the constraints on the path are (but a relative path should work okay).
func (*XmlCatalogFile) GetByPublicID ¶
func (p *XmlCatalogFile) GetByPublicID(s string) *PIDSIDcatalogFileRecord
func (*XmlCatalogFile) Validate ¶
func (p *XmlCatalogFile) Validate() (retval bool)
Validate validates an XML catalog. It checks that the listed files exist and that the IDs (as strings that are not parsed yet) are well-formed. It assumes that the catalog has already been loaded from an XML catalog file on-disk. The return value is false if _any_ entry fails to load, but also each entry has its own error field.
type XmlContype ¶
type XmlContype string
XmlContype categorizes the XML file. See variable "XmlContypes".
type XmlDoctype ¶
type XmlDoctype string
XmlDoctype is just a DOCTYPE string, for example: <!DOCTYPE html>
type XmlPeek ¶
type XmlPeek struct { PreambleRaw CT.Raw // string DoctypeRaw CT.Raw // string HasDTDstuff bool ContentityBasics }
XmlPeek is called by FU.AnalyseFile(..) when preparing an FU.AnalysisRecord . ContentityBasics has chunks of Raw but not the full "Raw" string. .
func Peek_xml ¶
Peek_xml takes a string and does the minimum to find XML preamble, DOCTYPE, root element, whether DTD stuff was encountered, and the locations of outer elements containing metadata and body text.
It uses the Go stdlib parser, so success in finding a root element in this function all but guarantees that the string is valid XML.
It is called by FU.AnalyzeFile .
type XmlPublicID ¶
type XmlPublicID string
XmlPublicID = PID = Public ID = FPI = Formal Public Identifier
type XmlSystemID ¶
type XmlSystemID string
XmlSystemID = SID = System ID = URI (Universal Resource Identifier) (can be a filepath or an HTTP address)