04-link

command
v0.0.0-...-56ad08b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 28, 2024 License: MIT Imports: 4 Imported by: 0

README

Problem

Solutions

  • link: link package.
  • main: Use link package to extract links from HTML.

Lessons Learned

/x/net/html
  • Read the package example: https://godoc.org/golang.org/x/net/html
  • Token struct:
    type Token struct {
        Type     TokenType
        DataAtom atom.Atom
        Data     string
        Attr     []Attribute
    }
    
  • Type can give us information about what kind of token it is. Important ones for this exercise are:
    • StartTagToken: <a href>
    • EndTagToken: </a>
    • TextToken: Text in between. Using text nodes will skip other elements inside the link.
  • Data contains the data in the node.
    • Anchor tags: a.
    • Text nodes: The actual text of the node.
  • Attribute is of type:
    type Attribute struct {
        Namespace, Key, Val string
    }
    
  • Key is the name of the attribute and Value is the value.
    • <a href="example.net">: key = href and value = example.net.
Parse

Parse is easy.

  • Go through the nodes. If you reach a start anchor tag, set the capturing flag to start capturing. Store the href.
  • While capturing, add the text of every text node (trim all white space but add a space between nodes).
  • After reaching the end anchor tag, stop capturing and store the link.
  • Add link to the links slice.

Issues:

  • Nested links are ignored. Child links are not stored and their text is stored as part of the parent link.
    • For an example run go run main.go -f ex5.html.
strings.Builder
var sb strings.Builder  // Create the builder.
sb.WriteString("whatever")  // Write to it. We can use fmt.Sprintf as param too.
return sb.String()  // Get the final string.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL