A HTML parse and a serializer for Go. GoHTML tries to keep semantic similar to JS-DOM API while trying to keep the API simple by not forcing JS-DOM model into GoHTML. Because of this GoHTML has node tree model. GoHTML tokenizer uses std net/html module for tokenizing in underlining layer. There for it's users responsibility to make sure inputs to GoHTML is UTF-8 encoded. GoHTML allows direct access to the node tree.
Run the following command in project directory in order to install.
go get github.com/udan-jayanith/GoHTML
Then GoHTML can import like this.
import (
GoHtml "github.com/udan-jayanith/GoHTML"
)
- Parsing
- Serialization
- Node tree traversing
- Querying
Heres an example of fetching a website and parsing and then using querying methods.
res, err := http.Get("https://www.metalsucks.net/")
if err != nil {
t.Fatal(err)
}
defer res.Body.Close()
//Parses the given html reader and then returns the root node and an error.
node, err := GoHtml.Decode(res.Body)
if err != nil {
t.Fatal(err)
}
nodeList := node.GetElementsByClassName("post-title")
iter := nodeList.IterNodeList()
for node := range iter{
print(node.GetInnerText())
}
Changes, bug fixes and new features in this version.
- add: Tokenizer
- add: NodeTreeBuilder
- renamed: QuerySelector to Query
- renamed: QuerySelectorAll to QueryAll
Fully fledged documentation is available at go.pkg
Contributions are welcome and pull requests and issues will be viewed by an official.