Convert the unselectable navigation interface of the Google Cloud documentation site into Markdown TOC for further processing.
https://gitlab.com/brlin/gcloud-docs-nav2md
#google-cloud #data-conversion #html-query
You need to have the following software installed in order to follow this tutorial:
- gzip
For uncompressing the gzip-compressed product's release archive. - html-query
For parsing the navigation interface HTML markup into JSON data for further processing. - jinjanator
For converting the JSON navigation data into Markdown markup using Jinja2 templates. - jq
For beautifying the JSON navigation data for easier inspecting the parse results. - Mozilla Firefox
For selecting the HTML markup of the navigation interface we need to parse data from. - Your preferred plaintext editor application
For examining the HTML markup. - Your preferred tar archive manipulating application.
For extracting the product's release archive. - Your preferred text terminal emulator application
For running command-line interface commands required by this tutorial.
Execute the following instructions to do the conversion:
-
Download the release archive from the Releases page.
-
Extract the release archive.
-
Launch Mozilla Firefox.
-
Browse the main page of the Google Cloud documentation website.
-
Right click the first item in the navigation interface(it is "Google Cloud Documentation home" at the writing of this process at 2024/10/16), and select Inspect. The developer console will open, and, in the main panel of the Inspector tab you should notice a
devsite-nav-text
class element is selected. -
Search outward the nested elements until you locate an element that is in the
devsite-nav-list
class, right-click the element and select the Copy > Inner HTML option in its contextual menu. -
Create a dump.html file in the /path/to/the/extracted/product/gcloud-docs-nav2md-X.Y.Z folder, the content is similar to this sample file.
-
Use your preferred text editor to open the file, and paste the previously copied HTML markup to the file, then save the file.
-
Launch your preferred terminal emulator application.
-
Change the working directory to the /path/to/the/extracted/product/gcloud-docs-nav2md-X.Y.Z folder.
-
Run the following commands to convert the navigation interface HTML markup to JSON data:
hq \ '{ navitems: .devsite-nav-item | [ { title: .devsite-nav-text | @text, url: a | @(href) } ] }' <dump.html \ | jq . \ | tee nav.json
The conversion result should be similar to this sample file.
-
Run the following commands to convert the navigation JSON data to ToC Markdown markup:
jinjanate_opts=( --format json --output toc.md ) jinjanate "${jinjanate_opts[@]}" toc.md.j2 nav.json
The conversion result should be similar to this sample file.
After doing so you should have your ToC Markdown markup in the toc.md file(sample)!
The following materials are referenced during the development of this project:
- Special query syntax | orf/html-query: jq, but for HTML
Explains the usage of the@(href)
and the@text
query syntax of html-query.
Unless otherwise noted(individual file's header/REUSE.toml), this product is licensed under version 3 of the GNU Affero General Public License, or any of its recent versions you would prefer.
This work complies to the REUSE Specification, refer the REUSE - Make licensing easy for everyone website for info regarding the licensing of this product.