parsel - Select parts of a HTML document based on CSS selectors


NAME

parsel - Select parts of a HTML document based on CSS selectors


INVOCATION

cat page.html | parsel <SELECTOR> [<SELECTOR> [...]]


DESCRIPTION

This command takes an HTML document in STDIN and some CSS selectors in arguments. See 'parsel' and 'cssselect' python modules to see which selectors and pseudo selectors are supported.

Each SELECTOR selects a part in the DOM, but unlike CSS, does not narrow the DOM tree considered by subsequent selectors. So a sequence of div p arguments (2 arguments) selects all <DIV> and then all <P> in the document; in other words it is NOT equivalent to the div p css selector which selects only those <P> which are under any <DIV>.

Each SELECTOR also outputs what was matched, in the following format: First output an integer how many distinct HTML parts were selected, then output the selected parts themself each in its own line. CR, LF, and Backslash chars are escaped by one Backslash char. It's useful for programmatic consumption, because you only have to fist read a line which tells how many subsequent lines to read: each one is a DOM part on its own. Then just unescape Backslash-R, Backslash-N, and double Backslashes to get the HTML content.

Additionally it takes these special arguments as well:

@SELECTOR

Prefix your selector with an @ at sign to suppress output. Mnemonic: Command line echo suppression in DOS batch and in Makefile.

text{} or ::text

Remove HTML tags and leaves text content only before output. text{} syntax is borrowed from pup(1). ::text form is there for you if curly brackets are magical in your shell and you don't bother escaping. Note, ::text is not a valid CSS pseudo selector at the moment.

attr{ATTRIB} or [[ATTRIB]]

Output only the value of the uppermost selected element's ATTRIB attribute. attr{} syntax is borrowed from pup(1). Mnemonic for the [[ATTRIB]] form: in CSS you filter by tag attribute with [attr] square brackets, but as it's a valid selector, parsel(1) takes double square brackets to actually output the attribute.

/ (forward slash)

A stand-alone / takes the current selection as a base for the rest of the selectors. Mnemonic: one directory level deeper. So this arg sequence: .content / p div selects only those P and DIV elements which are inside a "content" class. This is useful because with css only, you can not group P and DIV together here. In other words neither .content p, div nor .content > p, div provides the same result.

.. (double period)

A stand-alone .. rewinds the base DOM selection to the previous base selection before the last /. Mnemonic: parent directory.

@:root

Rewind base selection back to the DOM's root. Note, :root is also a valid CSS pseudo selector, but in a subtree (entered into by /) it would yield only that subtree, not the original DOM, so parsel(1) goes back to it at this point. You likely need @ too to suppress output the whole document here.


EXAMPLE OUTPUT

  $ parsel input[type=text] < page.html
  2
  <input type="text" name="domain" />
  <input type="text" name="username" />
  $ parsel input[type=text] [[name]] < page.html
  2
  <input type="text" name="domain" />
  <input type="text" name="username" />
  2
  domain
  username
  $ parsel @input[type=text] [[name]] < page.html
  2
  domain
  username
  $ parsel @form ::text < page.html
  1
  Enter your logon details:\
\
Domain:\
\
Username:\
\
Password:\
\
Click here to login:\
\


REFERENCE

https://www.w3schools.com/cssref/css_selectors.php
https://developer.mozilla.org/en-US/docs/Web/CSS/Reference#selectors
https://github.com/scrapy/cssselect
https://cssselect.readthedocs.io/en/latest/#supported-selectors


SIMILAR TOOLS

https://github.com/ericchiang/pup
https://github.com/suntong/cascadia
https://github.com/mgdm/htmlq