parsel - Select parts of a HTML document based on CSS selectors

NAME

parsel - Select parts of a HTML document based on CSS selectors

INVOCATION

parsel <SELECTOR> [<SELECTOR> [...]] < document.html

DESCRIPTION

This command takes an HTML document in STDIN and some CSS selectors in arguments. See 'parsel' and 'cssselect' python modules to see which selectors and pseudo selectors are supported.

Each SELECTOR selects a part in the DOM, but unlike CSS, does not narrow the DOM tree down for subsequent selectors. So a sequence of div p arguments (2 arguments) selects all <DIV> and then all <P> in the document; in other words it is NOT equivalent to the div p css selector which selects only those <P> which are under any <DIV>. To combine selectors, see the / (slash) operator below.

Each SELECTOR also outputs what was matched, in the following format: First output an integer how many distinct HTML parts were selected, then output the selected parts themself each in its own line. CR, LF, and Backslash chars are escaped by one Backslash char. It's useful for programmatic consumption, because you only have to fist read a line which tells how many subsequent lines to read: each one is one selected DOM sub-tree on its own (or text, see ::text and [[ATTRIB]] below). Then just unescape Backslash-R, Backslash-N, and double Backslashes (for example with sed -e 's/\\\\/\\/g; s/\\r/\r/g; s/\\n/\n/g') to get the HTML content.

Additionally it takes these special arguments as well:

@SELECTOR

Prefix your selector with an @ at sign to suppress output. Mnemonic: Command line echo suppression in DOS batch and in Makefile.

text{} or ::text

Remove HTML tags and leaves text content only before output. text{} syntax is borrowed from pup(1). ::text form is there for you if curly brackets are magical in your shell and you don't want to type escaping. Note, ::text is not a standard CSS pseudo selector at the moment.

attr{ATTRIB} or [[ATTRIB]]

Output only the value of the uppermost selected element's ATTRIB attribute. attr{} syntax is borrowed from pup(1). Mnemonic for the [[ATTRIB]] form: in CSS you filter by tag attribute with [attr] square brackets, but as it's a valid selector, parsel(1) takes double square brackets to actually output the attribute.

/ (forward slash)

A stand-alone / takes the current selection as a base for the rest of the selectors. Therefore the subsequent SELECTORs work on the previously selected elements, not on the document root. Mnemonic: one directory level deeper. So this arg sequence: .content / p div selects only those P and DIV elements which are inside a "content" class. This is useful because with css only, you can not group P and DIV together here. In other words neither .content p, div nor .content > p, div provides the same result.

SEL1/SEL2/SEL3

A series of selectors delmited by / forward slashes in a single argument is to delve into the DOM tree, but show only those elements which the last selector yields. In contrast to the multi-argument variant SEL1 / SEL2 / SEL3, which shows everything SEL1, SEL2, SEL3, etc produces. Similar to this 5 words argument: @SEL1 / @SEL2 / SEL3, except SEL1/SEL2/SEL3 rewinds the base selection to the one before SEL1, while the former one moves the base selection to SEL3 at the end.

You may still silence its output by prepending @, like: @SEL1/SEL2/SEL3, so not even SEL3 will be shown. This is useful when you want only its attributes or inner text (see text{} and attr{}).

Since slashes may occour normally in valid CSS selectors, please double those / slashes which are not meant to separate selectors, but are part of a selector - usually an URL in a tag attribute. Eg. instead of a[href="http://example.net/page"], input a[href="http:////example.net//page"].

.. (double period)

A stand-alone .. rewinds the base DOM selection to the previous base selection before the last /. Mnemonic: parent directory. Note, it does not select the parent element in the DOM tree, but the stuff previously selected in this parsel(1) run. To select the parent element(s) use parent{}.

parent{} or :parent

Select the currently selected elements' parent elements on the DOM tree. Note, :parent is not a standard CSS selector at the moment. Use the parent{} form to disambiguate it from real (standardized) CSS selectors in your code.

@:root

Rewind base selection back to the DOM's root. Note, :root is also a valid CSS pseudo selector, but in a subtree (entered into by /) it would yield only that subtree, not the original DOM, so parsel(1) goes back to it at this point. You likely need @ too to suppress output the whole document here.

OPTIONS

-1: Show only the first element found. The output is not escaped in this case.

EXAMPLE OUTPUT

  $ parsel input[type=text] < page.html
  2
  <input type="text" name="domain" />
  <input type="text" name="username" />

  $ parsel input[type=text] [[name]] < page.html
  2
  <input type="text" name="domain" />
  <input type="text" name="username" />
  2
  domain
  username

  $ parsel @input[type=text] [[name]] < page.html
  2
  domain
  username

  $ parsel @form ::text < page.html
  1
  Enter your logon details:\r\nDomain:\r\nUsername:\r\nPassword:\r\nClick here to login:\r\n

REFERENCE

https://www.w3schools.com/cssref/css_selectors.php
https://developer.mozilla.org/en-US/docs/Web/CSS/Reference#selectors
https://github.com/scrapy/cssselect
https://cssselect.readthedocs.io/en/latest/#supported-selectors

SIMILAR TOOLS

https://github.com/ericchiang/pup
https://github.com/suntong/cascadia
https://github.com/mgdm/htmlq

parsel - Select parts of a HTML document based on CSS selectors