| parsel - Select parts of a HTML document based on CSS selectors |
parsel - Select parts of a HTML document based on CSS selectors
parsel <SELECTOR> [<SELECTOR> [...]] < document.html
This command takes an HTML document in STDIN and some CSS selectors in arguments. See 'parsel' and 'cssselect' python modules to see which selectors and pseudo selectors are supported.
Each SELECTOR selects a part in the DOM, but unlike CSS, does not
narrow the DOM tree down for subsequent selectors. So a sequence of
div p arguments (2 arguments) selects all <DIV> and then all <P> in
the document; in other words it is NOT equivalent to the div p css
selector which selects only those <P> which are under any <DIV>.
To combine selectors, see the / (slash) operator below.
Each SELECTOR also outputs what was matched, in the following format:
First output an integer how many distinct HTML parts were selected, then
output the selected parts themself each in its own line.
CR, LF, and Backslash chars are escaped by one Backslash char. It's
useful for programmatic consumption, because you only have to fist read
a line which tells how many subsequent lines to read: each one is one
selected DOM sub-tree on its own (or text, see ::text and [[ATTRIB]] below).
Then just unescape Backslash-R, Backslash-N, and double Backslashes
(for example with sed -e 's/\\\\/\\/g; s/\\r/\r/g; s/\\n/\n/g')
to get the HTML content.
Additionally it takes these special arguments as well:
Prefix your selector with an @ at sign to suppress output.
Mnemonic: Command line echo suppression in DOS batch and in Makefile.
Remove HTML tags and leaves text content only before output.
text{} syntax is borrowed from pup(1).
::text form is there for you if curly brackets are magical in your shell and you don't want to type escaping.
Note, ::text is not a standard CSS pseudo selector at the moment.
Output only the value of the uppermost selected element's ATTRIB attribute.
attr{} syntax is borrowed from pup(1).
Mnemonic for the [[ATTRIB]] form: in CSS you filter by tag attribute
with [attr] square brackets, but as it's a valid selector,
parsel(1) takes double square brackets to actually output the attribute.
A stand-alone / takes the current selection as a base for the rest of the selectors.
Therefore the subsequent SELECTORs work on the previously selected elements,
not on the document root.
Mnemonic: one directory level deeper.
So this arg sequence: .content / p div selects only those P and DIV elements
which are inside a "content" class.
This is useful because with css only, you can not group P and DIV together here.
In other words neither .content p, div nor .content > p, div provides
the same result.
A series of selectors delmited by / forward slashes in a single argument
is to delve into the DOM tree, but show only those elements which the last selector yields.
In contrast to the multi-argument variant SEL1 / SEL2 / SEL3, which shows everything
SEL1, SEL2, SEL3, etc produces.
Similar to this 5 words argument: @SEL1 / @SEL2 / SEL3, except SEL1/SEL2/SEL3
rewinds the base selection to the one before SEL1, while the former one moves the
base selection to SEL3 at the end.
You may still silence its output by prepending @, like: @SEL1/SEL2/SEL3, so
not even SEL3 will be shown.
This is useful when you want only its attributes or inner text (see text{} and attr{}).
Since slashes may occour normally in valid CSS selectors,
please double those / slashes which are not meant to separate selectors,
but are part of a selector - usually an URL in a tag attribute.
Eg. instead of a[href="http://example.net/page"], input a[href="http:////example.net//page"].
A stand-alone .. rewinds the base DOM selection to the
previous base selection before the last /.
Mnemonic: parent directory.
Note, it does not select the parent element in the DOM tree,
but the stuff previously selected in this parsel(1) run.
To select the parent element(s) use parent{}.
Select the currently selected elements' parent elements on the DOM tree.
Note, :parent is not a standard CSS selector at the moment.
Use the parent{} form to disambiguate it from real (standardized) CSS selectors in your code.
Rewind base selection back to the DOM's root.
Note, :root is also a valid CSS pseudo selector, but in a subtree (entered into by /)
it would yield only that subtree, not the original DOM, so parsel(1) goes back to it at this point.
You likely need @ too to suppress output the whole document here.
Show only the first element found. The output is not escaped in this case.
$ parsel input[type=text] < page.html 2 <input type="text" name="domain" /> <input type="text" name="username" />
$ parsel input[type=text] [[name]] < page.html 2 <input type="text" name="domain" /> <input type="text" name="username" /> 2 domain username
$ parsel @input[type=text] [[name]] < page.html 2 domain username
$ parsel @form ::text < page.html 1 Enter your logon details:\r\nDomain:\r\nUsername:\r\nPassword:\r\nClick here to login:\r\n
| parsel - Select parts of a HTML document based on CSS selectors |