parsel - Select parts of a HTML document based on CSS selectors |
parsel - Select parts of a HTML document based on CSS selectors
cat page.html | parsel <SELECTOR> [<SELECTOR> [...]]
This command takes an HTML document in STDIN and some CSS selectors in arguments. See 'parsel' and 'cssselect' python modules to see which selectors and pseudo selectors are supported.
Each SELECTOR selects a part in the DOM, but unlike CSS, does not
narrow the DOM tree considered by subsequent selectors. So a sequence of
div p
arguments (2 arguments) selects all <DIV> and then all <P> in
the document; in other words it is NOT equivalent to the div p
css
selector which selects only those <P> which are under any <DIV>.
Each SELECTOR also outputs what was matched, in the following format: First output an integer how many distinct HTML parts were selected, then output the selected parts themself each in its own line. CR, LF, and Backslash chars are escaped by one Backslash char. It's useful for programmatic consumption, because you only have to fist read a line which tells how many subsequent lines to read: each one is a DOM part on its own. Then just unescape Backslash-R, Backslash-N, and double Backslashes to get the HTML content.
Additionally it takes these special arguments as well:
Prefix your selector with an @
at sign to suppress output.
Mnemonic: Command line echo suppression in DOS batch and in Makefile.
Remove HTML tags and leaves text content only before output.
text{}
syntax is borrowed from pup(1)
.
::text
form is there for you if curly brackets are magical in your shell and you don't bother escaping.
Note, ::text
is not a valid CSS pseudo selector at the moment.
Output only the value of the uppermost selected element's ATTRIB attribute.
attr{}
syntax is borrowed from pup(1)
.
Mnemonic for the [[ATTRIB]]
form: in CSS you filter by tag attribute
with [attr]
square brackets, but as it's a valid selector,
parsel(1)
takes double square brackets to actually output the attribute.
A stand-alone /
takes the current selection as a base for the rest of the selectors.
Mnemonic: one directory level deeper.
So this arg sequence: .content / p div
selects only those P and DIV elements
which are inside a "content" class.
This is useful because with css only, you can not group P and DIV together here.
In other words neither .content p, div
nor .content > p, div
provides
the same result.
A stand-alone ..
rewinds the base DOM selection to the
previous base selection before the last /
.
Mnemonic: parent directory.
Rewind base selection back to the DOM's root.
Note, :root
is also a valid CSS pseudo selector, but in a subtree (entered into by /
)
it would yield only that subtree, not the original DOM, so parsel(1)
goes back to it at this point.
You likely need @
too to suppress output the whole document here.
$ parsel input[type=text] < page.html 2 <input type="text" name="domain" /> <input type="text" name="username" />
$ parsel input[type=text] [[name]] < page.html 2 <input type="text" name="domain" /> <input type="text" name="username" /> 2 domain username
$ parsel @input[type=text] [[name]] < page.html 2 domain username
$ parsel @form ::text < page.html 1 Enter your logon details:\ \ Domain:\ \ Username:\ \ Password:\ \ Click here to login:\ \