![]() You even can use XPath in a puppeteer javascript script const puppeteer = require('puppeteer') Īwait tViewport() Ĭonst xpath_expression = page.waitForXPath(xpath_expression) Ĭonst links = await page.$x(xpath_expression) Ĭonst link_urls = await page.evaluate((. Xmlstarlet sel -t -v - # parse the stream with XPath expression Xmlstarlet format -H - 2>/dev/null | # convert broken HTML to HTML ^M is Control+v Enter xmlstarlet: curl -Ls | Or a xpath & network aware tool like xidel or saxon-lint: xidel -se (Written by Andy Lester, the author of ack and more.) This comes with the package www-mechanize-perl (Debian based distro). Instead, use a proper parser: mech-dump mech-dump -links -absolute -agent-alias='Linux Mozilla' ![]() Parsing HTML with regex is a regular discussion: this is a bad idea. I am in no way recommending regex for parsing HTML, unless you know what you're doing or have very limited needs (eg: only want links), like in this case. P.S.(2): You should, if you want to make use of a better way to parse HTML, instead of using my solution above (which doesn't handle every corner case, given the lack of an HTML example/sample) use the better answer from Gilles Quenot that is more fit for general (eg: complete) and more optimized support of HTML syntaxes. P.S.: This may or may not be obvious, but this also doesn't take into account links/URLs that are generated dynamically (eg: PHP, JS, etc) since curl mostly works on static links. Given you can't/don't show an example of said structure or the webpage itself, it is difficult to make an answer that works on it, unless more HTML knowledge was involved. Thus certain knowledge of the webpage structure or HTML is required. Lastly, this does not take into account every possible way a link is displayed. If you want to remove that, make/use sed or something else like so:Ĭurl -f -L URL | grep -Eo "https?://\S+?\"" | sed 's/&.*//'
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |