Clean and convert HTML to XML for BaseX -

January 15, 2014

i run xquery commands using basex on html source may full of <script>, <style> nodes must removed , unclosed tags (<br>, <img>) must have pair. (for example dirty source of this page )

"converting html xml" suggests using tidy, doesn't have gui , doesn't seem work correctly on source (it outputs nothing), , doubt if removes scripts , other unnecessary tags. old, way.

as didn't find question address needs, asked again. because close tools coding , querying, asked here.

basex has integration tagsoup, convert html well-formed xhtml.

most distributions of basex bundle tagsoup, if installed basex linux repository, might need add manually (for example, on debian , ubuntu it's called libtagsoup-java). further details different installation options given in documentation linked above.

afterwards, either set tagsoup parser default using command

set parser html

or in xquery prologue using

declare option db:parser "html";

afterwards, fetch document want. example amazon site linked:

declare option db:parser "html"; doc('http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3daps&amp;field-keywords=camera')

this should work, doesn't. i'm querying main developers reason doesn't (seems because of http redirection) , update answer when issue resolved (or understand why not work). workaround until fetch document text , parse html:

html:parse(fetch:text('http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3daps&amp;field-keywords=camera')

Search This Blog

ANgular

Clean and convert HTML to XML for BaseX -

Comments

Post a Comment

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -