Clean and convert HTML to XML for BaseX -
i run xquery commands using basex on html source may full of <script>, <style> nodes must removed , unclosed tags (<br>, <img>) must have pair. (for example dirty source of this page )
"converting html xml" suggests using tidy, doesn't have gui , doesn't seem work correctly on source (it outputs nothing), , doubt if removes scripts , other unnecessary tags. old, way.
as didn't find question address needs, asked again. because close tools coding , querying, asked here.
basex has integration tagsoup, convert html well-formed xhtml.
most distributions of basex bundle tagsoup, if installed basex linux repository, might need add manually (for example, on debian , ubuntu it's called libtagsoup-java). further details different installation options given in documentation linked above.
afterwards, either set tagsoup parser default using command
set parser html or in xquery prologue using
declare option db:parser "html"; afterwards, fetch document want. example amazon site linked:
declare option db:parser "html"; doc('http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3daps&field-keywords=camera') this should work, doesn't. i'm querying main developers reason doesn't (seems because of http redirection) , update answer when issue resolved (or understand why not work). workaround until fetch document text , parse html:
html:parse(fetch:text('http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3daps&field-keywords=camera')
Comments
Post a Comment