Clean and convert HTML to XML for BaseX -
i run xquery commands using basex on html source may full of <script>
, <style>
nodes must removed , unclosed tags (<br>
, <img>
) must have pair. (for example dirty source of this page )
"converting html xml" suggests using tidy, doesn't have gui , doesn't seem work correctly on source (it outputs nothing), , doubt if removes scripts , other unnecessary tags. old, way.
as didn't find question address needs, asked again. because close tools coding , querying, asked here.
basex has integration tagsoup, convert html well-formed xhtml.
most distributions of basex bundle tagsoup, if installed basex linux repository, might need add manually (for example, on debian , ubuntu it's called libtagsoup-java
). further details different installation options given in documentation linked above.
afterwards, either set tagsoup parser default using command
set parser html
or in xquery prologue using
declare option db:parser "html";
afterwards, fetch document want. example amazon site linked:
declare option db:parser "html"; doc('http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3daps&field-keywords=camera')
this should work, doesn't. i'm querying main developers reason doesn't (seems because of http redirection) , update answer when issue resolved (or understand why not work). workaround until fetch document text , parse html:
html:parse(fetch:text('http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3daps&field-keywords=camera')
Comments
Post a Comment