xml parsing - Scraping HTML-page using readHTMLTables in R -


i'd scrape webpage: http://unbisnet.un.org:8080/ipac20/ipac.jsp?&menu=search&aspect=power&npp=50&ipp=20&spp=20&profile=bib&index=.tw&term=%22draft+resolution%22&index=.aw&term=netherlands

i need create dataframe containing 5 columns (title, imprint, enhanced title, un document symbol, publication date) each resolution. tried using readhtmltables, can't seem figure out. website contains many tables. when run code list 394 objects, me seems empty.

u <- "http://unbisnet.un.org:8080/ipac20/ipac.jsp?&menu=search&aspect=power&npp=50&ipp=20&spp=20&profile=bib&index=.tw&term=%22draft+resolution%22&index=.aw&term=netherlands" tables = readhtmltable(u) names(tables) length(tables) tables[[1]] 

you have change approach. that's horrible site many, many nested <table>s. here's 1 way deal it, you'll have roll sleeves fields may want. please research xpath queries or post specific follow-up "how to"'s separate questions (i.e. how target specific xpath element) vs tack on post.

we'll use hadleyverse:

library(rvest) library(dplyr) 

this reads in html page:

doc <- html("http://unbisnet.un.org:8080/ipac20/ipac.jsp?&menu=search&aspect=power&npp=50&ipp=20&spp=20&profile=bib&index=.tw&term=%22draft+resolution%22&index=.aw&term=netherlands") 

at point in nested structure there's <table> 1 <tr> , 2 <td> elements have field , data field. 1 way data value field target label field (which in <a> tag), go 1 level (to <td>) take it's next sibling (the value <td>). example, title: values:

titles <- xml_text(             html_nodes(doc,                xpath="//td/a[starts-with(., 'title:')]/../following-sibling::td[1]")) 

you can same these 3 fields:

imprints <- xml_text(html_nodes(doc, xpath="//td/a[starts-with(., 'imprint:')]/../following-sibling::td[1]")) un_doc_sym <- xml_text(html_nodes(doc, xpath="//td/a[starts-with(., 'un doc')]/../following-sibling::td[1]")) pub_date <- xml_text(html_nodes(doc, xpath="//td/a[starts-with(., 'publication date')]/../following-sibling::td[1]")) 

and wrap them in data_frame / data.frame:

unbisnet <- data_frame(titles, imprints, un_doc_sym, pub_date)  glimpse(unbisnet)  ## observations: 50 ## variables: ## $ titles     (chr) "draft resolution [on establishment of internati... ## $ imprints   (chr) "[new york] : un, 29 july 2015", "[new york] : un, ... ## $ un_doc_sym (chr) "s/2015/562", "a/69/l.80", "a/69/l.78", "a/69/l.76"... ## $ pub_date   (chr) "20150729", "20150715", "20150630", "20150625", "20... 

targeting of others won't easy it's doable.


Comments

Popular posts from this blog

html - Firefox flex bug applied to buttons? -

html - Missing border-right in select on Firefox -

python - build a suggestions list using fuzzywuzzy -