xml parsing - Scraping HTML-page using readHTMLTables in R -

- February 15, 2013

i'd scrape webpage: http://unbisnet.un.org:8080/ipac20/ipac.jsp?&menu=search&aspect=power&npp=50&ipp=20&spp=20&profile=bib&index=.tw&term=%22draft+resolution%22&index=.aw&term=netherlands

i need create dataframe containing 5 columns (title, imprint, enhanced title, un document symbol, publication date) each resolution. tried using readhtmltables, can't seem figure out. website contains many tables. when run code list 394 objects, me seems empty.

u <- "http://unbisnet.un.org:8080/ipac20/ipac.jsp?&menu=search&aspect=power&npp=50&ipp=20&spp=20&profile=bib&index=.tw&term=%22draft+resolution%22&index=.aw&term=netherlands" tables = readhtmltable(u) names(tables) length(tables) tables[[1]]

you have change approach. that's horrible site many, many nested <table>s. here's 1 way deal it, you'll have roll sleeves fields may want. please research xpath queries or post specific follow-up "how to"'s separate questions (i.e. how target specific xpath element) vs tack on post.

we'll use hadleyverse:

library(rvest) library(dplyr)

this reads in html page:

doc <- html("http://unbisnet.un.org:8080/ipac20/ipac.jsp?&menu=search&aspect=power&npp=50&ipp=20&spp=20&profile=bib&index=.tw&term=%22draft+resolution%22&index=.aw&term=netherlands")

at point in nested structure there's <table> 1 <tr> , 2 <td> elements have field , data field. 1 way data value field target label field (which in <a> tag), go 1 level (to <td>) take it's next sibling (the value <td>). example, title: values:

titles <- xml_text(             html_nodes(doc,                xpath="//td/a[starts-with(., 'title:')]/../following-sibling::td[1]"))

you can same these 3 fields:

imprints <- xml_text(html_nodes(doc, xpath="//td/a[starts-with(., 'imprint:')]/../following-sibling::td[1]")) un_doc_sym <- xml_text(html_nodes(doc, xpath="//td/a[starts-with(., 'un doc')]/../following-sibling::td[1]")) pub_date <- xml_text(html_nodes(doc, xpath="//td/a[starts-with(., 'publication date')]/../following-sibling::td[1]"))

and wrap them in data_frame / data.frame:

unbisnet <- data_frame(titles, imprints, un_doc_sym, pub_date)  glimpse(unbisnet)  ## observations: 50 ## variables: ## $ titles     (chr) "draft resolution [on establishment of internati... ## $ imprints   (chr) "[new york] : un, 29 july 2015", "[new york] : un, ... ## $ un_doc_sym (chr) "s/2015/562", "a/69/l.80", "a/69/l.78", "a/69/l.76"... ## $ pub_date   (chr) "20150729", "20150715", "20150630", "20150625", "20...

targeting of others won't easy it's doable.

Search This Blog

YU

xml parsing - Scraping HTML-page using readHTMLTables in R -

Comments

Post a Comment

Popular posts from this blog

mysql - FireDac error 314 - but DLLs are in program directory -

git - How to list all releases of public repository with GitHub API V3 -

wget - Downloading a page with 404 response code -