Extracting Html Table From A Website In R
Solution 1:
Since the data is loaded with JavaScript, grabbing the HTML with rvest will not get you what you want, but if you use PhantomJS as a headless browser within RSelenium, it's not all that complicated (by RSelenium standards):
library(RSelenium)
library(rvest)
# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()
# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]
# clean up
rd$close()
ptm$stop()
# parse with rvest
df <- html %>% read_html() %>%
html_node('#ismr-event-history table.ism-table') %>%
html_table() %>%
setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>% # clean column names
setNames(gsub('\\s', '_', names(.)))
str(df)
## 'data.frame': 20 obs. of 10 variables:
## $ Gameweek : chr "GW1""GW2""GW3""GW4" ...
## $ Gameweek_Points : int 34475351666665634890 ...
## $ Points_Bench : int 16971429382 ...
## $ Gameweek_Rank : chr "2,406,373""2,659,789""541,258""905,524" ...
## $ Transfers_Made : int 0020322020 ...
## $ Transfers_Cost : int 0000444000 ...
## $ Overall_Points : chr "34""81""134""185" ...
## $ Overall_Rank : chr "2,406,373""2,448,674""1,914,025""1,461,665" ...
## $ Value : chr "£100.0""£100.0""£99.9""£100.0" ...
## $ Change_Previous_Gameweek: logi NA NA NA NA NA NA ...
As always, more cleaning is necessary, but overall, it's in pretty good shape without too much work. (If you're using the tidyverse, df %>% mutate_if(is.character, parse_number)
will do pretty well.) The arrows are images which is why the last column is all NA
, but you can calculate those anyway.
Solution 2:
This solution uses RSelenium along with the package XML
. It also assumes that you have a working installation of RSelenium
that can properly work with firefox
. Just make sure you have the firefox
starter script path added to your PATH
.
If you are using OS X
, you will need to add /Applications/Firefox.app/Contents/MacOS/
to your PATH
. Or, if you're on an Ubuntu machine, it's likely /usr/lib/firefox/
. Once you're sure this is working, you can move on to R with the following:
# Install RSelenium and XML for R#install.packages("RSelenium")#install.packages("XML")# Import packages
library(RSelenium)
library(XML)
# Check and start servers for Selenium
checkForServer()
startServer()
# Use firefox as a browser and a port that's not used
remote_driver <- remoteDriver(browserName="firefox", port=4444)
remote_driver$open(silent=T)
# Use RSelenium to browse the site
epl_link <- "https://fantasy.premierleague.com/a/entry/767830/history"
remote_driver$navigate(epl_link)
elem <- remote_driver$findElement(using="class", value="ism-table")
# Get the HTML source
elemtxt <- elem$getElementAttribute("outerHTML")
# Use the XML package to work with the HTML source
elem_html <- htmlTreeParse(elemtxt, useInternalNodes = T, asText = TRUE)
# Convert the table into a dataframe
games_table <- readHTMLTable(elem_html, header = T, stringsAsFactors = FALSE)[[1]]
# Change the column names into something legible
names(games_table) <- unlist(lapply(strsplit(names(games_table), split = "\\n\\s+"), function(x) x[2]))
names(games_table) <- gsub("£", "Value", gsub("#", "CPW", gsub("Â","",names(games_table))))
# Convert the fields into numeric values
games_table <- transform(games_table, GR = as.numeric(gsub(",","",GR)),
OP = as.numeric(gsub(",","",OP)),
OR = as.numeric(gsub(",","",OR)),
Value = as.numeric(gsub("£","",Value)))
This should yield:
GWGPPBGRTMTCOPORValueCPWGW1341240637300342406373100.0GW2476265978900812448674100.0GW353954125820134191402599.9GW4517905524001851461665100.0GW5661437943834247958889100.1GW66623037042430951037699.9GW76591387922437023247499.8GW86331083630043387967100.4GW948811146092048175385100.9GW10902712100057127716101.1GW117124217063463816083100.9GW1235927986612466931820101.2GW1341827385351071053487101.1GW1482153087250079229436100.2GW1555910488082484329399100.6GW1649818015490089235142100.7GW1748421167062094040857100.7GW1842233150310098278136100.8GW194192600618001023 99048100.6GW205301644385001076 113148100.8
Please note that the column CPW (change from previous week) is a vector of empty strings.
I hope this helps.
Post a Comment for "Extracting Html Table From A Website In R"