Rotten Tomatoes Data in R

A few months ago, I decided it would be fun to do some predictive modeling of the quality of upcoming Hollywood films. There’s tons of data out there, but some of it can be hard to find. As part of that project, I wrote a short R script to scrape some data from Rotten Tomatoes. Feed the function an actor’s name, and it will return all of their film and TV work along with corresponding Tomatometer scores, release years, and a few other things. So, we can use this to find out if anybody’s career is in a real decline (or upswing).  For example, Charlie Sheen or M. Night Shyamalan.

Reviews of films featuring or by Charlie Sheen and M. Night Shyamalan are highly negatively correlated with time.
Reviews of films featuring or by Charlie Sheen and M. Night Shyamalan are highly negatively correlated with time.

It turns out that it is much more difficult to find celebrities with positive career trajectories. I suspect this is because celebrities that start out in films with poor reviews don’t get the opportunity to improve over time.

Actors and directors who got better with age.
Actors and directors who got better with age.

Note: To run the code below, you will need R – a free program for statistical computing. Also, R is excellent and there’s no good reason not to have it.

library(XML)

cleanString <- function(string)
{
  # Removes special characters from character strings.
  # Args:
  #  string: A character vector to be stripped of special characters.
  #
  # Returns:
  #  The supplied character vector without common special characters.
  string <- gsub("\\n"," ",string)
  string <- gsub("\\t"," ",string)
  string <- gsub(" +"," ",string)
}

actorMovies <- function(name, film.table=4, tv.table=4)
{
  # Downloads a celebrity's data from www.rottentomatoes.com
  # Args:
  #  name: The celebrity's name or Rotten Tomatoes ID.
  #  film.table: The number of the table that corresponds to
  #              celebrity film credits on the rotten tomatoes page.
  #  tv.table: The number of the table that corresponds to
  #            celebrity TV credits on the rotten tomatoes page.
  # Returns:
  #  A dataframe with celebrity credits.
  name <- gsub(" ","_",name)
  name <- gsub("\\.","",name)
  url <- paste0("http://www.rottentomatoes.com/celebrity/",name)
  page <- htmlParse(url)
  tabs <- readHTMLTable(page)
  n <- length(tabs)
  films <- tabs[[film.table]]
  names(films) <- c("score","title","role","gross","year")
  films$score <- as.numeric(gsub("%","",as.character(films$score)))
  films$year <- as.numeric(substr(as.character(films$year),1,4))
  films$gross <- as.numeric(gsub("M","",gsub("\\$","",as.character(films$gross))))
  films$title <- cleanString(films$title)
  films$role <- cleanString(films$role)
  films$film <- 1
  tv <- data.frame()
  if(n==tv.table)
  {
    tv <- tabs[[tv.table]]
    names(tv) <- c("score","title","role","blank","year")
    tv$score <- as.numeric(gsub("%","",as.character(tv$score)))
    tv$year <- as.numeric(substr(as.character(tv$year),1,4))
    tv$film <- 0
    tv$gross <- NA
    tv <- tv[,c("score","title","role","gross","year","film")]
    tv$title <- cleanString(tv$title)
    tv$role <- cleanString(tv$role)
  }
  films <- rbind(films,tv)
  films$director <- grepl("Director",films$role)
  films$producer <- grepl("Producer",films$role)
  films <- films[films$title!="TITLE",]
  return(films)
}

Rotten Tomatoes does provide an API, but it doesn’t appear to offer a celebrity query option.

Recent Posts

Written by:

Benjamin Radford is a data scientist and political scientist. He received his Ph.D. in Political Science from Duke University where he studied security, peace, & conflict and political methodology. He specializes in data science, cybersecurity, political forecasting, and arms proliferation. He is currently a Principal Data Scientist with Sotera Defense Solutions.

4 Comments

  1. Frank C Jones
    10/9/2014

    You’re having way too much fun with R. Don’t you know that stats is supposed to be boring? BTW, this site is great.

    • Ben Radford
      10/9/2014

      Thanks, Frank! It’s a bit sparse now, but I’m hoping to get some more content up here soon.

  2. #Tried running
    actorMovies(name = “m_night_shyamalan”, film.table = 3, tv.table = 4)
    #returned
    “Error in names(films) <- c("score", "title", "role", "gross", "year") :
    'names' attribute [5] must be the same length as the vector [2]"
    #Did I do the wrong table #, and if so how do I find the right one?

  3. 4/19/2015

    Hey @THSTUPIDFACTORY – try film.table=4. That works for me. RottenTomatoes occasionally changes the layout of their site which messes with this code. The way to figure it out for yourself is to set global variables:

    name=”m_night_shyamalan”
    film.table=4

    Then walk through the first few lines of the actorMovies function by hand. When you get to this line:
    films <- tabs[[film.table]] run the command "print(films)" Then you can look through the result and see which entry (table) is the one with all the movies and stuff. That number is film.table. So if it is the third entry, then film.table=3. Once you figure it out, it should be consistent for all actors until they change the page layout again.

Comments are closed.