TDM 10100: Project 7 — 2023
Motivation: A couple of bread-and-butter functions that are a part of the base R are: subset
, and merge
. subset
provides a more natural way to filter and select data from a data.frame. merge
brings the principals of combining data that SQL uses, to R.
Context: We’ve been getting comfortable working with data in within the R environment. Now we are going to expand our tool set with these useful functions, all the while gaining experience and practice wrangling data!
Scope: r, subset, merge, tapply
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/icecream/combined/products.csv
-` /anvil/projects/tdm/data/icecream/combined/reviews.csv` -
/anvil/projects/tdm/data/movies_and_tv/titles.csv
-
/anvil/projects/tdm/data/movies_and_tv/episodes.csv
-
/anvil/projects/tdm/data/movies_and_tv/ratings.csv
Questions
Please select 3 cores when launching JupyterLab for this project. |
Data can come in a lot of different formats and from a lot of different locations. It is common to have several files that need to be combined together, before analysis is performed. The merge
function is helpful for this purpose. The way that we merge files is different in each language and data science too. With R, there is a built-in merge
function that makes things easy! (Of course students in TDM 10100 have not yet learned about SQL databases, but many of you will learn SQL databases someday too. The merge
function is very similar to the ability to merge
tables in SQL databases.)
Read the data in using the following code. We used
|
Please remember to run the Error in fread: could not find function "fread" |
Question 1 (1 pt)
We will use the products
data.frame for this question.
-
What are all the different ingredients in the first record of
products
? -
Consider the
rating
column and theingredients
column. Consider only the products in which "GUAR GUM" is one of the ingredients. (You will need to use eithergrep
orgrepl
or something similar, to find these products. Hint: You should have 85 such products.) List the ratings for these 85 products, in decreasing order.
Please find out the distribution of ratings for the ice cream which ingredients include "GUAR GUM", display the result in descending order
Question 2 (1 pt)
We will use the products
and reviews
data.frames for this question.
-
Use the
brand
andkey
columns from both theproducts
data.frame andreviews
data.frame tomerge
the two data frames. This will give a new data.frame that contains the product details and their associated reviews.
If you do not specify the |
|
Question 3 (3 pts)
We will use the episodes
, titles
and ratings
data.frames for questions 3 through Question 5
-
Use
merge
(a few times) to create a new data.frame that contains at least the following four columns for only the episodes of the show called "Stranger Things". The show itself called "Stranger Things" has atitle_id
of tt4574334. You can find this on IMDB here: www.imdb.com/title/tt4574334/ Each episode of Stranger Things has its owntitle_id
that contains the information for the specific episode as well. For your output: Show the top 5 rows of your final data.frame, containing the top 5 rated episodes of Stranger Things.-
The
primary_title
of the show itself — call itshow_title
. -
The
primary_title
of the episode — call itepisode_title
. -
The
rating
of the show itself — call itshow_rating
. -
The
rating
of the episode — call itepisode_rating
.
-
Start by getting a subset of the |
Make sure to show the top 5 rows of your final data.frame, containing the top 5 rated episodes of Stranger Things!
In the videos, I did not rename the columns. You might want to rename them, because it might help you, but you do not need to rename them. It’s up to you. I’m trying to be a little flexible and to provide guidance without being too strict either. |
Question 4 (1 pt)
For question 4, use the data frame that you built in Question 3.
-
Use regular old indexing to find all episodes of "Stranger Things" with an
episode_rating
less than 8.5 andseason_number
of exactly 3. -
Repeat the process, but this time use the
subset
function instead.
Make sure that the dimensions of the data frames that you get in question 4a and 4b are the same sizes!
Question 5 (2 pts)
For question 5, use the data frame that you built in Question 3.
The subset
function allows you to index data.frame’s in a less verbose manner. Read this.
While it maybe appears to be a clean way to subset data, I’d suggest avoiding it over explicit long-form indexing. Read this fantastic article by Dr. Hadley Wickham on non-standard evaluation. Take for example, the following (a bit contrived) example using the dataframe we got in question (3).
Note: You do not need to write much for your answer. It is OK if you try the example below, and you see that it fails (and it will fail for sure!), and then you say something like, "I will try hard to not use variable names that overlap with other variable names". Or something like that! We simply want to ensure that students are choosing to use good variable names.
season_number <- 3
subset(StrangerThingsBigMergedDF, (season_number == season_number) & (rating.y < 8.5))
-
Read that provided article and do your best to explain why
subset
gets a different result than our example that uses regular indexing.
Project 07 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and output for the assignment
-
firstname-lastname-project07.ipynb
.
-
-
R code and comments for the assignment
-
firstname-lastname-project07.R
.
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |