TDM 20100: Project 10 — 2023
Motivation: Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like MIN
, MAX
, and AVG
in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values.
Context: We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values!
Scope: SQL, SQL in R
Dataset(s)
For this project, we will be using the lahman
sqlite database. This database contains the data in the directory
-
/anvil/projects/tdm/data/lahman
You may get some more lahman
database information from this youtube video youtube.com/watch?v=tS_-oTbsDzs
To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.
%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db
For every following cell where you want to run a SQL query, prepend %%sql
to the top of the cell — just like we do for R or bash cells.
Questions
Question 1 (1 pt)
-
Let’s say we are interested in the total number of baseball players for each year, from year 2018 to year 2022, respectively. Please write a query to count the total number of players in the appearances table (by year), and display these totals in, descending order, by year.
year num_of_players
2022 1495
…
SELECT column_name AS alias_name FROM table_name WHERE [condition] |
Question 2 (2 pts)
Now, let’s look into the teams
table. The attendance
column provides the total number of audiences that attended a team’s home games. We may say that a team is more popular if it has more attendance at its home games.
-
Please find out what is the average attendance number for each team in the
teams
table, during games from 2022 (only). You should have one average attendance number per team. -
Now use a subquery to compute the average attendance across all teams and games. Then modify question 2a, to only include teams whose average attendance for the team is larger than the average across all teams and games. Using an alias, change the attendance column in your query to appear as "average_attendance".
The AVG function will be useful to calculate average attendance
|
We can achieve this using a subquery. A subquery is a query that is used to embed a query within another query. |
Question 3 (1 pt)
If you answered question (2) correctly, you should find that team Los Angeles Dodgers
, with team ID 'LAN`, had the highest average attendance. We can consider this team as the most popular team in 2022.
-
Please calculate the winning percentage for this team in 2022, using the fields 'W' and 'L' from the
teams
table with the formula:
winning_percentage = W/(W+L)
Use the name winning_per
for the resulting column.
Some of you might get a |
Question 4 (2 pts)
You now know 2 different applications of the AS
keyword, and you also know how to use a query as a subquery. Great!
In the previous project, we were introduced to aggregate functions. We know we can use the WHERE
clause to filter our results, but what if we wanted to filter our results based on an aggregated column?
-
Update the query from question (3) to print all teams that have winning percentage from year 2012 to 2022 (inclusive) greater than 55%. You should get 3 teams. Display the results, by win percentage, in descending order.
See this article for more information on the |
Question 5 (2 pts)
-
Now let’s look at
allstarfull
table. Please list all players who attended 20 or more All Star games. List the players in descending order, by the number of All Star games that they attended. -
Please explore the tables in the database and write a query about some information that you are interested in. Please make sure to use aliasing, a subquery, and at least one aggregate function.
Project 10 Assignment Checklist
-
Jupyter notebook with your code, comments and output for questions 1 to 5
-
firstname-lastname-project10.ipynb
-
-
Submit files through Gradescope
-
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |