STAT 19000: Project 2 — Spring 2021
Motivation: In Python it is very important to understand some of the data types in a little bit more depth than you would in R. Many of the data types in Python will seem very familiar. A character
in R is similar to a str
in Python. An integer
in R is an int
in Python. A numeric
in R is similar to a float
in Python. A logical
in R is similar to a bool
in Python. In addition to all of that, there are some very popular classes that packages like numpy
and pandas
introduces. On the other hand, there are some data types in Python like `tuple`s, `list`s, `set`s, and `dict`s that diverge from R a little bit more. It is integral to understand some basic concepts before jumping too far into everything.
Context: This is the second project introducing some basic data types, and demonstrating some familiar control flow concepts, all while digging right into a dataset.
Scope: tuples, lists, if statements, opening files
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/craigslist/vehicles.csv
Questions
Question 1
Read in the dataset /class/datamine/data/craigslist/vehicles.csv
into a pandas
DataFrame called myDF
. pandas
is an integral tool for various data science tasks in Python. You can read a quick intro here. We will be slowly introducing bits and pieces of this package throughout the semester. Similarly, we will try to introduce byte-sized (ha!) portions of plotting packages to slowly build up your skills.
How big is the dataset (in Mb or Gb)?
If you didn’t do [optional question 6 in project 1](#p1-06), we would recommend taking a look. |
Remember to check out a question’s relevant topics. We try very hard to link you to content and examples that will get you up and running as quickly as possible. |
-
Python code used to solve the problem.
Question 2
In Question 1, we read our data into a pandas
DataFrame. Use one of the pandas
DataFrame attributes to get the number of columns and rows of our dataset. How many columns and rows are there? Use f-strings to print a message, for example:
`
There are 123 columns in the DataFrame!
There are 321 rows in the DataFrame!
`
In project 1, we learned how to read a csv file in, line-by-line, and print values. Use the csv
package to print just the first row, which should contain the names of the columns, OR instead of using the csv
package, use one of the pandas
attributes from myDF
(to print the column names).
-
The output from printing the f-strings.
-
Python code used to solve the problem.
Question 3
Use the csv
or pandas
package to get a list called our_columns
that contains the column names. Add a string, "extra", to the end of our_columns
. Print the second value in the list. Without using a loop, print the 1st, 3rd, 5th, etc. elements of the list. Print the last four elements of the list ( "state", "lat", "long", and "extra") by accessing their negative index.
"extra" doesn’t belong in our list, you can easily remove this value from our list by doing the following…
our_columns.pop(25)
# or even this, as pop removes the last value by default
our_columns.pop()
BUT the problem with this solution is that you must know the index of the value you’d like to remove, and sometimes you do not know the index of the value. Instead, please show how to use a list method to remove "extra" by value rather than by index.
-
Python code used to solve the problem.
-
The output from running your code.
Question 4
matplotlib
is one of the primary plotting packages in Python. You are provided with the following code:
my_values = tuple(myDF.loc[:, 'odometer'].dropna().to_list())
The result is a tuple containing the odometer readings from all of the vehicles in our dataset. Create a lineplot of the odometer readings.
Well, that plot doesn’t seem too informative. Let’s first sort the values in our tuple:
my_values.sort()
What happened? A tuple is immutable. What this means is that once the contents of a tuple are declared they cannot be modified. For example:
# This will fail because tuples are immutable
my_values[0] = 100
You can read a good article about this here. In addition, here is a great post that gives you an idea when using a tuple might be a good idea. Okay, so let’s go back to our problem. We know that lists are mutable (and therefore sortable), so convert my_values
to a list and then sort, and re-plot.
It looks like there are some (potential) outliers that are making our plot look a little wonky. For the sake of seeing how the plot would look, use negative indexing to plot the sorted values minus the last 50 values (the 50 highest values). New new plot may not look that different, that is okay.
To prevent plotting values on the same plot, close your plot with the |
import matplotlib.pyplot as plt
my_values = [1,2,3,4,5]
plt.plot(my_values)
plt.show()
plt.close()
-
Python code used to solve the problem.
-
The output from running your code.
Question 5
We’ve covered a lot in this project! Use what you’ve learned so far to do one (or more) of the following tasks:
-
Create a cool graphic using
matplotlib
, that summarizes some data from our dataset. -
Use
pandas
and your investigative skills to sift through the dataset and glean an interesting factoid. -
Create some commented coding examples that highlight the differences between lists and tuples. Include at least 3 examples.
-
Python code used to solve the problem.
-
The output from running your code.