TDM 30200: Project 4 — 2023
snow way, that is pretty quick
Motivation: We got some exposure to parallelizing code in the previous project. Let’s keep practicing in this project!
Context: This is the last in a series of projects focused on parallelizing code using joblib
and Python.
Scope: Python, joblib
Questions
Interested in being a TA? Please apply: purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE |
Question 1
Check out the data located here: www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/
Normally, you are provided with clean, or semi clean sets of data, you read it in, do something, and call it a day. In this project, you are going to go get your own data, and although the questions won’t be difficult, they will have less guidance than normal. Try and tap in to what you learned in previous projects, and of course, if you get stuck just shoot us a message in Piazza and we will help!
As you can see, the yearly datasets vary greatly in size. What is the average size in MB of the datasets? How many datasets are there (excluding non year datasets)? What is the total download size (in GB)? Use the requests
library and either beautifulsoup4
or lxml
to scrape and parse the webpage, so you can calculate these values.
Make sure to exclude any non-year files. |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
The 1893.csv.gz
dataset appears to be about 1/1000th of our total download size — perfect! Use the requests
package to download the file, write the file to disk, and time the operations using this package (which is already installed).
If you had a single CPU, approximately how long would it take to download and write all of the files (in minutes)?
The following is an example of how to write a scraped file to disk.
|
-
Code used to solve this problem.
-
Output from running the code.
Question 3
You can request up to 128 cores using OnDemand and our Jupyter Lab app. Save your work and request a new session with a minimum of 4 cores. Write parallelized code that downloads all of the datasets (from 1750 to 2022) into your $SCRATCH
directory. Before running the code, estimate the total amount of time this should take, given the number of cores you plan to use.
If you request more than 4 cores, please make sure to delete your Jupyter Lab session once your code has run and instead use a session with 4 or fewer cores. |
There aren’t datasets for 1751-1762 — so be sure to handle this. Perhaps you could look at the |
Time how long it takes to download and write all of the files. Was your estimation close (within a minute or two)? If not, have any theories as to why? You may get results that are very variable, that is ok. Ultimately, we can only download as fast as that website will allow us to. Anytime you are dealing with another website or some factor outside of your control, unexpected behavior is to be expected.
Remember, your |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
In a previous question, I provided you with the code to actually extract and save content from our requests
response. This is the sort of task that may not be obvious on how to do. Learning how to use search engines like Google or Kagi to figure this out is critical.
Figure out how to extract the csv file from each of the datasets using Python. Write parallelized go that loops through and extracts all of the data. The end result should be 1 csv file per year. Like in the previous question, measure the time it takes to extract 1 csv, and attempt to estimate how long it should take to extract all of them. Time the extraction and compare your estimation to reality. Were you close? Note that with something like this, where it is purely computational, parallelization should be much more predictable than, for example, downloading files from a website.
If you request more than 4 cores, please make sure to delete your Jupyter Lab session once your code has run and instead use a session with 4 or fewer cores. |
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Unzipped, your datasets total 100Gb! That is a lot of data!
You can read here about what the data means.
-
11 character station ID
-
8 character date in YYYYMMDD format
-
4 character element code (you can see the element codes here in section III)
-
value of the data (varies based on the element code)
-
1 character M-flag (10 possible values, see section III here)
-
1 character Q-flag (14 possible values, see section III here)
-
1 character S-flag (30 possible values, see section III here)
-
4 character observation time (HHMM) (0700 = 7:00 AM) — may be blank
It has (maybe?) been a snowy week, use your parallelization skills to figure out something about snowfall. For example, maybe you want to find the last time or the last year which X amount of snow fell. Or maybe you want to find the station id for the location who has had the most instances of over X amount of snow. Get creative! You may create plots to supplement your work (if you want).
Any good effort will receive full credit.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |