As you may or many not know, I have been working through the JHU Data Science Signature track to learn R Programming. Right now I don’t really have a professional reason to know R, I just felt that it would be a fun thing to learn due to my interest in science in general as well as my recent work at a company that handles data from clinical data trials (think treatments for cancer, AIDS, etc.). I also have an interest in astronomy, astrophysics, and machine learning and perhaps I can find a savvy way to link everything together once I have all the information pinging around in my brain.
Now that I’ve worked through the first two classes, I’m going to review two of the assignments. It’s worth noting that I’ve waited until after the course had ended before posting this article. This is mostly because I wanted to finish learning all the material myself to explain my thoughts better. To that end, I am writing two blog posts for this class, choosing one script from assignment 1 and one script from assignment 3 (both from the second class).
Disclaimer: Please do not use the information here to cheat, only to learn. Yes, I am using the exact problems and working them through such that they should actually work if you cheat. The only reason I am not going to edit the scripts in any way is that enough people have posted their solutions elsewhere on the web I feel this post is a “drop in the bucket” insofar as that is concerned.
The prompt
The prompt to this homework assignment was the following:
Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA. A prototype of the function is as follows:
pollutantmean <- function(directory, pollutant, id = 1:332) {
## 'directory' is a character vector of length 1 indicating
## the location of the CSV files
## 'pollutant' is a character vector of length 1 indicating
## the name of the pollutant for which we will calculate the
## mean; either "sulfate" or "nitrate".
## 'id' is an integer vector indicating the monitor ID numbers
## to be used
## Return the mean of the pollutant across all monitors list
## in the 'id' vector (ignoring NA values)
}
The file needs to be saved as “pollutantmean.R” (this is only so the submission script will recognize the file).
What you need
Assuming you already have R/R Studio, all you will need at this point is the data (a few hundred small CSV files). For that, I recommend going ahead and getting the data directly from the course website on Coursera. (Note: You can join the course for free to do this, you only need to pay if you would like a certificate after completing all the coursework.)
