The Coding Mant.is

Smashing Through Code

R programming: pollutantmean – Initial Error Handling (Part 2) — 13-July-2014

R programming: pollutantmean – Initial Error Handling (Part 2)

Now that we have the script checking for invalid pollutants, let’s work on handling the other two inputs: the IDs and the directory.

Initial Errors: Invalid IDs

I’m going to do this with a slightly different approach than the pollutants. The way I see it, there are a couple of ways that the input could be incorrect. The user could either supply only invalid IDs or invalid IDs could be supplied along with valid IDs. For example, let’s say the user specifies the range 300:400. For our purposes, the range 300:332 is valid, but 333:400 is invalid. I could have the program error out if any invalid IDs are supplied, but instead what I’m going to do is first strip the invalid IDs and then only terminate the program if no valid IDs remain.

To do this I’m going to write another function. Since this function is going to bound the ids provided between 1 and 332, I’m going to name it “boundID”.

R makes this reasonably straightforward. What I’m going to do is create a boolean vector that represents which values in the supplied ID list are either greater than 0 or less than 333. In R, if you supply the boolean vector as the “index” of another vector, it will only return the elements that are TRUE.

As an aside, it is worth mentioning that you should only do this both the boolean vector and the vector you are using are the same length. If they are not, R will try to “make” them the same size and may exhibit results you may not expect. As a quick example, try the following in R:

> boolVector <- c(TRUE, FALSE)
> abcdVector <- c("a", "b", "c", "d")
> abcdefgVector <- c("a", "b", "c", "d", "e", "f", "g")
> abcVector[boolVector]
[1] "a" "c"
> abcdefgVector[boolVector]
[1] "a" "c" "e" "g"

What happens, is R repeats the elements of boolVector until there are the same number of elements as the vector you are trying to subset. So when used with abcdVector, boolVector is used as [TRUE, FALSE, TRUE, FALSE] and when used with abcdefgVector, boolVector is used as [TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE]. Since boolVector in this case only has two elements, it’s relatively easy to see what will happen depending on whether the vector you are going to subset has an even or odd number of elements. That said, more complicated vectors with hundreds of elements makes it harder to predict the results (as would be the case with our data).

In this case the boolean vector I am constructing will always be the same length as the id vector since I will construct the bool vector from the id vector using a simple conditional:

boundID <- function(idList){
  use <- idList < 333 & idList > 0
  if (length(idList[use]) != length(idList)){
    cat("Some IDs have been removed from this list as they are out of range.\nCurrent range is 1-332.\n\n")
  } 
  idList[use]
}

So for every element of idList that is in range (i.e. less than 333 or greater than 0), the vector “use” will have the value TRUE and all other entries will have the value FALSE. I also have the function output a message letting the user know when values had to be removed (and also lets the user know what the current valid range is). The complete code now looks like:

boundID <- function(idList){
  use <- idList < 333 & idList > 0
  if (length(idList[use]) != length(idList)){
    cat("Some IDs have been removed from this list as they are out of range.\nCurrent range is 1-332.\n\n")
  } 
  idList[use]
}

pollutantIsNotValid <- function(pollutantToCheck){
  possiblePollutants <- c("sulfate", "nitrate")
  !(match(pollutantToCheck, possiblePollutants) > 0 & !is.na(match(pollutantToCheck, possiblePollutants)))
}

pollutantmean <- function(directory = "~/Development/JHUDataScience/ProgrammingAssignment1/specdata", pollutant, id = 1:332) {
  id <- boundID(id)
  if(length(id) < 1) {
    stop("No valid IDs remain.")
  }

  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    stop("Invalid pollutant provided. Valid pollutants are sulfate and nitrate.")
  }
}

You also may notice the stop condition in the program only executes if the length of the id vector is less than 1. This is because the only reason to stop, once the invalid IDs have been removed, is if there are no remaining IDs. (e.g. If the user specifies the range 350:400.) Let’s see how this looks with a quick test:

> pollutantmean(id = 350:400)
Some IDs have been removed from this list as they are out of range.
Current range is 1-332.

Error in pollutantmean(id = 350:400) : No valid IDs remain.

Initial Errors: Invalid Directory

I’m actually going to tackle this problem in a somewhat indirect way, since the path to the data can vary between computers. This program is expecting the data directory to be dedicated only to the data files, i.e. 001.csv, 002.csv, and so on. I think this is a safe assumption to keep in the program, so the easiest way I can think of to check that the data directory is correct is to check that the first file in the directory is in fact “001.csv”. To do that I will write another function (this is probably no longer a surprise):

directoryIsNotValid <- function(directory){
  fileList <- list.files(path = directory)
  !(fileList[1] == "001.csv")
}

Now the complete code is:

boundID <- function(idList){
  use <- idList < 333 & idList > 0
  if (length(idList[use]) != length(idList)){
    cat("Some IDs have been removed from this list as they are out of range.\nCurrent range is 1-332.\n\n")
  } 
  idList[use]
}

pollutantIsNotValid <- function(pollutantToCheck){
  possiblePollutants <- c("sulfate", "nitrate")
  !(match(pollutantToCheck, possiblePollutants) > 0 & !is.na(match(pollutantToCheck, possiblePollutants)))
}

directoryIsNotValid <- function(directory){
  fileList <- list.files(path = directory)
  !(fileList[1] == "001.csv")
}

pollutantmean <- function(directory = "~/Development/JHUDataScience/ProgrammingAssignment1/specdata", pollutant, id = 1:332) {
  id <- boundID(id)
  if(length(id) < 1) {
    stop("No valid IDs remain.")
  }

  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    stop("Invalid pollutant provided. Valid pollutants are sulfate and nitrate.")
  }

  if(directoryIsNotValid(directory)){
    stop("Invalid data directory provided. Please supply the correct path to the data.")
  }
}

Just to hammer this point home in case you are wondering why I broke off such a simple check into a separate function. The reason is so the code is self documenting. It may not be immediately obvious to someone else reading it why I’m checking the first element in the fileList is 001.csv, but the check “if directory is not valid” tells the reader exactly why. Let’s test this last bit out and see what happens:

> pollutantmean()
> pollutantmean(directory = "~/")
Error in pollutantmean(directory = "~/") : 
  Invalid data directory provided. Please supply the correct path to the data.

For the first test I didn’t supply any arguments because the default directory is the correct directory for my installation. Since there is nothing for the script to do (yet) when the directory, pollutant, and id are all valid (which they are by default), there is no output as I expect. When I supply the user directory, I receive an error because the “001.csv” file is not the first element of the files grabbed from that directory.

I think that the preliminaries are all taken care of now! Next I’ll start addressing the purpose of the original prompt: returning the mean of the pollutant data for one or more IDs (files).

Prev – Initial Error Handling (Part 1)
Next – Let’s handle some data

R programming: pollutantmean – Initial Error Handling (Part 1) — 12-July-2014

R programming: pollutantmean – Initial Error Handling (Part 1)

Initial Analysis: Error Conditions

The program will need to use the directory to locate the files, the id to identify the files (“centers”) of interest, and then the pollutant to determine which mean value is desired.

In this case, I want to consider some of the error behaviors first. This isn’t necessarily always the way to go, but right off the top of my head I can think of three events that I know should stop the program before it even starts:

  • What will your program do if you provide an invalid pollutant type?
  • What will your program do if you provide one or more invalid ids?
  • What will your program do if you provide an invalid directory?

Initial Errors: Invalid Pollutant

Let’s dig into the first one: first, what is an “invalid pollutant type”? Well, the prompt states that there are two types: “sulfate” and “nitrate”. If you enter anything else, even  something of a differing case e.g. “SULFATE” or “SuLFaTE,” you will receive an error since the columns in the files are named in lowercase (specifically the error is: “undefined columns selected”).

So it’s probably worthwhile to have pollutants like “lead” rejected, but we probably want to keep ones like “Sulfate”. To do this, we can force the pollutant to lowercase and check if it matches either “sulfate” or “nitrate”.

The first part, changing the pollutant to lowercase, is reasonably easy since R has a built in “tolower()” function. This will convert all alphabetic characters to lowercase, e.g. tolower(“A1”) becomes “a1”:

pollutantmean <- function(directory, pollutant, id = 1:332) {
  pollutant <- tolower(pollutant)
}

Now to check the updated (lowercase) pollutant, I want to write another function. Specifically, I want to write another function to keep the code as readable as possible. What I am going to to do is use an “if” statement to check if the pollutant is valid or not. If it is, then the program will continue. If it is not, then the program will stop.

pollutantmean <- function(directory, pollutant, id = 1:332) {
  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    ## Some code to terminate the program will go here.
  }
}

Notice that I have already chosen the name of my new function and that the function name both gives the reader and idea of what it does and “reads nicely” in combination with the if statement. My goal here is to try and choose names that are self-descriptive, so for example here the reader can immediately understand what is going on without hunting down the new function.

Many languages have the ability to search a vector/list/etc. and see if it contains a certain element. So I intend to write a function that has a vector with the values “sulfate” and “nitrate” and then see if the pollutant the user provided matches either of those items.

The function that R has available to do this is “match()”. Match() either returns the position of the position of the value in the vector or NA for a value that is not in the vector. In this case, the value I need must be both an integer greater than or equal to 1 and not an NA:

pollutantIsNotValid <- function(pollutantToCheck){
  possiblePollutants <- c("sulfate", "nitrate")
  !(match(pollutantToCheck, possiblePollutants) > 0 & !is.na(match(pollutantToCheck, possiblePollutants)))
}

pollutantmean <- function(directory, pollutant, id = 1:332) {
  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    ## Some code to terminate the program will go here.
  }
}

To terminate the script in the IF statement, I am going to use R’s “stop()” function. Stop() will stop the script and will print the error message that is provided as an argument. In this case, all I need to do is tell the user they’ve supplied an invalid pollutant:

pollutantIsNotValid <- function(pollutantToCheck){
  possiblePollutants <- c("sulfate", "nitrate")
  !(match(pollutantToCheck, possiblePollutants) > 0 & !is.na(match(pollutantToCheck, possiblePollutants)))
}

pollutantmean <- function(directory, pollutant, id = 1:332) {
  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    stop("Invalid pollutant provided. Valid pollutants are sulfate and nitrate.")
  }
}

There is one final change I’m going to make before I show some sample output below. Rather than enter the directory (path to the data) every time I test the script, I’m going to set the default value to the directory where I am keeping the data on my machine. This way I can just tweak the pollutant and id values for now:

pollutantIsNotValid <- function(pollutantToCheck){
  possiblePollutants <- c("sulfate", "nitrate")
  !(match(pollutantToCheck, possiblePollutants) > 0 & !is.na(match(pollutantToCheck, possiblePollutants)))
}

pollutantmean <- function(directory = "~/Development/JHUDataScience/ProgrammingAssignment1/specdata", pollutant, id = 1:332) {
  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    stop("Invalid pollutant provided. Valid pollutants are sulfate and nitrate.")
  }
}

To execute the code, I am going to change the working directory, source the file, and use a couple test values:

> setwd("~/Development/JHUDataScience/ProgrammingAssignment1")
> source("pollutantmean.R")
> pollutantmean(pollutant = "nitrate")
> pollutantmean(pollutant = "NItrate")
> pollutantmean(pollutant = "nitrte")
Error in pollutantmean(pollutant = "nitrte") : 
  Invalid pollutant provided. Valid pollutants are sulfate and nitrate.

There was no output for “nitrate” or “NItrate” since those are both valid and the program doesn’t have anything to execute for those yet. The program did successfully stop with the correct error message when the incorrect pollutant value was supplied. So far, so good!

Prev – Setup and prompt for problem
Next – Initial Error Handling (Part 2)

R programming: pollutantmean – Setup and prompt — 8-July-2014

R programming: pollutantmean – Setup and prompt

As you may or many not know, I have been working through the JHU Data Science Signature track to learn R Programming. Right now I don’t really have a professional reason to know R, I just felt that it would be a fun thing to learn due to my interest in science in general as well as my recent work at a company that handles data from clinical data trials (think treatments for cancer, AIDS, etc.). I also have an interest in astronomy, astrophysics, and machine learning and perhaps I can find a savvy way to link everything together once I have all the information pinging around in my brain.

Now that I’ve worked through the first two classes, I’m going to review two of the assignments. It’s worth noting that I’ve waited until after the course had ended before posting this article. This is mostly because I wanted to finish learning all the material myself to explain my thoughts better. To that end, I am writing two blog posts for this class, choosing one script from assignment 1 and one script from assignment 3 (both from the second class).

Disclaimer: Please do not use the information here to cheat, only to learn. Yes, I am using the exact problems and working them through such that they should actually work if you cheat. The only reason I am not going to edit the scripts in any way is that enough people have posted their solutions elsewhere on the web I feel this post is a “drop in the bucket” insofar as that is concerned.

The prompt

The prompt to this homework assignment was the following:

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA. A prototype of the function is as follows:

pollutantmean <- function(directory, pollutant, id = 1:332) {
  ## 'directory' is a character vector of length 1 indicating
  ## the location of the CSV files

  ## 'pollutant' is a character vector of length 1 indicating
  ## the name of the pollutant for which we will calculate the
  ## mean; either "sulfate" or "nitrate".

  ## 'id' is an integer vector indicating the monitor ID numbers
  ## to be used

  ## Return the mean of the pollutant across all monitors list
  ## in the 'id' vector (ignoring NA values)
}

The file needs to be saved as “pollutantmean.R” (this is only so the submission script will recognize the file).

What you need

Assuming you already have R/R Studio, all you will need at this point is the data (a few hundred small CSV files). For that, I recommend going ahead and getting the data directly from the course website on Coursera. (Note: You can join the course for free to do this, you only need to pay if you would like a certificate after completing all the coursework.)

Next – Initial Error Handling (Part 1)

Design a site like this with WordPress.com
Get started