Now that we have the script checking for invalid pollutants, let’s work on handling the other two inputs: the IDs and the directory.

Initial Errors: Invalid IDs

I’m going to do this with a slightly different approach than the pollutants. The way I see it, there are a couple of ways that the input could be incorrect. The user could either supply only invalid IDs or invalid IDs could be supplied along with valid IDs. For example, let’s say the user specifies the range 300:400. For our purposes, the range 300:332 is valid, but 333:400 is invalid. I could have the program error out if any invalid IDs are supplied, but instead what I’m going to do is first strip the invalid IDs and then only terminate the program if no valid IDs remain.

To do this I’m going to write another function. Since this function is going to bound the ids provided between 1 and 332, I’m going to name it “boundID”.

R makes this reasonably straightforward. What I’m going to do is create a boolean vector that represents which values in the supplied ID list are either greater than 0 or less than 333. In R, if you supply the boolean vector as the “index” of another vector, it will only return the elements that are TRUE.

As an aside, it is worth mentioning that you should only do this both the boolean vector and the vector you are using are the same length. If they are not, R will try to “make” them the same size and may exhibit results you may not expect. As a quick example, try the following in R:

> boolVector <- c(TRUE, FALSE)
> abcdVector <- c("a", "b", "c", "d")
> abcdefgVector <- c("a", "b", "c", "d", "e", "f", "g")
> abcVector[boolVector]
[1] "a" "c"
> abcdefgVector[boolVector]
[1] "a" "c" "e" "g"

What happens, is R repeats the elements of boolVector until there are the same number of elements as the vector you are trying to subset. So when used with abcdVector, boolVector is used as [TRUE, FALSE, TRUE, FALSE] and when used with abcdefgVector, boolVector is used as [TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE]. Since boolVector in this case only has two elements, it’s relatively easy to see what will happen depending on whether the vector you are going to subset has an even or odd number of elements. That said, more complicated vectors with hundreds of elements makes it harder to predict the results (as would be the case with our data).

In this case the boolean vector I am constructing will always be the same length as the id vector since I will construct the bool vector from the id vector using a simple conditional:

boundID <- function(idList){
  use <- idList < 333 & idList > 0
  if (length(idList[use]) != length(idList)){
    cat("Some IDs have been removed from this list as they are out of range.\nCurrent range is 1-332.\n\n")
  } 
  idList[use]
}

So for every element of idList that is in range (i.e. less than 333 or greater than 0), the vector “use” will have the value TRUE and all other entries will have the value FALSE. I also have the function output a message letting the user know when values had to be removed (and also lets the user know what the current valid range is). The complete code now looks like:

boundID <- function(idList){
  use <- idList < 333 & idList > 0
  if (length(idList[use]) != length(idList)){
    cat("Some IDs have been removed from this list as they are out of range.\nCurrent range is 1-332.\n\n")
  } 
  idList[use]
}

pollutantIsNotValid <- function(pollutantToCheck){
  possiblePollutants <- c("sulfate", "nitrate")
  !(match(pollutantToCheck, possiblePollutants) > 0 & !is.na(match(pollutantToCheck, possiblePollutants)))
}

pollutantmean <- function(directory = "~/Development/JHUDataScience/ProgrammingAssignment1/specdata", pollutant, id = 1:332) {
  id <- boundID(id)
  if(length(id) < 1) {
    stop("No valid IDs remain.")
  }

  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    stop("Invalid pollutant provided. Valid pollutants are sulfate and nitrate.")
  }
}

You also may notice the stop condition in the program only executes if the length of the id vector is less than 1. This is because the only reason to stop, once the invalid IDs have been removed, is if there are no remaining IDs. (e.g. If the user specifies the range 350:400.) Let’s see how this looks with a quick test:

> pollutantmean(id = 350:400)
Some IDs have been removed from this list as they are out of range.
Current range is 1-332.

Error in pollutantmean(id = 350:400) : No valid IDs remain.

Initial Errors: Invalid Directory

I’m actually going to tackle this problem in a somewhat indirect way, since the path to the data can vary between computers. This program is expecting the data directory to be dedicated only to the data files, i.e. 001.csv, 002.csv, and so on. I think this is a safe assumption to keep in the program, so the easiest way I can think of to check that the data directory is correct is to check that the first file in the directory is in fact “001.csv”. To do that I will write another function (this is probably no longer a surprise):

directoryIsNotValid <- function(directory){
  fileList <- list.files(path = directory)
  !(fileList[1] == "001.csv")
}

Now the complete code is:

boundID <- function(idList){
  use <- idList < 333 & idList > 0
  if (length(idList[use]) != length(idList)){
    cat("Some IDs have been removed from this list as they are out of range.\nCurrent range is 1-332.\n\n")
  } 
  idList[use]
}

pollutantIsNotValid <- function(pollutantToCheck){
  possiblePollutants <- c("sulfate", "nitrate")
  !(match(pollutantToCheck, possiblePollutants) > 0 & !is.na(match(pollutantToCheck, possiblePollutants)))
}

directoryIsNotValid <- function(directory){
  fileList <- list.files(path = directory)
  !(fileList[1] == "001.csv")
}

pollutantmean <- function(directory = "~/Development/JHUDataScience/ProgrammingAssignment1/specdata", pollutant, id = 1:332) {
  id <- boundID(id)
  if(length(id) < 1) {
    stop("No valid IDs remain.")
  }

  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    stop("Invalid pollutant provided. Valid pollutants are sulfate and nitrate.")
  }

  if(directoryIsNotValid(directory)){
    stop("Invalid data directory provided. Please supply the correct path to the data.")
  }
}

Just to hammer this point home in case you are wondering why I broke off such a simple check into a separate function. The reason is so the code is self documenting. It may not be immediately obvious to someone else reading it why I’m checking the first element in the fileList is 001.csv, but the check “if directory is not valid” tells the reader exactly why. Let’s test this last bit out and see what happens:

> pollutantmean()
> pollutantmean(directory = "~/")
Error in pollutantmean(directory = "~/") : 
  Invalid data directory provided. Please supply the correct path to the data.

For the first test I didn’t supply any arguments because the default directory is the correct directory for my installation. Since there is nothing for the script to do (yet) when the directory, pollutant, and id are all valid (which they are by default), there is no output as I expect. When I supply the user directory, I receive an error because the “001.csv” file is not the first element of the files grabbed from that directory.

I think that the preliminaries are all taken care of now! Next I’ll start addressing the purpose of the original prompt: returning the mean of the pollutant data for one or more IDs (files).

Prev – Initial Error Handling (Part 1)
Next – Let’s handle some data