2014-07-13T01:08:36-04:00

R programming: pollutantmean – Initial Error Handling (Part 2)

13-July-2014/qofthings/5 Comments

Now that we have the script checking for invalid pollutants, let’s work on handling the other two inputs: the IDs and the directory.

Initial Errors: Invalid IDs

I’m going to do this with a slightly different approach than the pollutants. The way I see it, there are a couple of ways that the input could be incorrect. The user could either supply only invalid IDs or invalid IDs could be supplied along with valid IDs. For example, let’s say the user specifies the range 300:400. For our purposes, the range 300:332 is valid, but 333:400 is invalid. I could have the program error out if any invalid IDs are supplied, but instead what I’m going to do is first strip the invalid IDs and then only terminate the program if no valid IDs remain.

To do this I’m going to write another function. Since this function is going to bound the ids provided between 1 and 332, I’m going to name it “boundID”.

R makes this reasonably straightforward. What I’m going to do is create a boolean vector that represents which values in the supplied ID list are either greater than 0 or less than 333. In R, if you supply the boolean vector as the “index” of another vector, it will only return the elements that are TRUE.

As an aside, it is worth mentioning that you should only do this both the boolean vector and the vector you are using are the same length. If they are not, R will try to “make” them the same size and may exhibit results you may not expect. As a quick example, try the following in R:

> boolVector <- c(TRUE, FALSE)
> abcdVector <- c("a", "b", "c", "d")
> abcdefgVector <- c("a", "b", "c", "d", "e", "f", "g")
> abcVector[boolVector]
[1] "a" "c"
> abcdefgVector[boolVector]
[1] "a" "c" "e" "g"

What happens, is R repeats the elements of boolVector until there are the same number of elements as the vector you are trying to subset. So when used with abcdVector, boolVector is used as [TRUE, FALSE, TRUE, FALSE] and when used with abcdefgVector, boolVector is used as [TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE]. Since boolVector in this case only has two elements, it’s relatively easy to see what will happen depending on whether the vector you are going to subset has an even or odd number of elements. That said, more complicated vectors with hundreds of elements makes it harder to predict the results (as would be the case with our data).

In this case the boolean vector I am constructing will always be the same length as the id vector since I will construct the bool vector from the id vector using a simple conditional:

boundID <- function(idList){
  use <- idList < 333 & idList > 0
  if (length(idList[use]) != length(idList)){
    cat("Some IDs have been removed from this list as they are out of range.\nCurrent range is 1-332.\n\n")
  } 
  idList[use]
}

So for every element of idList that is in range (i.e. less than 333 or greater than 0), the vector “use” will have the value TRUE and all other entries will have the value FALSE. I also have the function output a message letting the user know when values had to be removed (and also lets the user know what the current valid range is). The complete code now looks like:

boundID <- function(idList){
  use <- idList < 333 & idList > 0
  if (length(idList[use]) != length(idList)){
    cat("Some IDs have been removed from this list as they are out of range.\nCurrent range is 1-332.\n\n")
  } 
  idList[use]
}

pollutantIsNotValid <- function(pollutantToCheck){
  possiblePollutants <- c("sulfate", "nitrate")
  !(match(pollutantToCheck, possiblePollutants) > 0 & !is.na(match(pollutantToCheck, possiblePollutants)))
}

pollutantmean <- function(directory = "~/Development/JHUDataScience/ProgrammingAssignment1/specdata", pollutant, id = 1:332) {
  id <- boundID(id)
  if(length(id) < 1) {
    stop("No valid IDs remain.")
  }

  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    stop("Invalid pollutant provided. Valid pollutants are sulfate and nitrate.")
  }
}

You also may notice the stop condition in the program only executes if the length of the id vector is less than 1. This is because the only reason to stop, once the invalid IDs have been removed, is if there are no remaining IDs. (e.g. If the user specifies the range 350:400.) Let’s see how this looks with a quick test:

> pollutantmean(id = 350:400)
Some IDs have been removed from this list as they are out of range.
Current range is 1-332.

Error in pollutantmean(id = 350:400) : No valid IDs remain.

Initial Errors: Invalid Directory

I’m actually going to tackle this problem in a somewhat indirect way, since the path to the data can vary between computers. This program is expecting the data directory to be dedicated only to the data files, i.e. 001.csv, 002.csv, and so on. I think this is a safe assumption to keep in the program, so the easiest way I can think of to check that the data directory is correct is to check that the first file in the directory is in fact “001.csv”. To do that I will write another function (this is probably no longer a surprise):

directoryIsNotValid <- function(directory){
  fileList <- list.files(path = directory)
  !(fileList[1] == "001.csv")
}

Now the complete code is:

boundID <- function(idList){
  use <- idList < 333 & idList > 0
  if (length(idList[use]) != length(idList)){
    cat("Some IDs have been removed from this list as they are out of range.\nCurrent range is 1-332.\n\n")
  } 
  idList[use]
}

pollutantIsNotValid <- function(pollutantToCheck){
  possiblePollutants <- c("sulfate", "nitrate")
  !(match(pollutantToCheck, possiblePollutants) > 0 & !is.na(match(pollutantToCheck, possiblePollutants)))
}

directoryIsNotValid <- function(directory){
  fileList <- list.files(path = directory)
  !(fileList[1] == "001.csv")
}

pollutantmean <- function(directory = "~/Development/JHUDataScience/ProgrammingAssignment1/specdata", pollutant, id = 1:332) {
  id <- boundID(id)
  if(length(id) < 1) {
    stop("No valid IDs remain.")
  }

  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    stop("Invalid pollutant provided. Valid pollutants are sulfate and nitrate.")
  }

  if(directoryIsNotValid(directory)){
    stop("Invalid data directory provided. Please supply the correct path to the data.")
  }
}

Just to hammer this point home in case you are wondering why I broke off such a simple check into a separate function. The reason is so the code is self documenting. It may not be immediately obvious to someone else reading it why I’m checking the first element in the fileList is 001.csv, but the check “if directory is not valid” tells the reader exactly why. Let’s test this last bit out and see what happens:

> pollutantmean()
> pollutantmean(directory = "~/")
Error in pollutantmean(directory = "~/") : 
  Invalid data directory provided. Please supply the correct path to the data.

For the first test I didn’t supply any arguments because the default directory is the correct directory for my installation. Since there is nothing for the script to do (yet) when the directory, pollutant, and id are all valid (which they are by default), there is no output as I expect. When I supply the user directory, I receive an error because the “001.csv” file is not the first element of the files grabbed from that directory.

I think that the preliminaries are all taken care of now! Next I’ll start addressing the purpose of the original prompt: returning the mean of the pollutant data for one or more IDs (files).

Prev – Initial Error Handling (Part 1)
Next – Let’s handle some data

R programming: pollutantmean – Initial Error Handling (Part 1)

12-July-201413-July-2014/qofthings/Leave a comment

Initial Analysis: Error Conditions

The program will need to use the directory to locate the files, the id to identify the files (“centers”) of interest, and then the pollutant to determine which mean value is desired.

In this case, I want to consider some of the error behaviors first. This isn’t necessarily always the way to go, but right off the top of my head I can think of three events that I know should stop the program before it even starts:

What will your program do if you provide an invalid pollutant type?
What will your program do if you provide one or more invalid ids?
What will your program do if you provide an invalid directory?

Initial Errors: Invalid Pollutant

Let’s dig into the first one: first, what is an “invalid pollutant type”? Well, the prompt states that there are two types: “sulfate” and “nitrate”. If you enter anything else, even something of a differing case e.g. “SULFATE” or “SuLFaTE,” you will receive an error since the columns in the files are named in lowercase (specifically the error is: “undefined columns selected”).

So it’s probably worthwhile to have pollutants like “lead” rejected, but we probably want to keep ones like “Sulfate”. To do this, we can force the pollutant to lowercase and check if it matches either “sulfate” or “nitrate”.

The first part, changing the pollutant to lowercase, is reasonably easy since R has a built in “tolower()” function. This will convert all alphabetic characters to lowercase, e.g. tolower(“A1”) becomes “a1”:

pollutantmean <- function(directory, pollutant, id = 1:332) {
  pollutant <- tolower(pollutant)
}

Now to check the updated (lowercase) pollutant, I want to write another function. Specifically, I want to write another function to keep the code as readable as possible. What I am going to to do is use an “if” statement to check if the pollutant is valid or not. If it is, then the program will continue. If it is not, then the program will stop.

pollutantmean <- function(directory, pollutant, id = 1:332) {
  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    ## Some code to terminate the program will go here.
  }
}

Notice that I have already chosen the name of my new function and that the function name both gives the reader and idea of what it does and “reads nicely” in combination with the if statement. My goal here is to try and choose names that are self-descriptive, so for example here the reader can immediately understand what is going on without hunting down the new function.

Many languages have the ability to search a vector/list/etc. and see if it contains a certain element. So I intend to write a function that has a vector with the values “sulfate” and “nitrate” and then see if the pollutant the user provided matches either of those items.

The function that R has available to do this is “match()”. Match() either returns the position of the position of the value in the vector or NA for a value that is not in the vector. In this case, the value I need must be both an integer greater than or equal to 1 and not an NA:

pollutantIsNotValid <- function(pollutantToCheck){
  possiblePollutants <- c("sulfate", "nitrate")
  !(match(pollutantToCheck, possiblePollutants) > 0 & !is.na(match(pollutantToCheck, possiblePollutants)))
}

pollutantmean <- function(directory, pollutant, id = 1:332) {
  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    ## Some code to terminate the program will go here.
  }
}

To terminate the script in the IF statement, I am going to use R’s “stop()” function. Stop() will stop the script and will print the error message that is provided as an argument. In this case, all I need to do is tell the user they’ve supplied an invalid pollutant:

pollutantIsNotValid <- function(pollutantToCheck){
  possiblePollutants <- c("sulfate", "nitrate")
  !(match(pollutantToCheck, possiblePollutants) > 0 & !is.na(match(pollutantToCheck, possiblePollutants)))
}

pollutantmean <- function(directory, pollutant, id = 1:332) {
  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    stop("Invalid pollutant provided. Valid pollutants are sulfate and nitrate.")
  }
}

There is one final change I’m going to make before I show some sample output below. Rather than enter the directory (path to the data) every time I test the script, I’m going to set the default value to the directory where I am keeping the data on my machine. This way I can just tweak the pollutant and id values for now:

pollutantIsNotValid <- function(pollutantToCheck){
  possiblePollutants <- c("sulfate", "nitrate")
  !(match(pollutantToCheck, possiblePollutants) > 0 & !is.na(match(pollutantToCheck, possiblePollutants)))
}

pollutantmean <- function(directory = "~/Development/JHUDataScience/ProgrammingAssignment1/specdata", pollutant, id = 1:332) {
  pollutant <- tolower(pollutant)
  if(pollutantIsNotValid(pollutant)){
    stop("Invalid pollutant provided. Valid pollutants are sulfate and nitrate.")
  }
}

To execute the code, I am going to change the working directory, source the file, and use a couple test values:

> setwd("~/Development/JHUDataScience/ProgrammingAssignment1")
> source("pollutantmean.R")
> pollutantmean(pollutant = "nitrate")
> pollutantmean(pollutant = "NItrate")
> pollutantmean(pollutant = "nitrte")
Error in pollutantmean(pollutant = "nitrte") : 
  Invalid pollutant provided. Valid pollutants are sulfate and nitrate.

There was no output for “nitrate” or “NItrate” since those are both valid and the program doesn’t have anything to execute for those yet. The program did successfully stop with the correct error message when the incorrect pollutant value was supplied. So far, so good!

Prev – Setup and prompt for problem
Next – Initial Error Handling (Part 2)

R programming: pollutantmean – Setup and prompt

8-July-201412-July-2014/qofthings/Leave a comment

As you may or many not know, I have been working through the JHU Data Science Signature track to learn R Programming. Right now I don’t really have a professional reason to know R, I just felt that it would be a fun thing to learn due to my interest in science in general as well as my recent work at a company that handles data from clinical data trials (think treatments for cancer, AIDS, etc.). I also have an interest in astronomy, astrophysics, and machine learning and perhaps I can find a savvy way to link everything together once I have all the information pinging around in my brain.

Now that I’ve worked through the first two classes, I’m going to review two of the assignments. It’s worth noting that I’ve waited until after the course had ended before posting this article. This is mostly because I wanted to finish learning all the material myself to explain my thoughts better. To that end, I am writing two blog posts for this class, choosing one script from assignment 1 and one script from assignment 3 (both from the second class).

Disclaimer: Please do not use the information here to cheat, only to learn. Yes, I am using the exact problems and working them through such that they should actually work if you cheat. The only reason I am not going to edit the scripts in any way is that enough people have posted their solutions elsewhere on the web I feel this post is a “drop in the bucket” insofar as that is concerned.

The prompt

The prompt to this homework assignment was the following:

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA. A prototype of the function is as follows:

pollutantmean <- function(directory, pollutant, id = 1:332) {
  ## 'directory' is a character vector of length 1 indicating
  ## the location of the CSV files

  ## 'pollutant' is a character vector of length 1 indicating
  ## the name of the pollutant for which we will calculate the
  ## mean; either "sulfate" or "nitrate".

  ## 'id' is an integer vector indicating the monitor ID numbers
  ## to be used

  ## Return the mean of the pollutant across all monitors list
  ## in the 'id' vector (ignoring NA values)
}

The file needs to be saved as “pollutantmean.R” (this is only so the submission script will recognize the file).

What you need

Assuming you already have R/R Studio, all you will need at this point is the data (a few hundred small CSV files). For that, I recommend going ahead and getting the data directly from the course website on Coursera. (Note: You can join the course for free to do this, you only need to pay if you would like a certificate after completing all the coursework.)

Next – Initial Error Handling (Part 1)

Working my way through R

15-June-2014/qofthings/Leave a comment

So I just finished another week on the R JHU/Coursera courses that I’m working on. I was able to finish the quizzes and peer assignment for the Intro class. Due to past experience in the sciences and programming tools (e.g. git/github) this class was actually pretty easy for me. And thank goodness for that, because the other class was actually really hard this week.

The lectures are good – they are clear and easy to understand. Likewise the quizzes are nice and straightforward. What made the course hard was the homework assignment. Part of the difficulty was what you’d expect, i.e. that I’m new to R and so it takes a while to figure out how R does things. The other part of the difficulty was in figuring out what was needed to “pass” the assignment.

Actually, the issues that I (and someone else I know that is working through this) ran into remind me a lot of Code Academy. Both this assignment and Code Academy have people working through code assignments without a live person on the other end reviewing the code/output. The easiest way to compensate for that is to test the code that was being written by the students and see if the output from the test matches what is expected. (Of course this is how programmer’s test their actual code as well.)

A problem that I ran into with Code Academy is that they require the code they want the students to produce to almost exactly match their own – so rather than just testing output, they test the structure as well (i.e. looks for the presence of certain variables with certain names, etc.). While this makes sense in a way, it’s also a bit limiting as there’s always more than one way to achieve a goal.

This brings me back to the Coursera course. Like Code Academy, the code has to be reviewed. Also like Code Academy, this is done mainly through tests (although each class I’m taking does have *one* peer review assignment). Unlike Code Academy, the system doesn’t tell you what you’re missing.

For example, let’s say that you’re practicing on Code Academy and try to submit your work. If it fails, you might see a message like “uh oh, it looks like you don’t have a variable named $myvariable!”. This is helpful because it tells you what specifically failed so you can fix it. This isn’t built into the Submit program with the Coursera course – so when it fails, it fails with no other information. Frustrating :(

I did work through it and managed to get my assignment submitted with all parts passed. Although I originally planned to take the R classes in pairs, I think I’m going to take them one at a time for at least the next few so that I can get into a stride. So far, with these two classes anyway, I have the grades to earn the certificate with distinction. I’d like to keep it that way. (When I do things, I like to do them full speed ahead.)

In other news, the website for Nickel City Ruby (which I’m helping plan) is finally live! It looks great and I’m incredibly excited! Planning has been going pretty smoothly thus far, so here’s to hoping that trend continues until October :)

An update for the week

8-June-2014/qofthings/Leave a comment

As a throw back to last week’s post, it looks like Reading Rainbow is up to ~$3.5mil of their new $5mil goal. How incredible is that? I think this is so wonderful and exciting. It’s great to see “Reading Rainbow 2.0” in the works to help bolster a love of reading in today’s kids. I think this is especially important because it seems like today is so noisy. Constant status updates on social media with small character limitations practically trains young minds to stop after the first paragraph.

But the first paragraph is only an introduction! All the good stuff doesn’t happen until you read all the subsequent paragraphs! /newparagraphonpurpose

On this week’s social radar I came across this lovely campaign:

http://www.foodrecoverynetwork.org/

The goal is to help coordinate donations of unused/leftover food. It looks like it’s not wholly unique: there is an app for a similar effort dedicated to San Francisco (disclaimer: I’m not in San Francisco):

http://feeding-forward-node.herokuapp.com/

I love this idea. I think it’s awful that so much food is wasted because it’s too close to the expiration, or it can’t be put out the next day (e.g. prepared food at college campuses or grocery stores that is never bought). I’m not sure if we have anything like this locally, but I think it would be a great thing to start if we don’t. I’ll look into it :D

In other news this week I started the first two classes of the R Signature Track by JHU on Coursera:

https://www.coursera.org/specialization/jhudatascience/1?utm_medium=courseDescripTop

So far, so good. It’s nice to ease into (especially since the first course is pretty dedicated to terminology and learning to use some well-known developer tools like git. The video lectures are very well done – clear, easy to understand, and captioned! So far the assignments have been on par with the course material – everything ties together nicely.

This is one of those things where I’m really excited to be learning something new, but I’m not sure how it will apply to anything I’m working on (yet!). Of course, all that means is that I need to figure out a project to work on it with. I could always put some exercises up here once I get far enough along to do so.

Which brings me to one of my main hopes for this blog: once I get into my “flow”, I really want to make this blog into an online tutorial of sorts. Basically: as I learn and grow as a programmer, I want to share the wealth. It will probably be a while before I start posting any exercises so I can get far enough into the course for the exercises to be useful, but hopefully they will make good references for others. Here’s to hoping!

The Coding Mant.is

Smashing Through Code

Tag: r

R programming: pollutantmean – Initial Error Handling (Part 2)

R programming: pollutantmean – Initial Error Handling (Part 1)

Initial Analysis: Error Conditions

R programming: pollutantmean – Setup and prompt

The prompt

What you need

Working my way through R

Recent Posts

Follow Blog via Email

Tags