r - Calculating a mean from data held in multiple files -
i trying write r script calculates mean of specified pollutant (nitrate or sulfate) based on data 1 or more of 332 monitor stations. data each station held in separate file, numbered 1:332. new r and, fair chooses me, should homework problem. have written script below, works 1 file:
pollutantmean <- function(directory, pollutant, id = 1:332) { filepath <- "/users/jim/documents/coursera/2_r_prog/data" for(i in seq_along(id)) { if(id < 10) { name <- paste("00", id[i], sep = "") } if(id >= 10 && id < 100) { name <- paste("0", id[i], sep = "") } if(id >= 100) { name <- id[i] } } file <- paste(name, "csv", sep = ".") station <- paste(filepath, directory, file, sep = "/") monitor <- read.csv(station) if(pollutant == "nitrate") { x <- mean(monitor$nitrate, na.rm = t) } if(pollutant == "sulfate") { x <- mean(monitor$sulfate, na.rm = t) } x }
however, if enter more 1 file (eg 70:72) mean last file (72). suggests me calculating mean each file , overwriting mean of next, last outputted. able solve using rbind(), can't figure out how assign unique names each variable become arguments rbind(). grateful can offer. cheers, jim
you don't loop on files.
and mean of last file because when loop on ids create names, loop returns last name created.
you should create vector of names stations , loop on !
tips : don't need loop , conditional statements create names, use sprintf
precising size of string expected (3) , want "expand" string (0)
> id <- c(1, 10, 100) > names <- sprintf("%03d", id) > names [1] "001" "010" "100"
and should works :
pollutantmean <- function(directory, pollutant, id = 1:332) { filepath <- "/users/jim/documents/coursera/2_r_prog/data" names <- sprintf("%03d", id) files <- paste0(names, ".csv") # or directly : files <- sprintf("%03d.csv", id) station <- file.path(filepath, directory, files) means <- numeric(length(station)) (i in seq_along(station)) { monitor <- read.csv(station[i]) if(pollutant == "nitrate") { means[i] <- mean(monitor$nitrate, na.rm = t) } else if(pollutant == "sulfate") { means[i] <- mean(monitor$sulfate, na.rm = t) } } return(means) }
edit : if want single mean, can use code above , ponderate each means nrow non na. replace loop :
means <- numeric(length(station)) counts <- numeric(length(station)) (i in seq_along(station)) { monitor <- read.csv(station[i]) if(pollutant == "nitrate") { means[i] <- mean(monitor$nitrate, na.rm = true) counts[i] <- sum(!is.na(monitor$nitrate)) } else if(pollutant == "sulfate") { means[i] <- mean(monitor$sulfate, na.rm = true) counts[i] <- sum(!is.na(monitor$sulfate)) } } mymean <- sum(means * counts) / sum(counts) return(mymean)
since first intention gather datas 1 vector, here solution create list in each element desire "pollutant" variable of each datasframes, unlist gather vectors 1 , can compute mean on vector.
pollutantmean <- function(directory, pollutant, id = 1:332) { filepath <- "/users/jim/documents/coursera/2_r_prog/data" names <- sprintf("%03d", id) files <- paste0(names, ".csv") # or directly : files <- sprintf("%03d.csv", id) station <- file.path(filepath, directory, files) li <- lapply(station, function(x) { monitor <- read.csv(x) if(pollutant == "nitrate") { monitor$nitrate } else if(pollutant == "sulfate") { monitor$sulfate } }) mymean <- mean(unlist(li)) return(mymean) }
Comments
Post a Comment