The Azimuth Project
Blog - El Niño project (part 5) (changes)

Showing changes from revision #7 to #8: Added | Removed | Changed

This is a blog article in progress, written by John Baez. To see discussions of the article as it was being written, visit the Azimuth Forum. For the final polished article, go to the Azimuth Blog.

If you want to write your own article, please read the directions on How to blog.

Here is the tale of how I downloaded some weather data and started analyzing it, using programs written by Graham Jones. Graham is an Azimuth member who has done a lot of working replicating and extending the El Niño prediction paper by Ludescher et al. I’ll talk about his work in future articles. He’s actually good at this stuff. I’m not.

Then why am I the one writing this?

Precisely because I’m almost completely ignorant about programming, I’m hoping my tale will prove that you too can do this stuff… provided you have smart friends, or read this article.

More precisely, this article explains how to:

• download and run software that runs the programming language R;

• download temperature data from the National Center for Atmospheric Research;

• use R to create a file of temperature data for a given latitude/longitude rectangle for a given time interval.

If you want to copy what I’m doing, please remember that a few details depend on the operating system. Since I don’t care about operating systems, I use a Windows PC. If you use something better, some details will differ for you.

Also: at the end of this article there are some programming puzzles. If you explain how to solve these I will be in your debt! So will Graham, since I won’t have to ask him.

A sad history

To back up my claim of almost complete ignorance, a word about my history.

I first programmed a bit in BASIC at the Lawrence Hall of Science in Berkeley back when I was visiting my uncle in the summer of 1978. I was excited because he’d given me the book Computer Lib by Ted Nelson. I did more the next year in high school, sitting in a concrete block room with a teletype terminal that was connected to a mainframe somewhere far away. I stored my programs on paper tape.

My excitement gradually dwindled, because I was having more fun doing math and physics using just pencil and paper. My own brain was more easy to program than the machine. I did not start a computer company, I did not get rich. Later I did some programming in APL in college, and still later I did a bit in Mathematica in the early 1990s… but nothing much, and nothing sophisticated. Indeed, none of these languages would be the ones you’d choose to explore sophisticated ideas in computation!

I’ve just never been very interested… until now. I now want to do a lot of data analysis, and it will be embarrassing to keep asking other people to do all of it for me. I need to learn how to do it myself.

Maybe you’d like to do this stuff too—or at least watch me make a fool of myself. So here’s my tale, from the start.

Downloading and running R

To use the programs written by Graham, I need to use R, a language currently popular among statisticians. It is not the language my programmer friends would want me to learn—they’d want me to use something like Python. But tough!

To download R to my Windows PC, I cleverly type download R into Google and go to the top website it recommended:

http://cran.r-project.org/bin/windows/base/

I click the big fat button on top saying

Download R 3.1.0 for Windows

and get asked to save a file R-3.1.0-win.exe. I save it in my Downloads folder; it takes a while to download since it was 57 megabytes. When I get it, I click on it and follow the easy default installation instructions. My Desktop window now has a little icon on it that says R.

Clicking this, I get an interface where I can type commands after a red

>

symbol. Following Graham’s advice, I start by trying

> 2^(1:8)

which generates a list of powers of 2 from 2 12^1 to 2 82^8, like this:

[1] 2 4 8 16 32 64 128 256

Then I try

> mean(2^(1:8))

which gives the arithmetic mean of this list. Somewhat more fun is

> plot(rnorm(20))

which plots a bunch of points, apparently 20 standard normal deviates. When I hear “20 standard normal deviates” I think of the members of a typical math department… but no, those are deviants. Standard normal deviates are random numbers chosen from a Gaussian distribution of mean zero and variance 1.

Downloading climate data

To do something more interesting, I need to input data.

The papers by Ludescher et al use surface air temperatures in a certain patch of the Pacific, so I want to get ahold of those. They’re here:

NCEP/NCAR Reanalysis 1: Surface.

NCEP is the National Centers for Environmental Prediction and NCAR is the National Center for Atmospheric Research. They have a bunch of files here containing worldwide daily average temperatures on a 2.5 degree latitude × 2.5 degree longitude grid (144 × 73 grid points), from 1948 to 2010. And if you go here, the website will help you get data from within a chosen rectangle in a grid, for a chosen time interval.

These are NetCDF files, where NetCDF stands for Network Common Data Form:

NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.

According to my student Blake Pollard:

... the method of downloading a bunch of raw data via ftp (file transfer protocol) is a great one to become familiar with. If you poke around on ftp://ftp.cdc.noaa.gov/Datasets or some other ftp servers maintained by government agencies you will find all the data you could ever want. Examples of things you can download for free: raw multispectral satellite images, processed data products, 're-analysis' data (which is some way of combining analysis/simulation to assimilate data), sea surface temperature anomalies at resolutions much higher than 2.5 degrees (although you pay for that in file size). Also, believe it or not, people actually use NetCDF files quite widely, so once you know how to play around with those you'll find the world quite literally at your fingertips!

I know about ftp since I’m an old guy and this was around before the web existed. But I need to get R to accept data from these NetCDF files: that’s what scares me!

Graham said that R has a “package” called RNetCDF for using NetCDF files. So, I need to get ahold of this package, download some files in the CDF format and somehow get R to eat those files with the help of this package.

At first I was utterly clueless! However, after a bit of messing around, I notice that right on top of the R interface there’s a menu item called Packages. I boldly click on this and choose Install Package(s).

I am rewarded with an enormous alphabetically ordered list of packages… obviously statisticians have lots of stuff they like to do over and over! I find RNetCDF, click on that and click something like “OK”.

I’m asked if I want to use a “personal library”. I click “no”, and get an error message. So I click “yes”. The computer barfs out some promising text:

utils:::menuInstallPkgs() trying URL 'http://cran.stat.nus.edu.sg/bin/windows/contrib/3.1/RNetCDF_1.6.2-3.zip' Content type 'application/zip' length 548584 bytes (535 Kb) opened URL downloaded 535 Kb

package ‘RNetCDF’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in C:\Users\JOHN\AppData\Local\Temp\Rtmp4qJ2h8\downloaded_packages

Success!

But now I need to figure out how to download a file and get R to eat it and digest it with the help of RNetCDF.

At this point my deus ex machina, Graham, descends from the clouds and says:

You can download the files from your browser. It is probably easiest to do that for starters. Put ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/ into the browser, then right-click a file and Save link as…

This code will download a bunch of them:

for (year in 1950:1979) { 
  download.file(url=paste0("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.", year, ".nc"), destfile=paste0("air.sig995.", year, ".nc"), mode="wb") 
  }

It will put them into the “working directory”, probably C:\Users\JOHN\Documents. You can find the working directory using getwd(), and change it with setwd(). But you must use / not \ in the filepath.

Compared to UNIX, the Windows operating system has the strange feature of using \ instead of / in path names, but R uses the UNIX conventions even on Windows.

So, after some mistakes, in the R interface I type

> setwd("C:/Users/JOHN/Documents/My Backups/azimuth/el nino")

and then type

> getwd()

to see if I’ve succeeded. I’m rewarded with

[1] "C:/Users/JOHN/Documents/My Backups/azimuth/el nino"

Good!

Then, following Graham’s advice, I cut-and-paste this into the R interface:

for (year in 1950:1979) { download.file(url=paste0("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.", year, ".nc"), destfile=paste0("air.sig995.", year, ".nc"), mode="wb") }

It seems to be working! A little bar appears showing how each year’s data is getting downloaded. It chugs away, taking a couple minutes for each year’s worth of data.

Using R to process NetCDF files

Okay, now I’ve got all the worldwide daily average temperatures on a 2.5 degree latitude × 2.5 degree longitude grid (144 × 73 grid points), from 1950 to 1970.

The world is MINE!

But what do I do with it? Graham’s advice is again essential, along with a little R program, or script, that he wrote:

The R script netcdf-convertor.R from

https://github.com/azimuth-project/el-nino/tree/master/R

will eat the file, digest it, and spit it out again. It contains instructions.

I go to this URL, which is on GitHub, a popular free web-based service for software development. You can store programs here, edit them, and GitHub will help you keep track of the different versions. I know almost nothing about this stuff, but I’ve seen it before so I’m not completely intimidated.

I click on the blue thing that says netcdf-convertor.R and see something that looks like the right script. Unfortunately I can’t see how to download it! I eventually see a button I’d overlooked, cryptically labelled “Raw”. I figure that since I don’t want a roasted or oven-broiled piece of software, I should click on this. I indeed succeed in downloading netcdf-convertor.R. Graham later says I could have done something better, but oh well. I’m happy nothing has actually exploded yet.

Once I’ve downloaded this script, I open it using an text processor and look at it. At top are a bunch of comments written by Graham:

     You should be able to use this by editing this section only.

setwd(“C:/Users/Work/AAA/Programming/ProgramOutput/Nino”)

lat.range <- 13:14

lon.range <- 142:143

firstyear <- 1957

lastyear <- 1958

outputfilename <- paste0(“Scotland-”, firstyear, “-”, lastyear, “.txt”)

    Explanation
  1. Use setwd() to set the working directory to the one containing the .nc files such as air.sig995.1951.nc. Example:

setwd(“C:/Users/Work/AAA/Programming/ProgramOutput/Nino”)

  1. Supply the latitude and longitude range. The NOAA data is every 2.5 degrees. The ranges are supplied as the number of steps of this size. For latitude, 1 means North Pole, 73 means South Pole. For longitude, 1 means 0 degrees East, 37 is 90E, 73 is 180, 109 is 90W or 270E, 144 is 2.5W.

These roughly cover Scotland.

lat.range <- 13:14

lon.range <- 142:143

These are the area used by Ludescher et al, 2013. It is 27x69 points which then subsampled to 9 by 23.

lat.range <- 24:50

lon.range <- 48:116

  1. Supply the years

firstyear <- 1950

lastyear <- 1952

  1. Supply the output name as a text string. paste0() concatenates strings which you may find handy:

outputfilename <- paste0(“Pacific-”, firstyear, “-”, lastyear, “.txt”)

Example of output

S013E142 S013E143 S014E142 S014E143

Y1950P001 281.60000272654 281.570002727211 281.60000272654 280.970002740622

Y1950P002 280.740002745762 280.270002756268 281.070002738386 280.49000275135

Y1950P003 280.100002760068 278.820002788678 281.120002737269 280.070002760738

Y1950P004 281.070002738386 279.420002775267 281.620002726093 280.640002747998

Y1950P193 285.450002640486 285.290002644062 285.720002634451 285.75000263378

Y1950P194 285.570002637804 285.640002636239 286.070002626628 286.570002615452

Y1950P195 285.92000262998 286.220002623275 286.200002623722 286.620002614334

Y1950P364 276.100002849475 275.350002866238 276.37000284344 275.200002869591

Y1950P365 276.990002829581 275.820002855733 276.020002851263 274.72000288032

Y1951P001 278.220002802089 277.470002818853 276.700002836064 275.870002854615

Y1951P002 277.750002812594 276.890002831817 276.650002837181 275.520002862439

Y1952P365 280.35000275448 280.120002759621 280.370002754033 279.390002775937

There is one row for each day, and 365 days in each year (leap days are omitted). In each row, you have temperatures in Kelvin for each grid point in a rectangle.

S13E142 means 13 steps South from the North Pole and 142 steps East from Greenwich. The points are in reading order, starting at the top-left (Northmost, Westmost) and going along the top row first.

Y1950P001 means year 1950, day 1. (P because longer periods might be used later.)

The instructions are admirably detailed concerning what I should do, but they don’t say where where the output will appear when I do it. Is this typical of software documentation? I guess I should just try it. After all, the program is not called DestroyTheWorld.

Unfortunately, at this point a lot of things start acting weird.

It’s too complicated and boring to explain in detail, but basically, I keep getting a file missing error message. I don’t understand why this happens under some conditions and not others. I try lots of experiments.

Eventually I discover that one year of temperature data failed to download—the year 1949, right after the first year available! So, I’m getting the error message whenever I try to do anything involving that year of data.

To fix the problem, I simply download the 1949 data by hand from here:

ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/.

(You can open ftp addresses in a web browser just like http addresses.) I put it in my working directory for R, and everything is fine again. Whew!

By the things are finally working, I sort of know what to do—after all, I’ve spent about an hour trying lots of different things.

I decide to create a file listing temperatures where I live in Riverside from 1948 to 1979. To do this, I open Graham’s script netcdf-convertor.R in a word processor and change this section:

setwd(“C:/Users/Work/AAA/Programming/ProgramOutput/Nino”)

lat.range <- 13:14

lon.range <- 142:143

firstyear <- 1957

lastyear <- 1958

outputfilename <- paste0(“Scotland-”, firstyear, “-”, lastyear, “.txt”)

to this:

setwd(“C:/Users/JOHN/Documents/My Backups/azimuth/el nino”)

lat.range <- 23:23

lon.range <- 98:98

firstyear <- 1948

lastyear <- 1979

outputfilename <- paste0(“Riverside-”, firstyear, “-”, lastyear, “.txt”)

Why? Well, I want it to put the file in my working directory. I want the years from 1948 to 1979. And I want temperature data from where I live! Googling the info, I see Riverside, California is at 33.9481° N, 117.3961° W.

34° N is about 56 degrees south of the North Pole, which is 22 steps of size 2.5°. And because some idiot decided everyone should count starting at 1 instead of 0 even in contexts like this, the North Pole itself is step 1, not step 0… so Riverside is latitude step 23. That’s why I write:

lat.range <- 23:23

Similarly, 117.5° W is 242.5° E, which is 97 steps of size 2.5°… which counts as step 98 according to this braindead system. That’s why I write:

lon.range <- 98:98

Having done this, I save the file netcdf-convertor.R under another name, Riverside.R.

And then I do some stuff that it took some fiddling around to discover.

First, in my R interface I go to the menu item File, at far left, and click on Open script. It lets me browse around, so I go to my working directory for R and choose Riverside.R. A little window called R editor opens up in my R interface, containing this script.

I’m probably not doing this optimally, but I can now right-click on the R editor and see a menu with a choice called Select all. If I click this, everything in the window turns blue. Then I can right-click again and choose Run line or selection. And the script runs!

Voilà!

It huffs and puffs, and then stops. I peek in my working directory, and see that a file called

Riverside.1948-1979.txt

has been created. When I open it, it has lots of lines, starting with these:

S023E098

Y1948P001 279.95

Y1948P002 280.14

Y1948P003 282.27

Y1948P004 283.97

Y1948P005 284.27

Y1948P006 286.97

As Graham promised, each line has a year and day label, followed by a vector… which in my case is just a single number, since I only wanted the temperature in one location. I’m hoping this is the temperature near Riverside, in kelvin.

A small experiment

To see if this is working, I’d like to plot these temperatures and see if they make sense. Unfortunately I have no idea how to get R to take a file containing data of the sort I have and plot it! I need to learn how, but right now I’m exhausted, so I use another method to get the job done— a method that’s too suboptimal and embarrassing to describe here. (Hint: it involves the word “Excel”.)

I do a few things, but here’s the most interesting one (namely, not very interesting). I plot the temperatures for 1963:

I compare it to some publicly available data, not from Riverside, but from nearby Los Angeles:

As you can see, there was a cold day on January 13th, when the temperature dropped to 33°F. That seems to be visible on the graph I made, and looking at the data from which I made the graph, I see the temperature dropped to 251.4 kelvin on the 13th: that’s -7°F, very cold. It does get colder around Riverside than in Los Angeles in the winter, since it’s a desert, with temperatures not buffered by the ocean. So, this does seem compatible with the public records. That’s mildly reassuring.

But other features of the graph don’t match, and I’m not quite sure if they should or not. So, all this very tentative and unimpressive. However, I’ve managed to get over some of my worst fears, download some temperature data, and graph it! Now I need to learn how to use R to do statistics with this data, and graph it in a better way.

Puzzles

You can help me out by answering these puzzles. Later I might pose puzzles where you can help us write really interesting programs. But for now it’s just about learning R.

Puzzle 1. Given a file with lots of lines of this form:

S023E098

Y1948P001 279.95

Y1948P002 280.14

Y1948P003 282.27

Y1948P004 283.97

Y1948P005 284.27

Y1948P006 286.97

write an R program that creates a huge vector, or list of numbers, like this:

279.95, 280.14, 282.27, 283.97, 284.27, 286.97, …

Puzzle 2: Extend the above program so that it plots this list of numbers, or outputs it to a new file.

If you want to test your programs, here’s the actual file:

Riverside-1948-1979.txt

category: blog, climate, software