SAS Introduction - Mendota Ice Record
Doug Hemken
February 2020
Introduction
This example looks at the Mendota Lake ice records to illustrate what an ordinary analysis might look like using SAS.
Question:
When will the ice fishing season end ("ice out")?
The data are published by the Wisconsin State Climatology Office
Data Step
The first step to our analysis is to read the data into SAS format. This uses a "global statement", filename
, to tell SAS where to find the data on the web, and a DATA
step, a block of SAS code that reads and manipulates the data into a format suitable for analysis.
filename ice url
"http://www.ssc.wisc.edu/~hemken/SASworkshops/data/Mendota2020.txt";
data MendotaIce; * Creating a data set;
infile ice firstobs=2;
input Winter $ Closed $ Opened $ Days;
year=input(substr(winter,1,4), 4.)+1;
if closed ne "" then close =
input(compress(closed||put(year-1, 4.0), '-', 's'), date9.);
if opened ne "" then open =
input(compress(opened||put(year, 4.0), '-', 's'), date9.);
icein = mdy(month(close),day(close), 1960);
if month(close) in (1, 2, 3) then close =
intnx('year', close, 1, 'sameday');
if month(close) in (11, 12) then icein =
intnx('year', icein, -1, 'sameday');
if month(open) eq 12 then open =
intnx('year', open, -1, 'sameday');
dur = open - close;
iceout = icein + dur;
format close open mmddyy10.;
format icein iceout date5.;
drop closed opened;
run;
Most of the DATA step details are about getting dates into a convenient form for analysis. Open
and close
are in a date format. The data values are integers to which we have attached date labels, so that the integers 0 = "1/1/1960", 1 = "1/2/1960", -1 = "12/31/1959",
366 = "1/1/1961", etc. The numeric coding makes it easy for us to calculate the difference between two dates or the average of a collection of dates. The format (the labels) help us keep track of what those numbers mean.
Icein
and iceout
are coded in "day of the year" format, so that 0 = 1Jan, 1 = 2Jan, -1 = 31Dec, etc. This is similar to date format, but without the year.
Here is what the first few observations look like. While most of the data is complete, some of the early data is not.
Obs Winter Days year close open icein dur iceout
1 1852-53 . 1853 . 04/05/1853 . . .
2 1853-54 . 1854 12/27/1853 . 27DEC . .
3 1854-55 . 1855 . . . . .
4 1855-56 118 1856 12/18/1855 04/14/1856 18DEC 118 14APR
5 1856-57 151 1857 12/06/1856 05/06/1857 06DEC 151 05MAY
Proc Steps
With the data in SAS format, we can use various statistical PROC
s to graph and analyze the data.
Sgplot
proc sgplot data=MendotaIce;
series y=iceout x=year;
yaxis tickvalueformat= data;
run;
There are seven suspiciously early ice out dates, which should make us look more closely at the data.
Obs Winter Days year close open icein dur iceout
85 1936-37 . 1937 12/07/1936 12/30/1936 07DEC 23 30DEC
86 1936-37 121 1937 01/05/1937 04/13/1937 05JAN 98 12APR
In some winters the lake freezes over, opens a little, then refreezes. Let's agree that we are interested in the ultimate ice out date, so we want to drop a few observations in our analysis.
Means
A simple answer to our question can be found with PROC MEANS
.
proc means data=MendotaIce; * simple descriptives;
where iceout gt "12Jan60"d;
var iceout;
run;
The MEANS Procedure
Analysis Variable : iceout
N Mean Std Dev Minimum Maximum
-------------------------------------------------------------------
164 92.3414634 11.6768551 57.0000000 125.0000000
-------------------------------------------------------------------
Over the last 150+ years, the ice usually breaks up around the 92nd or 93rd day of the year.
We could apply our date format to interpret these results more easily.
iceout
02APR
The hefty standard deviation tells us we have to allow a large margin of error in guessing an ice out date.
To see the distribution of the ice out date, we can use a PROC
step that produces graphs.
proc sgplot data=MendotaIce;
where iceout gt "12Jan60"d;
histogram iceout;
density iceout /type=kernel;
run;
Freq
Many procedures produce both tables and plots. To look at the distribution of ice out by month we could do this:
proc freq data=MendotaIce;
where iceout gt "12Jan60"d;
tables iceout / plots=freqplot;
format iceout monname.;
run;
Reg
Can we do better if we take global warming into account?
proc reg data=MendotaIce;
where iceout gt "12Jan60"d;
model iceout = year;
run; quit;
Number of Observations Read | 164 |
---|---|
Number of Observations Used | 164 |
Analysis of Variance | |||||
---|---|---|---|---|---|
Source | DF |
Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 1 | 2961.12546 | 2961.12546 | 24.90 | <.0001 |
Error | 162 | 19264 | 118.91205 | ||
Corrected Total | 163 | 22225 |
Root MSE | 10.90468 | R-Square | 0.1332 |
---|---|---|---|
Dependent Mean | 92.34146 | Adj R-Sq | 0.1279 |
Coeff Var | 11.80908 |
Parameter Estimates | |||||
---|---|---|---|---|---|
Variable | DF |
Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 266.24285 | 34.85918 | 7.64 | <.0001 |
year | 1 | -0.08976 | 0.01799 | -4.99 | <.0001 |
What does the intercept, about 270, mean?
Estimate
23SEP
While mathematically correct, and usable to calculate a prediction for 2020, it might make sense to rescale year
to simply give us our answer.
data recentered;
set mendotaice;
yearc = year-2020;
run;
Notice that the intercept changes but the slope does not.
proc reg data=recentered plots=none;
where iceout gt "12Jan60"d;
model iceout = yearc;
run; quit;
Number of Observations Read | 164 |
---|---|
Number of Observations Used | 164 |
Analysis of Variance | |||||
---|---|---|---|---|---|
Source | DF |
Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 1 | 2961.12546 | 2961.12546 | 24.90 | <.0001 |
Error | 162 | 19264 | 118.91205 | ||
Corrected Total | 163 | 22225 |
Root MSE | 10.90468 | R-Square | 0.1332 |
---|---|---|---|
Dependent Mean | 92.34146 | Adj R-Sq | 0.1279 |
Coeff Var | 11.80908 |
Parameter Estimates | |||||
---|---|---|---|---|---|
Variable | DF |
Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 84.93663 | 1.71084 | 49.65 | <.0001 |
yearc | 1 | -0.08976 | 0.01799 | -4.99 | <.0001 |
Now the intercept means the expected ice out date for this year:
Estimate
25MAR
About a week earlier than if we ignore the trend.
Over the course of a lifetime, ice fishing season had ended about a week earlier, and winter is about two weeks shorter.