Freq Used SAS Proc with exmaples
PROC DATASETS
PROC DATASETS is a utility
procedure that helps to manage the SAS datasets in various libraries. The
multi-user environments are constrained by the system resources like SAS
Workspace or the shared folders. To remove unnecessary files and manage
the datasets, Proc Datasets is often used in the main programs.
With PROC DATASETS, you can
·
Delete
SAS files
·
List the
SAS files that are contained in a SAS library
·
Copy SAS
files from one SAS library to another
·
Rename
SAS files
·
List the
attributes of a SAS data set, information such as the date the data were last
modified, whether the data are compressed, whether the data are indexed etc.
·
Append
SAS data sets
·
Create
and delete indexes on SAS data sets
The example below demonstrates
three frequently used features of Proc Contents – Delete, copy and Rename.
libname mylib 'D:\mydata';
DATA mylib.intial_cl;
INPUT account credit_limit;
DATALINES;
1002 2000
1003 4000
1004 3000
;
DATA mylib.new_cl;
INPUT account credit_limit;
DATALINES;
1002 3000
1004 5000
1005 2500;
proc datasets library=mylib details;
change
new_cl=brand_new_cl;
delete intial_cl;
copy in=mylib out
=work; select brand_new_cl;
run;
Line 1: specifying the library to
list the detail (filenames and attributes)
Line 2: change the name ‘new_cl’
into ‘brand_new_cl’
Line 3: delete intial_cl from
mylib
Line 4: copy from ‘mylib’ library
to ‘work’ library the file specified in select statement (brand_new_cl)
Proc PRINT is used to understand
how the data looks and also for a variety of reporting purposes. Proc print
in conjunction with ODS features can produce impressive reports. PROC
Print has various options and features to control the display, filter
data/variables and to do additional computation on the data.
When we work with data, it’s a
good practice to have a look at a sample data extract. If the dataset is
large, we would want to restrict the number of column and rows in our print
report. The below example show how to do that.Assume that ‘statement’ is a
dataset with 100 variables and 1 Million records. We want to see any 10
observations and 5 variables viz. account_code, current_balance,
credit_limit, status_code and fico_score. This is how we do it.
Proc print data=statement (obs=10);
title 'Sample records-Statement';
var account_code current_balance
credit_limit status_code fico_score;
run;
Line 1: tells SAS the dataset
name. Also restricts the number of observations printed through OBS
statement.
Line 2: specifies a title
for the report.
Line 3: Specifies the variables to
be printed
What we have seen is a basic
version of PROC PRINT. Some of the features we use with are: BY
statement, SUM statement and PAGE statement. Here are some examples on how to
use them.
BY statement: Produces a separate
section of the report for each BY group.
PAGEBY: Controls page ejects that
occur before a page is full. Computes the totals by page, if specified.
SUM: Computes the total for the
specified variables
DATA account_perf;
INPUT account current_os
ext_status_code $ int_status_code $;
cards;
1002 300 A C
1003 20 A
C
1004 1200 A D
1005 800 Z A
1006 450 Z A
1007 560 Z A
1008 450 A D
1009 900 Z D
1110 300 Z D
;run;
proc sort data = account_perf;
by ext_status_code;
run;
Proc print data = account_perf;
Title "Example for SUM, BY
and PAGEBY Statements";
by ext_status_code ;
pageby ext_status_code;
sum current_os;run;
Note that BY statement in Proc
Print require the data to be sorted by the variables in BY statement. PROC
SORT procedure should be used to do the sorting prior to the print.
PROC SORT is a very useful utility
we often use when work with data and procedures. PROC SORT sorts the
variable(s) in a dataset in an ascending or descending order. Additionally it
performs some other operations like deleting the duplicate rows and ordering
the variables for certain procedures. When we do the data cleaning,
PROC SORT is used to remove all the duplicate records or duplicate observation.
In the example shown below, the accounts are appearing multiple times
(duplicate and non-duplicate rows) and we want to keep the record last
updated only. Using PROC SORT we can filter those accounts and create a clean
SAS dataset.
Let us create a SAS dataset to
demonstrate some of the capabilities of PROC SORT.
DATA statement;
INPUT account current_os
ext_status_code $ Dt_updt mmddyy10. ;
format dt_updt mmddyy10.;
cards;
1002 300 A 03/15/2005
1003 20 A 03/15/2005
1003 20 A 03/15/2005
1004 1200 A 03/15/2005
1005 800 Z 03/15/2005
1006 450 Z 03/15/2005
1007 560 Z 03/15/2005
1002 300 Z 03/25/2005
1002 300 Z 03/25/2005
1009 900 Z 03/15/2005
1110 300 Z 03/15/2005
1004 1200 Z 03/26/2005
;run;
The DATA step above creates a
dataset with 12 records. Note that account number 1003 and 1002 have
duplicate repords. Accounts 1002 and 1004 are repeated but they are not
duplicated. In order to do any analysis or reporting, we should
first clean this dataset to make sure no doble couting is done on measurement
variables. Such issues are common with statement table as the customer can
request to change the billing cycle for their credit card statements or
problems with data loading at ETL stage.
This is how we use a PROC SORT
statement:
proc sort data = statement out=statement1 nodup;
by account descending Dt_updt;
run;
Line 1: specifies the data to read
in and ‘out’ statement specifies the dataset to be created after sorting.
’NOHUP’ is a keyword tells SAS to remove all records that are identical and
keep only the first one.
Line 2: BY
statement specifies the variables to be used for sorting. Here account
variable is sorted in an ascending order (default) and Dt_updt is sorted in a
descending order. Our objective is to keep the record last updated so when
SAS deletes the duplicates it keeps the first record in the order sorted, and
in this case ‘descending’ sort keyword brings the latest record first.
Notice that we still have multiple
records for couple of accounts. To remove them we use the ‘nodupkey’ keyword
as follows:
proc sort data = statement1 out=statement2 nodupkey;
by account ;
run;
Line 1: ‘nodupkey’ keyword is
specified to instruct SAS to remove all the records appearing repetitively
for the variables in BY statement.
Line 2: BY statement specifies the
variables where SAS should look for duplicates. If an account is repeated
twice the fist record in the sort order is kept and rest are deleted from the
output dataset.
So with these two sort steps we
have a clean dataset with only latest information. It is possible to specify
multiple variables in the BY statement and it is possible to control the
order (ascending or descending) of the individual variables while performing
the sorting.
Some Procedures use BY statements
to categorize the output – For example, PROC PRINT, PROC SUMMARY, PROC
UNIVARIATE. Make sure you sort the data by BY variables before you
perform a procedure on it.
TIP: PROC SORT is very
time/resource consuming process in Consumer Finance environment as we
typically work with millions of records. We don’t have to sort data for PROC
SUMMARY, unless it has a BY statement.
The TRANSPOSE procedure creates an
output data set by restructuring the values in a SAS data set, transposing
selected variables into observations. It converts the row elements to
columns.
Let us create a dataset to
demonstrate an example:
DATA statement_summary;
INPUT Perfmonth date9.
tot_accounts newacct actives current_balance;
format Perfmonth MONYY7. ;
label
Perfmonth ="Performance
Month"
tot_accounts ="Total Number
of Accounts"
actives= " #Active
Accounts"
newacct
="# New Accounts"
current_balance ="$ Current
Balances";
cards;
01-Apr-03 8860303 86521
1366811 32932422.08
01-May-03 8947743 90272
1376739 30994371.23
01-Jun-03 9035228 90436
1397670 31551717.63
01-Jul-03 9123510 92157
1424469 31788023.01
01-Aug-03 9199904 79118
1417949 32329068.83
01-Sep-03 9289735 92682
1369723 33772968.41
01-Oct-03 9390095 10294 1371610
34349188.42
01-Nov-03 9493607 10673
1383296 35398736.71
01-Dec-03 9583579 93288
1427501 36525256.12
01-Jan-04 9643529 63447
1432429 39782001.88
01-Feb-04 9706194 66283
1340457 38625107.74
01-Mar-04 9723757 20907
1294083 37758060.3
;run;
proc print data = statement_summary label;run;
This is how the output looks like
. Now we want to produce a report that displays al metric in a time series
format ie. you want the Performance months as columns . Then it becomes a
typical case where we use PROC Transpose.
Total
Performance Number of #
New #Active $ Current
Obs Month
Accounts
Accounts Accounts Balances
1 APR2003
8860303 86521
1366811 32932422.08
2 MAY2003
8947743
90272 1376739 30994371.23
3 JUN2003
9035228
90436 1397670 31551717.63
4 JUL2003
9123510
92157 1424469 31788023.01
5 AUG2003
9199904
79118 1417949 32329068.83
6 SEP2003
9289735
92682 1369723 33772968.41
7 OCT2003
9390095
10294 1371610 34349188.42
8 NOV2003
9493607
10673 1383296 35398736.71
9 DEC2003
9583579
93288 1427501 36525256.12
10 JAN2004
9643529
63447 1432429 39782001.88
11 FEB2004
9706194
66283 1340457 38625107.74
12 MAR2004
9723757
20907 1294083 37758060.30
This how we use a PROC TRANSPOSE
proc transpose data =statement_summary out =
ts_statement;
id Perfmonth;
var tot_accounts
newacct actives current_balance ;
run;
Line 1: Specifies the dataset and output dataset – Proc
Transpose does not print the results
Line 2: Spcifies the field to be
transposed through an ‘ID’ statement.
Line 3: Specifies the variables to
be transposed
This is how it looks if we print a
part of the data transposed (ts_statement)
Obs
_NAME_
_LABEL_
APR2003 MAY2003 JUN2003
1
tot_accounts Tot Number of Accounts 860303
8947743 9035228
2
newacct
# New Accounts
86521 90272. 90436
3
actives #Active
Accounts 136681 1376739
397670
4
current_balance $ Current Balances 3293242 30994371
31551717
Note that two variables are
created by SAS ‘_Name_ ‘ and ‘_Label_ ‘ . This field has all variable
names and their respective labels so that we can identify them. Variable we
specified in ID statement is now converted into columns.
PROC Transpose is often used to
convert the variables into a time series format as shown above. It is
possible to use a ‘BY’ statement in a PROC Transpose to transpose them
in a grouped manner. When we work with SAS programs for automation, there are
several instances we would do a PROC transpose to reshape the data in to a
structure that helps in automation.
PROC DOWNLOAD is used to download
data from a remote server, when you are working with SAS remote submit
session. SAS remote sign on to a Unix server enables the user to compose and
submit the program in their windows clients and see the SAS log and Output in
their workstation. When you are submitting a program to remote server, the
Sas datasets are created in the remote server (Stoner Unix server for
example). PROC Download enables to download the data into your local windows
folders.
Let us look at a program :
libname loc 'c:\datasets';
rsubmit;
libname user01 '/projects/cmart'
options obs =100;
Proc downlod data = user01.cmart_auth out =
loc.cmart_auth;
run;
endrsubmit;
The above program assumes
that a signon to a remote server is already estabilished. There are two
librariesdeclared – ‘loc’ is a local SAS library and ‘user01’ is a remote Sas
library . Options obs=100 restricts the number of observations downloaded to
100. OBS=Max should be used if the intention is to download the complete
dataset.
PROC FREQ is used to produce
frequency counts and cross tabulation tables for specified variables/field.
It can also produce the statistics like Chi Square to analyze relationships
among variables.
As a good practice, at a data
processing stage we use PROC Freq on categorical fields to understand the
data and their frequency distributions. For example a cross tabulation table
of FICO score segments to External Status code will tell us how the accounts
are distributed across various score segments and statuses. PROC FREQ offers
various features to control the output and the way the tables are produced.
Let us start with an example to understand them better.
DATA account_perf;
INPUT client $ account current_os
ext_status_code $ int_status_code $;
cards;
Cmart 1002 300 A
C
Cmart 1003 20
A C
JCP 1004
1200 A D
JCP 1005
800 Z A
GIA 1006
450 Z A
GIA 1007
560 Z A
JCP 1008
450 A D
GIA 1009
900 . D
Cmart 1110 300 Z
D
;run;
proc sort DATA = account_perf;
BY client;
run;
proc freq DATA = account_perf;
TITLE 'Frequency of External
Status Code by Client';
TABLES ext_status_code*
int_status_code /missing norow nocol nopercent;
BY client;
WHERE client ne 'JCP';
LABEL client ='Client Name';
run;
The data step creates a SAS
dataset named ‘account_perf’. Proc sort is used to sort the data by client as
we use a ‘BY’ statement in PROC FREQ that follows.
Line 1 and 2 in PROC FREQ
statement tells SAS which dataset to use and the title of the output.
Line 3 specifies SAS to generate a
cross-tab of External status code and Internal Status code trough a
TABLES statement. There are several options specified like missing, norow,
nocol and nopercent to control the way statistics are displayed.
A standard FREQ output shows a column, row and cell percentages.
Line 4: BY
Statement specifies the grouping criteria. In this case, frequencies are
computed for each client group.
Line 5: WHERE statement allows the
conditional processing. Here all clients other than ‘JCP’ are taken for
computation.
Line 5: Labels the output. Its
also possible to use FORMAT statements to manipulate the display.
Here is a list of commonly used
options with TABLES statement
Nocol - suppresses printing of
column percentages of a cross tab.
Norow - suppresses printing of row percentages of a cross tab. Nopercent - suppresses printing of cell percentages of a cross tab.
Missing - interprets missing
values as non-missing and includes them in % and statistics calculations.
List - prints two-way to n-way tables
in a list format rather than as cross tabulation tables
Chisq - performs several
chi-square tests.
Now let us look at the following
requirements:
i) The user wants the output not
to be printed but to be put into a SAS data set for further processing. ii)
Wants to order the values to be changed to ascending order of frequency
counts (highest occurring first) and iii) The output to be listed instead of
cross-tab.
The following program block does
the job:
proc freq DATA = account_perf order=freq;
Title 'Frequency of External
Status Code by Client';
TABLES
int_status_code*ext_status_code / list out = perf_freq ;
WHERE client ne 'JCP';
label client ='Client Name';
run;
At line : 1- please note that
‘order =freq’ is specified to tell SAS to order the output by descending
frequency count
Line: 3- ‘list’ keyword is used to
tell SAS to list the output and not cross-tabulate. ‘Out=perf_freq’ specifies
the output dataset name to store the frequency output.
To sum up, PROC FREQ is a very
useful and the most used of the SAS procedures. In Consumer Finance
environment, PROC FREQ is used mainly in the data preparation stage and for
reporting. Frequency ordering, list and output data creation using OUT are
often used options.
The PROC MEANS produces descriptive
statistics for numeric variables. By default, the MEANS procedure
reports N (the number of non-missing observations), Mean, Std Dev (Standard
Deviation), Minimum and Maximum. We can request additional statistics through
the options. Let us start with a simple example:
proc means DATA = account_perf ;
Title 'Summary of Current
Outstanding' ;
BY client ;
var current_os;
label client ='Client Name'
current_os= 'Current Outstanding'
;
run;
Line:3-Tells SAS to group the
values in client field and produce the statistics separately.
Line:4- Specifies the variable for
which the statistics are to be computed. Here current_os is a field in
the dataset given and that contain data for current Dollar outstanding for
each account in the dataset.
Now we want to request more
statistic like variance, range, sum, mean, median, minimum and maximum this
is how we do it.
proc means DATA = account_perf var range
sum mean median min max ;
Title 'Summary Statistics of
Current Outstanding’;
var current_os;
run;
Suppose we want to create a SAS
dataset with the statistics computed for further processing, this is how we
instruct SAS.
proc means DATA = account_perf var
sum mean ;
Title 'Summary of Current
Outstanding' ;
var current_os;
by client;
output out = perf_mean n = count
sum = total mean = avg ;
run;
The above program requests three
statistics are to be put into the output dataset ‘perf_mean’. Total
count, sum and mean for each client (BY Client) are created as count, total
and avg fields, respectively, in the output dataset.
To Sum up, PROC MEANS is used to
understand the numeric variables, their distributions and other statistical
characteristics. Options like BY and CLASS statements enable the users to
look at them by segmenting into various categories. Just like we use FREQ on
categorical and character fields, Means is used on numeric data.
Additionally, PROC Means is used for summarizing large datasets with a
variety of statistics for reporting purposes. In this way it can be used as a
substitute for PROC SUMMARY.
PROC GPLOT is used to produce the
graphical output in SAS. PROC GPLOT is considered as an improvement over the
PROC PLOT procedure as it provides good quality presentation compared to
simple PROC PLOT outputs.
PROC GPLOT produces a variety of
two-dimensional graphs including
·
Simple
scatter plots
·
Overlay
plots in which multiple sets of data points display on one set of axes
·
Plots
against a second vertical axis
·
Bubble
plots
·
Logarithmic
plots (controlled by the AXIS statement).
In conjunction with the SYMBOL
statement the GPLOT procedure can produce join plots, high-low plots, needle
plots, and plots with simple or spline-interpolated lines. The SYMBOL
statement can also display regression lines on scatter plots.
The GPLOT procedure is useful for
·
Displaying
long series of data, showing trends and patterns
·
Interpolating
between data points
·
Extrapolating
beyond existing data with the display of regression lines and confidence
limits.
The dataset and the program below
gives a good understanding about the GPLOT options and how the graphs are
created.
DATA account_perf;
INPUT client $ account fico_seg
$ current_os tot_payment ext_status_code $ ;
cards;
Cmart 1002 401-500 300 100
A C
Cmart 1003 501-600 200 150
A C
Cmart 1004 601-700 1200 180
A D
Cmart 1005 701-800 800 190
Z A
Cmart 1006 801-900 450 200
Z A
GIA 1007
401-500 560 210 Z A
GIA 1008
501-600 450 180 A D
GIA 1009
601-700 900 145 A D
GIA 1110
701-800 300 148 Z D
;
run;
proc sort data = account_perf;
by client;run;
PART-I
GOPTIONS
RESET
= global
GUNIT
= PCT
CBACK
= BLACK
CTEXT
= WHITE
/*
DEVICE = GIF*/
FONTRES =
PRESENTATION
HTITLE
= 3
HTEXT
= 2
HSIZE
= 8 IN
VSIZE
= 6 IN
INTERPOL = JOIN
NOBORDER;
PART-II
SYMBOL1 VALUE =
DOT COLOR = LIME HEIGHT = 2.5
WIDTH = 2;
SYMBOL2 VALUE =
SQUARE COLOR = VIPK HEIGHT = 2.5 WIDTH =
2;
SYMBOL3 VALUE = TRIANGLE
COLOR = STRO HEIGHT = 2.5 WIDTH = 2;
PART-III
AXIS1 LABEL = ('Current
O/S')
COLOR = WHITE
LENGTH = 60
MAJOR = (NUMBER = 5)
MINOR = NONE ;
AXIS2 LABEL = ('FICO
SEGMENTS')
COLOR = WHITE
LENGTH = 85
MAJOR = (NUMBER = 5)
MINOR = NONE;
PROC GPLOT DATA = Account_perf;
by client;
PLOT
current_os*fico_seg tot_payment*fico_seg/overlay haxis=axis1 vaxis=axis2;
RUN;
GOPTIONS: When working with GPLOT,
the first step is to become familiar with the GOPTIONS. The purpose of the
GOPTIONS statement is to apply levels of specific options to graphs created
in the session or to override specific default values. It can be located
anywhere within your SAS program; however, in order for the requested options
to be applied to it, it must be placed before the graphics procedure. The
GOPTIONS used above and explained below.
reset = option resets graphics options to their default values.
gunit = sets the units of character height measurement to
percentage of display height.
cback = option sets the color background to black.
cfont= option sets the font color to white
device = option selects the device driver to use to create a
graphic file(GIF)
htitle= option sets the text height for the first title to 3 (in
percentage of display height).
htext = option sets the text height for all text to 2 (in
percentage of display height).
Interpol: interpolation method (i.e., how to "connect"
the points)
border option includes a border around the graphics output area.
In Part II, SYMBOL statements
create SYMBOL definitions, which are used by the GPLOT procedure. These
definitions control:
·
The
appearance of plot symbols and plot lines, including bars, boxes, confidence
limit lines, and area fills
·
Interpolation
methods
Part III defines the Axes 1 and 2
where we can name the axis, spiffy the color, length of each axis and
customize the number of major and minor divisions on it.
Having seen various options in GPLOT
now let us look at the GPLOT statement in Part-IV.
Line:2 Tells SAS to produce a
graph for each category in the variable specified in BY statement. Note that
to use BY in GPLOT, the dataset should be sorted by the BY variable. In
this example, a graph will be produced by each client in the dataset.
Line:3 specifies the X and Y axes
in the graph. Here we are plotting two variables – O/S and Payments in the
same Y variable (ie FICO_score) .The OVERLAY option on the PLOT statement
makes sure that both plot lines appear on the same graph. Further,
definitions done in the Part-III are applied using Haxis and Vaxis options.
POC GPLOT provides the flexibility
to plot single or multiple variables on a same graph. Many a time we would
require to name the SAS graph files and store it in a specific directory so
that we have control over the files when we do the automation. This can be
achieved by DEVICE option in GOPTIONS.
In scorecard tracking or
Champion-challenger strategy tracking projects often there are many segmentations
(like score segments or strategy identifiers or vintage segments) and
charting for various performance variables are carried out to compare the
segments. This kind of complicated charting can be automated using various
options of GPLOT
|
Friday, 1 March 2013
Freq Used SAS Proc with exmaples
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment