Friday, 1 March 2013

Freq Used SAS Proc with exmaples


Freq Used SAS Proc with exmaples

PROC DATASETS

PROC DATASETS is a utility procedure that helps to manage the SAS datasets in various libraries. The multi-user environments are constrained by the system resources like SAS Workspace or the shared folders. To remove unnecessary files  and manage the datasets, Proc Datasets is often used in the main programs.
With PROC DATASETS, you can
·         Delete SAS files
·         List the SAS files that are contained in a SAS library
·         Copy SAS files from one SAS library to another
·         Rename SAS files
·         List the attributes of a SAS data set, information such as the date the data were last modified, whether the data are compressed, whether the data are indexed etc.
·         Append SAS data sets
·         Create and delete indexes on SAS data sets 

The example below demonstrates three frequently used features of Proc Contents – Delete, copy and Rename.
libname mylib 'D:\mydata';
DATA mylib.intial_cl;
INPUT account credit_limit;
DATALINES;
1002 2000
1003 4000
1004 3000
;
DATA mylib.new_cl;
INPUT account credit_limit;
DATALINES;
1002 3000
1004 5000
1005 2500;
proc datasets library=mylib details;
   change new_cl=brand_new_cl;
   delete intial_cl;
   copy in=mylib out =work;   select  brand_new_cl;
run;
Line 1: specifying the library to list the detail (filenames and attributes)
Line 2: change the name ‘new_cl’ into ‘brand_new_cl’
Line 3: delete intial_cl from mylib
Line 4: copy from ‘mylib’ library to ‘work’ library the file specified in select statement (brand_new_cl)


Proc PRINT is used to understand how the data looks and also for a variety of reporting purposes. Proc print in conjunction with ODS features can produce impressive reports.  PROC Print has various options and features to control the display, filter data/variables and to do additional computation on the data.
When we work with data, it’s a good practice to have a look at a sample data extract. If the dataset is large, we would want to restrict the number of column and rows in our print report. The below example show how to do that.Assume that ‘statement’ is a dataset with 100 variables and 1 Million records. We want to see any 10 observations and 5 variables viz. account_code, current_balance, credit_limit, status_code and fico_score. This is how we do it.

Proc print data=statement (obs=10);
title 'Sample records-Statement';
var account_code current_balance credit_limit status_code  fico_score;
run;
Line 1: tells SAS the dataset name. Also restricts the number of observations printed through OBS statement.
Line 2:  specifies a title for the report.
Line 3: Specifies the variables to be printed
What we have seen is a basic version of PROC PRINT. Some of the features we use with are:  BY statement, SUM statement and PAGE statement. Here are some examples on how to use them.
BY statement: Produces a separate section of the report for each BY group.
PAGEBY: Controls page ejects that occur before a page is full. Computes the totals by page, if specified.
SUM: Computes the total for the specified variables
DATA account_perf;
INPUT account current_os ext_status_code $ int_status_code $;
cards;
1002 300   A  C
1003 20    A  C
1004 1200  A  D
1005 800   Z  A
1006 450   Z  A
1007 560   Z  A
1008 450   A  D
1009 900   Z  D
1110  300  Z  D
;run;
proc sort data = account_perf;
by ext_status_code;
run;
Proc print data = account_perf;
Title "Example for SUM, BY and PAGEBY Statements";
by ext_status_code  ;
pageby ext_status_code;
sum  current_os;run;
Note that BY statement in Proc Print require the data to be sorted by the variables in BY statement. PROC SORT procedure should be used to do the sorting prior to the print. 


PROC SORT is a very useful utility we often use when work with data and procedures. PROC SORT sorts the variable(s) in a dataset in an ascending or descending order. Additionally it performs some other operations like deleting the duplicate rows and ordering the variables for certain procedures.  When we do the data cleaning, PROC SORT is used to remove all the duplicate records or duplicate observation. In the example shown below, the accounts are appearing multiple times (duplicate and non-duplicate rows) and we want to keep the record last updated only. Using PROC SORT we can filter those accounts and create a clean SAS dataset.
Let us create a SAS dataset to demonstrate some of the capabilities of PROC SORT.

DATA statement;
INPUT account current_os ext_status_code $  Dt_updt mmddyy10. ;
format dt_updt mmddyy10.;
cards;
1002 300  A 03/15/2005
1003 20   A 03/15/2005
1003 20   A 03/15/2005
1004 1200 A 03/15/2005
1005 800  Z 03/15/2005
1006 450  Z 03/15/2005
1007 560  Z 03/15/2005
1002 300  Z 03/25/2005
1002 300  Z 03/25/2005
1009 900  Z 03/15/2005
1110 300  Z 03/15/2005
1004 1200 Z 03/26/2005
;run;

The DATA step above creates a dataset with 12 records. Note that account number 1003 and 1002 have duplicate repords. Accounts 1002 and 1004 are repeated but they are not duplicated.   In order to do any analysis or reporting, we should first clean this dataset to make sure no doble couting is done on measurement variables. Such issues are common with statement table as the customer can request to change the billing cycle for their credit card statements or problems with data loading at ETL stage.
This is how we use a PROC SORT statement:
proc sort data = statement out=statement1 nodup;
by account descending Dt_updt;
run;
Line 1: specifies the data to read in and ‘out’ statement specifies the dataset to be created after sorting. ’NOHUP’ is a keyword tells SAS to remove all records that are identical and keep only the first one.
Line 2: BY statement specifies the variables to be used for sorting. Here account variable is sorted in an ascending order (default) and Dt_updt is sorted in a descending order. Our objective is to keep the record last updated so when SAS deletes the duplicates it keeps the first record in the order sorted, and in this case ‘descending’ sort keyword brings the latest record first.
Notice that we still have multiple records for couple of accounts. To remove them we use the ‘nodupkey’ keyword as follows: 
proc sort data = statement1 out=statement2 nodupkey;
by account ;
run;
Line 1: ‘nodupkey’ keyword is specified to instruct SAS to remove all the records appearing repetitively for the variables in BY statement.
Line 2: BY statement specifies the variables where SAS should look for duplicates. If an account is repeated twice the fist record in the sort order is kept and rest are deleted from the output dataset.
So with these two sort steps we have a clean dataset with only latest information. It is possible to specify multiple variables in the BY statement and it is possible to control the order (ascending or descending) of the individual variables while performing the sorting.
Some Procedures use BY statements to categorize the output – For example, PROC PRINT, PROC SUMMARY, PROC UNIVARIATE.  Make sure you sort the data by BY variables before you perform a procedure on it.
TIP:  PROC SORT is very time/resource consuming process in Consumer Finance environment as we typically work with millions of records. We don’t have to sort data for PROC SUMMARY, unless it has a BY statement.


The TRANSPOSE procedure creates an output data set by restructuring the values in a SAS data set, transposing selected variables into observations.  It converts the row elements to columns.
Let us create a dataset to demonstrate an example:
DATA statement_summary;
INPUT Perfmonth date9. tot_accounts newacct     actives  current_balance;
format Perfmonth MONYY7. ;
label
Perfmonth ="Performance Month"
tot_accounts ="Total Number of Accounts"
actives= " #Active Accounts"
newacct     ="# New Accounts"
current_balance ="$ Current Balances";
cards;
01-Apr-03 8860303 86521 1366811     32932422.08
01-May-03 8947743 90272 1376739     30994371.23
01-Jun-03 9035228 90436 1397670     31551717.63
01-Jul-03 9123510 92157 1424469     31788023.01
01-Aug-03 9199904 79118 1417949     32329068.83
01-Sep-03 9289735 92682 1369723     33772968.41
01-Oct-03 9390095 10294 1371610     34349188.42
01-Nov-03 9493607 10673 1383296     35398736.71
01-Dec-03 9583579 93288 1427501     36525256.12
01-Jan-04 9643529 63447 1432429     39782001.88
01-Feb-04 9706194 66283 1340457     38625107.74
01-Mar-04 9723757 20907 1294083     37758060.3
;run;
proc print data = statement_summary label;run;
This is how the output looks like . Now we want to produce a report that displays al metric in a time series format ie. you want the Performance months as columns . Then it becomes a typical case where we use PROC Transpose.
                          Total
         Performance    Number of      # New      #Active     $ Current
  Obs       Month        Accounts    Accounts    Accounts      Balances

    1      APR2003       8860303       86521      1366811    32932422.08
    2      MAY2003       8947743       90272      1376739    30994371.23
    3      JUN2003       9035228       90436      1397670    31551717.63
    4      JUL2003       9123510       92157      1424469    31788023.01
    5      AUG2003       9199904       79118      1417949    32329068.83
    6      SEP2003       9289735       92682      1369723    33772968.41
    7      OCT2003       9390095       10294      1371610    34349188.42
    8      NOV2003       9493607       10673      1383296    35398736.71
    9      DEC2003       9583579       93288      1427501    36525256.12
   10      JAN2004       9643529       63447      1432429    39782001.88
   11      FEB2004       9706194       66283      1340457    38625107.74
   12      MAR2004       9723757       20907      1294083    37758060.30

This how we use a PROC TRANSPOSE
proc transpose data =statement_summary out = ts_statement;
id Perfmonth;
var tot_accounts newacct      actives  current_balance ;
run;
 Line 1: Specifies the dataset and output dataset – Proc Transpose does not print the results
Line 2: Spcifies the field to be transposed through an ‘ID’ statement.
Line 3: Specifies the variables to be transposed

This is how it looks if we print a part of the data transposed (ts_statement)
Obs    _NAME_            _LABEL_               APR2003 MAY2003  JUN2003
  1     tot_accounts    Tot Number of Accounts 860303  8947743  9035228
  2     newacct         # New Accounts         86521   90272.   90436
  3     actives         #Active Accounts       136681  1376739  397670
  4     current_balance $ Current Balances     3293242 30994371 31551717

Note that two variables are created by SAS ‘_Name_ ‘ and  ‘_Label_ ‘ . This field has all variable names and their respective labels so that we can identify them. Variable we specified in ID statement is now converted into columns.
PROC Transpose is often used to convert the variables into a time series format as shown above. It is possible to use a  ‘BY’ statement in a PROC Transpose to transpose them in a grouped manner. When we work with SAS programs for automation, there are several instances we would do a PROC transpose to reshape the data in to a structure that helps in automation.


PROC DOWNLOAD is used to download data from a remote server, when you are working with SAS remote submit session. SAS remote sign on to a Unix server enables the user to compose and submit the program in their windows clients and see the SAS log and Output in their workstation. When you are submitting a program to remote server, the Sas datasets are created in the remote server (Stoner Unix server for example). PROC Download enables to download the data into your local windows folders.
Let us look at a program :
libname loc 'c:\datasets';
rsubmit;
libname user01 '/projects/cmart'
options obs =100;
Proc downlod data = user01.cmart_auth out = loc.cmart_auth;
run;
endrsubmit;
The above program  assumes that a signon to a remote server is already estabilished. There are two librariesdeclared – ‘loc’ is a local SAS library and ‘user01’ is a remote Sas library . Options obs=100 restricts the number of observations downloaded to 100. OBS=Max should be used if the intention is to download the complete dataset.


PROC FREQ is used to produce frequency counts and cross tabulation tables for specified variables/field. It can also produce the statistics like Chi Square to analyze relationships among variables.
As a good practice, at a data processing stage we use PROC Freq on categorical fields to understand the data and their frequency distributions. For example a cross tabulation table of FICO score segments to External Status code will tell us how the accounts are distributed across various score segments and statuses. PROC FREQ offers various features to control the output and the way the tables are produced. Let us start with an example to understand them better.

DATA account_perf;
INPUT client $ account current_os ext_status_code $ int_status_code $;
cards;
Cmart 1002 300   A  C
Cmart 1003 20    A  C
JCP     1004 1200  A  D
JCP     1005 800   Z  A
GIA     1006 450   Z  A
GIA     1007 560   Z  A
JCP     1008 450   A  D
GIA     1009 900   .  D
Cmart 1110 300   Z  D
;run;
proc sort DATA =  account_perf;
BY client;
run;
proc freq DATA = account_perf;
TITLE 'Frequency of External Status Code by Client';
TABLES ext_status_code* int_status_code /missing norow nocol nopercent;
BY client;
WHERE client ne 'JCP';
LABEL client ='Client Name';
run;

The data step creates a SAS dataset named ‘account_perf’. Proc sort is used to sort the data by client as we use a ‘BY’ statement in PROC FREQ that follows.
Line 1 and 2 in PROC FREQ statement tells SAS which dataset to use and the title of the output.
Line 3 specifies SAS to generate a cross-tab of External status code and Internal Status code  trough a TABLES statement. There are several options specified like missing, norow, nocol and nopercent  to control the way statistics are displayed.  A standard FREQ output shows a column, row and cell percentages.
Line 4:  BY Statement specifies the grouping criteria. In this case, frequencies are computed for each client group.
Line 5: WHERE statement allows the conditional processing. Here all clients other than ‘JCP’ are taken for computation.
Line 5: Labels the output. Its also possible to use FORMAT statements to manipulate the display.
Here is a list of commonly used options with TABLES statement
Nocol - suppresses printing of column percentages of a cross tab.
Norow - suppresses printing of row percentages of a cross tab.
Nopercent - suppresses printing of cell percentages of a cross tab.
Missing - interprets missing values as non-missing and includes them in % and statistics calculations.
List - prints two-way to n-way tables in a list format rather than as cross tabulation tables
Chisq - performs several chi-square tests.
Now let us look at the following requirements:
i) The user wants the output not to be printed but to be put into a SAS data set for further processing. ii) Wants to order the values to be changed to ascending order of frequency counts (highest occurring first) and iii) The output to be listed instead of cross-tab.
The following program block does the job:
proc freq DATA = account_perf order=freq;
Title 'Frequency of External Status Code by Client';
TABLES int_status_code*ext_status_code  / list out = perf_freq ;
WHERE client ne 'JCP';
label client ='Client Name';
run;
At line : 1- please note that ‘order =freq’ is specified to tell SAS to order the output by descending frequency count
Line: 3- ‘list’ keyword is used to tell SAS to list the output and not cross-tabulate. ‘Out=perf_freq’ specifies the output dataset name to store the frequency output.
To sum up, PROC FREQ is a very useful and the most used of the SAS procedures. In Consumer Finance environment, PROC FREQ is used mainly in the data preparation stage and for reporting. Frequency ordering, list and output data creation using OUT are often used options.


The PROC MEANS produces descriptive statistics for numeric variables.  By default, the MEANS procedure reports N (the number of non-missing observations), Mean, Std Dev (Standard Deviation), Minimum and Maximum. We can request additional statistics through the options. Let us start with a simple example:
proc means DATA = account_perf  ;
Title 'Summary of Current Outstanding' ;
BY client ;
var current_os;
label client ='Client Name'
current_os= 'Current Outstanding' ;
run;
Line:3-Tells SAS to group the values in client field and produce the statistics separately.
Line:4- Specifies the variable for which the statistics are to be computed. Here current_os is a  field in the dataset given and that contain data for current Dollar outstanding for each account in the dataset.
Now we want to request more statistic like variance, range, sum, mean, median, minimum and maximum this is how we do it.
proc means DATA = account_perf  var range sum  mean  median min max ;
Title 'Summary Statistics of Current Outstanding’;
var current_os;
run;
Suppose we want to create a SAS dataset with the statistics computed for further processing, this is how we instruct SAS.
proc means DATA = account_perf  var  sum  mean ;
Title 'Summary of Current Outstanding' ;
var current_os;
by client;
output out = perf_mean n = count sum = total mean = avg  ;
run;
The above program requests three statistics are to be put into the output dataset ‘perf_mean’.  Total count, sum and mean for each client (BY Client) are created as count, total and avg fields, respectively, in the output dataset.
To Sum up, PROC MEANS is used to understand the numeric variables, their distributions and other statistical characteristics. Options like BY and CLASS statements enable the users to look at them by segmenting into various categories. Just like we use FREQ on categorical and character fields, Means is used on numeric data.  Additionally, PROC Means is used for summarizing large datasets with a variety of statistics for reporting purposes. In this way it can be used as a substitute for PROC SUMMARY.


PROC GPLOT is used to produce the graphical output in SAS. PROC GPLOT is considered as an improvement over the PROC PLOT procedure as it provides good quality presentation compared to simple PROC PLOT outputs.
PROC GPLOT produces a variety of two-dimensional graphs including
·         Simple scatter plots
·         Overlay plots in which multiple sets of data points display on one set of axes
·         Plots against a second vertical axis
·         Bubble plots
·         Logarithmic plots (controlled by the AXIS statement).
In conjunction with the SYMBOL statement the GPLOT procedure can produce join plots, high-low plots, needle plots, and plots with simple or spline-interpolated lines. The SYMBOL statement can also display regression lines on scatter plots.
The GPLOT procedure is useful for
·         Displaying long series of data, showing trends and patterns
·         Interpolating between data points
·         Extrapolating beyond existing data with the display of regression lines and confidence limits.
The dataset and the program below gives a good understanding about the GPLOT options and how the graphs are created.

DATA account_perf;
INPUT client $ account fico_seg $  current_os tot_payment  ext_status_code $ ;
cards;
Cmart 1002 401-500 300  100 A  C
Cmart 1003 501-600 200  150 A  C
Cmart 1004 601-700 1200 180 A  D
Cmart 1005 701-800 800  190 Z  A
Cmart 1006 801-900 450  200 Z  A
GIA     1007 401-500 560  210 Z  A
GIA     1008 501-600 450  180 A  D
GIA     1009 601-700 900  145 A  D
GIA     1110 701-800 300  148 Z  D
;
run;
proc sort data = account_perf;
by client;run;

PART-I
GOPTIONS
 RESET            = global
 GUNIT            = PCT
 CBACK            = BLACK
 CTEXT            = WHITE
/* DEVICE         = GIF*/
 FONTRES    = PRESENTATION
 HTITLE           = 3
 HTEXT            = 2
 HSIZE      = 8 IN
 VSIZE      = 6 IN
 INTERPOL   = JOIN
 NOBORDER;

PART-II
SYMBOL1  VALUE = DOT      COLOR = LIME    HEIGHT = 2.5 WIDTH = 2;
SYMBOL2  VALUE = SQUARE   COLOR = VIPK    HEIGHT = 2.5 WIDTH = 2;
SYMBOL3  VALUE = TRIANGLE COLOR = STRO    HEIGHT = 2.5 WIDTH = 2;

PART-III
AXIS1 LABEL  = ('Current O/S')
      COLOR  = WHITE
      LENGTH = 60
      MAJOR  = (NUMBER = 5)
      MINOR  = NONE       ;
AXIS2 LABEL  = ('FICO SEGMENTS')
      COLOR  = WHITE
      LENGTH = 85
      MAJOR  = (NUMBER = 5)
      MINOR  = NONE;

PROC GPLOT DATA = Account_perf;
by client;
PLOT   current_os*fico_seg tot_payment*fico_seg/overlay haxis=axis1 vaxis=axis2;
RUN;
GOPTIONS: When working with GPLOT, the first step is to become familiar with the GOPTIONS. The purpose of the GOPTIONS statement is to apply levels of specific options to graphs created in the session or to override specific default values. It can be located anywhere within your SAS program; however, in order for the requested options to be applied to it, it must be placed before the graphics procedure. The GOPTIONS used above and explained below.
reset = option resets graphics options to their default values.
gunit = sets the units of character height measurement to percentage of display height.
cback = option sets the color background to black.
cfont= option sets the font color to white
device = option selects the device driver to use to create a graphic file(GIF)
htitle= option sets the text height for the first title to 3 (in percentage of display height).
htext = option sets the text height for all text to 2 (in percentage of display height).
Interpol: interpolation method (i.e., how to "connect" the points)
border option includes a border around the graphics output area.
In Part II, SYMBOL statements create SYMBOL definitions, which are used by the GPLOT procedure. These definitions control:
·         The appearance of plot symbols and plot lines, including bars, boxes, confidence limit lines, and area fills
·         Interpolation methods
Part III defines the Axes 1 and 2 where we can name the axis, spiffy the color, length of each axis and customize the number of major and minor divisions on it.
Having seen various options in GPLOT now let us look at the GPLOT statement in Part-IV.
Line:2 Tells SAS to produce a graph for each category in the variable specified in BY statement. Note that to use BY in GPLOT, the dataset should be sorted by the BY variable.  In this example, a graph will be produced by each client in the dataset.
Line:3 specifies the X and Y axes in the graph. Here we are plotting two variables – O/S and Payments in the same Y variable (ie FICO_score) .The OVERLAY option on the PLOT statement makes sure that both plot lines appear on the same graph.  Further, definitions done in the Part-III are applied using Haxis and Vaxis options.
POC GPLOT provides the flexibility to plot single or multiple variables on a same graph. Many a time we would require to name the SAS graph files and store it in a specific directory so that we have control over the files when we do the automation. This can be achieved by DEVICE option in GOPTIONS.
In scorecard tracking or Champion-challenger strategy tracking projects often there are many segmentations (like score segments or strategy identifiers or vintage segments) and charting for various performance variables are carried out to compare the segments. This kind of complicated charting can be automated using various options of GPLOT

No comments:

Post a Comment