SAS: Reading Census Data

Census data is exceptionally large. The 2016 census profile for Ontario is 4.5G and has more than 46,694,909 lines of records. To extract the data efficiently, StatsCan provides a csv file that identifies the starting row number for each geography. Using this file, you can compile the parameter list for the geographical area of interest at the selected geographic level, eg. province level, census division level, and census subdivision level.
Census file can be downloaded at link
Step 1: Compile parameter lists

%put &name.;
Canada Ontario Durham York Toronto Peel Halton
 %put &start.;
2 2249 7287023 9513800 12198965 20521853 25289987
 %put &end.;
2248 4495 7289269 9516046 12201211 20524099 25292233

Step 2: Extract and Compile data

%macro ext (namelst=, startlst=, endlst=);
proc datasets library=work noprint;
delete census;
quit;
%let i=1;
%do %while (%scan(&namelst., &i, ' ') ne );
%let parm1=%scan(&namelst., &i, ' ');
%let parm2=%scan(&startlst., &i, ' ');
%let parm3=%scan(&endlst., &i, ' ');
data census_&parm1.;
infile 'X:\Work\Stats Can\98-401-X2016044_ONTARIO_eng_CSV\98-401-X2016044_ONTARIO_English_CSV_data.csv'
delimiter = ',' firstobs=&parm2. obs=&parm3. TRUNCOVER  DSD LRECL=32767 ;
INFORMAT 
year 8.
geo_code  $13.
geo_level 8.
geo_name $80.
gnr 8.1
gnr_lf 8.1
quality_flag $5. 
alt_geo_code 8.
Item $50.
itemID 8.
Notes 8.
Total 8.
Male $8.
Female $8.
;
FORMAT 
year 8.
geo_code  $13.
geo_level 8.
geo_name $50.
gnr 8.1
gnr_lf 8.1
quality_flag $5. 
alt_geo_code 8.
Item $80.
itemID 8.
Notes 8.
Total 8.
Male $8.
Female $8.
;
input
year 
geo_code $
geo_level 
geo_name $
gnr
gnr_lf
quality_flag 
alt_geo_code 
Item $
itemID 
Notes 
Total
Male $
Female $
;
run;
proc append base = census data = census_&parm1.;
run;
%let i = %eval (&i +1);
%end;
%mend ext;
%ext (namelst=&name., startlst=&start., endlst=&end. );

The resulting dataset is only 4.1mb, which you can manipulate efficiently.

SAS: Convert format catalogs from 32bit to 64bit

Reference: http://support.sas.com/kb/44/047.html

Error message:
ERROR: File FORMATS.CATALOG was created for a different operating system.
Step 1: In windows 32-bit SAS, create a transport file (.cpt) with PROC CPORT and file option.

libname my32 'X:\Work\SAS\formats'; /* path where commonfmt.sas7bcat exists */
filename cat1 'X:\Work\SAS\formats\commonfmt.cpt';  /* transport file you are creating */

proc cport lib=my32 file=cat1 memtype=catalog;
   select commonfmt;
run;

The .cpt file will contain the format information of commonfmt.sas7bcat catalog file.

Step 2: In windows 64-bit SAS, unload the transport file (.cpt) using PROC CIMPORT and infile option.

libname my64 'X:\Work\SAS\format64';  /* path to store the new Formats.sas7bcat file */
filename trans1 'X:\Work\SAS\formats\commonfmt.cpt';  /* same as in Step 1 above */

proc cimport infile=trans1 lib=my64;
run;

Step 3: Check the formats in windows 64-bit SAS.

libname sasfmt  'X:\Work\SAS\format64';

PROC CATALOG CATALOG = SASFMT.COMMONFMT;
CONTENTS;
QUIT;

SAS 9.4 ODS RTF file corrupt

Recently, I updated from SAS 9.3 to 9.4. The code was working fine on 9.3. It contains a macro running do loops, producing graphs for each loop, and outputting all the graphs into a rtf file. When I ran the code in 9.4, I can’t open the file in Word. Google search recommends update the sasv9.cfg file to increase the -memsize and -memmaxsz to 6G (https://communities.sas.com/t5/ODS-and-Base-Reporting/SAS-ODS-RTF-file-corrupt/td-p/282779). Another alternative I find is to just change ODS RTF to ODS PDF without changing any code. When running the ODS PDF statements, I got the following warning, but the PDF file was generated without a problem.

Warning: Unsupported device "ACTIVEX" for LISTING destination.  Using device "ACTXIMG".

To get rid of the warning, in SAS EG, navigating to Tools/Options…/Results/Graph. In the “Graph Format” dropdown list, changing “ActiveX” to “ActiveX image”.

SAS Installation: ” Connect: Class not registered “

Issue: Installed SAS 9.4 on the machine where SAS 9.3 has already been installed. Not able to import Excel file in SAS 9.4.
Error message:
ERROR: Connect: Class not registered
ERROR: Error in the LIBNAME statement
Configuration:
– office 365 (32 bit)
– office 2010 (64 bit)
– Windows 7 (64 bit)
– SAS 9.3
– SAS 9.4 (64 bit)
– odbc 64bit: C:\Windows\System32\odbcad32.exe
– odbc 32bit: C:\Windows\SysWOW64\odbcad32.exe

Possible Reason: SAS PC Files Server not installed or not working properly. Need Excel ODBC driver in the same bitness as in SAS.
reference: http://support.sas.com/kb/60/356.html
Attempt 1: Stop SAS PC Files Server in Service (administrative tools); uninstal SAS PC Files Server; download SAS PC Files Server from (pcfilesrv__94200__wx6__en__web__1.zip) from https://support.sas.com/downloads/package.htm?pid=2167; unzip, install from setup.exe; get Java private environment error when installing PC Files Server.
Results 1: not working

Possible Reason: SAS 9.4 is not connected to the right ODBC. ODBC 32bit has Excel in the driver list but ODBC 64bit doesn’t have Excel in the driver list.
reference: https://support.microsoft.com/en-ca/help/942976/odbc-administrator-tool-displays-both-the-32-bit-and-the-64-bit-user-d
Attempt 2: run as administrator on both ODBC files; add excel driver as the system data source in the “System DSN” tab; change target from “\system32\odbcad32.exe” to \syswow64\odbcad32.exe” in the property of “Data Sources (ODBC) in Administrative Tools
Results 2: not working

Possible Reason: office 365 doesn’t have the match Access Engine for program external to Microsoft Office to access the Office 365 application data
Attempt 3:
— download Access Database engine 2016 distribution (32bit)
— install to cmd, cd to where the installation file is saved, run AccessDatabaseEngine.exe /quiet /passive
— open registry editor on window 7: windows key + r -> run regedit -> yes
— navigate to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office\16.0\Common\FilesPaths, delete mso.dll
reference: http://support.sas.com/kb/43/933.html
https://knowledge.autodesk.com/support/autocad-map-3d/learn-explore/caas/screencast/Main/Details/7132d430-dc09-40bc-9148-94ff9db41c24.html
Results 3: not sure why but WORKED, but doesn’t work on the Office 365 32bit installation.

Attempt 4:
— uninstall Microsoft Office 365 (32bits)
— install Microsoft Office 365 (64bits)
— uninstall Access Database engine 2016 distribution (32bits)
— install Access Database engine 2016 distribution (64bits)
— install to cmd, cd to where the installation file is saved, run AccessDatabaseEngine.exe /quiet /passive
— open registry editor on window 7: windows key + r -> run regedit -> yes
— navigate to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office\16.0\Common\FilesPaths, delete mso.dll
Results 4: worked. matched SAS 64bits with Office 365 64bits

SAS: Find String Characters within String Variable

Find “Mike” in the Name variable. Use scan function when you know the position of the word within the variable and the delimiter. Use contains for general search. Use substr function when you know the starting position of the word within the variable.

  • Scan
data a;
set b;
where scan(name, 1, ' ') = "Mike";
run;
/* nested Scan example */
/* data string is like 'XX XX ABCD>XXXX' */ 
/* need to get the 'ABCD' portion of the string, but it can be different lengths */
data temp;
set set1;
temp =scan(scan(var, 3, ' '),1, '>');
keep temp;
run;
  • Contains
data a;
set b;
where upcase(name) contains "MIKE";
run;
  • Substr
data a;
set b;
where substr(name, 1, 4) = "Mike";
run;

GitHub: Download a file from github

  1. Using Jupyter Notebook
    • Open the file in the github and copy the url
    • Open Jupyter Notebook
    • Click “upload” button and paste the url in the “File Name:” field in the “File Upload” window. Click “Open”.
    • Click “upload” button where the file entry is.

SAS STAT: getting output statistics (2)

This post focus on the ODS Output for regression, including ANOVA, least square means, parameter estimates etc.

Each output from a SAS procedure has an associated name and label.  Each name is part of a name path, and each label is part of a label path.  In order to select, exclude, or modify an output table, you must first know its name.  Each level in the name path corresponds to a part of the procedure’s hierarchy of output that you can view in the Results window.

ods output 
parameterestimates = outputtable1 
fitstatistics =  outputtable2
overallanova = outputtable3
lsmeans = outputtable4
nobs = outputtable5
means = outputtable6;
ODS table name  Statement / Option Key Stats
ParameterEstimates Model / Solution Estimated linear model coefficients, t Value, Probability T, standard error
FitStatistics Default r-square, Coeff Var, Root MSE, dependent mean
OverallAnova Default DF, sum of squares, mean square, F value, Probability F
LsMeans LsMeans Least squares means are requested for each effect listed in the LSMEANS statement. LSMEANS can perform on classification variables and multiple comparison on interactions. LSMEANS are predicted population margins; that is, they estimate the marginal means over a balanced population.
Nobs Default Number of Observations
Means Means When variables in the Means statement is the same as in the Model statement, Means statement will provide the count of observations for each model effect or combination of effects.

Principal Component Analysis and Factor Analysis

Reference: A Step-by-Step Approach to Using SAS for Factor Analysis and Structural Equation Modeling

Options:

  • flag: output to flag factor loading with absolute values greater than the specified size. eg flag = .40
  • method: extracting method
  • prin: principal factors for a principal component analysis
  • mineigen: critical eigenvalue to retain and rotate any component whose eigenvalue is specified or lager.
  • nfact: number of components to be retained and rotated

SAS: Standardize Variables

Standardized variables can be referred as Z score with a mean of 0 and standard deviation of 1. When multi-scale variables are enter into Regression analysis, the variables with larger variances will have more importance and influence on the results than the variables with small variances. It is important to standardized variables in the preprocessing step for regression analysis, cluster analysis, and neural network.

To calculate the standardized variable, use the non-standardized variable minus the mean and then divided by standard deviation.

Get means and standard deviation before standardizing the variables.

proc univariate data = data1 out=stat;
var var1 var2 var3;
run;

In the data set, the variables are not measured in the same units and cannot be assumed to have equal variance. We use PROC STDIZE to standardize the variables.

proc stdize data= data1 out=Stand method =std;
var var1 var2 var3;
run;

Get means and standard deviation after standardizing the variables. Means of the variables should equal 0 and standard deviations equal to 1.

proc univariate data = Stand out=stat;
var var1 var2 var3;
run;

Eigenvalue and Eigenvector

Google Definition: 1) each of a set of values of parameter of which a differential equation has a nonzero solution (an eigenfuction) under given conditions. 2) any number such that a given matrix minus that number times the identity matrix has a zero determinant.

reference: https://www.khanacademy.org/math/linear-algebra/alternate-bases/eigen-everything/v/linear-algebra-introduction-to-eigenvalues-and-eigenvectors

Kaiser-Guttman Criterion: ‘Eigenvalues greater than one’ (Guttan, 1954; Kaiser, 1960, 1970) is commonly used to determine number of factors to retain. The thinking behind the criterion is “that a factor must account for at least as much variance as an individual variable” per Nunnally and Bernstein (1994)