SAS String Functions

SCAN/COMPRESS/COMPBL

SCAN(string, count<,charlist <,modifiers>>)

Column: deptname
‘AB-Department X’
‘CD-Department Y’

data faculty;
set admin1516;
where divdesc = “Faculties”;
length fac $4.;
fac = scan(deptname, 1, ‘-‘);
run;

New Column: fac
‘AB’
‘CD’

COMPRESS(source<, chars><, modifiers>)

compress all the blanks

a= ‘A  B    C  D’

b= compress(a)

results for b is : ‘ABCD’

COMPBL: removes multiple blanks

a= ‘A  B  C D ‘

b= compbl(a)

results for b is : ‘A B C D’

 

 

Remote submit and library

libname   datalib   remote '/directory1/folder' server=servername.spawner; run;

ORACLE DATA

libname datalib oracle user=username password="password" schema=schemaname;

Start remote process.

rsubmit;

Download dataset to local.

proc download data=serverdataset; run;
proc download data=serverdataset out=locallib.dataset; run;

End remote process from the client session.

endrsubmit;

 

 

 

Delete dataset

Library has to be specified using library= option.  Use “delete libraryname.tablea” doesn’t work. Option “noprint” causes error.

rsubmit;
proc datasets library=libraryname nodetails nolist;
delete tablea;
run;
endrsubmit;

 

Import CSV, DBF, and xlsx datasets with data compression

  1. Import table1.xlsx
PROC IMPORT OUT= WORK.table1
DATAFILE= "X:\GROUPS\Surveys\
results\4. Data\table1.xlsx"
DBMS=EXCEL REPLACE;
RANGE="table1$";
GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
RUN;
  1. Import table1.csv. Data file is systematically saved in the year folder

Method 1:

%macro survey (strtyear, stopyear);
%do yr= &strtyear %to &stopyear;
%put &yr.;
data WORK.table_&yr. ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile "X:\GROUPS\Surveys\&yr.\results\4. Data\table1.csv" delimiter = ',' TRUNCOVER DSD lrecl=32767
firstobs=2 ;
informat field1 $1. ;
informat field2 $4. ;

format field1 $1. ;
format field2 $4. ;

input
field1 $
field2 $;

label
field1 = "Label 2"
field2= "Label 2"

if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
%end;
%mend survey;
%survey (2007, 2011);

Method 2:

PROC IMPORT OUT= WORK.table_2013
DATAFILE= “L:\GROUPS\Surveys\2013\Results\4. Data\table1.csv”
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;

  1. Import dbf dataset
PROC IMPORT OUT= WORK.table_2013 (compress=yes)
DATAFILE= "L:\GROUPS\Surveys\2013\Results\4. Data\table1.dbf"
DBMS=DBF REPLACE;
GETDELETED=NO;
RUN;

Compression datasets

  1. Compressing Text data

(compress = yes)

  1. Compressing Numeric data

(compress=binary)

 

R Online Reference and Resources

reference site:

http:/www.bi-notes.com

Tutorial: From 0 to R with Google Analytics

https://developers.google.com/analytics/solutions/r-google-analytics

Blog

http://www.tmllr.com/?utm_source=analyticsdemystified.com&utm_medium=social&utm_content=tims_r_post&utm_campaign=gilliganisawesome

From level[1] to level[2] with R and Google Analytics: ggplot2

http://www.tmllr.com/from-level1-to-level2-with-r-and-google-analytics-ggplot2/

Crawling Websites

Found a very interesting blog on “Simple Web Crawler with SAS Macro and SAS Data Step” at http://www.sascommunity.org/wiki/Simple_Web_Crawler_with_SAS_Macro_and_SAS_Data_Step with the following code to strip url from the listed websites.

data work.links_to_crawl;
length url $256;
input url $;
datalines;
http://www.yahoo.com
http://www.sas.com
http://www.google.com
;
run;

%macro crawler();
%let html_num = 1;

data work.links_crawled;
length url $256;
run;

%next_crawl:
/* pop the next url off */
%let next_url = ;

data work.links_to_crawl;
set work.links_to_crawl;
if _n_ eq 1 then call symput(“next_url”, url);
else output;
run;

%let next_url = %trim(%left(&next_url));

%if “&next_url” ne “” %then %do;

%put crawling &next_url … ;

/* crawl the url */
filename _nexturl url “&next_url”;

/* put the file we crawled here */
filename htmlfile “file%trim(&html_num).html”;

/* find more urls */
data work._urls(keep=url);
length url $256 ;
file htmlfile;
infile _nexturl length=len;
input text $varying2000. len;

put text;

start = 1;
stop = length(text);

if _n_ = 1 then do;
retain patternID;
pattern = ‘/href=”([^”]+)”/i’;
patternID = prxparse(pattern);
end;

/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the start parameter so that searching */
/* begins again after the last match. */
call prxnext(patternID, start, stop, text, position, length);
do while (position ^= 0);
url = substr(text, position+6, length-7);
* put url=;
output;
call prxnext(patternID, start, stop, text, position, length);
end;
run;

/* add the current link to the list of urls we have already crawled */
data work._old_link;
url = “&next_url”;
run;
proc append base=work.links_crawled data=work._old_link;
run;

/* only add urls that we haven’t already crawled or that aren’t queued up to be crawled */
proc sql noprint;
create table work._append as
select url
from work._urls
where url not in (select url from work.links_crawled)
and url not in (select url from work.links_to_crawl);
quit;

/* only add urls that are absolute (http://…) */
data work._append;
set work._append;
absolute_url = substrn(url, 1, 7);
put absolute_url=;
if absolute_url eq “http://” ;
drop absolute_url;
run;

/* add new links */
proc append base=work.links_to_crawl data=work._append force;
run;

/* increment our file number */
%let html_num = %eval(&html_num + 1);

/* loop */
%goto next_crawl;
%end;

%mend crawler;

%crawler();

 

Use PROC Univariate to get histogram by group

proc univariate data =sample.college NOPRINT;
class region type;
histogram tuition / normal (color=red)
cfill =itgray
ctext =blue;
inset N= ‘Number of colleges’ Median (8.2) mean (8.2) std = ‘Standard Deviation’ (8.3) / position = ne;
run;

The following histograms illustrate the distribution of tuition by region and type of college.  You can get a general understanding that private colleges are much more expensive than the public colleges in certain region.

Histogram5