SAS: Proc SQL Format

  • get variable with the required format from other table with sql join and format in one statement.
  • the example shows the new table contains the new character variable newvar1 with $1. format and the new date variable newvar2 with ddmmyy10. format
 proc sql;
create table new as
select a.* , b.var1 as newvar1 format=$1. , datepart(b.var2) as newvar2 format=ddmmyy10. 
from old1 as a 
left join lib.old2 as b
on a.var = b.var;
quit;

SAS: Create All Possible Combination of Categorical Variables and Generate Summary Statistics for Each Combination

A. Scenario

The data set contains personal response on attribute 1 to attribute 4. The response can be either ‘Y’ or ‘N’. Some of the records are blank. The possible combinations for attribute 1 to 4 are 2x2x2x2, which is 16 combinations.

I would like to know how many people belongs any one of the 16 combination of the attributes. The distribution of people among the combination of attributes indicates the prominent attributes of the population (where the distribution is dense) and the outlier attributes (where the distribution is sparse).

B. Code

  • create table with all combination
data combination;
format attr1 $1. attr2 $1. attr3 $1. attr4 $1.; *Important to add format statement. The variables need to be exactly the same format as the data table that you want to run summary statistics on;
do attr1= 'Y','N';
do attr2 = 'Y','N';
do attr3 = 'Y','N';
do attr4 = 'Y','N';
output;
end;
end;
end;
end;
run;
  • run sum use Proc tabulate with classdata option
proc tabulate data=data out = sum missing classdata = combination; *classdata option use the combination table created in the last step;
class attr1 attr2 attr3 attr4;
var count;
table attr1*attr2*attr3*attr4, count*(sum="");
run;

C. Output

table a classtable named combination
table sum using the classtable combination

D. Notes

  • make sure that the variables used in the class statement of the Proc tabulate procedure are included in the classdata table;
  • The output table sum automatically add the combination of null to all attr1 to attr4.
  • make sure that the variables used in classdata table have exactly the same name and format as the variables in the table that you want to run statistics on. In the above example, the attr1-4 in classdata table combination need to match the variables attr1-4 in the data table. Use Proc contents procedure on both data sets to make sure these variables are matching. Merely check the format in column attributes is not enough. If you omit the format statement when creating the classdata table combination, you will get the following error.

Statistics: Tools for Systematic Review and Meta-Analysis

A. Resource

B. Guidelines

  • Cooper & Hedges, 1994
  • Hedges & Olkin, 1985
  • Lipsey & Wilson, 2001
  • Borenstein, Hedges, Higgins, & Rothstein, 2008: Comprehensive Meta-Analysis Version 2.2.048

C. Review Process

  • Identification of studies
    • Name of the reviewer
    • Date of the review
    • Article: Author, date of publication, title, journal, issue number, pages, and credentials
  • General Information
    • Focus of study
    • Country of study
    • Variables being measured
    • Age range of participants
    • Location of the study
  • Study Research Questions
    • hypothesis
    • theoretical/empirical basis
  • Methods designs
    • Independent variables
    • Outcome variables
    • Measurement tools
  • Methods groups
    • Nonrandomized with treatment and control groups/repeated measures design
    • Number of groups
  • Methods sampling strategy
    • Explicitly stated/Implicit/not stated/unclear
    • sampling frame (telephone directory, electoral register, postcode, school listing)random selection/systematically/convenience
  • Sample information
    • number of participants in the study
    • if more than one group, the number of participants in each group
    • sex
    • socioeconomic status ethnicity
    • special educational need
    • region
    • control for bias from confounding variables and groups
    • baseline value for longitudinal study
  • Recruitment and consent
    • Method: letters of invitation, telephone, face-to-face
    • incentives
    • consent sought
  • Data collection
    • Methods: experimental, curriculum-based assessment, focus group, group interview, one-to-one interview, observation, self-completion questionnaire, self-completion report or diary, exams, clinical test, practical test, psychological test, school records, secondary data etc.
    • who collected the data
    • reliability
    • validity
  • Data analysis
    • statistical methods: descriptive, correlation, group differences (t test, ANOVA), growth curve analysis/multilevel modeling(HLM), structural equation modeling(SEM), path analysis, regression
  • Results and conclusion
    • Group means, SD, N, estimated effect size, appropriate SD, F, t test, significance, inverse variance weight

D. Statistics

  • Cohen’s kappa
  • Cohen’s d
  • effect size
  • aggregate/weighted mean effect size
  • 95% confidence interval: upper and lower
  • homogeneity of variance (Q statistic): Test if the mean effect size of the studies are significantly heterogeneous (p<.05), which means that there is more variability in the effect sizes than would be expected from sampling error and that the effect sized did not estimate common population mean (Lipsey & Wilson, 2001)
  • df: degrees of freedom
  • I square (%): the percentage of variability of the effect size that is attributable to true heterogeneity, that is, over and above the sampling error.
  • Outlier detection
  • mixed-effects model (consider studies as random effects): moderator analysis for heterogeneity (allow for population parameters to vary across studies, reducing the probability of committing a Type I error)
  • Proc GLM/ANOVA (consider studies as fixed effects): moderator analysis for heterogeneity
    • Region
    • Socioeconomic status
    • Geographical location
    • Education level
    • Setting
    • Language
    • sampling method
  • Statistical difference in the mean effect size of methodological feature of the study
    • confidence in effect size derivation (medium, high)
    • reliability (not reported, reported)
    • validity (not reported vs. reported
  • classic fail-safe N/Orwin’s fail-safe N: The number of missing null studies needed to bring the current mean effect size of the meta-analysis to .04. Threshhold is 5k+10, k is number of studies for the meta-analysis. If the N is greater than the 5k+10 limit then it is unlikely that publication bias poses a significant threat to the validity of findings of the meta-analysis.
    • Used to assess publication bias. eg. control for bias in studies (tightly controlled, loosely controlled, not controlled)

E. Purpose/Research Questions

  • Whether the treatment is associated with single effect or multiple effects?
  • Understand the variability of studies on the association of treatment with single or multiple effects, and explain the variable effects potentially through the study features (moderators). How do the effects of the treatment vary different study features?

F. Reference

Statistics Basics: Odds Ratio and Relative Risk Calculation in Excel

I have done a post of meta-analysis using the Proc Freq to produce cross tabulation of 2 by 2 table, the odds ratio, and the relative risk (column 1, column 2). In this post, I will show how the stats for odds ratio and relative risks are calculated using Excel.

A. SAS output

  • Frequency table
  • Odds table

B. Excel

  • Frequency and Odds Ratio
  • Relative Risk Column 1
  • Relative Risk Column 2

C. Interpretation

Drug: peole who feel better/people who did not improve = 58/30 = 1.93333
Placebo: people wo feel better/people who did not improve = 24/3 = 8
Odds ratio: of 0.241667 means patients who take the drug are 0.24 times likely to improve.
The probability of the improvement with the drug is 0.74 time the probability of the improvement with the placebo.
The probability of the no improvement in the symptoms with the drug is 3.7 time the probability of no improvement with the placebo.

SAS: Meta-Analysis CMH Example for Categorical Variable

A. Reference

B. Meta-Analysis

A meta-analysis is a statistical analysis that combines the results of multiple scientific studies. Meta-analysis can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting measurements that are expected to have some degree of error. The aim then is to use approaches from statistics to derive a pooled estimate closest to the unknown common truth based on how this error is perceived.

Wikipedia
  • In meta-analysis, studies become observations.
  • Research collect data for meta-analysis by systematic review of the literature in the field, and compile data directly from the summary statistics in the publication.

C. Problem with simply lumping the data from different studies together

  • Not consider treatment-by-study interaction
  • Assume response rates are the same in all studies.

D. SAS Solution (follow Hamer and Simpson’s paper, but corrected the output from the paper)

  • Create data set with the results of 2 studies. B: Remitted; N:Not remitted; P: Placebo; D: Drug.
  • I have used B (Better) to indicate Remitted cases because Proc Freq test is based on column 1 and row 1 of the 2 by 2 table, so if we code R for Remitted cases then the remitted case will be in column 2 because the table is by alphabetical order and R is after N.
  • The Hamer and Simpson paper actually tested the null hypothesis for the non-effective cases rather than the effective cases.
data chm;
input study $ response $ trt $ cellfreq @@;
datalines;
study1	B	P	24	study1	N	P	3
study1	B	D	58	study1	N	D	30
study2	B	P	16	study2	N	P	57
study2	B	D	2	study2	N	D	10
;
run;
  • Run Cochran-Mantel-Haenszel Statistics using Proc Freq procedure with cmh option.
proc freq data=chm;
tables study*trt*response /cmh;
weight cellfreq;
run;

E. SAS Output

  • SAS chm table
  • Frequency table
  • Cochrane-Mantel-Haenszel test

F. Notes

  • The Mantel-Haenszel estimator of the common odds ratio assumed the estimation to be homogeneous among both studies.
  • The Mentel-Haenszel statistics tests the null hypothesis that the response rate is the same for the two treatments, after adjusting for possible differences in study response rates.
  • For Proc Freq testing options, make sure the group that you want to tested are in row 1 and column 1. It is also important to crosstab treatment as row and response as column, so the interpretation of the relative risk for the risk of improvement make sense. In Hamper and Simpon’s paper the crosstab has been transposed, therefore the relative risk output doesn’t make sense.

G. Interpretation

  • The CMH test statistics is 4.65 with a p-value of 0.03, therefore, we can reject the null hypothesis that there is no association between treatment and response. P-value lower than 0.05 indicates that the association between treatment and response remains strong after adjusting for study.
  • Relative Risk (Column 1) equals to 0.74 which means the probability of the improvement with the drug is 0.74 time the probability of the improvement with the placebo.
  • Relative Risk (Column 2) equals to 1.51 which means the probability of no improvement in the symptoms with the drug is 1.51 times the probability of no improvement with the placebo.
  • The Breslow-Day test has a large p-value of 0.295 which indicates there is no significant difference in the odds ratios among the studies.

* I will show the odds ratio and relative risk calculation in Excel in another post.

Tableau: Show Table with Row Level Records with No Aggregation

A. Use Hyper Tableau Data Extracts

  • Follow post to create Hyper file

B. Worksheet configuration

  • Create new measure DISTINCTID
  • Analysis Menu: Aggregated Measures -> Uncheck
  • Analysis Menu: Stack Marks -> Off
  • Drag Measure Names from Dimension to Filters Shelf -> Measure Names will show in the Filter Shelf -> Edit Filter -> To only show Effect Size and N.
  • Drag Measure Values from Measures over Text on Mark Shelf -> Measure Values Shelf will appear with Effect Size and N
  • Drag Measure Names to Column Shelf
  • Drag DISTINCTID, Study, Grade range, Languages (bilingual) and Dominant cognitive benefits measured to Row Shelf
  • Make sure DISTINCTID -> computing using -> Table (down)

C. Worksheet Result

D. Important Notes

  • Measure Values and Measure Names need to work together to be effective.
  • The configuration for Aggregate Measures and Stack Marks can be different for every worksheet.

Tableau Special: Extract and Union Join Tables in PDF

A. Software

  • tableau prep builder 2019.2

B. Data Connection

  • PDF

C. Sample PDF document

page 9 of Odesope 2010

D. Data Interpreter

  • select “Use Data Interpreter”
  • drag “Page 9 Table 1” to “Add Data” canvas to start the flow
  • Effect size is recognized as character variable
  • In Input section, under Data Sample tab, select “Use all data”

E. Flow Development

  • Click “+” next to “Page 9 Table 1” in the flow and select “Add Union”
  • Union 1 shows there are 28 Rows of data with 7 Fields from “Page 9 Table 1”.
  • drag “Page 10 Table 1” over the Union 1 icon in the flow which means that “Page 10 Table 1” will be added to Union 1.
  • Union 1 now shows there are 58 Rows of data with 7 Fields from “Page 9 Table 1” and “Page 10 Table 1”.
  • add “Page 11 Table 1” to Union 1.

F. Add Calculation Field

  • Create a Calculated Field named [Effect Size] from the existing [Effect Size (g)]
  • Use split function to trim the ‘*’ from the string and then use float function to convert string to float.
  • Click ‘+’ next to Union 1 icon and select “Add Output”
  • Click Output icon to run flow
  • The output.hyper file will be generate under the c:\user\username\documents\My Tableau Prep Repository\Datasources\ folder

F. Use Case

  • Systematic Review/Meta-analysis

Regression Basics – MSR, MSE, Cook’s Distance

How do calculate the MSR, MSE, and Cook’s Distance directly from the output dataset with model prediction? SAS will produce a general summary of ANOVA results as below, but how are these statistics calculated.

Y= Intercept + beta1x1 +beta2x2 +beta3x3; n = 49, observation 20 to 47 are omited in display.

The excel spreadsheet shows the model prediction as yhat_old, residual values as resid_old, and the cook’s distance value as cook_old. The 19th observation has a high value (1.29) using the 4/n threshold (4/49=0.08).

  • Mean Square due to Regression (MSR) =24,948,880,090
    • Column model^2 are the squares of the difference between yhat_old and mean of Y which is 186,509, as shown in the Excel;
    • Take sum of the above, which equals to 74,846,640,270, as shown in the Excel.
    • Divided by degree of freedom, which is 3 (3 parameters as x1, x2, x3 ).
    • MSR means variances that are explained by the model.
  • Mean Square Error (MSE) =139,372,106
    • Column error^2 are the squares the difference between yhat_old and Y;
    • Take sum of the above, which equals to 6,271,744,784, as shown in the Excel.
    • Divided by degree of freedom, which is total number of observation minus number of parameters, which should include intercept as a parameter (intercept, x1, x2, x3).
    • MSE means variances that are unexplained by the model.
  • Root MSE or s: The estimated standard deviation of the random error. s=sqrt(139372106)=11805.6
  • F value: MSR/MSE =179.01
  • Cook’s Distance (how the cook’s D for observation 19 is calculated)
    • yhat_new is the model prediction excluding observation 19 in the model.
    • Column diff^2 are the squares of the difference between yhat_old and yhat_new;
    • Take sum of the above , which equals to 723,926,182;
    • For the denominator, multiply MSE (139,732, 106) by the number of parameters (4), which includes the intercept as parameter (intercept, x1, x2, x3).
    • Take the division of the results from the above 2 items equals to 1.2985.
    • Cook’s Distance measures the individual observation’s influence on the model but removing the observation from the model and measure the squared difference in prediction as a portion of the MSE per parameter. The smaller the Cook’s Distance (close to 0) means that removing the observation from the model have no impact on the model.

The Excel calculation matches the SAS output.

PowerBI and R Integration – Connect SAS dataset

Our database infrastructure is in SAS. Although many data requests can be handled directly through Base SAS, but SAS doesn’t have good visualization capability.

Directly connect PowerBI with SAS raw datasets or analytical result in SAS datasets from PowerBI through R integration can efficiently demonstrate the insights of data.

Here is the R script to connect the SAS dataset to PowerBI.

PowerBI and R Integration – Data Manipulation using R script

Reference: Power BI Introduction: Working with R Scripts in Power BI Desktop — Part 3, by Robert Sheldon.

In previous post I was focusing on using R script to develop visuals in PowerBI directly. In that case, the R scripts that are integrated with the visuals don’t generate datasets for PowerBI, and the R scripts are isolated for each visuals.

The PowerBI and R scripts can also be integrated to get and transform dataset. The caveat is the R scripts can only generate and transform one dataset at a time through Power Query Editor.

Get Data:

  • File / Home Ribbon > Get Data > More … > Other > R Script > Connect
  • The “cars” dataset appears in the “Model” view.
3 examples: cars (simple R script in Source step to GET DATA); cars_mean (R script in Source step and Run R script step); cars_co (more complex R script in Source step)
  • Click “Edit Queries” on the Home Ribbon to get into the “Power Query Editor”
  • Select “Car” from the “Queries” list
  • In the “Query Settings” the “Applied Steps” will show up in the sequece
  • To edit the original R script that gets the dataset, click the * besides the “Source” step (first step) and the “R script” box will appear. If the “Query Settings” is now showing, click “View” ribbon and “Query Settings”.
Source step in Power Query Editor for table cars.

Transform Data:

  • To create another dataset that contains the mean distance by speed group.
  • Repeat the same GET DATA steps above and name the dataset “cars_mean”
  • Get into “Power Query Editor”, select “cars_mean” table.
  • Click “Run R Script” on the “Transform” ribbon and enter the R codes that are showing below.
  • Unless in the sources step, you have to reference the dataset as “dataset” rather than the name of the table, eg “cars_mean”.
Run R script step in the Power Query Editor for table cars_mean
cars_mean table in the Data View.

More Complicated R script for Source step:

  • In the Source step, you can refer the dataset with its name. In this example, it is cars.
  • The R script needs to return a table. If the script only produce a value, you need data.frame() function to convert it to a table.
  • Make sure the required R package have been installed in the R library.
  • Use library() function to attach all the required libraries.
R script in the Source step for table cars_co in Power Query Editor
cars_co in Data View

Advanced Editor: