>

Regression is a Univariate General Linear Model Subsuming Other Parametric Methods as Special Cases

Sherry Vidal

Texas A&M University, January 1997

Abstract

Although the concept of the General Linear Model has existed since the 1960's, other univariate analyses such as the t-test and OVA methods have remained popular over the years. Certain univariate analyses require some variables to be in a nominal scale vs. interval scale and provide limited information about the data as compared to other data analytic tools. This paper explains how regression subsumes all univariate analyses and how regression can provide the researcher with a greater understanding of the data. A heuristic data set is used to further clarify this discussion.

Paper presented at the annual meeting of the Southwest Educational Research Association, Austin, January, 1997.

Regression is a Univariate General Linear Model Subsuming Other Parametric Methods as Special Cases

Over the years graduate students continue to learn statistics with a relatively limited conceptual understanding of the foundations of elementary univariate analyses. Maxwell, Camp, and Arvey (1981) emphasized that "researchers are not well acquainted with the differences among the various measures (of association) or the assumptions that underlie their use" (p. 525). Frequently, many researchers and graduate students make assertions such as "I would rather use Analysis of Variance (ANOVA) than regression in my study because it is simpler and will be able to provide me with all the information I need." Unfortunately, comments such as these are ill-informed and can result in the use of less desirable data analytic tools.

All univariate analyses such as the T-test, Pearson correlation, ANOVA, and planned contrasts are subsumed by correlational analyses. In 1968 Cohen acknowledged that ANOVA is a special case of regression; he stated that within regression analyses "lie possibilities for more relevant and therefore more powerful exploitation of research data" (p. 426). Thus, an understanding of a model which subsumes univariate analyses is not only pertinent to any researcher, but imperative if a researcher wants to maximize findings of research data.

The general linear model is a model which subsumes many univariate analyses. The general linear model (GLM) "is a linear equation which expresses a dependent (criterion) variable as a function of a weighted sum of independent (predictor) variables" (Falzer, 1974, p. 128). Simply stated, the GLM produces an equation which minimizes the mean differences of independent variables as they are related to a dependent variable. From a computer printout of a regression analysis, the researcher can obtain weights which apply to each variable and then construct this equation. Regression as a general linear model can provide the exact same information as a T-test or ANOVA, but this type of analysis also provides other information which can be useful. In addition, the GLM allows the researcher more flexibility regarding the type of variables that can be entered (e.g. interval vs. nominally scaled variables).

The purpose of the present paper is to illustrate the foundations of the general linear model, in terms of regression, and the advantages this analytic tool provides over other commonly used univariate methods. The present paper conceptually outlines the general linear model; further computational detail can be found in Tatsuoka (1975). Although Cohen (1968) and Falzer (1974) acknowledged the importance of the general linear model in the 60's and 70's, the use of ANOVA methods remained popular because of its computational simplicity over other methods such as regression. Computational aids such as high powered computers were unavailable to many researchers until the 1980's; therefore researchers used analytical methods which were congruent with existing technology.

Today computers can easily compute complex analyses such as regression, however the shift from OVA methods to the general linear model has been gradual. During the years 1969-1978, Wilson (1980) found that 41% of journal articles in an educational research journal used OVA methods as compared with 25% during the years 1978-1987 (Elmore & Woehlke, 1988). Researchers are beginning to recognize that the general linear model

can be used equally well in experimental or non-experimental research. It can handle continuous and categorical variables. It can handle two, three, four or more independent variables'. Finally, as we will abundantly show, multiple regression analysis can do anything that the analysis of variance does "sums of squares, mean squares, F ratios" and more. (Kerlinger & Pedhazur, 1973, p. 3)

Conclusively, eliminating variance from intervally scaled predictor variables can lead to misleading results. Cliff (1987) stated:

such divisions are not infallible; think of the persons near the borders. Some who should be highs are actually classified as lows, and vice versa. In addition, the "barely highs" are classified the same as the "very highs," even though they are different. Therefore, reducing a reliable variable to a dichotomy makes the variable more unreliable, not less. (p. 130)

Furthermore, Thompson (1986) has established that ANOVA methods tend to overestimate smaller effect sizes: "OVA methods tend to reduce power against type II errors by reducing reliability levels of variables that were originally higher than nominally scaled. Statistical significant effects are theoretically possible only when variables are reliably measured" (p. 919). Conversely, regression analyses in general "did tend to provide more accurate estimates of explained variance than did the OVA analyses. The pattern was most noticeable when sample size was small" (Thompson, p. 924).

To examine specifically how regression and correlation subsume univariate analyses, a heuristic data set is provided in Table 1 for illustration. The fictitious data set for this example was taken from Daniel (1989). The two experimental conditions are represented by the variable group (1=control, 2=experimental). Other independent variables are sex (1=male, 2=female). The sample (n=16) consisted of eight girls and eight boys. A reading posttest with an interval scale from 1-100, where one represents a low score and 100 a high score, was used as the dependent variable. These data are used to determine how these variables can help determine which of two classrooms is most appropriate for students.

Table 1

Heuristic Data

 GROUP PTEST SEX IQ OVAIQ 1 18 1 93 1 1 84 2 88 1 2 64 1 85 1 2 81 2 95 1 1 98 1 93 1 1 55 2 95 1 2 49 1 85 1 2 14 2 87 1 1 99 1 130 2 1 84 2 117 2 2 47 1 118 2 2 99 2 106 2 1 83 1 118 2 1 81 2 112 2 2 74 1 103 2 2 99 2 104 2

Analysis of the Data Set

The data were analyzed using SPSS for WINDOWS 1995. The following analyses were implemented: a T-test, One-way ANOVA, Two-way ANOVA, Pearson correlation, planned contrast, and regression analysis. Appendix A present the computer program used to analyze the data.

T-Test

Since a T-test is restricted to the comparison of two means, the two means of the independent variable group were examined in relation to the dependent variable posttest. The results are shown in Table 2. In this example, the researcher is attempting to understand possible differences on the posttest score between those subjects in the control group and those subjects in the experimental group.

Table 2

T-test SPSS Printout

 Variable N Mean SD SE of Mean GROUP 1 8 75.2500 26.768 9.464 GROUP 2 8 65.8750 28.847 10.199

Mean Difference = 9.3750

Levene's Test for Equality of Variances: F= .132 P= .722

t-test for Equality of Means

 Variances t-value Df 2-Tail Sig SE of Diff 95% CI for Diff Equal .67 14 .511 13.913 (-20.466, 39.216) Unequal .67 13.92 .511 13.913 (-20.482, 39.232)

A t-value of .67 is the statistic commonly referred to in research journals when using this type of statistical analysis. Tatsuoka (1975) illustrated how the t value is simply a function of the correlation coefficient in the following formula:

t = r&N-2/ &1-r2

T- test Done Using Regression

Step by step statistics from the regression output will be used to illustrate a proof of this formula for the heuristic data set. Table 3 illustrates the statistics that are a result of the regression analysis for this heuristic data set. Initially many researchers, especially graduate students, can become overwhelmed by this information, but this paper will attempt to highlight a few important areas of these given results.

Table 3

Multiple Regression SPSS Printout

 Variable Mean Std Dev PTEST 70.563 27.315 GROUP 1.500 .516

N = 16

Correlation

 Ptest Group PTEST 1.000 -.177 GROUP -.177 1.000

 Multiple R R Square Adjusted R Square Standard Error .17723 .03141 -.03777 27.82647 R Square Change F. Change Signif F Change .03141 .45403 .5114

 Source DF Analysis of Variance Sum of Squares Mean Square Regression Residual 1 14 351.56250 10840.37500 351.56250 774.31250

F = .45403 Signif F = .5114

 Variable B SE B 95% CI B Beta GROUP (Constant) -9.375000 84.625000 13.913236 21.998757 -39.215922 37.442360 20.465922 131.807640 -.177235

First refer to the area titled correlation, notice the correlation coefficient between group and posttest is equal to -.177. If the correlation coefficient is inserted into the previously described formula, the following result is found:

t = -.177 & 16-2/ & 1-.031 =.662/.9843 =.672.

This t value is identical to the t value reported in Table 2, thereby supporting the premise that a t-test is a function of correlational analysis. One can refer to the common formula for regression for a proof that regression analysis is also a function of the correlation coefficient (Thompson, 1992).

In addition, Table 3 reports an R2 value of .0341, which can be interpreted as "the proportion of Y that we can explain with the predictors [independent variables]" (Thompson, 1992, p. 10). Furthermore, an adjusted R2 of -.0377 is reported. This adjustment is an attempt to account for various biases (see Snyder & Lawson, 1993). However, conceptually a squared value should not be negative, thus this negative value may lead the researcher to infer that this predictor variable (group) is a poor predictor for this sample.

Lastly, the regression output gives the researcher information about the sum of squares and weights for the regression equation. These figures aid our understanding in how and which variables are account for the variance explained. In addition, the sum of square values can be used to calculate an effect size which is "the degree to which the phenomenon is present in the population" (Cohen, 1988, p. 12). Dividing the sum of squares of a given variable by the total sum of squares will yield an effect size for each variable. Further specifics of the regression equation and beta weights will be discussed later.

One-Way ANOVA Analysis

A one-way ANOVA using group as the independent variable and "ptest" as the dependent variable was executed with the results reported in Table 4. Since a one-way ANOVA is conceptually identical to a t-test, an extensive discussion of how regression subsumes a one-way ANOVA will not be presented. However, a proof can demonstrate how an ANOVA analysis, specifically the F statistic, is a function of correlational analysis in the following formula:

F = t2 = ( r&N-2/ &1-r2)2

In other words, F=.454 = t2=(.674)2.

Table 4

One-Way ANOVA SPSS Printout

Analysis of Variance

 Source D.F. Sum of Squares Mean Squares F Ratio F Prob. Between Groups Within Groups Total 1 14 15 351.5625 10840.3750 11191.9375 351.5625 774.3125 .4540 .5114

 Group N Mean Standard Deviation Standard Error 95% CI Grp 1 Grp 2 Total 8 8 16 75.2500 65.8750 70.5625 26.7675 28.8466 27.3154 9.4637 10.1988 6.8288 52.8718 TO 97.6282 417587 TO 89.9913 56.0072 TO 85.1178

Two-Way ANOVA Analysis

Next, a two-way ANOVA was conducted. The two ways were sex (male/female) and group (experimental/condition). The dependent variable was the reading posttest score. Recall that ANOVA requires both independent variables to be in a nominal scale form, thus sex and group are appropriate variables. Table 5 lists the SPSS output for the two-way ANOVA for the heuristic data.

Table 5

Two-Way ANOVA SPSS Printout

 Source of Variation SS DF MS F Sig of F Main Effects Sex Group (combined) 2-Way Interactions Sex By Group   Model Residual Total 264.06 351.56 615.63     175.56   791.19 10400.8 11191.94 1 1 2     1   3 12 15 264.06 351.56 307.81     175.56   263.73 866.729 746.13 .30 .41 .35     .20   .30 .591 .536 .708     .661   .822

The two-way ANOVA gives us the same information as in a one-way, but we also main effects and interaction effects between the variables sex and group. Sum of squares for each variable are reported as well as an F statistic. Notice how the sum of squares for group has not changed from the One-way ANOVA to the Two-way ANOVA Analysis. An effect size could also be calculated with this sum of squares information. For this data the effect size for the two-way interaction would be 175.56/11,191.94 = .0156. This interaction variable provides the researcher with further information on how the independent variables interact with each other in relation to the dependent variable.

Two-Way ANOVA Using Planned-Contrast Regression

In order to recreate the interaction in regression, a new variable must be created. See Appendix A for the appropriate SPSS commands. The new variable "A1B1", represents the group-by-sex interaction. "A1" will now represent group membership and sex will be represented by "B1". Moreover, a planned contrast is used to create orthogonal comparisons. These results are reported in Table 6.

Table 6

Planned C

Comparison SPSS Printout

 Variable Mean Std Dev Variance PTEST A1 B1 A1B1 70.563 .000 .000 .000 27.315 1.033 1.033 1.033 746.129 1.067 1.067 1.067

Correlation

 PTEST A1 B1 A1B1 PTEST 1.00 -.177 .154 .125

Variable(s) Entered on Step Number 3.. A1B1

 Multiple R R Square Adjusted R Square Standard Error .26588 .07069 -.16163 29.44026

Analysis of Variance

 DF Sum of Squares Mean Square Regression Residual 3 12 791.18750 10400.75000 263.72917 866.72917

F = .30428 Signif F = .8218

 Variable B SE B Beta T Sig T A1 B1 A1B1 (Constant) -4.687500 4.062500 3.312500 70.56250 7.360066 7.360066 7.360066 7.360066 -.177235 .153603 .125246 9.587 -.637 .552 .450 .0000 .5362 .5911 .6607

The first half of the output furnishes basic descriptive statistics and correlations between the contrasts. Notice how the A1, group, correlation with "ptest" is equal to -.177, this is the same result as reported earlier in Table 3. Although SPSS prints out a new summary for each variable (e.g., group, sex, group by sex), only the last summary which includes all the variables entered is used in Table 6. Refer to the column with the T values. If you square these values they will equal the F statistic reported in the ANOVA, therefore demonstrating that multiple regression can compute the same statistics as an ANOVA without requiring the predictor variables to be in a nominal scale.

As stated earlier, a t-test may provide a researcher with the information that two means are different, but regression can inform the researcher more distinctly how two variables are different from one another in relation to the dependent variable. In regression the researcher can determine what parts of the dependent variable (y) are explained (y') or unexplained (error) by the independent variables. A Venn diagram in Figure 1 illustrates this concept in terms of the example presented earlier. The Y' area is a synthetic variable which describes the total area explained by the 3 variables ("A1", "B1", "A1B1").

_____________________________

_____________________________

As you may notice, the correlation of each variable with Y equals beta2. This result occurs only when effects are uncorrelated such as in orthogonal contrasts. The Y' area can also be referred to as R2. An R2 of .07069 is reported in the regression output, indicating that 7% of the variance can be explained by the predictors. The Venn diagram can help a researcher visualize more clearly this percentage of variance explained by the predictors.

Ultimately regression provides the researcher with an equation which gives the best possible prediction of Y' for the sample data. The basic linear equation for regression is:

Y' = a + b1(A1) + b2(B1) + b3(A1B1)

In standardized form, the regression equation would be:

Y' = b 1(ZA1) +b 2(ZB1) + b 3(ZA1B1).

See Thompson (1988) for a further discussion of beta coefficients and structure coefficients in terms of interpreting results of a regression equation. Since the variables group and sex are not in z score form, the appropriate equation would be the unstandardized regression equation:

Y'= 70.56 - 4.68(A1) + 4.06(B1) + 3.31(A1B1).

The beta values are given in Table 6. For this sample data, this equation can help a researcher determine the best possible prediction of Y' reading posttest scores, given the group condition (experimental/control) and sex (male/female). Hence, the researcher is able to make more informed decisions about the contribution of variables in relation to the dependent variable. While weights could be constructed from the statistics in an ANOVA analysis, regression provides this information without any further computations and does not require the researcher to dichotomize variables.

Summary

To conclude, there are many similarities across all univariate analyses. Correlation is the link that ties these analyses together because regression represents the model that acts as an umbrella to all univariate analyses. That is, all analyses are correlational, although some designs may not be.

References

Cliff, N. (1987). Analyzing multivariate data. San Diego: Harcourt Brace Jovanovich.

Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin, 70, 426-443.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillside, NJ: Erlbaum.

Daniel, L. (1989, March). Commonality analysis with multivariate data sets. Paper presented at the annual meeting of the American Educational Research Association, San Fransisco. (ERIC Document Reproduction Service No. ED 314 483)

Elmore, R., & Woehlke, P. (1988). Statistical methods employed in the American Educational Research Journal, Educational Researcher, and Review of Educational Research from 1978 to 1987. Educational Researcher, 17(9), 19-20.

Falzer, P. (1974). Representative design and the general linear model. Speech Monographs, 41, 127-138.

Kerlinger, F. N., & Pedhazur, E. J. (1973). Multiple regression in behavioral research. New York: Holt, Rinehart, and Winston.

Maxwell, S., Camp, C., & Arvey, R. (1981). Measures of strength of association: A comparative examination. Journal of Applied Psychology, 66(5), 525-534.

Pedhazur, E.J. (1982). Multiple regression in behavioral research: Explanation and prediction. (2nd ed.). New York:Holt, Rinehart, and Winston.

Snyder, P., & Lawson, S. (1993). Evaluating results using corrected and uncorrected effect size estimates. The Journal of Experimental Education, 61, (4), 334-349.

Statistical Package for the Social Sciences (SPSS) [Computer Software]. (1995). Chicago:IL SPSS Inc.

Tatsuoka, M. (1975). The general linear model: A "new" trend in analysis of variance. Champaign, IL: Institute for Personality and Ability Testing.

Thompson, B. (1986). ANOVA versus regression analysis of ATI designs: An empirical investigation. Educational and Psychological Measurement, 46, 917-928.

Thompson, B. (1992, April). Interpreting regression results: Beta weights and structure coefficients are both important. Paper presented at the annual meeting of the American Educational Research Association, San Francisco. (ERIC Document Reproduction Service No. ED 344 897)

Willson, V. (1980). Research techniques in AERJ articles: 1969 to 1978. Educational Researcher, 9(6), 5-10.

 School Articles Lesson Plans Learning Articles Education Articles