DR. BARRY E. LANGFORD

Associate Professor of Marketing

FGCU

DR. L's MARKETING RESEARCH LECTURE NOTES Ch. 13 through 15

MAR 3613 / 6646

 


CHAPTER 13

DATA PROCESSING AND FUNDAMENTAL DATA ANALYSIS;

DATA ANALYSIS: STATISTICAL TESTING OF DIFFERENCES

I. Validation & Editing (This is called QUALITY CONTROL)

A. Validation - the process of ascertaining that interviews actually were conducted as specified.

Done to determine the validness (correctness) of the interview process.

B. Editing - the process of ascertaining that questionnaires were filled out properly and completely.

Done to check for interviewer/respondent mistakes.

1. Did the interviewer ask all questions?

2. Were skip patterns followed?

SKIP PATTERNS - are requirements to pass over certain questions based upon the respondent's answer to a previous question.

3. Responses to open-ended questions are checked for the quality of responses, which shows the quality of the interviewer's effort, among other things.

II. Coding - the process of grouping and assigning numeric codes to the various responses to questions.

Closed-ended questions are usually precoded on the questionnaire.

However, open-ended questions cannot be precoded because the researcher does not know what the responses will be until the survey is performed and the responses edited and analyzed.

A. Coding Process For Open-Ended Questions:

1. List all forms (types) of responses;

2. Consolidate (group) responses into homogeneous groups;

3. Set codes for each group;

4. Enter codes for each respondent into a computer for statistical analysis.

III. Data Entry - physically entering numerical values into the computer.

Normally done directly from the questionnaires, but sometimes the data are first transferred to a computer coding sheet (but this creates errors in the transferral process that may not be caught and corrected).

A. Types Of Data Entry:

1. Intelligent Data Entry Systems (or Devices) - the logical checking of information by the machine (computer) as the data is entered into the data entry device (computer),

which is programmed to check for and avoid certain types of common errors at the point of data entry.

Sometimes a second machine is attached to the data entry device to perform this checking routine.

Either way, the computer checks for data entry errors such as invalid or wild codes and violations of skip patterns.

2. Dumb Data Entry - no automatic checking by the computer; all checks are done manually by data entry personnel and the researchers in an iterative fashion to eliminate ALL data entry errors. This is used less frequently than Intelligent Data Entry.

B. Data Entry Process (not really in the book):

1. The questionnaire's code # is first entered (from the upper right corner of its first page) in the first column of the data matrix in the data storage medium;

2. That questionnaire's response to each question/statement (item) is then entered on the same line (which is called a record) in the matrix in its designated column(s).

3. Each subsequent questionnaire is then entered into the matrix in its individual record in the same order that it was received by the researcher.

4. If Dumb Data Entry was used, another check of all entered data is then conducted for all questionnaires that will be included in the study; the researcher manually checks all entries against each questionnaire's responses to all items.

This entire manual checking process is then repeated by at least two other people for every response before the data are submitted to analyses.

IV. Machine Cleaning Of Data - if a checking program is available, a

final computerized error check of tabulated data is then performed using

ERROR CHECKING ROUTINES - which are computer programs that accept logic instructions from the user to check for errors in the data.

So. these routines check for the presence of various conditions that may have been violated.

V. Tabulation of Survey Results - is the next research step.

A. One-Way Frequency Table - a table showing the number of responses to each possible answer of a survey question, and the percentage (proportion) of respondents who gave each possible response to the question.

This analysis step is called FREQUENCIES ANALYSIS. It is followed often by a crosstabulation analysis.

B. Crosstabulation - examination of the responses to one question (a categorical variable) relative to responses to one or more other questions (categorical variables).

Crosstabulation combines two or more categorical variables into one table to show frequencies and percentages (proportions), (and the percentages are based on column totals).

The purpose is to study some basic relationships among variables.

Even though this is a low level of research, it is still a powerful and easily understood approach to the SUMMARIZATION AND ANALYSIS OF SURVEY RESEARCH RESULTS.

Unfortunately, many marketing research projects stop here in their analyses.

VI. Graphic Representations of Data

A. Line Charts

B. Pie Charts

C. Bar Charts

1. Plain (Figure 13.4)

2. Clustered (Figure 13.6)

3. Stacked (Figure 13.7)

4. Multiple Row, Three Dimensional (Figure 13.8)

VII. Descriptive Statistics -- statistical measures which summarize the main, basic characteristics of large sets of data.

A. Measures of Central Tendency -- describe the centrality of a frequency distribution:

1. Mean - the sum of the values for all observations of a variable, divided by the # of observations.

The mean is simply the arithmetic average of the observations.

The mean only has "real" value when calculated from interval or ratio scaled (metric) data.

The mean is said to be of no statistical value for nominal or ordinal scaled (nonmetric) data developed from nonmetric data scales (measurements).

2. Median - The observation below which 50% of all observations fall. It is used often for variables such as income where the arithmetic mean would be abnormally affected by a few extreme values.

The median is said to be worthless only for nominal data.

3. Mode - the value that occurs most frequently. It is the value that has the highest frequency in a frequency distribution.

The mode can be calculated for and represent any type of data.

B. Measures of Dispersion -- The variability of data around the mean in a frequency distribution. It indicates how spread out the data are.

The larger the sample size (N), the smaller will be the variability of the data.

1. Sample Standard Deviation for a variable (S) - the square root of the sum of the squared deviations of all observations from the mean of those observations, divided by the degrees of freedom (which is the number of observations minus 1).

2. Sample Variance for a sample (S2) - the sum of the squared deviations of all observations from the mean of those observations, divided by the degrees of freedom (the number of observations minus 1).

3. Range for a variable - the maximum value for a variable, minus the minimum value for that variable.

DATA ANALYSIS: STATISTICAL TESTING OF DIFFERENCES (Chapter 13 -- Continued)

I. Statistical Significance

Making statistical inferences refers to generalizing (applying) the statistical results from a small sample to the entire population from which that sample was drawn.

That is, we want to be able to confidently assume that the population would give us virtually the same responses as did the sample. This is often referred to as GENERALIZING the sample results to the population of interest.

However, there are almost always some mathematical differences between the responses of the sample and that of the population (if it were also surveyed),

or between two samples from the same population, or between any two groups;

but this does not mean that the mathematical differences have any real meaning to a researcher.

It is only when the mathematical differences between the groups' responses concerning a variable are "statistically" different that matter, because the presence of statistical difference suggests that we can not generalize the sample results to the population of interest, or to any another group of interest.

 Finally, if a difference "is" statistically significant, a practitioner (such as a marketing manager) can legitimately choose to ignore the difference, if the absolute mathematical difference is so small that the manager's financial risk of relying on the numbers is almost nil.

A. Mathematical Differences - only show that the numbers (responses) are not exactly the same.

This does not, ceteris paribus, suggest that the mathematical difference is either important or statistically significant. It may be either or both, but simply being different is not a research problem in itself. It is normal and almost always exists.

B. Statistical Significance - if a particular mathematical difference is large enough to be unlikely to have occurred simply due to chance (a sampling error), then the difference is statistically significant.

This "can be" a meaningful finding in marketing research.

C. Managerially Important Differences - a mathematical difference, whether statically significant or not, may or may not be important to management.

For instance, a small and statistically significant difference between consumers' responses to two types of packages may be of no concern to marketing managers who view the difference to have no monetary value.

It is only LARGE AND SIGNIFICANT differences that must never be ignored by marketers.

II. Hypothesis Testing

A. Hypotheses - are specific assumptions or theories that a researcher or manager makes about some characteristic of the population under study.

It is sometimes called the "Research Hypothesis."

There are "basically" four types of hypotheses (but notice how similar even these are):

1. Ho: There is no difference between the mean of the sample and the mean of the population (or other sample). [A=B]

Ha: There is a difference. [A not= B]

2. Ho: The population (or other sample) mean equals the sample mean. [A=B]

Ha: The population (or other sample) mean does not equal the sample mean. [A not= B]

3. Ho: The population (or other sample) mean is different than the sample mean. [A not = B]

Ha: There is no difference. [A = B]

 4. Ho: The population mean is 6. [A = 6]

Ha: The population mean is some other number.

[A not= 6]

B. In HYPOTHESIS TESTING, the researcher determines whether a hypothesis concerning some characteristic of the population is valid (i.e., correct enough not to change the researcher's mind about that characteristic).

A statistical hypothesis test allows us to calculate the probability, called P-value, of observing a particular result (the one we generated by our research on a sample) if the stated null hypothesis is actually true.

We usually want that P-value to be less than .05 (5%) in order to reject the null hypothesis (which is the result we hope we get from all hypothesis tests). [More on this throughout the remainder of the text material]

Now, back to a discussion of differences between the expected (hypothesized) population value for a variable and the actual value observed for that same variable from a sample of that population.

That is, we are discussing having a Null Hypothesis that there is "no difference" (the expected value) and finding there is a difference per our sample, or having a Null Hypothesis that a value is 6 and it really is some other number like 15.

Remember: Generating a p < .05 (based on our choosing an alpha = .05) calls for us to reject the truth of our Null.

There are 2 explanations for observing a difference between our hypothesized value and a particular research result (value) from a sample. Either,

1. The hypothesis is true, and the observed difference in the sample results is likely enough (i.e., p = or > .05) due to chance (which is a sampling error); Or,

2. The hypothesis is most likely false (i.e., p < .05), and the true value is some other value, other than the one hypothesized.

[p < .05 calls for us to reject the Null that the value is 6, for instance, and accept the possibility that the real value is some other number above or below 6.]

Why do we have only these two explanations for observing a difference?

Because your research design and implementation reasonably eliminated the only other possible source of difference - bias.

Recall that the observed difference in our sample result versus that specified in the Null can only be attributed to three things: real difference, bias-created difference, and chance difference.

Thus, when we eliminate bias, any observed difference must either be from chance or it must be real.

FINALLY, IF THERE IS LESS THAT A 5% CHANCE THAT THE OBSERVED DIFFERENCE IS BY CHANCE, THERE MUST BE MORE THAN A 95% CHANCE THAT IT IS REAL.

 SO, WE REJECT THE NULL (THAT THERE IS NO DIFFERENCE, FOR INSTANCE, AND ACCEPT THE ALTERNATE HYPOTHESIS THAT THERE MAY BE A REAL DIFFERENCE).

In other words, we first attempt to eliminate bias as an explanation for observed difference, then we statistically test to see if there is at least a 95% probability (based on our chosen alpha = .05) that the observed difference is real. If there is at least this probability of the difference being real, we can confidently state that the observed difference is, in fact, real. And we state this by rejecting a Null that states that there is no difference, and accepting the Alternate that there is a difference.

As a side note, remember that sampling error is composed of errors (i.e., differences) that come from just two things: bias and chance. Thus, we first used a combination of our research design and implementation to eliminate bias as an explanation for our observed difference, and then we used statistics to eliminate chance as an explanation. So, in these ways, we created a 95%+ probability that the observed difference must be real, because we have less than a 5% probability that it is due to sampling error.

 C. Initial Steps in Hypothesis Testing:

1. Stating the Hypothesis - it is the alternate hypothesis that the researcher wants to be true.

Remember, the null hypothesis is stated in a form that we want to be false.

Then, a statistically significant result, such as a P-value of less than .05, allows us the pleasure of rejecting the null in favor of the result we want -- the correctness of our alternate hypothesis.

2. Choosing The Appropriate Test Statistic -- [see the

"Flow Diagram For Choosing A Statistical Test" on the notes on page 48.]

3. Developing A Decision Rule

A. Level of Significance -- we most often choose a significance level (called ALPHA) of .05 (which is 5%) as our criterion for testing a hypothesis about a variable.

Our chosen level of significance (alpha) is the maximum risk we are willing to accept in making a Type I error.

Our specific purpose for setting this criteria (alpha level) is to determine whether the difference between our observed result (such as an actual sample mean score for all responses for a variable) and its expected value (which is the mean score we stated [hypothesized] in our null hypothesis) could have occurred by chance (a sampling error) less than 5 times out of 100.

If our hypothesis test shows this to be true, we have determined that there is less than a 5% chance that we would be wrong to reject the null hypothesis in favor of accepting our alternate hypothesis.

This result would be shown by statistically developing an "actual" significance level (called P-value) of less that the cutoff point (called alpha) that we chose as our criterion for testing the null hypothesis.

4. Test Of Significance.

The 4th step in a hypothesis test is actually finding out the results of our research -- significant or not significant. In our "test of significance," if our calculated P-value (actual significance level) is less than .05 (or any other chosen alpha level), the difference we observed in our sample is considered statistically significant (i.e., is a real difference).

This finding causes us to reject the null hypothesis in favor or our alternate hypothesis. This is the result we wanted when we conducted the test of significance for the variable.

This significant result indicates that the probability of the occurrence of the observed sample result (such as an observed sample mean that is different than the mean we hypothesized for another sample or a population) is due to chance (a sampling error) is less than 5%.

Since the observed difference is not likely to be due to chance, we can assume that the observed difference is a "real" difference between our sample mean and the one we specified (hypothesized) in the null hypothesis.

Thus, we confidently reject the null hypothesis in favor of our alternate hypothesis.

Now think of this in terms of a simple example. Let's assume that our null hypothesis states that there is no difference between the mean of our sample and the mean of some other sample or the population, and our alternate hypothesis states that there is a difference. Then our hypothesis test of our sample result shows a significant difference because it generated a P-value less than .05.

This result causes us to reject the null hypothesis that there is no difference, and accept the alternate hypothesis that there is a difference.

In other words, we accept our alternate hypothesis because there is too great a probability that the acceptance of the null would be committing a Type II error (or Beta error),

which is defined as the acceptance of a false null hypothesis (of "no difference" in this example).

Now look at this in reverse, but for an example where the Null state a "value" for the mean rather than a "no difference" specification.

If the observed difference of our sample mean from the mean specified in our null hypothesis has a "greater" than 5% probability of being due chance, the test result is considered not significant, simply because there is "less" than a 95% chance that we would be correct to say that the observed difference actually exists (i.e., that there is = or > a 5% probability that the observed difference is due simply to chance).

Thus, we will not reject the null hypothesis because we have too great a chance (> 5%) of incorrectly rejecting a true null hypothesis,

which is called committing a Type I Error (or Alpha Error).

But a word of caution; This not-significant result does "not" mean that the null hypothesis that we accept is true.

It only means that our data does not provide us with enough evidence to confidently (i.e., > a 95% probability) reject it.

In other words, in terms of our example, if there is more than a 5% chance that the difference we observed between the expected (hypothesized) mean and our actual sample mean is due to chance, then we do not have enough confidence that the difference is real.

Thus, we do not reject the possibility that there is really no difference between the means stated in the Null and that found in the sample. That is, we do not reject the null hypothesis that there is "no difference between the means," or that the real "mean = 6" as hypothesized in our other example.

[I know how confusing this is. It is important that you review this material many times until you are certain that you fully understand it, because statistical procedures are conducted most often in order to test hypotheses about variables.]

B. The Cost Of Committing Type I and Type II Errors.

1. If a marketer commits a Type II error, the actual monetary loss to the firm may be relatively unimportant because to accept a false null hypothesis of no difference usually results in the marketer making no changes in the firm's products or operations. Thus, s/he does not undertake the new project which avoids monetary investment and risk. The only real cost of not doing something is an unknown amount of lost opportunity cost of not engaging in some strategy that "may" have produced a profit.

2. In contrast, if the marketer commits a Type I error, the monetary loss to the firm most likely will be substantial because rejection of a true null hypothesis (such as there is no difference between our product and that of the competition) causes the marketer to think there is a real difference and thus to invest time and money in a project that may fail if, in fact, the two products are not different.

For instance, if the marketer implements the manufacture and sale of a new product thinking that it is better than the competition (which was the essence of the alternate hypothesis which s/he accepted in committing a Type I error), and the product really isn't different (remember that the null of no difference was REJECTED in error), then the marketer may lose money on the introduction of the product which, in reality, is not different from that of the competition.

In this example, the TYPE I error was committed because the marketer spent resources to take marketing actions based upon some observed difference between the products when, in fact, the discovered difference occurred "by chance" rather than due to real differences between the two products.

III. Hypothesis Testing

A. Independent Samples - samples in which the measurement of a variable in one sample has no effect on the measurement of the same variable in the other sample, whether or not the samples were drawn from the same population.

B. Related Samples - samples in which the measurement of a variable in one sample may influence the measurement of the same variable in the other sample.

For example, experimental designs (experiments) often have this problem when they measure the same sample using the same measures before and after treatments to the predictor variable(s).

The problem is that the first measurement often affects the second measurement of the same participants due to their ability to learn from the first measurement and then adjust their responses to the second measurement.

C. Degrees of Freedom - the number of observations in a statistical problem that are not restricted (i.e., that are free to vary).

df = # of observations minus the # of assumptions or constraints necessary to calculate the statistic.

For example, if you know the total # of observations and the mean, and you are given the values of all but one of those observations, you can calculate the value of that last observation.

That is, the value of that last observation is "not free to vary" from what it really is. So, ceteris paribus, df = n - 1, in this case.

IV. Goodness of Fit -- refers to the fit of the observed distribution of responses (or pattern of data) on a variable in our sample in relation to either the expected distribution of responses for the same variable for the population of interest or the observed distribution of responses for the same variable in some other sample.

A. Chi-square (X2) -- tests the "goodness of fit" between the observed distribution (which is a pattern of frequencies of the observed values we found in our sample), and the expected distribution of responses to the same variable.

In other words, the Chi-square statistic determines whether an "observed" pattern of responses corresponds to (or fits) an "expected" pattern of responses.

If Calculated Chi-square value is > Table Chi-square value, reject the null hypothesis that "there is no difference between the variable's frequencies of responses in each cell" of a matrix of the responses to the statement (or question) pertaining to that variable.

In contrast, if Calculated (Calc.) Chi-square value is < Table (or Critical, or Crit) Chi-square value, don't reject the null hypothesis.

[Note: This same > < reject/don't reject rule applies to all significance tests]

Chi-square tests the difference between two frequency distributions, not between the means of those distributions. Thus, Chi-square tests are commonly used with nonmetric data because the means of the responses to nonmetric (nominal and ordinal) data have no meaning.

In fact, Chi-square is used most often as a nonparametric test for statistically examining the distributions of nominal (nonmetric) data. However, Chi-square also can be used to test the distributions of crosstabulated categorical data.

In addition to Chi-square, there also exist other nonparametric tests of ordinal (nonmetric) data, such as the Kolmogorov-Smirnoff test described briefly below.

B. Five Steps in the Chi-square Test of "Two" Independent Samples:

1. State the null and alternate hypotheses.

The null will suggest that there is no relationship between the frequencies of responses of the two samples (such as a male sample versus a female sample) with respect to the variable of interest.

The alternative suggests that a significant relationship does exist between the frequencies (of responses) of the two samples (groups). In other words, the distributions of responses (the frequencies) between the two groups are very similar.

2. Produce a k X r crosstabulation or contingency table of the observed sample frequencies.

k is columns, which represent the 2 samples (groups);

r is rows, which represent the treatments (or conditions) applied to those two groups.

Then use the marginal totals and the total for the entire table (N) in the next step.

3. Determine the expected frequency for each cell in the contingency table.

4. Calculate the Chi-square value.

5. Find the Crit Chi-square value in the table, based upon your chosen alpha (level of significance) and

df = (r-1)(k-1).

If Calc value < Crit value, the null is "not" rejected, suggesting that there is no significant difference between the two groups (k's) in terms of the conditions or treatments (r's). So, the two groups are similar.

The purpose of this test is to determine whether there is a statistical difference between the responses of the two groups (samples of males and females) with respect to the variable of interest, where each group was measured in such as way that each group's responses were independent of the responses provided by the other group.

Two examples of groups that often are measured and tested in this way are a sample of men v. a sample of women, or shoppers v. nonshoppers.

D. Kolmogorov-Smirnoff Test - is a test of the goodness of fit between the observed distribution and the expected distribution of observed values. The distributions of "Cumulative Frequency Proportions."

Thus, the K-S test is similar to the Chi-square test, but K-S is a nonparametric test used for ordinal (nonmetric) data, while the Chi-square test was intended for nominal data (but often is used for all types (levels) of data).

V. Hypothesis Tests About Proportions (Percentages)

A. Test of One Proportion, One Sample -- a Z-Test of whether or not the mean proportion of a variable in a single sample is likely to accurately represent the mean proportion of that same variable for the population of interest.

For example, the Null and Alternate Hypotheses for a test of whether the population variable proportion is > than some important level (percentage) based on a single sample would look like this:

Ho: P =< 60% [where P is the symbol for population; So obbviously, Ha: P > 60% proportion]

In other words, it is a test of whether or not we can rely on the mean proportion (60%) we developed from the sample as being very close to the real mean of the population of interest. Calculates in 5 steps.

B. Test of Two Proportions, From Two Independent Samples -- a Z-Test of the difference between the proportions of people in two different groups (samples) that engage in a certain activity (a single variable) or have a certain characteristic (a single variable).

Ho: p1 - p2 =< 0 [where p is sample proportion]

Ha: p1 - p2 > 0 <---[incidently, this shows it is a one-tailed test of significance].

We want to know if the difference between the two sample proportions is real (significant) and greater than zero, with respect to this variable.

Also, calculates in 5 steps.

VI. P-Values & Significance Testing

In all of the above tests, we traditionally conducted manual tests of significance via an arbitrarily set, standard level of significance (or of allowable sampling error) called alpha.

Alpha was then used, along with the appropriate degrees of freedom, to select the critical value of the appropriate test statistic from the tables in the Appendices.

We next calculated the value of the appropriate statistic to see if it was > or < the standard, critical value.

We then rejected the null hypothesis when the calc value was > than crit value, which demonstrated the significance of the test, or we accepted the null when the calc value was < than crit value.

However, this did "not" tell us the "exact" probability of getting a calculated test statistic that large due to chance.

Fortunately, SPSS computes this exact probability for us, and it is called the P-VALUE. SPSS automatically prints the actual P-Value for each statistic you choose when you write and run your program.

The P-VALUE identifies the probability that such a large distance (or variation) between the hypothesized population parameter (as stated in the Null Hypothesis) and the observed sample test statistic could have occurred by chance.

THE SMALLER THE P-VALUE, THE SMALLER IS THE PROBABILITY THAT THE OBSERVED RESULT OCCURRED DUE TO CHANCE (which is a sampling error).

So, the greater the probability that the observed result in our sample is real, and therefore accurately represents the real value on that variable in our population of interest.

------------------------------------------------------------------------

CHAPTER 14

DATA ANALYSIS: BIVARIATE CORRELATION & REGRESSION

I. Bivariate Analysis of Association

A. Bivariate Analysis - statistical methods of determining the amount of association between two variables.

B. Independent (Predictor) Variable (X) - the variable believed to affect (not cause) the value of the dependent variable.

C. Dependent (Criterion) Variable (Y) - the variable whose value is believed to change in response to changes in the independent variable.

II. Bivariate Regression

A. Bivariate Regression Analysis - analysis of the strength of the linear relationship between two variables when one is considered the independent variable and the other the dependent variable.

B. Nature of the Relationship (see Figure 14.1, page 403) - In a scatter plot (or scatter diagram), the dependent variable Y is plotted on the vertical axis, while the independent variable X is plotted on the horizontal axis. The dots are the X-Y plots of all observations in a sample.

If the relationship looks like a straight line, or a fairly straight line can be drawn (fit) through the center of the plots of the observations, we say that the relationship looks linear, suggesting that linear regression is appropriate.

If the line is curvilinear (nonlinear), linear regression is inappropriate, but curve-fitting non-linear regression techniques are appropriate [not covered here].

The scatter plot also shows the point where the regression line meets the Y axis, and this is called the estimated Y-intercept which is noted as a-hat in linear regression.

The rise divided by the run is b-hat (the estimated regression coefficient) in linear regression.

1. Vary Directly (+) - means that a positive linear relationship exists between the two variables X and Y, as in Figure 14.1 (A) & (B).

2. Vary Indirectly (-) - means that a negative linear relationship exists between the two variables X and Y, as in Figure 14.1 (C).

3. Independent - means that no linear relationship exists between the two variables X and Y, as in Figure 14.1(F).

C. Least Squares Estimation Procedure - a mathematical procedure that fits a line to the data in a scatter diagram (scatter plot) that best represents the linear relationship between the 2 variables being investigated.

Its use requires (assumes) that the errors (e) are normally and independently distributed.

1. The Regression Equation -

Y = a + bX + e

where:

Y = Criterion (Dependent) Variable.

a = Estimated Intercept.

b = Regression Coefficient For Each Predictor (Independent) Variable (X). It is the slope of the regression line.

X = Predictor (Independent) Variable.

e = Error Term. It is the difference between the actual values of Y determined from the sample and the values of Y predicted by the regression line.

This is the official regression equation. However, the equation which is actually used does not contain an error term (e).

Instead, all errors of any type, as well as the effects on the values of Y by all variables not in the equation are actually captured in the intercept term (a). Thus,

Y = a + bX is the "functional" least squares regression equation in statistical packages like SPSS. This is the regression line discussed below.

D. The Regression Line - is estimated by the regression equation.

It is the line that is best fits through the scatter diagram plots of the X-Y observations from the sample. [see Table 14.3 on p. 408, & Figure 14.3 on p. 409]

E. The Strength of the Association -- indicates how much (widely) the actual values of Y (determined by our sample) differ from the values predicted by our estimated model.

The Coefficient of Determination (r2) -- is a measure of the strength of the linear association between the Xs and Ys.

It shows the percent of the total variation in Y (the dependent or criterion variable) that is "explained" by the variation in X (the independent or predictor variable).

The Coefficient of Determination ranges from 0, suggesting no linear association between X and Y, to 1, suggesting a perfect linear association between X and Y.

For example: r2 = .72 suggests that 72% of the "total" variance in Y is explained by the variance in variable X.

Total Variance (total sum of squares --> SST) = Explained Variance (sum of squares due to regression ---> SSR) + Unexplained Variance (error sum of squares --> SSE).

SST = SSR + SSE

 r2 = Explained Variance divided by Total Variance

r2 = SSR / SST

The larger the coefficient of determination, the more the total variance in the criterion (dependent) variable is explained by our regression equation (i.e., by our predictor (independent) variable). This also means that less total variance is left unexplained.

Unexplained variance in the criterion variable may be the result of not including some other predictor variable(s) in our equation that may have some explanatory power.

In addition, some of the unexplained variance may simply be due to randomness in the behavior of our dependent variable, which, if true, would be impossible to explain by any predictor variables.

F. Statistical Significance of Regression Results

Now that we have determined the strength of the explanatory power of our developed regression equation (i.e., of our predictor variable), we can test the null hypothesis that "There is no linear relationship between X and Y" which, if true, would suggest that the equation does not have explanatory power.

We perform this test of statistical significance using the value of the F statistic generated by SPSS and shown on The Analysis Of Variance Table.

NOTE: The larger the F statistic, the smaller the P-value.

When the P-value is less than our chosen level of confidence (i.e., smaller than our chosen alpha, such as .05), the results of this hypothesis test are considered significant.

A significant F statistic (also called the "value of F" or the "F ratio") causes us to reject the null hypothesis in favor of the alternate hypothesis that "There is a linear relationship between X and Y."

Note: This is almost always the result you will obtain when you obtain a large coefficient of determination (r2).

III. Correlation Analysis - analysis of the degree to which changes in one variable are associated with changes in another.

A. Correlation -- is the measurement of the degree to which changes in one variable are associated with changes in another variable.

Correlation Coefficients (r) take values from:

-1 for perfect negative correlation, to 0 for no association, to +1 for perfect positive correlation.

B. Simple (or Bivariate) Correlation Analysis - refers to a correlation analysis of just two variables.

C. Pearson's Product Moment Correlation Coefficient (r) -- is the correlation statistic created for use with interval and ratio scaled data (metric data).

[In practice however, many researchers also use it for nominal (nonmetric) data.]

D. Spearman's Rank Order Correlation Coefficient (rs) -- is the correlation statistic created for use with ordinal scaled data (nonmetric data).

It tests null hypotheses using the t-test; and it is sometimes also called the Coefficient of Rank correlation.

MULTIVARIATE DATA ANALYSIS (Chapter 14 - Continued)

I. Multivariate Analysis - statistical procedures that simultaneously analyze multiple measurements (variables, or predictor variables) on each individual or object (i.e., the case) under study.

Types of Multivariate Analysis & Their Applications:

1. Multiple Regression Analysis -- enables the researcher to predict the level (or magnitude) of a criterion (dependent) variable based upon the levels of more than one predictor (independent) variable.

2. Multiple Discriminant Analysis -- enables the researcher to predict group membership on the basis of two or more independent variables.

3. Factor Analysis -- permits the analyst to reduce a set of independent (predictor) variables to a smaller set of factors (or composite variables) by identifying dimensions underlying the data.

4. Cluster Analysis -- is a procedure for identifying subgroups of individuals or items that are homogenous within subgroups and different (heterogenous) from other subgroups.

5. Perceptual Mapping (or Multidimensional Scaling) -- is appropriate when the goal is to analyze consumer perceptions of companies, products, brands, and so on.

6. Conjoint Analysis -- provides a basis to estimate the utility that consumers associate with different product features or attributes.

II. Multiple Regression Analysis (MRA) -- examines the relationship between a set of continuous (metric) predictor variables and a single continuous (metric) criterion variable.

MRA is a procedure for predicting the level or magnitude of a single criterion variable (Y) based upon the levels of more than one predictor variable (Xs).

A. General Equation: Y = a + b1X1 + b2X2 + ... + bnXn

B. Two Basic Purposes of Multiple Regression Analysis:

1. Prediction of the level of the dependent variable based upon given (observed) levels of the predictor variables.

2. Understanding the relationships between predictor variables and the criterion variable.

C. Multiple Regression Measures:

1. Coefficient of Determination (r2) - measures the percentage of the variation in the criterion variable explained by variations in the predictor variables as a group.

2. Regression Coefficients (b values) - are values that indicate the effect of the individual predictor variables on the criterion variable, considering (statistically allowing for) the presence of the other predictor variables in the equation.

3. Dummy (Binary) Variables (0/1) - refers to the coding of dichotomous variables 0 or 1 so that they can be used in multiple regression analysis.

This is necessary because regression analysis is designed for use with metric data (interval and ratio), while dichotomous variables are nonmetric (nominal). Dummy variables are used for variables such as gender, marital status, occupation, race, and so on.

For Example: These Two Dummy Variables (X1 & X2) Are Created To Represent Three Races:

Variables X1 X2

Black 1 0

Hispanic 0 1

White 0 0

D. Potential Problems Interpreting Multiple Regression Analysis

1. Collinearity (Multicollinearity) - the degree to which the predictor variable(s) are correlated among themselves.

Its presence in the data can bias [an upward bias] the estimates of the regression coefficients (b) and make them unreliable (unstable).

Thus, we could not trust the regression result in the presence of collinearity.

When any pair of predictor variables correlate with each other by .30 or greater, many analysts suggest that you check for distortions of the b values (the regression coefficients).

When such distortions exist, collinearity must be dealt with.

Two Strategies For Dealing With Multicollinearity:

a) One of the variables can be dropped from analysis; Or,

b) Factor analyze the correlated predictor variables to combine them into a smaller number of new, composite predictor variables (called factors) that are not highly correlated with each other.

These composite variables are then used in subsequent analyses (such as in regression analysis); or, these factors may be the end-result of the analysis.

2. Causation - cannot be determined by regression analysis;

you can only determine indicators of causation between predictor variables and the criterion variable using this statistical procedure.

Indicators of causation only suggest influence on the criterion variable, not causation.

3. Scaling of Coefficients - the magnitude of regression coefficients associated with various predictor variables can only be compared directly (literally) if they are scaled in the exact same units, or if the data have been standardized.

 _____________________________________________________________________

 CHAPTER 15

COMMUNICATING THE RESEARCH RESULTS

 No special notes here; the text is self explanatory.

After this chapter, read GUIDELINES FOR WRITING A RESEARCH REPORT.

 END OF MARKETING RESEARCH "LECTURE" NOTES


CLICK HERE to go to CHAPTERS 1 - 6 OF MARKETING RESEARCH NOTES

CLICK HERE to go to GUIDELINES FOR WRITING A RESEARCH REPORT

CLICK HERE to go to DR. L's HOMEPAGE

CLICK HERE to return to CHAPTERS 11 & 12