The Role of Assessment in

Research on Teaching and Learning Statistics

 

Joan Garfield

Department of Educational Psychology
University of Minnesota
332 Burton Hall,178 Pillsbury Drive S.E.
Minneapolis, MN 55455

e-mail: jbg@tc.umn.edu

phone:612-625-0337


Robert C. delMas

General College
354 Appleby Hall
University of Minnesota
128 Pleasant Street SE
Minneapolis, MN 55455

e-mail: delma001@tc.umn.edu

Phone: (612) 625-2076


Beth L. Chance

Department of Mathematics
University of the Pacific
Stockton, CA 95211

e-mail: bchance@uop.edu

Phone: (209) 946-3030


Paper presented at the AERA Annual Meeting

Montreal, 1999

 

Abstract

This paper examines the role of assessment in research studies focused on the teaching and learning of statistics at the undergraduate or graduate level. Some advantages and limitations for types of assessment methods typically used in statistics education research studies are summarized. An alternative framework is offered for conceptualizing assessment and its role in studies of statistics education. This framework is based on the theory of conceptual change. An illustration will be offered: a study of the impact of the use of computer simulations on learning statistical inference. Examples of the types of assessment embedded in this ongoing research project will be shared.

 

Background

Over the past 25 years, assessment has been used in several different ways in statistics education research projects at the postsecondary level. An examination of papers published in the Journal of Statistics Education, presented at AERA (particularly in this SIG for Educational Statisticians), and presented at other statistics education meetings, reveals some consistent patterns in terms of types of assessments used, the nature of the research questions used to structure the studies, and the types of results. Each of these is described in the following sections.

 

Types of instruments

    Measures of course performance

The assessment methods typically used in research studies conducted in undergraduate or graduate statistics courses are measures of course performance (e.g., final exams, midterms, and quizzes) (e.g., Giraud, 1997, Magel, 1998, Potthast, 1995). Advantages of using typical measures of course performance are clear: they are already part of the courses, they are developed to assess the desired learning outcomes, students take them seriously and typically apply their best effort to demonstrate what they have learned. Disadvantages of using exams is that they often provide a shallow view of what students know and can do with their knowledge, they tend to focus more on formulas, definitions, and following prescribed procedures rather than on reasoning and problem solving. An additional concern is that students who have good study habits, attend class regularly, and do their homework assignments, may do well on these performance measures but not necessarily demonstrate higher levels of statistical thinking and reasoning.

 

    Measures of attitudes and anxiety

Several studies use measures of students' attitudes towards statistics or anxiety about statistics as dependent variables in their research studies. Negative attitudes and high levels of anxiety have often been associated with learning statistics. Therefore, citing increased positive attitudes or decreased anxiety as outcome measures may make sense to some researchers. However, most instructors would agree that their main goal as teachers is to help students learn statistics, whereas a secondary goal might be to improve students' attitudes. Therefore, if attitudes and anxiety are used as outcome measures they should only be examined in conjunction with assessments of learning outcomes. In addition, Gal, Ginsburg, & Schau (1997) point out that these types of studies are not actually designed to provide information about causal factors and sources of students' attitudes and beliefs. Gal, Ginsburg, & Schau suggest several approaches that can be taken to obtain this type of necessary information, e.g., interviews, focus group discussions, think-aloud protocols, and/or journals or histories of present or past mathematical experiences.

A second way that attitude and anxiety measures have been used is to predict or model student performance in statistics courses. Most published studies have used Likert-type scales or semantic differential scales designed to measure students' attitudes toward statistics as independent variables in the prediction of statistics performance. In a survey of research studies related to attitudes toward statistics, Mayer (1999) found that overall results ranged from attitudes accounting for little or none of the variance (Wise, 1985, Pearney & Ravid, 1990, Green, 1994) to accounting for 79% of the proportion of variance in performance (Woehlke, 1991). She concluded that the difference in these findings may be attributable to a number of variables, such as the time attitude measures were adminstered, the types of students in the study, the use of different measures, and the use of different subscales within measures. Some of these variables are apparent in the studies described below.

Some researchers (Waters, et al., 1998, Pearney & Ravid, 1990, Green, 1994, Wisenbaker, 1997) ran correlations both at the beginning and end of the course. In most cases, the post- scores on the attitude scales correlated higher with course performance than did the pre- scores (Waters, et al., 1998, Green, 1994, Wisenbaker, 1997). In the case of the highest correlations, Woehlke (1991) gathered information from students in four levels of graduate statistics courses. In this case, the attitude scores of students in the course required only for educational measurement and statistics majors (and the one with the fewest number of students) correlated highest with course performance. Mayer suggested that another reason for the differences in correlations between the attitude measures and course performance can be linked to the use of the attitudes subscales or factors. For example, one measure, the ATS, has two subscales, Field (attitudes towards the field of statistics) and Course (attitudes towards the specific statistics course students are taking). Researchers have found consistently higher correlations when using the subscale Course as opposed to Field (Wise 1985, Waters, et al., 1998, Perney & Ravid, 1990, Green, 1994). Another source of variability is the statistics achievement measure used. Mayer found that most researchers defined course performance as course grade, although some used only the scores on a midterm or final exam.

In summary, the different research designs and assessment scales described above may account for the inconsistency in results relating student attitudes to performance in statistics.

 

    Course evaluations and anecdotes

Several statistics education studies focus on the description of a new method of teaching (e.g., the use of simulation software, cooperative groups, a particular software package, etc.) and how it affected students in their courses (e.g., Gaines and Lawrence, 1997, Kenney, 1998, Smith, 1998). Often course evaluations are shared along with anecdotal reports of the instructor/researcher's perceptions of the student outcomes (e.g., that students are more engaged and interested in learning about statistics when the new activity or teaching approach is used). This method is often quite interesting and often illuminating to the course instructor. However, this type assessment information can not be used to assess the impact of the activity on student learning, which is usually viewed as the most important course outcome.

 

Types of research questions asked

In the research studies described above, there tend to be two general types of questions addressed:

  1. Does a new activity (e.g., a method for teaching the central limit theorem) or a new teaching approach (e.g., cooperative groups) improve student performance and/or attitudes as compared to a traditional approach?

  2. Can student performance in a statistics class be modeled or predicted by measures of attitudes, and/or anxiety related to statistics, as well as other related measures such as mathematics achievement or gender?

 

Research Results

The results of these types of research studies appear to have a limited impact on the teaching of statistics because they rarely illuminate why a particular teaching method or activity works, explain how students' understanding and reasoning are affected or unaffected by the learning experience, or suggest changes for improved teaching practice.

We suggest that the types of questions posed and answered in these studies are limited by the types of assessment methods used, and that in modeling students' performance, many important variables are often overlooked such as type of course, teaching approach, use of technology, and proficiency of the instructor. Certainly, one might conclude that we should focus on writing final exams that are better able to assess statistical thinking, and to design research studies that include as variables the teacher and instructional variables described above. However, this type of empirical study would be very hard to carry out in true experimental fashion. Instead, we suggest a different way of using assessment in research, that informs the instructor, improves student learning, and builds a growing set of empirical data for research.

 

How we stumbled on an alternative approach to assessment

Our own use of assessment has changed dramatically over the past few years. During this time, we have been conducting collaborative, classroom-based research on the role of computer simulation software in the development of students' understanding of statistical inference. We began by developing software and instructional materials to teach sampling distributions. The method used was demonstrating the software and guiding students through an activity where populations were created and sampling distributions were generated. Students filled out an activity packet, which instructed them to change parameters such as population shape and sample size, and to comment on the resulting sampling distribution. At the end of the activity, students were to summarize what they learned, such as: sampling distributions of sample means tend to look more normal than like the population from which they were sampled, as the sample size increased, the variability of the sampling distribution decreases, but the center (mean) stays the same, etc.

We developed pretests and posttests to administer to students to evaluate changes in their understanding of concepts related to sampling distributions, as a result of using our software and activity. By examining the changes in reasoning and understanding from pretest to posttest, which were not as strong as we had hoped, we realized that modifications were needed in both the software and activity. After making several rounds of improvements based on continually examining test results over several courses, we were disappointed that not enough students appeared to be developing what we considered to be correct reasoning. This examination of assessment data led us back to the research literature to explore other models for helping students develop statistical reasoning.

As we reviewed the educational and psychological literature, we became familiar with the ideas of Posner, Strike, Hewson, and Gertzog (1982) who describe a model of conceptual change that has been applied in science classroom learning. The model proposes that students who have misconceptions or misunderstandings need to experience an anomaly, or contradictory evidence, before they will change their current conceptions. A variety of related research studies indicate that people, in general, are resistant to change and are very likely to find ways to either assimilate information or discredit contradictory evidence rather than restructure their thinking in order to accommodate the contradictions (Lord, Ross, & Lepper; 1979; Jennings, Amabile, & Ross, 1982; Ross & Anderson, 1982). Modern information processing theories (e.g., Holland, Holyoak, Nisbett, & Thagard; 1987) suggest that it may be necessary to direct attention toward the features of the discrediting experience in order for the contradictory evidence to be encoded. Left to their own devices, people will attend only to those features which are predicted by their current information structure.

Guided by this literature, we reexamined our instructional materials, and looked for a way to have students make their own predictions and then test them out using our software programs. We decided to use our pretest as part of the instructional activity, to promote conceptual change, and modified our activity in the following way. Students worked on items based on five pretest situations. Once a student responded to a pretest item, the student would immediately model the situation described by the item using simulation software to see if the prediction was correct. For each item the students were asked to evaluate their predictions and to compare the actual simulation results to their predictions. We administered the same posttest as before, and were pleased to see that these scores improved dramatically when this approach was used.

We decided to implement this model in subsequent activities related to statistical inference, using our simulation software. As new activities were developed and used with students, we explored alternative types of assessment to help us better understand what students were learning, how their conceptions or misconceptions were affected by the activities and software, and how their prior knowledge affected their experience with the learning activity. We moved from having a pretest of the knowledge students would be learning, to a pretest of knowledge needed to benefit from the learning activity, that could instead by used for diagnostic purposes. For our posttests, we developed items that would not only assess our desired learning goals for students, but items that would assess whether they had developed misconceptions related to the topic being studied.

As we developed new ways of assessing students, we realized that we had created an assessment framework that helps us better understand and document student learning, which therefore allows us to share more detailed and informative results with the research community. This assessment framework also helps us better inform our students about their level of learning and understanding, and ultimately, improves their knowledge of statistical inference.

Our assessment framework consists of the following components:

An example of the collection of instruments for one unit, Sampling Distributions of Means, is described in the following section.

 

An Example: Sampling Distributions of Means

In order to develop assessment instruments, we first needed to carefully analyze the different types of knowledge students should have (or not have), before, during, and after the instructional unit. This led to the following three lists:

 

Examples of these three lists are shown in Tables 1, 2, and 3.

Once we had described these three types of knowledge, we began to refine current items and construct new items, leading to our current assessment instruments. This process often began by developing open-ended questions and gathering students' responses either on paper or in individual interviews. This process allowed us both to refine the items and to develop representative multiple-choice responses to several items.

Our current pretest of prerequisite knowledge (see Appendix 1) consists of a set of items that may be used for diagnostic purposes. The instructor may administer all or some of these items before beginning the simulation activity, to determine if students have misunderstandings that may be correct before beginning the activity. For example, many students reveal a misunderstanding of the term "variability" thinking that it refers to the bumpiness of a distribution, rather than to the spread of the distribution. The instructor may address this by discussing the item with the class or individually clearing up the confusion. For research purposes, the data gathered from this pretest may allow for comparisons of different classes of students before beginning a particular intervention being investigated.

The posttest is designed to evaluate students' understanding of key concepts and their ability to use these concepts in solving statistical problems. This test (see Appendix 2) consists of a variety of items to assess students' learning after completing the unit. Most items are multiple-choice or matching format for easier scoring. The use of these items varies according to research and/or instructional purpose. Some or all of these items may be included in a posttest at the end of the activity, a unit quiz, or final exam.

 

Summary

The assessment instruments described in this paper (as well as related instructional materials and simulation software tools) on topics of statistical inference that are freely available on our web site: http://www.gen.umn.edu/faculty_staff/delmas/stat_tools/index.htm

The instruments we have developed may be borrowed and adapted for instructional and/or research purposes.

We believe that assessment instruments play an extremely important role in statistics education research studies. We encourage researchers to seriously consider the instruments they use, making sure that they are valid and reliable measures of student variables and will yield useful results. We hope that the framework we have created will help other researchers who are interested in conducting studies of statistics education.

A final comment, is that we have found that conducting collaborative, classroom-focused research can be a productive model for considering assessment issues and developing and refining instruments. We encourage other statistics education researchers to explore this model as a way to better understand and improve student learning of statistics.

 

References

Gaines, Leonard, and Lawrence, Elizabeth. (1997). Statistics: An activity based approach for learners at a distance. Paper presented at the Joint Statistical Meetings, Anaheim.

Giraud, Gerard (1997) Cooperative Learning and Statistics Instruction. Journal of Statistics Education v.5, n.3 (1997)

Hawkins, Anne, Jolliffe, Flavia; and Glickman, Leslie. (1992). Teaching Statistical Concepts. London: Longman Publishers.

Holland, J. H., Holyoak, K. J., Nisbett, R. E., and Thagard, P. R. (1987). Induction: Processes of inference, learning, and discovery. Cambridge, Massachusetts: The MIT Press.

Jennings, D., Amabile, T. and Ross, L. (1982). Informal covariation assessment: Data-based versus theory-based judgments. In D. Kahneman, P. Slovic and A. Tversky (eds.), Judgment Under Uncertainty: Heuristics and Biases. Cambridge: Cambridge University Press.

Kenney, P. (1998). Teaching introductory statistical methods for understanding: Teachniques, assessment, and affect. Paper preseented at AERA, San Diego.

Lord, C., Ross, L., and Lepper, M. (1979). Biased assimilation and attitude polarization: The effects of prior theories on subsequently considered evidence. Journal of Personality and Social Psychology, 37, 2098-2109.

Magel, R.C.(1998). Using Cooperative Learning in a Large Introductory Statistics Class. Journal of Statistics Education v.6, n.3.

Posner, G. J., Strike, K. A., Hewson, P. W., and Gertzog, W. A. (1982). Accommodation of a scientific conception: Toward a theory of conceptual change. Science Education, 66(2), 211-227.

Potthast, M. (1995). Cooperative learning experiences in introductory statistics. Paper presented at AERA, San Francisco.

Ross, L., and Anderson, C. (1982). Shortcomings in the attribution process: On the origins and maintenance of erroneous social assessments. In D. Kahneman, P. Slovic and A. Tversky (eds.), Judgment Under Uncertainty: Heuristics and Biases. Cambridge: Cambridge University Press

Schutz, P. A., Drogosz, L.M., White, V.E, and Distefano, C. (1997). Prior knowledge, attitude and strategy use in an introduction to statistics courses. Paper presented at AERA Annual Meeting, Chicago.

Gary Smith (1998). Learning statistics by doing statistics. Journal of Statistics Education v.6, n.3

Wisenbaker, Joseph and Scott, Janice (1997). Modeling aspects of students' attitudes and achievement in introductory statistics courses. Paper presented at AERA Annual Meeting, Chicago.

Acknowledgment

The research reported in this presentation was partially supported by NSF Grant No.DUE-9752523.


Table 1:
Prerequisite knowledge for learning sampling distributions


Students should understand:

  • the idea of a distribution

  • what variability means (spread vs. smoothness)

  • the idea of the center of a distribution

  • a random sample

  • a sample statistic vs. a population parameter

  • common shapes of distributions: normal, skewed, uniform, bimodal

  • being able to see between the data, to recognize overall shapes of distributions

  • the idea of area under a curve and how it represents likelihood of outcomes

  • properties of the normal distribution

  • normal distributions can look different due to different variability (shape vs. variance)

  • how to read and interpret histograms

  • the idea of sampling variability

Table 2:
Common misconceptions that students may have


Some common misconceptions students should NOT have:

  • Sampling distribution should look like the population (for n > 1)

  • Sampling distributions for small and large sample sizes have the same variability

  • Sampling distributions for large samples have more variability

  • Don't understand that a sampling distribution is a distribution of sample statistics

  • Confuse one sample (real data) with all possible samples (in distribution) or potential samples

  • representing a population (better for large samples) is confused with a sampling distribution for large samples, better representing a population (have more area)

  • students pay attention to the wrong things: e.g., heights of bars

  • the mean of a positive skewed distribution will be greater than the mean of the sampling distribution for samples taken from this population

Table 3:
Desired student outcomes

What students should understand about sampling distributions:

  • That a sampling distribution for means (based on quantitative data) is a distribution of sample means (statistics) of a given sample size, randomly sampled from a population with mean m and standard deviation s. It is a probability distribution for the sample mean.

  • The sampling distribution for means has the same mean as the population (parameter)

  • As n gets larger, variability of the sample means gets smaller (a statement, a visual recognition, and predicting what will happen or how the next picture will differ)

  • The standard error of the mean is a measure of variability of sample statistic values

  • How to interpret area/apply areas under curve as probability statements about sample statistics

  • The building block of a sampling distribution is a sample statistic
  • Some values of statistics are more or less likely than others to be drawn from a particular population

  • When it is reasonable to use a normal approximation

  • Different sample sizes lead to different probabilities for the same value (know how sample size affects the probability of different outcomes for a statistic)

  • Sampling distributions tend to look more normal than like the population, even for small samples

  • As sample sizes get very large, all sampling distributions for means look alike (i.e., have the same shape), regardless of the population from which they are drawn

  • Averages are more normal and less variable than individual observations

  • Distinguish between a distribution of observations in one sample and a distribution of statistics (sample means) from many samples (n greater than 1) that have been randomly selected.

What students should be able to DO with this knowledge:

  • Describe what a sampling distribution would look like for different populations and sample sizes (in terms of shape, center and spread, and where the majority of values would be found). What values of the sample mean are likely, and which are less likely.

  • Describe the size of the standard error of the mean

  • Describe the likelihood of different values of the sample mean

  • Describe the mean of the sample means for different shaped populations


Appendix 1: Pretest of prerequisite knowledge

 

Appendix 2: Sampling Distributions Posttest