Thursday, 15 June 2017

Assessing an assessment strategy: Two-stage collaborative exams



Confession

Two-stage collaborative exams (see below) have become quite common in science courses at our institution, especially at the First- and Second-Year levels. Personally, I have seen with my own eyes the engagement and excitement that the collaborative portion can generate, and I have always been an advocate for this assessment strategy simply because students seemed to like collaborative work, because collaboration was aligned with a lot of the teaching techniques that we were using, and because a colleague across the street, Brett Gilley, had mentioned that "they had done a study that showed that students learn more if they are tested collaboratively".

Embarrassingly enough, I had never actually read the aforementioned study, and somehow just assumed that it was a broad study demonstrating beyond the shadow of a doubt that two-stage collaborative exams enhance learning. (Now, of course, I know better...).


Overview

Two-stage exams are becoming widely used across post-secondary institutions, and are believed to benefit students in multiple ways (e.g. Zimbardo, Butler, & Wolfe, 2003). The first author of this study is an instructor in a science course that has been employing two-stage collaborative exams; in this study, he and his co-author conduct an in-class crossover study to establish whether or not this assessment strategy benefits student learning.   



The assessment strategy

There are several kinds of two-stage exams, each with multiple possible variations (Knierim, Turner, & Davis, 2015). The specific strategy discussed in the paper by Gilley and Clarkson (2014)is a two-stage collaborative exam where students first write a multiple-choice test individually, then they hand it in, they form groups of about four, and re-write the exam in their groups (one answer sheet per group).
This assessment strategy adds some authenticity to otherwise very traditional assessments (tests) in classes that have students work and learn collaboratively throughout the course.


Purpose of this study

Prior to this study, research on the effects of collaborative testing on student learning had yielded contrasting results (e.g. Cortright, Collins, Rodenbaugh, & DiCarlo, 2003Kapitanoff, 2009Leigh, Saunders, Calkins, & Withers, 2012; Sandahl, 2010). The authors set out to to test, as rigorously as they possibly could, whether two-stage collaborative tests improve student learning in a First-Year Earth and Ocean Science course.


Methodology


This study was embedded into the normal flow of the course during one semester and used a crossover design. Students wrote their Midterm exam, then the class was split into Groups A and B and, within each of these two groups, students organized themselves into smaller groups of 3-5 (Gilley & Clarkson, 2014) . They then wrote the individual re-test (different subsets of the Midterm questions), and finally wrote the remainder of the Midterm collaboratively within their small groups (group re-test). All students later wrote a "Learning Test" consisting of 10 specific Midterm questions.

A more detailed summary of the data collection methodology is shown in the flowchart below. Note that only the portion of the study that occurred in conjunction with Midterm 1 is represented, but the same procedure was adopted in conjunction with Midterm 2.






This study design ensured that each student saw the five "topic 1" and the five "topic 2" questions three time, although in different conditions during the re-test. For instance, students in Group A saw the five "topic 1" questions individually each of the three times, and the five "topic 2" questions individually on the Midterm and Learning Test, but in a small group setting during the group re-test. Conversely, Group B students saw the five "topic 2" questions individually each of the three times, and the five "topic 1" questions individually on the Midterm and Learning Test, but in a small group setting during the group re-test.

Student performances and learning were measured in multiple ways. First, students' scores on the five "topic 1" and five "topic 2" questions were compared (t-test) between their Midterm and the Learning Test. Then, the mean difference between Midterm and Learning Test for each of the two sets of questions was compared between Group A and Group B, and the effect size was calculated.
Finally, the authors calculated the normalized change (Marx & Cummings, 2007) from Midterm to Learning Test for every student. This is a measure of student performance improvement that takes into consideration how much a student can possibly i       

Importantly (in my opinion) in each case the Learning Test occurred in the form of an unannounced pop-quiz three days after the Midterm exam.      


Main findings

Learning Test vs. Midterm performance: On average, students' performance on the five "topic 1" and  five "topic 2" questions on the Learning Test was higher than on that same subset of questions in the Midterm exam. In Midterm 1 the difference was significant regardless of whether students completed the re-test individually or as a group; in Midterm 2 it was only significant for students in the group condition.

Individual re-test vs. group re-test: On average, the performance improvement from Midterm to Learning Test was significantly higher, with a medium effect size, when students completed the re-test in the group condition (so, higher improvement on "topic 2" questions for Group A students and on "topic 1" questions for Group B students).

These results are highlighted in Table 2 below.


Reproduced from: Gilley & Clarkson, 2014


High-, medium- and low-performing students: On average, normalized change (i.e., 'standardized performance improvement') was significantly higher for questions re-tested in the group condition than in the individual condition for students in the upper, middle, and lower turtle of the class.


Reproduced from: Gilley & Clarkson, 2014    


Analysis

Was this paper useful? Yes, for at least two reasons. First, it is an example of a crossover study conducted within the context and scope of a course. As such it could serve as a very useful "template" if I wanted to assess the effectiveness of a similar strategy in my class. Second, I now know exactly what the evidence is that many of us have been "invoking" to justify and promote two-stage collaborative exams (more about this below).    

Strengths and limitations: I believe that the biggest strength of this study is its design, and the care that authors took to control all the variables that they could possibly control. Moreover, the study was conducted in a classroom setting, with very minimal deviation from the customary assessment plan, which adds ecological validity to the results.


However, I also think there are two major limitations, one of them being that the results cannot be extrapolated to any exam in any course/discipline. In particular, the exam and course in question used exclusively multiple choice questions (MCQ). The process involved in collaboratively answering a MCQ likely includes a larger proportion of time devoted to discussion among group members, presentation of argument for or against a certain answer, and  eventually coming to a consensus. This is different from what may be involved in collaboratively formulating an open-ended answer (e.g. discussion of what the answer is, but also of how to express it in writing, what terminology to use, and overall a higher proportion of time spent writing the answer, typically done by one person, vs. discussing it as a group). This difference in process may translate into a difference in learning, so we should be careful not to automatically assume that just because two-stage collaborative MC exams result in enhanced learning, the same applies to all types of two-stage collaborative exams.
In addition, the study was conducted on only one class, for one term (what is more, it was a compressed Summer term, which again may not be representative of a regular term since students usually take only one course at a time), and only for a subset of questions.

The second limitation of the paper, in my opinion, is that the "Learning Test" took place only three days after the Midterm and the re-test. Thus, I find that claiming that because students scored higher on the Learning test than on the Midterm exam they "learned" seems a bit of a stretch, as it is quite possible that the score improvement may be due to students simply remembering answers that their group deemed to be correct, rather than having understood why those answers are correct.
    
Missing from this paper: Following from my point above, I think what could have made this study stronger would have been a more long-term retention test/Learning Test; for example, the "topic 1" and "topic 2" questions could have been administered on the last day of class. In addition, I would be very curious to see data over several semesters, particularly in light of the fact that previous studies yielded inconsistent results. 

Potential for implementation in a technology-rich environment: Having implemented a form of two-stage collaborative exams before (and realizing that it was quite different from the format investigated in this study), I am now very interested in implementing this assessment strategy in the biweekly formative assessments used in one of the classes I teach, which does not have midterm exams. I believe that the technology is available; it would be a matter of having students assigned to groups through the Learning Management System, assigning a deadline for the individual online quiz, and having a window of time when groups can collaborate (e.g. through a group online discussion board, a chatroom, or even Skype) to complete the quiz again, collaboratively. There would definitely be challenges, such as ensuring that all group members participate, but I believe that it would be worth a try.


References

Cortright, R. N., Collins, H. L., Rodenbaugh, D. W., & DiCarlo, S. E. (2003). Student retention of course content is improved by collaborative-group testing. Advanced Physiological Education, 27(3), 102–108.

Gilley, B.H., and Clarkson, B. (2014). Collaborative Testing: Evidence of Learning in a Controlled In-Class Study of Undergraduate Students. Journal of College Science Teaching, 43(3), 83-91.

Kapitanoff, S. (2009). Collaborative testing -Cognitive and interpersonal processes related to enhanced test performance. Active Learning in Higher Education, 10(1),56-70.

Knierim, K., Turner, H., and Davis, R.K. (2015). Two-stage exams improve student learning in an introductory geology course: logistics, attendance, and grades. Journal of Geoscience Education, 63, 157-164.

Leight, H., Saunders, C., Calkins, R., & Withers, M. (2012). Collaborative testing improves performance but not content retention in a large-enrolment introductory biology class. CBE—Life Science Education, 11, 392–401.

Marx, J. D., & Cummings, K. (2007). Normalized change. American
Journal of Physics, 75, 87–91.

Sandahl, S. S. (2010). Collaborative testing as a learning strategy in nursing education. Nursing Education Perspectives, 31(3), 142–147.

Zimbardo, P. G., Butler, L. D., & Wolfe, V. A. (2003). Cooperative college examinations: More gain, less pain when students share information and grades. The Journal of Experimental Education, 71(2), 101–125.