Test Scoring and Analysis Using SAS (2014)
Chapter 9. Tips on Writing Multiple-Choice Items
Testing has been conceptualized in a number of different fashions. It has been likened to taking a measure like height. It is related to a statistical sampling procedure where you have a conceptually infinite number of possible test items, and it has been described as trying to determine the shape of an object that resides within the brain. However it is conceptualized, a test consists of a set of one or more items. In the British tradition, testing often consists of a small number of essays (sometimes just one) on a topic. In America, and increasingly in the world, a test is thought of as a collection of individual items whose scores are summed together. Thus, testing requires the writing of test items. Although the primary focus of this book is how to analyze items and test scores once the test has been administered, we take some time here to talk about the construction of test items. This chapter is concerned with writing items for tests: course examinations, certification tests, admissions tests, or tests designed to assist in the learning process. What should a good test look like?
Tests, like all measures, should be fit for purpose. That is, they should do what you want them to do. Sometimes that will be an agglomeration of a bunch of different areas of your course, and other times it might be an attempt to create a scale that measures one well-defined ability, what measurement specialists often call a trait. There are many books that look at how to develop achievement tests and how to do measurement theory in general. We would recommend Haladyna and Rodriguez (2013) for test development; for measurement theory in general, see Brennan (2006).
Before you start writing items themselves, it is often useful to step back and get organized for the test.
It is very helpful to start an achievement measure with some sort of outline of what you want to accomplish. This is sometimes referred to as a test blueprint. A blueprint should be specific enough that if given to a colleague in the same field, that person could do a good job of writing the test. It does not need to be more explicit than that. Not only is the test blueprint very helpful for writing the test, it allows you to take a look at the blueprint and determine whether you are really testing students on the material that you feel is important. You should be able to look at a blueprint and conclude, "Yep, that’s the course, the whole course, and nothing but the course."
A blueprint is simply an elaborated outline of what will be on the test. There are approaches that use a kind of matrix design for this activity, but that isn’t necessary. Once the outline has been constructed, you can look at it and assign weightings to the various components of the test. We always have our weightings sum to 100%, but that isn’t an absolute, depending on how you construct your overall grading system.
It is also often very useful to distribute the test blueprint to students in the course in advance of the test. That way you know that their studying will be focused on those aspects of the test that are important to you. It also eliminates the "you tested us on things you didn’t teach us" and "you taught us stuff that wasn’t on the test" complaints. Some faculty get concerned that one is spoon-feeding students by providing this information. We look at it from the opposite side of the coin: Once you have told students what to study, you are pretty much free to write a rigorous assessment of that material. It allows you to write a test that, if students do well on it, you will be pleased to assign them good grades.
Taxonomy of Objectives and Items
A second useful starting point is to think about the level of difficulty and complexity you want your items to have. What do you want your students to do? Should they be able to recall facts and figures, should they comprehend the material presented (restate it in their own words, perhaps?), should they be able to contrast a given theory to alternatives? One way to think rigorously about issues such as this is to consult a taxonomy of objectives and/or test items. The idea of a taxonomy of educational objectives (what you want to accomplish in a course) was first proposed by Bloom (1956), and a useful revision of that has been developed by his colleagues, Anderson and Krathwohl (2001). The revised taxonomy has six levels of complexity, or levels of thinking, into which course objectives and test items might be classified (the original taxonomy also had six levels). Very briefly, these levels are:
• Remembering: Simple recall of information such as facts and figures.
• Understanding: Does the student understand the information? Can he/she rewrite it in his/her own terms?
• Applying: Can the student take the information or ideas presented and use them in a novel setting? Can the student apply the information to a new task?
• Analyzing: Can the student take the ideas/theory/information apart and analyze components, compare to other ideas?
• Evaluating: Can the student make a critical judgment about the theory/ideas and defend that judgment?
• Creating: Can the student create a new idea/product/concept that is applicable to the task or situation at hand?
There are a variety of web sites that present information on this taxonomy and others that have been developed since Bloom’s ground-breaking work. The underlying idea here is to think about levels of understanding and ability that go beyond remembering and understanding information. We are particularly enamored of the application level of the taxonomy. Can the student take what has been learned and use it in a new situation?
There are probably other things that one should do before writing a test, but this is a good beginning for now, so we will move on to actually looking at different possibilities for test items and how to write them.
Types of Items for Achievement Tests
Very broadly speaking, one can consider two basic types of test questions: ones that require recognition of correct answers and ones that require the generation of correct answers. There are a variety of formats possible within these two broad categories, and we will examine a number of them.
Recognition format simply means that the correct answer is presented to the examinee, who has to identify it among distracters, determine whether it is true or false, or match one characteristic to another (body parts and their names, for example). This format has the distinct advantage of being very easy to score, both in terms of assigning a value to a response (correct or incorrect) and the ability to use machine-based scoring of responses. It has several disadvantages, the main one being that examinees can guess the correct answer. Another important disadvantage is that it is somewhat limiting in terms of the type of abilities that can be elicited (although we will see that this limitation is not as severe as some believe).
The multiple-choice (MC) item is ubiquitous in American education and becoming more popular worldwide. The MC item has basically two components, the stem, or question, which poses the problem; and the options, or distracters, which are presented as a set of typically three to five choices. The examinee reads the stem and has to select the option that best answers the question posed in the stem. Some examples of MC stems:
• When did World War I begin? This stem requires recalling information, the lowest level of the taxonomy. The student simply has to remember when the war began.
• How did the killing of Archduke Ferdinand provide the spark for the war to begin? This stem requires the student to understand why the assassination of Archduke Ferdinand caused the war to begin.
• Which of the following "hotspots" in the world today might be thrown into war with the assassination of a world leader? This question requires the student to understand the situation in a number of places in the world and apply what he/she knows about the causes of armed conflict to those situations.
In the examples provided here, it is not difficult to see that the three stems listed here not only vary in their level of the taxonomy but also in their difficulty. That is, the higher the level of the taxonomy, the more difficult the question. One often, but not always, sees that relationship.
But we don’t yet really know the difficulty of these items. Indeed, each of them could be posed in generative, or what is often called constructed response, format. That is, they could just be presented as is, and the student would have to generate an answer for them. That would make them much more difficult than if presented in multiple-choice format. But, in multiple-choice format, the choice of options to use greatly affects the difficulty of the item. Consider the first stem above:
When did World War I begin?
Now, one could write a fairly easy set of options:
Or, one could write an incredibly challenging set of options:
a) July, 1914
b) August, 1914
c) September, 1914
d) October, 1914
Although most people would feel that the second set of choices might be a bit unfair, the point here has to do with what you expect the students to know. If you want them to generally understand that the war started roughly in 1914, you might use the following options:
That gives a reasonable set and will measure whether a student understands that the war started in 1914. Well, it started in 1914 in Europe. The US didn’t join until 1917, and so 1917 may be considered by some students to be an unfair choice! One of the rules of writing MC items is to make sure that the right answer is right and that the wrong answers are wrong!
Another format for the MC item that is very useful is to start with a setting or context or a set of information that sets up the question. This can be done on a question-by-question basis, or one can set up a more elaborate setting and ask several questions about it (think of the reading comprehension questions from the SAT where a passage is presented and then four or more questions are asked about it). The use of introductory material in this fashion can often allow the test constructor to ask questions requiring the use of higher order thinking skills – those at the top of the taxonomy presented before. Imagine that we wanted to measure how well you have understood some of the more sophisticated programming issues presented in this text. We could present an example of some code used to solve a particular analysis problem. And that code could have an error in it. We could ask you to analyze the code and locate the error. Depending on how subtle we made the error, this could be a very difficult problem, requiring a high level of knowledge and analytical skill. Here is an example:
Given the following SAS program to read multiple observations from one line of data, identify the line that contains an error:
a) data temperature;
b) input Temp_C;
c) Temp_F = (9*5)*Temp_C + 32;
e) 15 16 17 18 19 20
The answer is b. You need a double-trailing at sign (@) before the semi-colon to prevent the program from moving to a new line of data at each iteration of the DATA step.
Another possibility is to display a diagram or illustration, and then pose questions about it. However, when writing multiple questions about a single source or context, make sure that you are not giving away the answer to a question by the way another question is posed.
Giving away answers to questions brings up the issue of general do’s and don’ts for writing MC items. One can find a number of these lists (some quite long) on the Internet, but here is a shortened version that focuses on key elements of good item writing:
• Alignment: Make sure each item is important and can be clearly linked to your test blueprint. Ask yourself, "Do I really want to know if the examinee can do this?"
• Item Baggage: Item baggage is the ease with which a person who knows what you are interested in can get the item wrong. Avoid double negatives – actually, it is best to avoid all negatives; they introduce noise into the answering of the item. Avoid stems like, "Which of the following is not used in…." The general rule is no tricks. If an examinee gets an item wrong, you want it to be for one reason and one reason only: the person did not know the material.
• Clear Problems: Put the information necessary to do the problem/question in the stem of the item. That is, try to avoid overly long choices in the option list. You want the examinee to understand the nature of the problem once he/she has read the stem.
• Clearly Correct: As mentioned above, make certain that your right answers are right and your wrong answers are wrong. The easiest way to do this is to give the test to a colleague to answer. You will be amazed at how often something you thought was perfectly clear confuses someone who knows as much about the material as you do!
• Parsimony: Don’t repeat the same phrase time and again in the option list. Just put it up in the stem.
• Similarity of Options: The similarity of the choices in a multiple-choice item often determines how difficult the item is: the more similar, the more difficult. Make sure the surface features in your choices are roughly similar. Don’t have some choices use extreme language ("always," "never," "could not possibly," etc.) and others use very moderate language ("might," "can sometimes be seen to," etc.). Consistency is the key.
• Grammar: Make sure all choices are grammatically consistent with the stem.
• Number of Choices: Don’t force yourself to have five MC items or even four. For course exams, you also don’t need the number of choices to be the same for all items.
Examples of Strong and Poor MC Items
MC items are the workhorse of the examination field. Let’s take a look at several and make some commentary on their quality.
Which of the following best describes the transmission of information from one neuron to another:
A. The axon of one neuron is attached to the dendrites of a number of postsynaptic neurons, which lets electrical current be passed among neurons.
B. An electrical impulse down the axon of one neuron causes neurotransmitters to be sent across a synaptic gap and received by dendrites of another neuron.
C. The nucleus of the first cell divides, and the information contained in the DNA of that cell passes on to the DNA of the second cell through the synaptic cleft.
D. The axon of one neuron sheds its outermost glial cell, which is received by the postsynaptic neuron through the transmitters contained at the end of the dendrites of the receiving cell.
This item, although it seems to be pretty substantial, suffers from three problems. First, the nature of the task really resides as much in the alternatives as it does in the problem itself. The examinee has to read all the choices and make comparisons in order to answer the question. This isn’t always a bad thing, but it is when related to the second problem this item has: the choices are very long and complex. Making the comparisons among complex alternatives is not what is supposed to be being measured here. Instead, we want to know about the examinee’s understanding of how neurons work. Finally, the incorrect choices aren’t really reasonable (plausible) answers to the question.
A second example:
Three-year-old Maria’s parents are native speakers of two different languages. They each speak to Maria in their native languages. Maria is learning to speak both languages fluently. What is this an example of:
A. Experience-expectant learning
B. Tabula rasa
C. Functional organization
D. Grammatical processing
This second item is an example of applying a concept. The examinee has to be able to interpret a new situation and determine which of the four concepts listed it describes. The problem is clearly stated in the stem, and the choices are plausible, while clearly being incorrect.
And a third example:
The pegword method is a good approach for learning lists of items. What is the essential feature of the pegword method that makes it an effective mnemonic device:
C. Disinhibitory effects
D. Visual imagery
This third example requires the examinee to look at the characteristics (features) of the pegword method and determine which of those characteristics makes it work as a mnemonic device. Exactly what level of difficulty this item exists at depends a bit on what has been taught in the course. If the features and why they work have been taught explicitly, then this is a recall type item. If the pegword approach has been taught, but not explained, then this item is at least at an understanding level. If the basic ideas behind mnemonic strategies and the information processing theory that lies behind it have not been taught, one could argue that this is an analyzing item. Again, the level of thinking involved depends on what has gone on in the course.
The best place to start in analyzing test data is with a good test! Test analysis will help to point out a number of areas where there are weaknesses with a test, or a test question, or even just an aspect of a test question. But it is essential to start with your very best effort. In this brief chapter, we look at the most commonly used examination question, the multiple-choice question, and provide some help on how to organize yourself for the test and how to think about writing test items, as well as some practical help on test construction.
Anderson, L. W., and Krathwohl, D. R. Eds. 2001. A Taxonomy for Learning, Teaching and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives: Abridged edition, New York: Longman.
Bloom, B. S. 1956. Taxonomy of educational objectives. Vol. 1: Cognitive domain. New York: McKay.
Brennan, R. L. Ed. 2006. Educational Measurement, 4th ed. ACE/Praeger Series on Higher Education. MD: Rowman and Littlefield.
Haladyna, T. M., & Rodriguez, M. C. 2013. Developing and Validating Test Items. New York, NY: Routledge.