Test Scoring and Analysis Using SAS (2014)
Chapter 1. What This Book Is About
Introduction
This book has dual purposes: One is to describe basic (and some advanced) ideas about how tests are evaluated. These ideas start from such simple tasks as scoring a test and producing frequencies on each of the multiple-choice items to such advanced tasks as measuring how well test items are performing (item analysis) and estimating test reliability. You will even find some programs to help you determine if someone cheated on your test. In addition to discussing test theory and SAS programs to analyze tests, we have included a chapter on how to write good test items.
Tests are used in schools to assess competence in various subjects and in professional settings to determine if a person should be accredited (or reaccredited) to some profession, such as becoming a nurse, an EMT, or a physician. Too many of these tests are never evaluated. Why not? Because up until now, easy-to-use programs were not readily available.
A second purpose is to provide you with a collection of programs to perform most of the tasks described in this book. Even without a complete knowledge of SAS programming, you will learn enough to use the included programs to create better tests and assessment instruments. For those who know how to write SAS programs (on a beginning or intermediate level), you will find detailed explanations of how these programs work. Feel free to skip these sections if you like. The last chapter of this book contains listings of programs that you can use to score tests, print student rosters, perform item analysis, and conduct all the tasks that were developed in this book. Along with these listings, you will also find instructions telling you how to run each of the programs.
An Overview of Item Analysis and Test Reliability
Testing is used for an incredibly wide range of purposes in society today. Tests are used for certification into professions, admission into universities and graduate programs, grading at all levels of schooling, formative assessment to help students learn, and classification to determine whether students need special forms of assistance. In each of these settings, it is critical that the use and interpretation of the test scores be valid. Simply put, the test should be doing what it is supposed to do.
Part of what “doing what it is supposed to do” is related to notions of validity and reliability. These concepts are analogous to the concepts of “antique” and “old.” In order for something to be an antique, it has to be old. But just because something is old doesn’t mean it is an antique. Reliability is a necessary but not sufficient condition for being valid. The use and interpretation of a test is valid if it leads to the proper decisions: the right people being certified, the best qualified candidates being selected, grades being fair and accurate, instructional assistance being on target and useful.
What we are going to discuss in this book primarily has to do with the development of tests that are reliable and the analysis of test items to ensure that reliability. We will also discuss the idea of test validity, but test validation research in general is beyond the scope of what we will cover here. We recommend one of these references to get a good idea of what test validity and conducting validation studies is about:
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19,405-450.
Kane, M. T. 2006. “Validation.” Educational Measurement, 4th ed, ed. R. L. Brennan, 17-64. ACE/Praeger Series on Higher Education. MD: Rowman and Littlefield.
The basic idea of reliability has to do with consistency of information. If you tested again, would you get fundamentally the same results? It’s like stepping on the scale a second time to make sure that you are really the weight that the scale reported the first time! In order to get that reliability, you need to ensure that all of the items that comprise the test are measuring the same underlying trait, or construct. In this book, you will see how to write items that will make good tests and how to statistically analyze your items using SAS to guide the revision of items and the generation of tests that will have strong reliability.
A Brief Introduction to SAS
This section is intended for those readers who are not familiar with SAS. SAS is many things— a programming language, a collection of statistical procedures, a set of programs to produce a variety of graphs and charts, and an advanced set of tools to provide businesses with advanced business analytics. The programs developed in this book mostly use Base SAS (the programming part of SAS), some graphics programs to provide charts and graphs, and a new SAS procedure to analyze tests using item response theory (IRT).
SAS programs are composed of statements that do the following:
1. Instruct the computer to read data (from a variety of sources such as text files or Excel workbooks)
2. Perform calculations
3. Make logical decisions
4. Output data to files or printers
There are many books that can teach you how to program using SAS. We recommend the following (all available from SAS Institute at www.support.sas.com/publishing):
Delwiche, Laura and Susan Slaughter. 2012. The Little SAS Book: A Primer, Fifth Edition, Cary, NC: SAS Press.
Cody, Ron. 2007. Learning SAS by Example: A Programmer's Guide, Cary, NC: SAS Press.
SAS Institute Inc. 2013. SAS 9.4 Language Reference: Concepts, Cary, NC: SAS Press
SAS Institute Inc. 2013. SAS 9.4 Macro Language Reference, Cary, NC: SAS Press.