Estimated reading time: 3 minutes

Have you heard about standard setting approaches such as the Hofstee method, or perhaps the Angoff, Ebel, Nedelsky, or Bookmark methods?  There are certainly various ways to set a defensible cutscore or a professional credentialing or pre-employment test.  Today, we are going to discuss the Hofstee method.

Why Standard Setting?

Certification organizations that care about the quality of their examinations need to follow best practices and international standards for test development, such as  the Standards laid out by the National Commission for Certifying Agencies (NCCA).  One component of that is standard setting, also known as cutscore studies.  One of the most common and respected approaches for that is the modified-Angoff methodology.

However, the Angoff approach has one flaw: the subject matter experts (SMEs) tend to expect too much out of minimally competent candidates, and sometimes set a cutscore so high that even they themselves would not pass the exam.  There are several reasons this can occur.  For example, raters might think “I would expect anyone that worked for me to know how to do this” and not consider the fact that people who work for them might have 10 years of experience while test candidates could be fresh out of training/school and have the topic only touched on for 5 minutes.  SMEs often forget what it was like to be a much younger and inexperienced version of themselves.

For this reason, several compromise methods have been suggested to compare the Angoff-recommended cutscore with a “reality check” of actual score performance on the exam, allowing the SMEs to make a more informed decision when setting the official cutscore of the exam.  I like to use the Beuk method and the Hofstee method.

The Hofstee Method

One method of adjusting the cutscore based on raters’ impressions of the difficulty of the test and possible pass rates is the Hofstee method (Mills & Melican, 1987; Cizek, 2006; Burr et al., 2016).  This method requires the raters to estimate four values:

  1. The minimum acceptable failure rate
  2. The maximum acceptable failure rate
  3. The minimum cutscore, even if all examinees failed
  4. The maximum cutscore, even if all examinees passed

The first two values are failure rates, and are therefore between 0% and 100%, with 100% indicating a test that is too difficult for anyone to pass.  The latter two values are on the raw score scale, and therefore range between 0 and the number of items in the test, again with a higher value indicating a more difficult cutscore to achieve.

These values are paired, and the line that passes through the two points estimated.  The intersection of this line with the failure rate function, is the recommendation of the adjusted cutscore.   hofstee method cutscore standard setting

How can I use the Hofstee Method?

Unlike the Beuk, the Hofstee method does not utilize the Angoff ratings, so it represents a completely independent reality check.  In fact, it is sometimes used as a standalone cutscore setting method itself, but because it does not involve rating of every single item, I recommend it be used in concert with the Angoff and Beuk approaches.

How can you perform all the calculations that go into the Hofstee method?  Well, you don’t need to program it all from scratch.  Just head over to our Angoff Analysis Tool page and download a copy for yourself.

Psychometrics is the cornerstone of any high-quality assessment program.  However, most organizations can’t afford an in-house Ph.D. psychometrician, which then necessitates the search for psychometric consulting.  Most organizations, when first searching, are new to the topic and not sure what role the psychometrician plays. 

In this article, we’ll talk about how psychometricians and their tools can help improve your assessments, whether you just want to check on test reliability or pursue the lengthy process of accreditation.

Why ASC?

Whether you are establishing or expanding a credentialing program, streamlining operations, or moving from paper to online testing, ASC has a proven track record of providing practical, cost-efficient solutions with uncompromising quality. We offer a free consultation with our team of experts to discuss your needs and determine which solutions are the best fit, including our enterprise SaaS platforms, consulting on sound psychometrics, or recommending you to one of our respected partners.

At the heart of our business are our people.

Our collaborative team of Ph.D. psychometricians, accreditation experts, and software developers have diverse experience developing solutions that drive best practices in assessment. This real-world knowledge enables us to consult your organization with solutions tailored specifically to your goals, timeline, and budget.

Comprehensive Solutions to Address Specific Measurement Problems

Much of psychometric consulting is project-based around solving a specific problem.  For example, you might be wondering how to set a cut score on a certification/licensure exam that is legally defensible and meets accreditation standards. 

This is a very specific issue, and the scientific literature has suggested a number of sound approaches.  Here are some of the topics where psychometricians can really help:

  • Test Design: Job Analysis & Blueprints
  • Standard and Cutscore Setting Studies
  • Item Writing and Review Workshops
  • Test and Item Statistical Analysis
  • Equating Across Years and Forms
  • Adaptive Testing Research
  • Test Security Evaluation
  • NCCA/ANSI Accreditation

Why psychometric consulting?

All areas of assessment can be smarter, faster, and fairer.

Develop Reliable and Valid Assessments
We’ll help you understand what needs to be done to develop defensible tests and how to implement them in a cost-efficient manner.  Much of the work revolves around establishing a sound test development cycle.

Increase Test Security
We have specific expertise in psychometric forensics, allowing you to flag suspicious candidates or groups in real-time, using our automated forensics report.

Achieve Accreditation
Our dedicated experts will assist in setting your organization up for success with NCCA/ANSI accreditation of professional certification programs.

Comprehensive Psychometric Analytics
We use CTT and IRT with principles of machine learning and AI to deeply understand your data and provide actionable recommendations.

We can help your organization develop and publish certification and licensure exams, based on best practices and accreditation standards, in a matter of months.

Effective assessments; Best practices

Item and Test Statistical Analysis
If you are doing this process at least annually, you are not meeting best practices or accreditation standards. But don’t worry, we can help! In addition to performing these analyses for you, you also have the option of running them yourself in our FastTest platform or using our psychometric software like Iteman and Xcalibre.

Job Analysis
How do you know what a professional certification test should cover?  Well, let’s get some hard data by surveying job incumbents. Knowing and understanding this information and how to use it is essential if you want to test people on whether they are prepared for the job or profession.

Cut score Studies (Standard Setting)
When you use sound psychometric practices like the modified-Angoff, Beuk Compromise, Bookmark, and Contrasting Groups methods, it will help you establish a cutscore that meets professional standards.

 

It’s all much easier if you use the right software!

Once we help you determine the best solutions for your organization, we can train you on best practices, and it’s extremely easy to use our software yourself.  Software like Iteman and Xcalibre is designed to replace much of the manual work done by psychometricians for the item and test analysis, and FastTest automates many aspects of test development and publishing.  We even offer free software like the Angoff Analysis Tool

However, our ultimate goal is your success: Assessment Systems is a full-service company that continues to provide psychometric consulting and support even after you’ve made a purchase. Our team of professionals is available to provide you with additional support at any point in time. We want to ensure you’re getting the most out of our products!  Click below to sign up for a free account in FastTest and see for yourself.

Sign up for a Free Account

If you are involved with certification testing and are accredited by the National Commission of Certifying Agencies (NCCA), you have come across the term decision consistency.  NCCA requires you to submit a report of 11 important statistics each year, each for all active test forms.  These 11 provide a high level summary of the psychometric health of each form; more on that report here.  One of the 11 is decision consistency.

Decision consistency is an estimate of how consistent the pass/fail decision is on your test.  That is, if someone took your test today, had their brain wiped of that memory, and took the test again next week, what is the probability that they would obtain the same classification both times?  This is often estimated as a proportion or percentage, and we would of course hope that this number is high, but if the test is unreliable it might not be.

The reasoning behind the need for a index specifically on this is that the psychometric aspect we are trying to estimate is different than reliability of point scores (Moltner, Timbil, & Junger, 2015; Downing & Mehrens, 1978).  The argument is that examinees near the cutscore are of interest, and reliability evaluates the entire scale.  It’s for this reason that if you are using item response theory, the NCCA allows you to instead submit the conditional standard error of measurement function at the cutscore.  But all of the classical decision consistency indices evaluate all examinees, and since most candidates are not near the cutscore, this inflates the baseline.  Only the CSEM – from IRT – follows the line of reasoning of focusing on examinees near the cutscore.

An important distinction that stems from this dichotomy is that of decision consistency vs. accuracy.  Consistency refers to receiving the same pass/fail classification each time if you take the test twice.  But what we really care about is whether your pass/fail based on the test matches with your true state.  For a more advanced treatment on this, I recommend Lathrop (2015).

There are a number of classical methods for estimating an index of decision consistency that have been suggested in the psychometric literature.  A simple and classic approach is Hambleton (1972), which is based on an assumption that examinees actually take the same test twice (or equivalent forms).  Of course, this is rarely feasible in practice, so a number of methods were suggested over the next few years on how to estimate this with a single test administration to a given set of examinees.  These include Huynh (1976), Livingston (1972), and Subkoviak (1976).  These are fairly complex.  I once reviewed a report from a psychometrician that faked the Hambleton index because they didn’t have the skills to figure out any of the indices.

How does decision consistency relate to reliability?

The note I made above about unreliability is worth another visit, however.  After the rash of publications on the topic, Mellenbergh and van der Linden (1978; 1980) pointed out that if you assume a linear loss function for misclassification, the conventional estimate of reliability – coefficient alpha – serves as a solid estimate of decision consistency.  What is a linear loss function?  It means that a misclassification is worse if the person’s score is further from the cutscore.  That is, of the cutscore is 70, failing someone with a true score of 80 is twice as bad as failing someone with a true score of 75.  Of course, we never know someone’s true score, so this is a theoretical assumption, but the researchers make an excellent point.

But while research amongst psychometricians on the topic cooled since they made that point, NCCA still requires one of the statistics -most from the 1970s – to be reported.  The only other well-known index on the topic was Hanson and Brennan (1990).  While the indices have been show to be different than classical reliability, I remain to be convinced that they are the right approach.  Of course, I’m not much of a fan of classical test theory at all in the first place; that acceptance of CSEM from IRT is definitely aligned with my views on how psychometrics should tackle measurement problems.

 

Estimated reading time: 6 minutes

Item banking refers to the purposeful creation of a database of assessment items to serve as a central repository of all test content, improving efficiency and quality. The term item refers to what many call questions; though their content need not be restricted as such and can include problems to solve or situations to evaluate in addition to straightforward questions. As a critical foundation to the test development cycle, item banking is the foundation for the development of valid, reliable content and defensible test forms.

Automated item banking systems, such as Assess.ai or FastTest, result in significantly reduced administrative time for developing/reviewing items and assembling/publishing tests.  Contact us to request a free account.

What is Item Banking?

While there are no absolute standards in creating and managing item banks, best practice guidelines are emerging. Here are the essentials your should be looking for:

   Items are reusable objects; when selecting an item banking platform it is important to ensure that items can be used more than once; ideally, item performance should be tracked not only within a test form but across test forms as well.

   Item history and usage are tracked; the usage of a given item, whether it is actively on a test form or dormant waiting to be assigned, should be easily accessible for test developers to assess, as the over-exposure of items can reduce the validity of a test form. As you deliver your items, their content is exposed to examinees. Upon exposure to many examinees, items can then be flagged for retirement or revision to reduce cheating or teaching to the test.

   Items can be sorted; as test developers select items for a test form, it is imperative that they can sort items based on their content area or other categorization methods, so as to select a sample of items that is representative of the full breadth of constructs we intend to measure.

   Item versions are tracked; as items appear on test forms, their content may be revised for clarity. Any such changes should be tracked and versions of the same item should have some link between them so that we can easily review the performance of earlier versions in conjunction with current versions.

   Review process workflow is tracked; as items are revised and versioned, it is imperative that the changes in content and the users who made these changes are tracked. In post-test assessment, there may be a need for further clarification, and the ability to pinpoint who took part in reviewing an item and expedite that process.

   Metadata is recorded; any relevant information about an item should be recorded and stored with the item. The most common applications for metadata that we see are author, source, description, content area, depth of knowledge, IRT parameters, and CTT statistics, but there are likely many data points specific to your organization that is worth storing.

Managing an Item Bank

Names are important. As you create or import your item banks it is important to identify each item with a unique, but recognizable name. Naming conventions should reflect your bank’s structure and should include numbers with leading zeros to support true numerical sorting.  You might want to also add additional pieces of information.  If importing, the system should be smart enough to recognize duplicates.

Search and filter. The system should also have a reliable sorting mechanism. 

Assess' item banking output

Prepare for the Future: Store Extensive Metadata

Metadata is valuable. As you create items, take the time to record simple metadata like author and source. Having this information can prove very useful once the original item writer has moved to another department, or left the organization. Later in your test development life cycle, as you deliver items, you have the ability to aggregate and record item statistics. Values like discrimination and difficulty are fundamental to creating better tests, driving reliability, and validity.

Statistics are used in the assembly of test forms while classical statistics can be used to estimate mean, standard deviation, reliability, standard error, and pass rate.Item response theory parameters can come in handy when calsulating test information and standard error functions. Data from both psychometric theories can be used to pre-equate multiple forms.

In the event that your organization decides to publish an adaptive test, utilizing CAT delivery, item parameters for each item will be essential. This is because they are used for inteligent selection of items and scoring examinees. Additionally, in the event that the integrity of your test or scoring mechanism is ever challenged, documentation of validity is essential to defensibility and the storage of metadata is one such vital piece of documentation.

Increase Content Quality: Track Workflow

Utilize a review workflow to increase quality. Using a standardized review process will ensure that all items are vetted in a similar matter. Have a step in the process for grammar, spelling, and syntax review, as well as content review by a subject matter expert. As an item progresses through the workflow, its development should be tracked, as workflow results also serve as validity documentation.

Accept comments and suggestions from a variety of sources. It is not uncommon for each item reviewer to view an item through their distinctive lens. Having a diverse group of item reviewers stands to benefit your test-takers, as they are likely to be diverse as well!

item banking review kanban

Keep Your Items Organized: Categorize Them

Identify items by content area. Creating a content hierarchy can also help you to organize your item bank and ensure that your test covers the relevant topics. Most often, we see content areas defined first by an analysis of the construct(s) being tested. In the event of a high school science test, this may include the evaluation of the content taught in class. A high-stakes certification exam, almost always includes a job-task analysis. Both methods produce what is called a test blueprint, indicating how important various content areas are to the demonstration of knowledge in the areas being assessed.

Once content areas are defined, we can assign items to levels or categories based on their content. As you are developing your test, and invariably referring back to your test blueprint, you can use this categorization to determine which items from each content area to select.

Why Item Banking?

There is no doubt that item banking is a key aspect of developing and maintaining quality assessments. Utilizing best practices, and caring for your items throughout the test development life cycle, will pay great dividends as it increases the reliability, validity, and defensibility of your assessment. Moreover, good item banking will make the job easier and more efficient thus reducing the cost of item development and test publishing.

Ready to improve assesment quality through item banking?

Click below to visit our Contact page, where you can request a demonstration or a free account (up to 500 items).

With so many things to consider, it’s no wonder psychometricians often recommend the retirement of poor performing items. Here are some of the most common issues we see, along with our tried and true methods for designing good, psychometrically sound items.  We could all use some reminders on good item writing tips.  These are made easier if you use a strong item banking system like FastTest.

Issue

Recommendation

Key is invalid due to multiple correct answers. Consider each answer option individually; the key should be fully correct with each distractor being fully incorrect.
Item was written in a hard to comprehend way, examinees were unable to apply their knowledge because of poor wording.

 

Ensure that the item can be understood after just one read through. If you have to read the stem multiple times, it needs to be rewritten.
Grammar, spelling, or syntax errors direct savvy test takers toward the correct answer (or away from incorrect answers). Read the stem, followed by each answer option, aloud. Each answer option should fit with the stem.
Information was introduced in the stem text that was not relevant to the question. After writing each question, evaluate the content of the stem. It should be clear and concise without introducing irrelevant information.
Item emphasizes trivial facts. Work off of a test blue print to ensure that each of your items map to a relevant construct. If you are using Bloom’s taxonomy or a similar approach, items should be from higher order levels.
Numerical answer options overlap. Carefully evaluate numerical ranges to ensure there is no overlap among options.
Examinees noticed answer was most often A. Distribute the key evenly among the answer options. This can be avoided with FastTest’s randomized delivery functionality.
Key was overly specific compared to distractors. Answer options should all be about the same length and contain the same amount of information.
Key was only option to include key word from item stem. Avoid re-using key words from the stem text in your answer options. If you do use such words, evenly distribute them among all of the answer options so as to not call out individual options.
Rare exception can be argued to invalidate true/false always/never question. Avoid using “always” or “never” as there can be unanticipated or rare scenarios. Opt for less absolute terms like “most often” or “rarely”.
Distractors were not plausible, key was obvious. Review each answer option and ensure that it has some bearing in reality. Distractors should be plausible.
Idiom or jargon was used; non-native English speakers did not understand. It is best to avoid figures of speech, keep the stem text and answer options literal to avoid introducing undue discrimination against certain groups.
Key was significantly longer than distractors. There is a strong tendency to write a key that is very descriptive. Be wary of this and evaluate distractors to ensure that they are approximately the same length.


Want more item writing tips, and psychometrics in general?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source

There are a number of acceptable methodologies in the psychometric literature for standard-setting studies, also known as cutscores or passing points.  Some examples include Angoff, modified-Angoff, Bookmark, Contrasting Groups, and Borderline. The modified-Angoff approach is by far the popular approach. But, it remains a black box to many professionals in the testing industry, especially non-psychometricians in the credentialing field.  This post provides some clarity on the methodology. There is some flexibility in the study implementation, but this article describes a sound method.

What to Expect with the Modified-Angoff Approach

First of all, do not expect a straightforward, easy process that leads to an unassailably correct cutscore.  All standard-setting methods involve some degree of subjectivity.  The goal of the methods is to reduce that objectivity as much as possible.  Some methods focus on content, others on data, while some try to meld the two.

Step 1: Prepare Your Team

The modified-Angoff process depends on a representative sample of subject matter experts (SMEs), usually 6-20.  By “representative” I mean they should represent the various stakeholders. For instance, a certification for medical assistants might include experienced medical assistants, nurses, and physicians, from different areas of the country.  You must train them about their role and how the process works, so they can understand the end goal and drive toward it.

Step 2: The Minimally Competent Candidate (MCC)

This concept is the core of the Angoff process, though it is known by a range of terms or acronyms, including minimally qualified candidates (MQC) or just barely qualified (JBQ).  The reasoning is that we want our exam to separate candidates that are qualified from those that are not.  So we ask the SMEs to define what makes someone qualified (or unqualified!) from a perspective of skills and knowledge. This leads to a conceptual definition of an MCC.  We then want to estimate what score this borderline candidate would achieve, which is the goal of the remainder of the study.   This step can be conducted in person, or via webinar.

Step 3: Round 1 Ratings

Next, ask your SMEs to read through all the items on your test form and estimate the percentage of MCCs that would answer each correctly.  A rating of 100 means the item is a slam dunk; it is so easy that every MCC would get it right.  A rating of 40 is very difficult.  Most ratings are in the 60-90 range if the items are well-developed. The ratings should be gathered independently; if everyone is in the same room, let them work on their own in silence.  This can easily be conducted remotely, though.

Step 4: Discussion

This is where it gets fun.  Identify items where there is the most disagreement (as defined by grouped frequency distributions or standard deviation) and make the SMEs discuss it.  Maybe two SMEs thought it was super easy and gave it a 95 and two other SMEs thought it was super hard and gave it a 45.  They will try to convince the other side of their folly. Chances are that there will be no shortage of opinions and you, as the facilitator, will find your greatest challenge is keeping the meeting on track.  This step can be conducted in person, or via webinar.

Step 5: Round 2 Ratings

Raters then re-rate the items based on the discussion.  The goal is that there will be a greater consensus.  In the previous example, it’s not likely that every rater will settle on a 70.  But if your raters all end up from 60-80, that’s OK.  How do you know there is enough consensus?  We recommend the inter-rater reliability suggested by Shrout and Fleiss (1979).

Step 6: Evaluate Results and Final Recommendation

Evaluate the results from Round 2 as well as Round 1.  An example of this is below.  What is the recommended cutscore, which is the average or sum of the Angoff ratings depending on the scale you prefer?  Did the reliability improve?  Estimate the mean and SD of examinee scores (there are several methods for this). What sort of pass rate do you expect?  Even better, utilize the Beuk Compromise as a “reality check” between the modified-Angoff approach and actual test data.  You should take multiple points of view into account, and the SMEs need to vote on a final recommendation. They, of course, know the material and the candidates so they have the final say.  This means that standard setting is a political process; again, reduce that effect as much as you can. modified angoff

Step 7: Write Up Your Report

Validity refers to evidence gathered to support test score interpretations.  Well, you have lots of relevant evidence here.  Document it.  If your test gets challenged, you’ll have all this in place.  On the other hand, if you just picked 70% as your cutscore because it was a nice round number, you could be in trouble.

Additional Topics

In some situations, there are more issues to worry about.  Multiple forms?  You’ll need to equate in some way.  Using item response theory?  You’ll have to convert the Angoff-recommended cutscore onto the theta metric using the Test Response Function (TRF).  New credential and no data available?  That’s a real chicken-and-egg problem there.

Where Do I Go From Here?

Ready to take the next step and actually apply the modified-Angoff process to improving your exams?  Download our free Angoff Analysis Tool. Want to go even further and implement automation in your Angoff study?  Sign up for a free account in our FastTest item banker.

References

Shrout & Fleiss (1979). Intraclass correlations: Uses in assessing reliability. Psychological Bulletin, 86(2), 420-428.

Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.
Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source

Background of the Certification/Recertification Report

The National Commission for Certifying Agencies (NCCA) is a group that accredits certification programs.  Basically, many of the professional associations that had certifications (e.g., Certified Professional Widgetmaker) banded together to form a super-association, which then established an accreditation arm to ensure that certifications were of high quality – as there are many professional certifications in the world of near-zero quality.  So this is a good thing.  Becoming accredited is a rigorous process, and the story doesn’t stop there.  Once you are accredited, you need to submit annual reports to NCCA as well as occasionally re-apply.  This is mentioned in NCCA Standard 24, and is also described in the NCCA webpage regarding annual renewals of accreditation. It requires information on certification and recertification volumes.

The certification program must demonstrate continued compliance to maintain accreditation.

Essential Elements:

  1. The certification program must annually complete and submit information requested of the certification agency and its programs for the previous reporting year.

There are a number of reports that are required, one of which is a summary of Certification/Recertification numbers.  These are currently submitted through an online system, but an example of a previous paper form can be found here, which clearly states some of the tables that you need to fill out.

In the past, you had to pay consultants or staff to manually compile this information.  Our certification management system does it automatically for you – for free.

Overview of the Cert/Recert Report

Our easily generated Certification/Recertification report provides a simple, clean overview of your certification program. The report includes important information required for maintaining accreditation, including the number of applicants, new certifications, and recertifications, as well as percentages signifying the success rate of new candidates and recertification candidates. The report is automatically produced within FastTest’s reporting module, saving your organization thousands of dollars in consultant fees.

Here is a sample report. Assume that this organization has one base level certification, Generalist, with 3 additional specialist areas where certification can also be earned.

Annual Certification/Recertification Report

Date run: 11/10/2016
Timeframe: 1/1/2012 – 12/31/2015

Program Applicants for first time certification First time certified Due for recertification Recertified Percent due that recertified
Generalist 4,562 2,899 653 287 44%
Specialist A 253 122 72 29 40%
Specialist B 114 67 24 7 36%
Specialist C 44 13 2 0 0%

Let’s examine the data for the Generalist program. Follow the table across the first data line:

Generalist is the name of the program with the data being analyzed. The following data all refers to candidates who were involved in the certification process at the Generalist level within our organization.

4,562 is the number of candidates who registered to be certified for the program within the timeframe indicated above the table. These candidates have never been certified in this program before.

2,899 is the number of candidates who successfully completed the certification process by receiving a passing score and meeting any other minimum requirements set forth by the certification program.

653 is the number of previously certified candidates whose certification expired within the timeframe indicated above the table.

287 is the number of previously certified candidates who successfully completed the recertification process.

44% is the percentage of candidates eligible for recertification within the indicated timeframe who successfully completed the recertification process. Another way to express this value would be 287/653; the number of who successfully completed recertification divided by the number of those due for recertification within the given timeframe.

The same format follows for each of the Specialist programs.

If you found this post interesting, you might also be interested in checking out this post on the NCCA Annual Statistics Report.  That report is another one of the requirements, but focuses on statistical and psychometric characteristics of your exams.

Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source

I often hear this question about scaling, especially regarding the scaled scoring functionality found in software like FastTest and Xcalibre.  The following is adapted from lecture notes I wrote while teaching a course in Measurement and Assessment at the University of Cincinnati.

Scaling: Sort of a Tale of Two Cities

Scaling at the test level really has two meanings in psychometrics. First, it involves defining the method to operationally scoring the test, establishing an underlying scale on which people are being measured.  It also refers to score conversions used for reporting scores, especially conversions that are designed to carry specific information.  The latter is typically called scaled scoring.

You have all been exposed to this type of scaling, though you might not have realized it at the time. Most high-stakes tests like the ACT, SAT, GRE, and MCAT are reported on scales that are selected to convey certain information, with the actual numbers selected more or less arbitrarily. The SAT and GRE have historically had a nominal mean of 500 and a standard deviation of 100, while the ACT has a nominal mean of 18 and standard deviation of 6. These are actually the same scale, because they are nothing more than a converted z-score (standard or zed score), simply because no examinee wants to receive a score report that says you got a score of -1. The numbers above were arbitrarily selected, and then the score range bounds were selected based on the fact that 99% of the population is within plus or minus three standard deviations. Hence, the SAT and GRE range from 200 to 800 and the ACT ranges from 0 to 36. This leads to the urban legend of receiving 200 points for writing your name correctly on the SAT; again, it feels better for the examinee. A score of 300 might seem like a big number and 100 points above the minimum, but it just means that someone is in the 3rd percentile.

Now, notice that I said “nominal.” I said that because the tests do not actually have those means observed in samples, because the samples have substantial range restriction. Because these tests are only taken by students serious about proceeding to the next level of education, the actual sample is of higher ability than the population. The lower third or so of high school students usually do not bother with the SAT or ACT. So many states will have an observed average ACT of 21 and standard deviation of 4. This is an important issue to consider in developing any test. Consider just how restricted the population of medical school students is; it is a very select group.

How can I select a score scale?

For various reasons, actual observed scores from tests are often not reported, and only converted scores are reported.  If there are multiple forms which are being equated, scaling will hide the fact that the forms differ in difficulty, and in many cases, differ in cutscore.  Scaled scores can facilitate feedback.  They can also help the organization avoid explanations of IRT scoring, which can be a headache to some.

When deciding on the conversion calculations, there are several important questions to consider.

First, do we want to be able to make fine distinctions among examinees? If so, the range should be sufficiently wide. My personal view is that the scale should be at least as wide as the number of items; otherwise you are voluntarily giving up information. This in turn means you are giving up variance, which makes it more difficult to correlate your scaled scores with other variables, like the MCAT is correlated with success in medical school. This, of course, means that you are hampering future research – unless that research is able to revert back to actual observed scores to make sure all information possible is used. For example, supposed a test with 100 items is reported on a 5-point grade scale of A-B-C-D-F. That scale is quite restricted, and therefore difficult to correlate with other variables in research. But you have the option of reporting the grades to students and still using the original scores (0 to 100) for your research.

Along the same lines, we can swing completely in the other direction. For many tests, the purpose of the test is not to make fine distinctions, but only to broadly categorize examinees. The most common example of this is a mastery test, where the examinee is being assessed on their mastery of a certain subject, and the only possible scores are pass and fail. Licensure and certification examinations are an example. An extension of this is the “proficiency categories” used in K-12 testing, where students are classified into four groups: Below Basic, Basic, Proficient, and Advanced. This is used in the National Assessment of Educational Progress (http://nces.ed.gov/nationsreportcard/). Again, we see the care taken for reporting of low scores; instead of receiving a classification like “nonmastery” or “fail,” the failures are given the more palatable “Below Basic.”

Another issue to consider, which is very important in some settings but irrelevant in others, is vertical scaling. This refers to the chaining of scales across various tests that are at quite different levels. In education, this might involve linking the scales of exams in 8th grade, 10th grade, and 12th grade (graduation), so that student progress can be accurately tracked over time. Obviously, this is of great use in educational research, such as the medical school process. But for a test to award a certification in a medical specialty, it is not relevant because it is really a one-time deal.

Lastly, there are three calculation options: pure linear (ScaledScore = RawScore * Slope + Intercept), standardized conversion (Old Mean/SD to New Mean/SD), and nonlinear approaches like Equipercentile.

Perhaps the most important issue is whether the scores from the test will be criterion-referenced or norm-referenced. Often, this choice will be made for you because it distinctly represents the purpose of your tests. However, it is quite important and usually misunderstood, so I will discuss this in detail.

Criterion-Referenced vs. Norm-Referenced

This is a distinction between the ways test scores are used or interpreted. A criterion-referenced score interpretation means that the score is interpreted with regards to defined content, blueprint, or curriculum (the criterion), and ignores how other examinees perform (Bond, 1996). A classroom assessment is the most common example; students are scored on the percent of items correct, which is taken to imply the percent of the content they have mastered. Conversely, a norm-referenced score interpretation is one where the score provides information about the examinee’s standing in the population, but no absolute (or ostensibly absolute) information regarding their mastery of content. This is often the case with non-educational measurements like personality or psychopathology. There is no defined content which we can use as a basis for some sort of absolute interpretation. Instead, scores are often either z-scores or some linear function of z-scores.  IQ is historically scaled with a mean of 100 and standard deviation of 15.

It is important to note that this dichotomy is not a characteristic of the test, but of the test score interpretations. This fact is more apparent when you consider that a single test or test score can have several interpretations, some of which are criterion-referenced and some of which are norm-referenced. We will discuss this deeper when we reach the topic of validity, but consider the following example. A high school graduation exam is designed to be a comprehensive summative assessment of a secondary education. It is therefore specifically designed to cover the curriculum used in schools, and scores are interpreted within that criterion-referenced context. Yet scores from this test could also be used for making acceptance decisions at universities, where scores are only interpreted with respect to their percentile (e.g., accept the top 40%). The scores might even do a fairly decent job at this norm-referenced application. However, this is not what they are designed for, and such score interpretations should be made with caution.

Another important note is the definition of “criterion.” Because most tests with criterion-referenced scores are educational and involve a cutscore, a common misunderstanding is that the cutscore is the criterion. It is still the underlying content or curriculum that is the criterion, because we can have this type of score interpretation without a cutscore. Regardless of whether there is a cutscore for pass/fail, a score on a classroom assessment is still interpreted with regards to mastery of the content.  To further add to the confusion, Industrial/Organizational psychology refers to outcome variables as the criterion; for a pre-employment test, the criterion is typically Job Performance at a later time.

This dichotomy also leads to some interesting thoughts about the nature of your construct. If you have a criterion-referenced score, you are assuming that the construct is concrete enough that anybody can make interpretations regarding it, such as mastering a certain percentage of content. This is why non-concrete constructs like personality tend to be only norm-referenced. There is no agreed-upon blueprint of personality.

Multidimensional Scaling

An advanced topic worth mentioning is multidimensional scaling (see Davison, 1998). The purpose of multidimensional scaling is similar to factor analysis (a later discussion!) in that it is designed to evaluate the underlying structure of constructs and how they are represented in items. This is therefore useful if you are working with constructs that are brand new, so that little is known about them, and you think they might be multidimensional. This is a pretty small percentage of the tests out there in the world; I encountered the topic in my first year of graduate school – only because I was in a Psychological Scaling course – and have not encountered it since.

Summary of scaling

Scaling is the process of defining the scale that on which your measurements will take place. It raises fundamental questions about the nature of the construct. Fortunately, in many cases we are dealing with a simple construct that has a well-defined content, like an anatomy course for first-year medical students. Because it is so well-defined, we often take criterion-referenced score interpretations at face value. But as constructs become more complex, like job performance of a first-year resident, it becomes harder to define the scale, and we start to deal more in relatives than absolutes. At the other end of the spectrum are completely ephemeral constructs where researchers still can’t agree on the nature of the construct and we are pretty much limited to z-scores. Intelligence is a good example of this.

Some sources attempt to delineate the scaling of people and items or stimuli as separate things, but this is really impossible as they are so confounded. Especially since people define item statistics (the percent of people that get an item correct) and items define people scores (the percent of items a person gets correct). It is for this reason that IRT, the most advanced paradigm in measurement theory, was designed to place items and people on the same scale. It is also for this reason that item writing should consider how they are going to be scored and therefore lead to person scores. But because we start writing items long before the test is administered, and the nature of the construct is caught up in the scale, the issues presented here need to be addressed at the very beginning of the test development cycle.

Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source

SIFT: Software for Investigating Test Fraud

Test fraud is an extremely common occurrence.  We’ve all seen articles like this one.  However, there are very few defensible tools to help detect it.  I once saw a webinar from an online testing provider that proudly touted their reports on test security… but it turned out that all they provided was a simple export of student answers that you could subjectively read and form conjectures.  The goal of SIFT is to provide a tool that implements real statistical indices from the corpus of scientific research on statistical detection of test fraud, yet is user-friendly enough to be used by someone without a PhD in psychometrics and experience in data forensics.  SIFT still provides more collusion indices and other analysis than any other software on the planet, making it the standard in the industry from the day of its release.  The science behind SIFT is also being implemented in our world-class online testing platform, FastTest.  It is also worth noting that FastTest supports computerized adaptive testing, which is known to increase test security.

Interested?  Download a free trial version of SIFT!

What is Test Fraud?

As long as tests have been around, people have been trying to cheat them.  This is only natural; anytime there is a system with some sort of stakes/incentive involved (and maybe even when not), people will try to game that system.  Note that the root culprit is the system itself, not the test.  Blaming the test is just shooting the messenger.  However, in most cases, the system serves a useful purpose.  In the realm of assessment, that means that K12 assessments provide useful information on curriculum on teachers, certification tests identify qualified professionals, and so on.  In such cases, we must minimize the amount of test fraud in order to preserve the integrity of the system.

When it comes to test fraud, the old cliche is true: an ounce of prevention is worth a pound of cure.  You’ll undoubtedly see that phrase at conferences and in other resources.  So I of course recommend that your organization implement reasonable preventative measures to deter test fraud.  Nevertheless, there will still always be some cases.  SIFT is intended to help find those.  Also, some examinees might also be deterred by the knowledge that such analysis is even being done.

How can SIFT help me with statistical detection of test fraud?

Like other psychometric software, SIFT does not interpret results for you.  For example, software for item analysis like Iteman and Xcalibre do not specifically tell you which items to retire or revise, or how to revise them.  But they provide the output necessary for a practitioner to do so.  SIFT provides you a wide range of output that can help you find different types of test fraud, like copying, proctor help, suspect test centers, brain dump usage, etc.  It can also help find other issues, like low examinee motivation.  But YOU have to decide what is important to you regarding statistical detection of test fraud, and look for relevant evidence.  More information on this is provided in the manual, but here is a glimpse.

Types of data forensics

First, there are a number if intra-individual indices to evaluate.  Consider the third examinee here.  They took less than half the time of most examinees, had a very low score, and were flagged for answering Option 4 too often… likely a case of a student giving up and answering D for most of the test.

intra individual data forensics

A certification organization could use SIFT to look for evidence of brain dump makers and takers by evaluating similarity between examinee response vectors and answers from a brain dump site – especially if those were intentionally seeded by the organization!  We also might want to find adjacent examinees or examinees in the same location that group together in the collusion index output.  Unfortunately, these indices can differ substantially in their conclusions.

 

Inter output

 

Finally, we can roll up many of these statistics to the group level.  Below is an example that provides a portion of SIFT output regarding teachers.  Note the Gutierrez has suspiciously high scores but without spending much more time.  Cheating?  Possibly.  On the other hand, that is the smallest N, so perhaps the teacher just had a group of accelerated students.  Worthington, on the other hand, also had high scores but had notably shorter times – perhaps the teacher was helping?

test security data forensics

These are only the descriptive statistics – this doesn’t even touch on the collusion indices yet!

Still interested?  Download or purchase SIFT here.

The Story of SIFT

I started SIFT in 2012.  Years ago, ASC sold a software program called Scrutiny!  We had to stop selling it because it did not work on recent versions of Windows, but we still received inquiries for it.  So I set out to develop a program that could perform the analysis from Scrutiny! (the Bellezza & Bellezza index) but also much more.  I quickly finished a few collusion indices and planned to publish SIFT in March 2013, as my wife and I were expecting our first child on March 25.  Alas, he arrived a full month early and all plans went out the window!  Then unfortunately I had to spend a few years dealing with the realities of business, wasting hundreds of hours in pointless meetings and other pitfalls.  I finally set a goal to release SIFT before the second child arrived in July 2016.  I unfortunately failed at that too, but the delay this time was 3 weeks, not 3 years.  Whew!

Version 1.0 of SIFT includes 10 collusion indices (5 probabilistic, 5 descriptive), response time analysis, group level analysis, and much more to aid in the statistical detection of test fraud.  This is obviously not an exhaustive list of the analyses from the literature, but still far surpasses other options for the practitioner, including the choice to write all your own code.  Suggestions?  I’d love to hear them.  Email me at nthompson@54.89.150.95.

Want to improve the quality of your assessments?

Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.

Newsletter Sign Up
First Name*
Last Name*
Email*
Company*
Market Sector*
Lead Source