Dr. Frank Williams lies for money. His specialization is lying about psychometrics. He trades on his credentials and reputation to lie for the contractors who pay him.
Dr. Williams testified against examination reform in Maryland on behalf of his contractor, the Association of Social Work Boards. What was galling about his testimony was he lied about the testing standards publishers like ASWB are supposed to adhere to.
Perhaps he thought no one would bother to read the standards he misstated? He is wrong!
The AERA, APA, and NCME (2014) Standards are cited by ASWB as their principal source of psychometrics in the examination guidebook. Additionally, the Standards themselves specify:
All professional test developers, sponsors, publishers, and users should make reasonable efforts to satisfy and follow the Standards and should encourage others to do so. All applicable standards should be met by all tests and in all test uses unless a sound professional reason is available to show why a standard is not relevant or technically feasible in a particular case (p. 1)
Dr. Williams did not encourage others to satisfy and follow the standards in his Maryland testimony (start recording at 4:05:07). Instead, he lied and said the standards I cited were actually only relevant for educational tests.
“What the gentleman said about the conditional standard educational [sic] measurement part…I deal with a lot of other clients too through the accreditation process…the standards he raised are more for educational testing.” [4:06:20]
Like much of Dr. Williams’ testimony, it is easy to refute by simply reading the applicable testing standards. They are, very obviously, for all psychometric tests including educational tests, psychological tests, and licensure examinations–with specific standards and interpretations for credentialing and licensure examinations. Dr. Williams is either profoundly ignorant of his own area of expertise, which seems unlikely, or a paid liar.
In my testimony, I highlighted one example of how ASWB violates testing standards by not measuring the conditional standard error of measurement at the cut score. I cited these standards:
Standard 2.14: Where cut scores are specified for selection or classification, the standard errors of measurement should be reported in the vicinity of each cut score.
Standard 2.15 When there is credible evidence for expecting that conditional standard errors of measurement or test information functions will differ substantially for various subgroups, investigation of the extent and impact of such differences should be undertaken and reported as soon as is feasible… If differences are found, they should be clearly indicated in the appropriate documentation. In addition, if substantial differences do exist, the test content and scoring models should be examined to see if there are legally acceptable alternatives that do not result in such differences.
Standard 2.16: When a test score or composite score is used to make classification decisions (e.g., pass/fail, achievement levels), the standard error of measurement at or near the cut scores has important implications for the trustworthiness of these decisions.
Standard 3.6: Where credible evidence indicates that test scores may differ in meaning for relevant subgroups in the intended examinee population, test developers and/or users are responsible for examining the evidence for validity of score interpretations for intended uses for individuals from those subgroups…Subgroup mean differences do not in and of themselves indicate lack of fairness, but such differences should trigger follow-up studies, where feasible, to identify the potential causes of such differences…When sample sizes are sufficient, studies of score precision and accuracy for relevant subgroups also should be conducted.
In addition to dismissing these testing standards as irrelevant because they applied mostly to educational tests, Dr. Williams talked about “good enough” reliability:
“When it comes to the reliability of most credential and licensure exams, what’s deemed good enough is just the overall reliability and reliability around the cut score…When I’m assisting clients go through their accreditation process, the accreditors, which a lot of times is the NCCA, what they’re looking for is a reliability, and from a statistical measure, we’re looking for something greater than .80 which is deemed as acceptable. However, when you look at ASWB’s analyses, there’s actually approaches 0.9. So, although they don’t have, they are not showing the conditional standard errors of measurement, they actually are showing that their exam is reliable enough–what we would normally deem as good enough for an accreditation exam”[4:06:40]
Putting aside the nursing and teacher licensing examinations that report the conditional standard error of measurement, Dr. Williams’ lie is not a matter of interpretation. The testing standards clearly state that one type of reliability is not substitutable for another.
Standard 2.6: A reliability or generalizability coefficient (or standard error) that addresses one kind of variability should not be interpreted as interchangeable with indices that address other kinds of variability, unless their definitions of measurement error can be considered equivalent…Error variances derived via item response theory are generally not equivalent to error variances estimated via other approaches.
Item response theory provides additional insight because it treats reliability/precision as conditional on the test-takers overall ability; whereas, Cronbach’s alpha assumes reliability is consistent across ability levels and demographic groups.
You know who agrees with me? ASWB’s previous psychometricians–who, by the way, happened to be social workers:
For licensing examinations like ASWB’s (where all candidates who pass are considered competent and candidates who fail are considered incompetent), the decision consistency in pass/fail decisions is considered a more appropriate form of reliability than the traditional classical concept of reliability, the Kuder-Richardson Formula 20…The ASWB examinations have shown high reliability estimates, in the nineties, both by the preferred advanced IRT model (decision consistency in pass/fail decisions) and the less relevant classical standards (KR-20, test reliability measure as shown by its internal consistency) (Marson et al., 2011, p. 89).
That’s right! We used more advanced reliability and precision tools in the 1990s than we do now. Perhaps it is because of the deficient consulting of PSI and Dr. Williams that ASWB decided to change its reliability approach in 2014 to the test deemed less relevant and appropriate by ASWB in the 1990s!
Unfortunately, there is no penalty for lying in testimony. Someone could contact the Association of Test Publishers and tell them their Certification and Licensure Division chair is out there lying about psychometric standards, but I get the impression that the psychometrics industry does a lot of lying for money.
So, if you want to know why licensure exams are so broken, it’s because of consultants like Dr. Williams who knowingly discard required psychometric tools to make licensure exams. Psychometricians went through painstaking effort to specify exactly what developers and boards need to measure (and why). It is social justice work that is extremely important to enable evidence-based practice.
Test-makers simply ignore tests unaligned with their financial interests. If ASWB actually performed the tests required, they might find a biased cut score. This result would immediately remove the legal defensibility of licensure decisions using the exam, and thus the ability of states to license social workers using the ASWB exam.
These problems persist because there is no inherent check in the system to ensure compliance with psychometric best practices. Institutional inertia keeps the licensure system moving.
It is a myth that all standardized measures are biased. The 2014 standards specifically sought to properly unite fairness, validity, and reliability using newer psychometric tools that ASWB previously applied themselves. These psychometricians and many social workers spend lifetimes working on tools to more accurately measure people for safe, culturally-responsive, and evidence-based practice.
That work was spit on yesterday.