ASWB lies about DIF & DTF

Recently, the Association of Social Work Boards posted an article of misinformation about differential item functioning (whether an item on a test is biased) as compared to differential test functioning (whether the entire test is biased). I’m going to go over what is inaccurate about their blog post and why those misrepresentations matter.

First, the blog post is authored by a marketer, not a methodologist. As is common practice for ASWB, marketers and managers are using psychometric terminology to mislead social workers and regulators. A methodologist might have cited recent sources to talk about currently accepted psychometric practices. Of course, that would require understanding that it is unethical for Bobbie Hartman, the Marketing and Content Strategy Manager, to be speaking authoritatively about psychometrics.

This is a consistent theme for ASWB–their officers testify and make public statements about psychometrics but do not actually perform or receive enough training in psychometrics to make competent statements. Because ASWB outsources psychometrics to contractors, social workers asking about psychometrics end up with unsatisfying answers like “we’ll check with our psychometricians” and “our psychometricians assure us that DIF is a robust process” that are meaningless without data and procedure to objectively evaluate.

DTF is not about item elimination

The first section of the blog post badly misrepresents the purpose of DTF analysis, again due to the author’s lack of competence to write about the topic and their organizational self-interest in misrepresenting psychometrics. The purpose of DTF is to identify multivariate properties of the entire examination, whereas DIF provides multivariate properties of individual items. Bobbie could have read the standards she cites to find this definition:

Differential test functioning (DTF) refers to differences in the functioning of tests (or sets of items) for different specially defined groups. When DTF occurs, individuals from different groups who have the same standing on the characteristic assessed by the test do not have the same expected test score.

Standards for Educational and Psychological Testing (page 51)

One could perform the nearly 30-year-old approach cited by the author (Raju, 1995) to engage in item elimination. Indeed, Chalmers et al (2016) revised and updated the DTF approach used by Raju. Let’s see what they say about DIF vs. DTF:

It is also possible, however, to obtain nontrivial DTF in applications where little to no DIF effects have been detected. Meaningful DTF can occur in testing situations where DIF analyses suggest that no individual item appears to demonstrate a large amount of DIF. Specifically, substantial DTF can occur when the freely estimated parameters systematically favor one group over another. The aggregate of these small and individually insignificant item differences can become quite substantial at the test level, and in turn bias the overall test in favor of one population over another. Therefore, studying DTF in isolation and in conjunction with DIF analyses can be a meaningful and informative endeavor for test evaluators.

Chalmers, Counsell, & Flora, 2016 (page 118)

Perhaps if ASWB had bothered to update themselves on Item Response Theory scholarship from this century, they would know that DTF is important because it evaluates a separate question than item-level functioning…and one that ASWB (because of their incompetence and self-interest) refuses to recognize…that differential functioning (e.g., bias) can happen at the item level, content area level, or at the whole-exam level because, as ASWB says “different types of biases should be evaluated independently of one another because they are not necessarily related.” Right on, ASWB. Now do the work!

Looking for differential functioning only at the item-level assumes that each item is independent of the last one. It does not investigate patterns on the content area or subset level. Differing test-taker perspectives on child welfare, social work theory, policing, supervision, and other hot topics on the exam may have different meaning across groups. Looking only item-by-item would miss patterns that emerge among relationships between questions. Using only DIF data would lead ASWB to mistakenly conclude they have an unbiased examination when in reality, there may be substantial DTF for clinicians who are older, not white, and English language learners.

ASWB prevents researchers from investigating DTF

To be fair, there are citations from the 2000s in the blog post. Indeed…the cited studies develop and test procedures for investigating DTF using Item Response Theory and Confirmatory Factor Analysis. These are approaches to exam bias that ASWB refuses to perform! Yet, they cite the studies that established the standards for effect sizes that determine whether a test is biased.

Cruelly and hilariously, they do so because “DIF does not typically favor one examinee group consistently.” Does an ASWB’s examination fall into the typical case, or does it fail the test? Of course, it is impossible to know. ASWB cites, but does not perform, the DTF tests from Nye (CFA methods) and Stark (IRT methods). The methods section is there for a reason, ASWB! I’m pretty sure ASWB has the money to pay someone competent to perform the analysis.

Although they recently released a Request for Proposals to investigate ASWB’s exam data, ASWB suggests researchers investigating exam bias “address correlating external [emphasis added] variables that may influence the disparities in the licensing exam pass rate data. Such variables could include upstream [emphasis added] factors such as differences in education programs; considerations of intersectionality, including age, gender, race, health, socioeconomic status; and social determinants of health, including life experiences from early childhood to post-graduate.”

Missing from these exculpatory hypotheses is the actual psychometric functioning of the examination. Indeed, the areas of focus only welcome studies that investigate “pipeline” and “upstream” factors, not problems with the psychometric properties of the examination. Weird, because the ASWB’s examination guidebook already states that external factors are the reason for different test scores across groups–not a broken examination.

ASWB works to ensure the fairness of each of its exam questions but acknowledges that there may be differences in exam performance outcomes for members of different demographic groups because exam performance is influenced by many factors external [emphasis added] to the exams. ASWB has committed to contributing to the conversation around diversity, equity, and inclusion by investing in a robust analysis of examination pass rate data.

ASWB Examination Guidebook (01/2023) pg. 12

It certainly sounds like ASWB would like to fund studies that use its data to investigate the statement it already uses to explain exam bias data to test-takers. Would researchers investigating DTF be able to kludge their proposal in under bullet #1 below? I dunno. I guess they could try…

  • Variables associated with the results reported in the 2022 ASWB Examination Pass Rate Analysis
  • The impact of licensure on the social work profession
  • Supervision’s role in social work licensure
  • Professional practice standards
  • Electronic practice
  • Regulatory enforcement

Without a measurement equivalence study that investigates the multivariate properties of the exam, the statements ASWB makes about the quality of the examination will continue to rest on the best guesses of the exam developer, rather than the test’s actual performance in the real world. I’m not optimistic that this analysis will be performed. The RFP is administered by…ASWB. And were any psychometrics researchers to get the data, ASWB retains final say over any publication created using their data (ASWB, “Methods of Operation” 7.14 Research Support pt. #7).

DTF analysis is required, no really read Page 52

ASWB publicly lies about the definition of bias. They maintain that the Testing Standards that everyone uses define bias as Differential Item Functioning. Here are a few examples of ASWB lying about that:

“Protocols for standardized testing require that bias be accounted for throughout the exam development process at the individual test question level. It is not the final pass rate data that is used to identify bias in exams.”

email from Jacqueline Braxton, MSW LCSW Licensed Examination Development Project Coordinator, to a test-taker.

“ASWB uses a testing industry statistical measurement called Differential Item Functioning (DIF). DIF indicates whether an exam question shows tendencies to advantage or disadvantage one group of test-takers over another (ASWB, 2020). DIF is identified by statistically analyzing responses to the exam questions—called items—during pretesting. Scored items are continually monitored for DIF. On an annual basis, less than 5% of all items released show DIF. Items flagged for DIF are removed from the bank of potential exam questions.”

Stacey Owens in the New Social Worker.

And here is how bias is actually defined in the testing standards. (Note how ASWB didn’t actually quote the standards in their blog post!)

The term predictive bias may be used when evidence is found that differences exist in the patterns of associations between test scores and other variables for different groups, bringing with it concerns about bias in the inferences drawn from the use of test scores.

Standards for Educational and Psychological Testing (page 51-52)

Clearly, the Standards define bias as differences in test scores, not individual exam items. Now that we understand that ASWB lies constantly about the conceptual definition of bias, we can understand why their bias detection methodology is similarly harebrained. When ASWB states in their blog post that DTF analysis is not required, they are not telling the whole truth. Let’s read together!

First, here is how the standards define “credible evidence [indicating] potential bias in measurement.” It is one of three factors (a)”inconsistent item meaning across groups,” (b) Differential Item Functioning, and (c) Differential Test Functioning. Next, the standards state that when credible evidence of measurement bias and predictive bias exist, these three sources (a-c) must be investigated independently. So far, there is no mandate for DTF analysis.

ASWB restates this…but leaves off the next sentence…let’s see why!

The presence or absence of one form of such bias may have no relationship with other forms of bias. For example, a predictor test may show no significant levels of DIF, yet show group differences in regression lines in predicting a criterion.

Standards for Educational and Psychological Testing (page 52)

“Regression lines predicting a criterion” refers to differential test functioning predicting the criterion: social work competence. The standards state that differential test functioning can exist without differential item functioning, and that is why they need to be investigated separately.

ASWB does not agree! According the marketer writing their blog post, the literature can be summarized thusly:

Although it is theoretically possible that DIF analyses may fail to identify some problematic items and small amounts of bias may accumulate to produce DTF, it is very unlikely that practically important DTF will result, because there is often high power to detect small magnitudes of DIF,

ASWB, Bobbie Hartman

As we read, this is omitting a large part of the truth. Contra ASWB, the standards foresee this exact circumstance–an entire test demonstrates bias while showing little bias at the item level. That context seems necessary for understanding why independently conducting DIF & DTF analyses would be recommended by the plain language by the standards.

Wait, I’m now remembering the first part of the blog post… ASWB said DIF and DTF are independent and need to be analyzed independently. Yet, the conclusion clearly states that because ASWB’s DIF approach is so good and DIF is rarely systematically biased, no DTF analysis needs to be performed. Did the first part of the blog post meet the second part?

Contrary to the incompetent writing at ASWB, the standards actually spell out what ASWB needs to do, now that it has finally bowed to decades of advocate pressure (while patting itself on the back as groundbreaking) and uncovered important evidence of problems with the entire exam.

Especially where credible evidence of potential bias exists, small sample methodologies should be considered. For example, potential bias for relevant subgroups may be examined through small-scale tryouts that use cognitive labs and/or interviews or focus groups to solicit evidence on the validity of interpretations made from the test scores

Standards for Educational and Psychological Testing (page 52)

ASWB is not performing any small-scale tryouts, using cognitive labs, or interviewing test-takers. While they are holding focus groups, those groups do not address the test-taking experience. Participants publicly report that they were directed not to talk about the exam bias report or issues of racism or ethnocentrism in the examination by the group facilitators. Instead, the focus groups address the broader social work journey through licensure. The focus groups are also facilitated by psychometric contractors employed by ASWB to consult on the examinations–a fact hidden from focus group participants until #StopASWB advocates complained.

DTF evaluates whether the test is biased…not which items are biased. ASWB does not want to perform a DTF analysis because it would test the hypothesis of whether the exam actually does what it says it does–assesses (fairly and impartially) the entry-level competence of social work practitioners. ASWB has the data necessary to perform this analysis–it is all contained in their descriptive report on examination bias. However, they will never commit to testing any hypotheses about the examination’s internal properties.

Right now, ASWB could also perform differential functioning analysis on the four content areas. Here is a screenshot of a failed examination report. Which content area displays the highest degree of differential functioning? We’ll never know.

Here is a study of Differential Test Functioning of the Praxis examination used to license teachers. It compares scores across components of the exam using data provided by the exam developers over five years. Yet, ASWB’s blog post makes it seem like no exam developer ever provides this information to the research community. Curious….

ASWB is unlikely to perform this analysis because it tests hypotheses that threaten its bottom line–that their exam tests a single construct, entry-level competence, fairly and accurately. Looking beyond item-by-item analysis opens the possibility of entire content areas in the examination needing to be removed entirely due to invalidity and bias (rather than individual exam items). Instead, ASWB seems content to throw up its hands and merely report the damage publicly while making vague gestures towards the need for greater social equity in society as a curative to its exam’s shoddy psychometrics.

ASWB: Chokehold capitalists

To perform a DTF analysis and test hypotheses on examination bias data using multivariate methods would create the real possibility that the examination is shown to be hopelessly biased and removed from use. There is no alternative to the ASWB exam, so removing the examination would grind licensure to a halt in every state.

ASWB calls looking only at the item-level for bias a “conservative approach”…which is apt, though not for the reasons they imply. It conserves the oppressive status quo by not exposing the multivariate properties of the entire examination (or its four separately-scored content areas) to differential functioning analysis.

ASWB has a financial stake in the status quo. Here are some facts:

