References

AAMC. (n.d.). AAMC PREview Professional Readiness Exam. https://students-residents.aamc.org/aamc-preview/aamc-previewprofessional-readiness-exam

Abrams, Z. (2024). Addressing equity and ethics in artificial intelligence. Monitor on Psychology, 55(3), 24–29. https://www.apa.org/monitor/2024/04/addressing-equity-ethics-artificial-intelligence

Abyaa, A., Khalidi Idrissi, M., & Bennani, S. (2019). Learner modelling: systematic review of the literature from the last 5 years. Educational Technology Research and Development, 67, 1105–1143.

Acar, S. (2023). Creativity assessment, research, and practice in the age of artificial intelligence. Creativity Research Journal, 1–7. Advance online publication. https://doi.org/10.1080/10400419.2023.2271749

Acuity Insights. (n.d.). What is Casper? https://acuityinsights.app/casper/

Acuity Insights. (2023). Casper technical manual. https://acuityinsights.com/casper-technical-manual/

Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). Rethinking the use of tests: A meta-analysis of practice testing. Review of Educational Research, 87(3), 659–701. https://doi.org/10.3102/0034654316689306

Agrawal, A., Gans, J., & Goldfarb, A. (2022). Power and prediction: The disruptive economics of artificial intelligence. Harvard Business Review Press.

Aguilar, S. J., StuartA.Karabenick, S. A., StephanieD. Teasley, S.D.,Clare Baek,C. (2021).Associations between learning analytics dashboard exposure and motivation and self-regulated learning. Computers & Education, 162, Article 104085, https://doi.org/10.1016/j.compedu.2020.104085

Ahn, T., Arcidiacono, P., Hopson, A., & Thomas, J. R. (2019). Equilibrium grade inflation with implications for female interest in STEM majors (Working Paper 26556). National Bureau of Economic Research. https://doi.org/10.3386/w26556

Alan, S., Boneva, T., & Ertac, S. (2019). Ever failed, try again, succeed better: Results from a randomized educational intervention on grit. The Quarterly Journal of Economics, 134(3), 1121–1162. https://doi.org/10.1093/qje/qjz006

Ali, U. S., & van Rijn, P. W. (2016). An evaluation of different statistical targets for assembling parallel forms in item response theory. Applied Psychological Measurement, 40(3), 163–179. https://doi.org/10.1177/0146621615613308

American Educational Research Association, American Psychological Association, & National Council for Measurement in Education. (1999). Standards for educational and psychological testing. American Educational Research Association.

American Educational Research Association, American Psychological Association, & National Council for Measurement in Education. (2014). Standards for educational and psychological testing. American Psychological Association.

American Psychological Association. (2018). Top 20 principles from psychology for preK-12 teaching and learning: Coalition for psychology in schools and education. https://www.apa.org/ed/schools/teaching-learning/top-twenty-principles.pdf

Association of Test Publishers. (2022). Guidelines for technology-based assessment. https://www.testpublishers.org/assets/TBA%20Guidelines%203-14-2022%20draft%20numbered.pdf

Attali, Y.,&van der Kleij, F. (2017). Effects of feedback elaboration and feedback timing during computer-based practice in mathematics problem solving. Computers & Education, 110, 154–169. https://doi.org/10.1016/j.compedu.2017.03.012

Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & von Davier, A. A. (2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, Article 903077. https://doi.org/10.3389/frai.2022.903077

Autor, D. H., Levy, F.,&Murnane, R. J. (2003). The skill content of recent technological change: An empirical exploration. The Quarterly Journal of Economics, 118(4), 1279–1333. https://doi.org/10.1162/003355303322552801

Autor, D., Chin, C., Salomons, A., & Seegmiller, B. (2024). New frontiers: The origins and content of new work, 1940–2018. The Quarterly Journal of Economics. Advance online publication. https://doi.org/10.1093/qje/qjae008

Azevedo, R., & Bernard, R. M. (1995). The effects of computer-presented feedback on learning from computer-based instruction: A meta-analysis. Journal of Educational Computing Research, 13(2), 111–127. https://doi.org/10.2190/9LMD-3U28-3A0G-FTQT

Baker, R. S. J. d., & Yacef, K. (2009). The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining, 1(1), 3–17. https://doi.org/10.5281/zenodo.3554657

Bailey, T., Jeong, D.W.,&Cho, S.W. (2010). Referral, enrollment, and completion in developmental education sequences in community colleges. Economics of Education Review, 29(2), 255–270. https://doi.org/10.1016/j.econedurev.2009.09.002

Bangert-Drowns, R. L., Kulik, C. L. C., Kulik, J. A., & Morgan, M. (1991). The instructional effect of feedback in test-like events. Review of Educational Research, 61(2), 213–238. https://doi.org/10.3102/00346543061002213

Bauer, M. S., Damschroder, L., Hagedorn, H., Smith, J., & Kilbourne, A. M. (2015). An introduction to implementation science for the non-specialist. BMC Psychology, 3(32), 1–12. https://doi.org/10.1186/s40359-015-0089-9

Bejar, I. I., Lawless, R. R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2002). A feasibility study of on-the-fly item generation in adaptive testing (Research Report No. RR-02-03). ETS. https://doi.org/10.1002/j.2333-8504.2002.tb01890.x

Bennett, R. E. (1993). On the meanings of constructed response. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment (pp. 1–27). Lawrence Erlbaum Associates.

Bennett, R. E. (1998). Reinventing assessment: Speculations on the future of large-scale educational testing (Policy Information Perspective). ETS. http://www.ets.org/Media/Research/pdf/PICREINVENT.pdf

Bennett, R. E. (2011). Formative assessment: A critical review. Assessment in Education: Principles, Policy & Practice, 18(1), 5–25. https://doi.org/10.1080/0969594X.2010.513678

Bennett, R. E. (2023). Toward a theory of socioculturally responsive assessment. Educational Assessment, 28(2), 83–104. https://doi.org/10.1080/10627197.2023.2202312

Berman, A. I., Feuer, M. J., & Pellegrino, J. W. (2019). What use is educational assessment? The Annals of the American Academy of Political and Social Science, 683(1), 8–20. https://doi.org/10.1177/0002716219843871

Bernacki, M. L. (2018). Examining the cyclical, loosely sequenced, and contingent features of self-regulated learning: Trace data and their analysis. In D. H. Schunk & J. A. Greene (Eds.), Handbook of self-regulation of learning and performance (2nd ed., pp. 370–387). Routledge. https://doi.org/10.4324/9781315697048-24

Biddle, D. A., & Nooren, P. M. (2006). Validity generalization vs. Title VII: Can employers successfully defend tests without conducting local validation studies? Labor Law Journal, 57, 216–237. https://testgenius.com/articles/validity-generalization.pdf

Bicknell, K., Brust, C., & Settles, B. (2023, February 5). How Duolingo’s AI learns what you need to learn. IEEE Spectrum. https://spectrum.ieee.org/duolingo

Bjork, E. L.,&Bjork, R. A. (2011). Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. In M. A. Gernsbacher, R. W. Pew, L. M. Hough,&J. R. Pomerantz (Eds.), Psychology and the real world: Essays illustrating fundamental contributions to society (pp. 56–64). Worth Publishers.

Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5(1), 7–74. https://doi.org/10.1080/0969595980050102

Blackman, R., & Ammanath, B. (2022, March 21). Ethics and AI: 3 conversations companies need to have. Harvard Business Review. https://hbr.org/2022/03/ethics-and-ai-3-conversations-companies-need-to-be-having

Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educational Researcher, 13(6), 4–16. https://doi.org/10.3102/0013189X013006004

Bolsinova, M., Deonovic, B., Arieli-Attali, M., Burr, S., Hagiwara, M., & Maris, G. (2022). Measurement of ability in adaptive learning and assessment systems when learners use on-demand hints. Applied Psychological Measurement, 46(3), 219–235. https://doi.org/10.1177/01466216221084208

Bratsberg, B., & Rogeberg, O. (2018). Flynn effect and its reversal are both environmentally caused. Proceedings of the National Academy of Sciences, 115(26), 6674–6678. https://doi.org/10.1073/pnas.1718793115

Bresnahan, T. (2010). General purpose technologies. In B. H. Hall & N. Rosenberg (Eds.), Handbook of the economics of innovation (Vol. 2, pp. 761–791). https://doi.org/10.1016/S0169-7218(10)02002-2

Brookhart, S., Stiggins, R., McTighe, J., & Wiliam, D. (2020). The future of assessment practices: Comprehensive and balanced assessment systems. Learning Sciences International. https://testing123.education.mn.gov/cs/groups/communications/documents/document/mdaw/mdaw/∼edisp/000231.pdf

Bradley, M. (1975). Scientific education versus military training: The influence of Napoleon Bonaparte on the Ecole Polytechnique. Annals of Science, 32(5), 415–449. https://doi.org/10.1080/00033797500200381

Buckley, J., Colosimo, L., Kantar, R., McCall, M., & Snow, E. (2021). Game-based assessment for education. In OECD digital education outlook 2021: Pushing the frontiers with artificial intelligence, blockchain and robots (pp. 195–208). OECD. https://read.oecd-ilibrary.org/education/oecd-digital-education-outlook-2021_9289cbfd-en#page1

Bull, S., & Kay, J. (2016). SMILI : A framework for interfaces to learning data in open learner models, learning analytics and related fields. International Journal of Artificial Intelligence in Education, 26, 293–331. https://doi.org/10.1007/s40593-015-0090-8

Burning Glass Technologies. (2019). Mapping the genome of jobs: The Burning Glass skills taxonomy [White paper]. https://www.voced.edu.au/content/ngv%3A84406

Burrus, J., Rikoon, S. H.,&Brenneman, M.W. (Eds.). (2022). Assessing competencies for social and emotional learning: Conceptualization, development, and applications. Routledge. https://doi.org/10.4324/9781003102243

BusinessWire. (2024). Carnegie learning wins 2024 EdTech award for MATHstream [Press release]. https://www.businesswire.com/news/home/20240327088407/en/Carnegie-Learning-Wins-2024-EdTech-Award-for-MATHstream

Buyse, T., & Lievens, F. (2011). Situational judgment tests as a new tool for dental student selection. Journal of Dental Education, 75(6), 743–749. https://doi.org/10.1002/j.0022-0337.2011.75.6.tb05101.x

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105. https://doi.org/10.1037/h0046016

cApStAn & Halleux, B. (2019). PISA 2021 translation and adaptation guidelines. OECD. https://www.oecd.org/pisa/pisaproducts/PISA-2022-Translation-and-Adaptation-Guidelines.pdf

Cao, M., Drasgow, F., & Cho, S. (2015). Developing ideal intermediate personality items for the ideal point model. Organizational Research Methods, 18(2), 252–275. https://doi.org/10.1177/1094428114555993

Casner-Lotto, J., & Barrington, L. (2006). Are they really ready to work? Employers’ perspectives on the basic knowledge and applied skills of new entrants to the 21st century US workforce. Partnership for 21st Century Skills.

Cattell, R. B. (1965). A biometrics invited paper. Factor analysis: An introduction to essentials I. The purpose and underlying models. Biometrics, 21(1), 190–215. https://doi.org/10.2307/2528364

Cattell, R. B.,&Warburton, F.W. (1967). Objective personality and motivation tests: A theoretical introduction and practical compendium. University of Illinois Press.

Chakraborty, M., Tonmoy, T. I., Zaman, M., Gautam, S., Kumar, T., Sharma, K., Barman, N., Gupta, C., Jain, V., Chadha, A., Sheth, A.,& Das, A. (2023). Counter Turing test (CT2): AI-generated text detection is not as easy as you may think—Introducing AI detectability index (ADI). In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 2206–2239). ACL. https://aclanthology.org/2023.emnlp-main.136/

Cengage. (2019, January 16). New survey: demand for “uniquely human skills” increases even as technology and automation replace some jobs [Press release]. https://www.cengagegroup.com/news/press-releases/2019/new-survey-demand-for-uniquely-human-skillsincreases-even-as-technology-and-automation-replace-some-jobs/

Chamorro-Premuzic, T. (2021, May 26). The problem with job interviews. Forbes. https://www.forbes.com/sites/tomaspremuzic/2021/05/26/the-problem-with-job-interviews/?sh=4292b1224dee

Chan, S., Somasundaran, S., Ghosh, D., & Zhao, M. (2022). AGReE: A system for generating automated grammar reading exercises. In W. Che & E. Shutova (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 169–177). ACL. https://aclanthology.org/2022.emnlp-demos.17/

Charness, G., Gneezy, U., & Henderson, A. (2018). Experimental methods: Measuring effort in economics experiments. Journal of Economic Behavior & Organization, 149, 74–87. https://doi.org/10.1016/j.jebo.2018.02.024

Chen, L., Feng, G., Joe, J., Leong, C.W., Kitchen, C., & Lee, C. M. (2014). Towards automated assessment of public speaking skills using multimodal cues. In ICMI ’14: Proceedings of the 16th International Conference on Multimodal Interaction (pp. 200–203). ACM. https://doi.org/10.1145/2663204.2663265

Chen, Y., Lee, Y.-H., & Li, X. (2022). Item pool quality control in educational testing: Change point model, compound risk, and sequential detection. Journal of Educational and Behavioral Statistics, 47(3), 322–352. https://doi.org/10.3102/10769986211059085

Cheng, K. H. C., Hui, C. H.,&Cascio, W. F. (2017). Leniency bias in performance ratings: The Big-Five correlates. Frontiers in Psychology, 8, Article 521. https://doi.org/10.3389/fpsyg.2017.00521

Chernyshenko, O. S., Kankaraš, M., & Drasgow, F. (2018). Social and emotional skills for student success and well-being: Conceptual framework for the OECD study on social and emotional skills (OECD Education Working Paper No. 173). OECD. https://one.oecd.org/document/EDU/WKP(2018)9/En/pdf

Chetty, R., Deming, D. J., & Friedman, J. N. (2023). Diversifying society’s leaders? The determinants of causal effects of admission to highly selective private colleges (Working Paper No. 31492). National Bureau of Economic Research. https://doi.org/10.3386/w31492

Chi, M. T. H., & Wylie, R. (2014). The ICAP framework: Linking cognitive engagement to active learning outcomes. Educational Psychologist, 49(4), 219–243. https://doi.org/10.1080/00461520.2014.965823

Choi, I., Hao, J., Deane, P.,&Zhang, M. (2021). Benchmark keystroke biometrics accuracy from high-stakes writing tasks (Research Report No. RR-21-15). ETS. https://doi.org/10.1002/ets2.12326

Christian, M. S., Edwards, B. D., & Bradley, J. C. (2010). Situational judgment tests: Constructs assessed and a meta-analysis of their criterion-related validities. Personnel Psychology, 63(1), 83–117. https://doi.org/10.1111/j.1744-6570.2009.01163.x

Chopade, P., Edwards, D., Khan, S.M., Andrade, A.,&Pu, S. (2019, November). CPSX: using AI-machine learning for mapping human-human interaction and measurement of CPS teamwork skills. In 2019 IEEE International Symposium on Technologies for Homeland Security (HST) (pp. 1-6). IEEE.

Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. https://doi.org/10.1037/h0026256

Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159. https://doi.org/10.1037/0033-2909.112.1.155

College Board. (2023, September 27). SAT suite: Everything you need to know about the Digital SAT. College Board Blog. https://blog.collegeboard.org/everything-you-need-know-about-digital-sat

Connelly, B. S., & Ones, D. S. (2010). An other perspective on personality: Meta-analytic integration of observers’ accuracy and predictive validity. Psychological Bulletin, 136(6), 1092–1122. https://doi.org/10.1037/a0021212

Cooper, W. H. (1981). Ubiquitous halo. Psychological Bulletin, 90(2), 218–244. https://doi.org/10.1037/0033-2909.90.2.218

Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4, 253–278. https://doi.org/10.1007/BF01099821

Cotra, A. (2023, August 29). Language models surprised us. Planned Obsolescence. https://www.planned-obsolescence.org/language-models-surprised-us/

Cox, C. B., Barron, L. G., Davis, W., & de la Garza, B. (2017). Using situational judgment tests (SJTs) in training: Development and evaluation of a structured, low-fidelity scenario-based training method. Personnel Review, 46(1), 36–45. https://doi.org/10.1108/PR-05-2015-0137

Cronbach, L. J. (1975). Five decades of public controversy over mental testing. American Psychologist, 30(1), 1–14. https://doi.org/10.1037/0003-066X.30.1.1

Darling-Hammond, L. (2001). Inequality in teaching and schooling: How opportunity is rationed to students of color in America. In B. D. Smedley, A. Y. Stith, L. Colburn, & C. H. Evans (Eds.), The right thing to do, the smart thing to do: Enhancing diversity in health professions—Summary of the Symposium on Diversity in Health Professions in Honor of Herbert W. Nickens, M. D. (pp. 208–233). National Academies Press. http://www.nap.edu/catalog/10186.html

Davey, T. (2023). Automated test assembly. In R. J. Tierney, F. Rizvi, & K. Ercikan (Eds.), International encyclopedia of education: Vol. 14. Quantitative research and educational measurement (pp. 201–208). Elsevier. https://doi.org/10.1016/B978-0-12-818630-5.10027-2

Davoli, M., & Entorf, H. (2018). The PISA shock, socioeconomic inequality, and school reforms in Germany (IZA Policy Paper No. 140). IZA – Institute of Labor Economics. https://docs.iza.org/pp140.pdf

De Boeck, P. (2023, July 25–28). Pervasive DIF and DIF detection bias [Paper presentation]. International Meeting of the Psychometric Society (IMPS 2023), University of Maryland, College Park, MD, United States.

De Boeck, P., & Cho, S.-J. (2021). Not all DIF is shaped similarly. Psychometrika, 86(3), 712–716. https://doi.org/10.1007/s11336-021-09772-3

Dietrichson, J., Bøg, M., Filges, T., & Klint Jørgensen, A.-M. (2017). Academic interventions for elementary and middle school students with low socioeconomic status: A systematic review and meta-analysis. Review of Educational Research, 87(2), 243–282. https://doi.org/10.3102/0034654316687036

Dell. (2018, January 30). 3,800 business leaders declare: It’s A tale of two futures. https://www.dell.com/en-us/perspectives/3800-business-leaders-declare-its-a-tale-of-two-futures/

Deming, D. J. (2017). The growing importance of social skills in the labor market. The Quarterly Journal of Economics, 132(4), 1593–1640. https://doi.org/10.1093/qje/qjx022

Deming, D. (2024, March 7). The worst way to do college admissions: Making standardized test scores optional has harmed the disadvantaged applicants it was intended to help. The Atlantic. https://theatlantic.com/ideas/archive/2024/03/standardized-testing-requirements-act-sat/677667/

Deming, D., & Kahn, L. B. (2018). Skill requirements across firms and labor markets: Evidence from job postings for professionals. Journal of Labor Economics, 36(S1), S337–S369. https://doi.org/10.1086/694106

Deonovic, B., Yudelson, M., Bolsinova, M., Attali, M.,&Maris, G. (2018). Learning meets assessment. Behaviormetrika, 45(2), 457–474. https://doi.org/10.1007/s41237-018-0070-z

Diao, Q., & van der Linden, W. J. (2013). Integrating test-form formatting into automated test assembly. Applied Psychological Measurement, 37(5), 361–374. https://doi.org/10.1177/0146621613476157

Di Battista, A., Grayling, S., Hasselaar, E., Leopold, T., Li, R., Rayner, M., & Zahidi, S. (2023, May). Future of jobs report 2023. World Economic Forum. https://www.weforum.org/reports/the-future-of-jobs-report-2023

DiCerbo, K. (2024, March 7). How we built AI tutoring tools. Khan Academy Blog. https://blog.khanacademy.org/how-we-built-ai-tutoring-tools/

Dobrescu, L., Holden, R., Motta, A., Piccoli A., Roberts, P., & Walker, S. (2021). Cultural context in standardized tests (Working Paper 2021-08). University of New South Wales Business School. https://doi.org/10.2139/ssrn.3983663

Duolingo Team. (2023, March 14). Introducing Duolingo Max, a learning experience powered by GPT-4. Duolingo Blog. https://blog.duolingo.com/duolingo-max/

Eberly Center. (n.d.). Learning principles: Theory and research-based principles of learning. Carnegie Mellon University. https://www.cmu.edu/teaching/principles/learning.html

Elliott, S.W. (2017). Computers and the future of skill demand. OECD. https://doi.org/10.1787/9789264284395-en

Eloundou, T., Manning, S., Mishkin, P., & Rock, D. (2023). GPTs are GPTs: An early look at the labor market impact potential of large language models. arXiv. https://arxiv.org/abs/2303.10130v4

Embretson, S. (1994). Applications of cognitive design systems to test development. In C. R. Reynolds (Ed), Cognitive assessment: A multidisciplinary perspective (pp. 107–135). Springer.

Emerson, A., Houghton, P., Chen, K., Basheerabad, V., Ubale, R., Leong, C.W. (2022). Predicting user confidence in video recordings with spatio-temporal multimodal analytics. In ICMI ’22 companion: Companion publication of the 2022 International Conference on Multimodal Interaction (pp. 98–104). ACM. https://doi.org/10.1145/3536220.3558007

Erwin, T. D., & Sebrell, K. W. (2003). Assessment of critical thinking: ETS’s tasks in critical thinking. Journal of General Education, 52(1), 50–70. https://doi.org/10.1353/jge.2003.0019

ETS. (n.d.). Demonstrate program effectiveness with the ETS® Major Field Tests. https://www.ets.org/mft.html

ETS. (2014). ETS standards for quality and fairness. https://ets.org/pdfs/about/standards-quality-fairness.pdf

ETS. (2022). ETS guidelines for developing fair tests and communications. https://www.ets.org/pdfs/about/fair-tests-andcommunications.pdf

ETS. (2023a). ETS human progress study [Unpublished data set].

ETS. (2023b). Your at home testing. https://www.ets.org/gre/test-takers/general-test/register/at-home-testing.html

Falk, A., Becker, A., Dohmen, T., Enke, B., Huffman, D., & Sunde, U. (2018). Global evidence on economic preferences. The Quarterly Journal of Economics, 133(4), 1645–1692. https://doi.org/10.1093/qje/qjy013

Feuer, M. J. (2012). No country left behind: Rhetoric and reality of international large-scale assessment. ETS. http://www.ets.org/Media/Research/pdf/PICANG13.pdf

Feuer, M., Holland, P.W., Green, B. F., Bertenthal, M.W.,&Hemphill, F. C. (Eds.). (1999). Uncommon measures: Equivalence and linkage among educational tests. National Academies Press. https://doi.org/10.17226/6332

Flanagan, C. (2021, July 22). The University of California is lying to us. The Atlantic. https://www.theatlantic.com/ideas/archive/2021/07/why-university-california-dropping-sat/619522/

Flynn, M. (2023, May 30). The soft skills “debate” is over. Forbes. https://www.forbes.com/sites/mariaflynn/2023/05/30/the-soft-skills-debate-is-over/?sh=5baa274b7308

Foster, N., & Piacentini, M. (Eds.). (2023). Innovating assessments to measure and support complex skills. OECD Publishing. https://doi.org/10.1787/e5f3e341-en

Frensch, P. A., & Funke, J. (1995). Complex problem solving: The European perspective. Routledge.

Frey, C. B.,&Osborne, M. A. (2017). The future of employment: How susceptible are jobs to computerization? Technological Forecasting and Social Change, 114, 254–280. https://doi.org/10.1016/j.techfore.2016.08.019

Friedland, N. S., Allen, P. G., Matthews, G., Witbrock, M., Baxter, D., Curtis, J., Shepard, B., Miraglia, P., Angele, J., Staab, S., Moench, E., Oppermann, H., Wenke, D., Israel, D., Chaudhri, V., Porter, B., Barker, K., Fan, J., Chaw, S., … Clark, P. (2004). Project Halo: Towards a digital Aristotle. AI Magazine, 25(4), 29–47. https://doi.org/10.1609/aimag.v25i4.1783

Fu, J., Kyllonen, P. C., & Tan, X. (2024). From Likert to forced choice: Statement parameter invariance and context effects in personality assessment. Measurement: Interdisciplinary Research and Perspectives. Advance online publication. https://doi.org/10.1080/15366367.2023.2258482

Fuchs, L. S., & Fuchs, D. (1986). Effects of systematic formative evaluation: A meta-analysis. Exceptional Children, 53(3), 199–208. https://doi.org/10.1177/001440298605300301

Fyfe, E. R., Borriello, G. A., & Merrick, M. (2023). A developmental perspective on feedback: How corrective feedback influences children’s literacy, mathematics, and problem solving. Educational Psychologist, 58(3), 130–145. https://doi.org/10.1080/00461520.2022.2108426

Fyfe, E. R., De Leeuw, J. R., Carvalho, P. F., Goldstone, R. L., Sherman, J., Admiraal, D., Alford, L.K., Bonner, A., Brassil, C. E., Brooks, C. A., Carbonetto, T., Chang, S. H., Cruz, L., Czymoniewicz-Klippel, Daniel, F., Driessen, M., Habashy, N., Hanson-Bradley, C. L., Hirt, E. R., … Motz, B. A. (2021). Many Classes 1: Assessing the generalizable effect of immediate feedback versus delayed feedback across many college classes. Advances in Methods and Practices in Psychological Science, 4(3), Article 25152459211027575. https://doi.org/10.1177/25152459211027575

Gao, L., Ghosh, D.,&Gimpel, K. (2022). What makes a question inquisitive? A study on type-controlled inquisitive question generation. In V. Nastase, E. Pavlick, M.T. Pilehvar, J. Camacho-Callados,&A. Raganato (Eds.), Proceedings of the 11th Joint Conference on Lexical and Computational Semantics (pp. 240–257). ACL. https://doi.org/10.18653/v1/2022.starsem-1.22

Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2414–2423). IEEE. https://doi.org/10.1109/CVPR.2016.265

Geerlings, H., Glas, C.A.,&Van Der Linden, W. J. (2011). Modeling rule-based item generation. Psychometrika, 76, 337–359. https://doi.org/10.1007/s11336-011-9204-x

Geiger, M., Bärwaldt, R., & Wilhelm, O. (2021). The good, the bad, and the clever: Faking ability as a socio-emotional ability? Journal of Intelligence, 9(1), 1–22. https://doi.org/10.3390/jintelligence9010013

Geisinger, K. F. (2011). The future of high-stakes testing in education. In J. A. Bovaird, K. F. Geisinger, & C.W. Buckendahl (Eds.), High-stakes testing in education: Science and practice in K–12 settings (pp. 231–248). American Psychological Association. https://doi.org/10.1037/12330-014

Gierl, M. J., & Haladyna, T. M. (Eds.). (2013). Automatic item generation: Theory and practice. Routledge.

Gil, Y., & Selman, B. (2019). A 20-year community roadmap for artificial intelligence research in the US. arXiv. https://doi.org/10.48550/arXiv.1908.02624

Glas, C. A. W., & van der Linden, W. J. (2001, June 2–4). Modeling variability in item parameters in CAT [Paper presentation]. North American Psychometric Society Meeting King of Prussia, PA, United States.

Godwin, K. E., Almeda, M. V., Seltman, H., Kai, S., Skerbetz, M. D., Baker, R. S., & Fisher, A. V. (2016). Off-task behavior in elementary school children. Learning and Instruction, 44, 128–143. https://doi.org/10.1016/j.learninstruc.2016.04.003

Goldberg, B., & Sinatra, A. M. (2023). Generalized intelligent framework for tutoring (gift) SWOT analysis. In A. M. Sinatra, A. C. Graesser, X. Hu, G. Goodwin, & V. Rus (Eds.), Design recommendations for intelligent tutoring systems: Vol. 10. Strengths, weaknesses, opportunities and threats (SWOT) analysis of intelligent tutoring systems (pp. 9–26). U.S. Army Combat Capabilities Development Command—Soldier Center. https://gifttutoring.org/documents/163

Goodhart, C. A. E. (1984). Monetary theory and practice: The U.K. experience. Springer. https://doi.org/10.1007/978-1-349-17295-5

The Gordon Commission on the Future of Assessment in Education. (2013). To assess, to teach, to learn: A vision for the future of assessment. ETS. https://www.ets.org/Media/Research/pdf/gordon_commission_technical_report.pdf

Gosling, S. D., Augustine, A. A., Vazire, S., Holtzman, N., & Gaddis, S. (2011). Manifestations of personality in online social networks: Self-reported Facebook-related behaviors and observable profile information. Cyberpsychology, Behavior, and Social Networking, 14(9), 483–488. https://doi.org/10.1089/cyber.2010.0087

Gosling, S.D., Ko, S. J., Mannarelli, T.,&Morris, M. E. (2002). A room with a cue: Personality judgments based on offices and bedrooms. Journal of Personality and Social Psychology, 82(3), 379–398. https://doi.org/10.1037//0022-3514.82.3.379

Graf, E. A., & Fife, J. H. (2012). Difficulty modeling and automatic generation of quantitative items: Recent advances and possible next steps. In M. J. Gierl & T. M. Haladyna (Eds.), Automatic item generation (pp. 157–178). Routledge.

Greiff, S., Gaševi´c, D., & von Davier, A. (2017). Using process data for assessment in intelligent tutoring systems: A cognitive psychologist, psychometrician, and computer scientist perspective. In R. Sottilare, A. Graesser, X. Hu, & G. Goodwin (Eds.), Design recommendations for intelligent tutoring systems: Vol. 5. Assessment methods (pp. 171–179). U.S. Army Research Laboratory. https://gifttutoring.org/attachments/download/2410/Design%20Recommendations%20for%20ITS_Volume%205%20-%20Assessment_final_errata%20corrected.pdf

Grigorenko, E. L.,&Sternberg, R. J. (1998). Dynamic testing. Psychological Bulletin, 124(1), 75–111. https://doi.org/10.1037/0033-2909.124.1.75

Grose, J. (2024, January 17). Don’t ditch standardized tests: Fix them. The New York Times. https://www.nytimes.com/2024/01/17/opinion/standardized-tests.html

Grossmann, I., Rotella, A. Sharpinskyi, K., Browne, D. T., & Fong, G. T. (2023). Insights into the accuracy of social scientists’ forecasts of societal change. Nature Human Behavior, 7, 484–501. https://doi.org/10.1038/s41562-022-01517-1

Haberman, S. J., & Lee, Y.-H. (2017). A statistical procedure for testing unusually frequent exactly matching responses and nearly matching responses (Research Report No. RR-17-23). ETS. https://doi.org/10.1002/ets2.12150

Haberman, S. J., Lee, Y.-H., Papierman, P., Zhou, Y., & Subhedar, R. (2022). Systems and methods for detecting unusually frequent exactly matching and nearly matching test responses (U.S. Patent 11,398,161). U.S. Patent Office and Trademark Office. https://ppubs.uspto.gov/pubwebapp/external.html?q=(11398161).pn.&db=USPAT&type=ids

Hambleton, R. K. (2002). Adapting achievement tests into multiple languages for international assessments. In National Research Council (Ed), Methodological advances in cross-national surveys of educational achievement (pp. 58–79). National Academies Press. https://nap.nationalacademies.org/read/10322/chapter/4

Hao, J., Liu, L., Kyllonen, P. C., Flor, M., & von Davier, A. A. (2019). Psychometric considerations and a general scoring strategy for assessments of collaborative problem solving (Research Report No. RR-19-41). ETS. https://doi.org/10.1002/ets2.12276

Hao, J., Liu, L., von Davier, A. A., Lederer, N., Zapata-Rivera, D., Jakl, P.,&Bakkenson, M. (2017). EPCAL: ETS platform for collaborative assessment and learning (Research Report No. RR-17-49). ETS. https://doi.org/10.1002/ets2.12181

Hao, J., von Davier, A. A., Yaneva, V., Lottridge, S., von Davier, M., & Harris, D. J. (2024). Transforming assessment: the impacts and implications of large language models and generative AI. Educational Measurement: Issues and Practices. Advance online publication.

Hattie, J. A. (2009). Visible learning: A synthesis of over 800 meta-analyses relating to achievement. Routledge.

Hattie, J., & Gan, M. (2011). Instruction based on feedback. In E. Mayer & P. A. Alexander (Eds.), Handbook of research on learning and instruction (pp. 249–271). Routledge.

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112. https://doi.org/10.3102/003465430298487

He, J., Bartram, D., Inceoglu, I., & van de Vijver, F. J. R. (2014). Response styles and personality traits: A multilevel analysis. Journal of Cross-Cultural Psychology, 45(7), 1028–1045. https://doi.org/10.1177/0022022114534773

He, Q., Borgonovi, F., & Paccagnella, M. (2019). Using process data to understand adults’ problem-solving behaviour in the programme for the international assessment of adult competencies (PIAAC): Identifying generalised patterns across multiple tasks with sequence mining (OECD Education working paper No. 205 ). OECD. https://one.oecd.org/document/EDU/WKP(2019)13/en/pdf

Heckman, J.,&Zhou, J. (2021). Interactions as investments: The microdynamics and measurement of early childhood learning [Manuscript submitted for publication].

Hedlund, J., Wilt, J. M., Nebel, K. L., Ashford, S. J., & Sternberg, R. J. (2006). Assessing practical intelligence in business school admissions: A supplement to the graduate management admissions test. Learning and Individual Differences, 16(2), 101–127. https://doi.org/10.1016/j.lindif.2005.07.005

Herman, J. L., Martínez, J. F.,&Bailey, A. L. (2023). Fairness in educational assessment and the next edition of the standards: Concluding commentary. Educational Assessment, 28(2), 128–136. https://doi.org/10.1080/10627197.2023.2215980

Hilton, M., & Herman, J. (Eds.). (2017). Supporting students’ college success: The role of assessment of intrapersonal and interpersonal competencies. National Academies Press.

Himelfarb, I. (2019). A primer on standardized testing: History, measurement, classical test theory, item response theory, and equating. Journal of Chiropractic Education, 33(2), 151–163. https://doi.org/10.7899/JCE-18-22

Hinnant-Crawford, B. N. (2020). Improvement science in education: A primer. Myers Education Press.

Hitt, C., Trivitt, J., & Cheng, A. (2016). When you say nothing at all: The predictive power of student effort on surveys. Economics of Education Review, 52, 105–119. https://doi.org/10.1016/j.econedurev.2016.02.001

Holland, P.W. (1996). Assessing unusual agreement between the incorrect answers of two examinees using the K-index: Statistical theory and empirical support (Research Report No. RR-96-07). ETS. https://doi.org/10.1002/j.2333-8504.1996.tb01685.x

Hood, S. (1998). Culturally responsive performance-based assessment: Conceptual and psychometric considerations. Journal of Negro Education, 67(3), 187–196. https://doi.org/10.2307/2668188

Hoyt, W. T., & Kerns, M.-D. (1999). Magnitude and moderators of bias in observer ratings: A meta-analysis. Psychological Methods, 4(4), 403–424. https://doi.org/10.1037/1082-989X.4.4.403

Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., & Xing, E. P. (2017). Toward controlled generation of text. In D. Precup & Y. W. Teh (Eds.), Proceedings of machine learning research: Vol. 70. Proceedings of the 34th International Conference on Machine Learning (pp. 1587–1598). https://proceedings.mlr.press/v70/hu17e.html

IMS Global. (2022). Question & test interoperability (QTI) 3.0: Best practices and implementation guide. https://www.imsglobal.org/spec/qti/v3p0/impl/

Institute of Medicine. (2015). Psychological testing in the service of disability determination. The National Academies Press. https://doi.org/10.17226/21704

International Test Commission. (2001). International guidelines for test use. International Journal of Testing, 1(2), 93–114. https://doi.org/10.1207/S15327574IJT0102_1

International Test Commission. (2013). ITC guidelines for test use. Final version. https://www.intestcom.org/files/guideline_test_use.pdf

International Test Commission. (2017). The ITC guidelines for translating and adapting tests (2nd ed.). https://www.intestcom.org/files/guideline_test_adaptation_2ed.pdf

International Test Commission & Association of Test Publishers. (2022). Guidelines for technology-based assessment. https://www.intestcom.org/upload/media-library/guidelines-for-technology-based-assessment-v20221108-16684036687NAG8.pdf

Irvine, S. H., & Kyllonen, P. C. (Eds.). (2013). Item generation for test development. Routledge.

Jackson, C. K. (2018). What do test scores miss? The importance of teacher effects on non-test score outcomes. Journal of Political Economy, 126(5), 2072–2107. https://doi.org/10.1086/699018

Jiang, Y., Martin-Raugh, M., Yang, Z., Hao, J., Liu, L., & Kyllonen, P. C. (2023). Do you know your partner’s personality through virtual collaboration or negotiation? Investigating perceptions of personality and their impacts on performance. Computers in Human Behavior, 141, Article 107608. https://doi.org/10.1016/j.chb.2022.107608

John, O. P., & Srivastava, S. (1999). The Big-Five trait taxonomy: History, measurement, and theoretical perspectives. In L. A. Pervin & O. P. John (Eds.), Handbook of personality: Theory and research (Vol. 2., pp. 102–138). Guilford Press

Johnson, M. S. (2024). How do we demonstrate AI responsibility: The devil is in the details. [Manuscript in preparation].

Johnson, M. S., Liu, X., & McCaffrey, D. F. (2022). Psychometric methods to evaluate measurement and algorithmic bias in automated scoring. Journal of Educational Measurement, 59(3), 338–361. https://doi.org/10.1111/jedm.12335

Johnson, M. S., & McCaffrey, D. F. (2023). Evaluating fairness of automated scoring in educational measurement. In S. Lane (Ed.), Advancing natural language processing in educational assessment (pp. 143–164). Routledge. https://doi.org/10.4324/9781003278658-12

Johnson, M. S., & Sinharay, S. (2005). Calibration of polytomous item families using Bayesian hierarchical modeling. Applied Psychological Measurement, 29(5), 369–400. https://doi.org/10.1177/0146621605276675

Jung, J. Y., Tyack, L., & von Davier, M. (2022). Automated scoring of constructed-response items using artificial neural networks in international large-scale assessment. Psychological Test and Assessment Modeling, 64(4), 471–494.

Karay, Y., Reiss, B.,&Schauber, S. K. (2020). Progress testing anytime and anywhere: Does a mobile-learning approach enhance the utility of a large-scale formative assessment tool? Medical Teacher, 42(10), 1154–1162. https://doi.org/10.1080/0142159X.2020.1798910

Karpicke, J.D.,&Blunt, J. R. (2011). Retrieval practice produces more learning than elaborative studying with concept mapping. Science, 331(6018), 772–775. https://doi.org/10.1126/science.1199327

Kautz, T., & Zanoni, W. (2014). Measuring and fostering non-cognitive skills in adolescence: Evidence from Chicago public schools and the OneGoal program. University of Chicago.

Kell, H. J., Martin-Raugh, M. P., Carney, L. M., Inglese, P. A., Chen, L., & Feng, G. (2017). Exploring methods for developing behaviorally anchored rating scales for evaluating structured interview performance (Research Report No. RR-17-28). ETS. https://doi.org/10.1002/ets2.12152

Kessler, J. B., Low, C., & Sullivan, C. D. (2019). Incentivized resume rating: Eliciting employer preferences without deception. American Economic Review, 109(11), 3713–3744. https://doi.org/10.1257/aer.20181714

King, G.,&Wand, J. (2007). Comparing incomparable survey responses: Evaluating and selecting anchoring vignettes. Political Analysis, 15(1), 46–66. https://doi.org/10.1093/pan/mpl011

Kingston, N., & Nash, B. (2011). Formative assessment: A meta-analysis and a call for research. Educational Measurement: Issues and Practice, 30(4), 28–37. https://doi.org/10.1111/j.1745-3992.2011.00220.x

Klieger, D. M., Kell, H. J., Rikoon, S., Burkander, K. N., Bochenek, J. L., & Shore, J. R. (2018). Development of the behaviorally anchored rating scales for the skills demonstration and progression guide (Research Report No. RR-18-24). ETS. https://doi.org/10.1002/ets2.12210

Klinger, D. A., McDivitt, P. R., Howard, B. B., Munoz, M. A., Rogers, W. T., & Wylie, E. C. (2015). The classroom assessment standards for preK-12 teachers. Kindle Direct Press.

Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical review, a meta-analysis, and a preliminary feedback intervention theory. Psychological Bulletin, 119(2), 254–284. https://doi.org/10.1037/0033-2909.119.2.254

Klute, M., Apthorp, H., Harlacher, J., & Reale, M. (2017). Formative assessment and elementary school student academic achievement: A review of the evidence (Report No. REL 2017-259). Regional Educational Laboratory Central.

Koedinger, K. R., Carvalho, P. F., Liu, R., & McLaughlin, E. A. (2023). An astonishing regularity in student learning rate. Proceedings of the National Academy of Sciences, 120(13), Article e2221311120. https://doi.org/10.1073/pnas.2221311120

Kosinski, M., Bachrach, Y., Kohli, P., Stillwell, D., & Graepel, T. (2014). Manifestations of user personality in website choice and behaviour on online social networks. Machine Learning, 95, 357–380. https://doi.org/10.1007/s10994-013-5415-y

Krachman, S. B., Arnold, R., & LaRocca, R. (2016). Expanding the definition of student success: A case study of the CORE districts. Transforming Education. https://transformingeducation.org/wp-content/uploads/2017/04/TransformingEducationCaseStudyFINAL1.pdf

Kukea Shultz, P., & Englert, K. (2021). Cultural validity as foundational to assessment development: An indigenous example. Frontiers in Education, 6, Article 701973. https://doi.org/10.3389/feduc.2021.701973

Kulik, J. A.,&Fletcher, J.D. (2016). Effectiveness of intelligent tutoring systems: A meta-analytic review. Review of Educational Research, 86(1), 42–78. https://doi.org/10.3102/0034654315581420

Kumar, V., & Boulanger, D. (2020, October). Explainable automated essay scoring: Deep learning really has pedagogical value. In Frontiers in Education, 5, Article 572367. https://doi.org/10.3389/feduc.2020.572367

Kuncel, N. R., Kochevar, R. J., & Ones, D. S. (2014). A meta-analysis of letters of recommendation in college and graduate admissions: Reasons for hope. International Journal of Selection and Assessment, 22(1), 101–107. https://doi.org/10.1111/ijsa.12060

Kyllonen, P. C. (2016). Socio-emotional and self-management variables in learning and assessment. In A. A. Rupp & J. P. Leighton (Eds.), The Wiley handbook of cognition and assessment: Frameworks, methodologies, and applications (pp. 174–197). John Wiley & Sons. https://doi.org/10.1002/9781118956588.ch8

Kyllonen, P. (2021). Taxonomy of cognitive abilities and measures for assessing artificial intelligence and robotics capabilities. In AI and the future of skills: Volume 1. Capabilities and assessments (pp. 50–76). OECD Publishing. https://doi.org/10.1787/feecd512-en

Kyllonen, P. C., & Bertling, J. P. (2013). Innovative questionnaire assessment methods to increase cross-country comparability. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (pp. 277–285). CRC Press.

Kyllonen, P., Hao, J., Weeks, J., Fauss, M., & Kerzabi, E. (2023). Collaborative problem solving (CPS) skill: Estimating an individual’s contribution to small group performance [Unpublished manuscript]. ETS.

Kyllonen, P., Hartman, R., Sprenger, A., Weeks, J., Bertling, M., McGrew, K., Kriz, S., Bertling, J., Fife, J., & Stankov, L. (2019). General fluid/inductive reasoning battery for a high-ability population. Behavior Research Methods, 51(2), 507–522. https://doi.org/10.3758/s13428-018-1098-4

Kyllonen, P. C., & Kell, H. (2018). Ability tests measure personality, personality tests measure ability: Disentangling construct and method in evaluating the relationship between personality and ability. Journal of Intelligence, 6(3), Article 32, https://doi.org/10.3390/jintelligence6030032

Kyriazos, T. A. (2018). Applied psychometrics: The application of CFA to multitrait-multimethod matrices (CFA-MTMM). Psychology, 9(12), 2625–2648. https://doi.org/10.4236/psych.2018.912150

Landers, R. N., Armstrong, M. B., Collmus, A. B., Mujcic, S., & Blaik, J. (2022). Theory-driven game-based assessment of general cognitive ability: Design theory, measurement, prediction of performance, and test fairness. Journal of Applied Psychology, 107(10), 1655–1677. https://doi.org/10.1037/apl0000954

Landers, R. N.,&Sanchez, D. R. (2022). Game-based, gamified, and gamefully designed assessments for employee selection: Definitions, distinctions, design, and validation. International Journal of Selection and Assessment, 30(1), 1–13. https://doi.org/10.1111/ijsa.12376

Lane, S., Raymond, M. R., & Haladyna, T. M. (Eds.). (2016). Handbook of test development (Vol. 2, pp. 3–18). Routledge.

Lang, J. W. B., & Tay, L. (2021). The science and practice of item response theory in organizations. Annual Review of Organizational Psychology and Organizational Behavior, 8, 311–338. https://doi.org/10.1146/annurev-orgpsych-012420-061705

Langer, C., & Wiederhold, S. (2023). The value of early-career skills (CESifo Working Paper No. 10288). CESifo Network. https://doi.org/10.2139/ssrn.4369987

Lassébie, J., & Quintini, G. (2022). What skills and abilities can automation technologies replicate and what does it mean for workers? New evidence (OECD Social, Employment and Migration Working Papers, No. 282). OECD Publishing. https://doi.org/10.1787/646aad77-en

Law, K. S., Mobley, W. H., & Wong, C.-S. (2002). Impression management and faking in biodata scores among Chinese job-seekers. Asia Pacific Journal of Management, 19, 541–556. https://doi.org/10.1023/A:1020521726390

Lederman, O., Calacci, D., MacMullen, A., Fehder, D. C., Murray, F. E., & Pentland, A.S. (2016). Open badges: A low-cost toolkit for measuring team communication and dynamics. In The online proceedings of the 2016 International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation (SBP-BriMS 2016). http://sbp-brims.org/2016/proceedings/IN_105.pdf

Lee, G. H., Lee, K. J., Jeong, B. & Kim, T. (2024). Developing personalized marketing service using generative AI. IEEE Access, 12, 22394–22402. https://doi.org/10.1109/ACCESS.2024.3361946

Lee, H. A. (2023, January 23). This is why Microsoft Kinect was a complete failure. SVG. https://www.svg.com/301470/this-is-why-microsoft-kinect-was-a-complete-failure/

Lee, Y.-H., & Haberman, S. J. (2013). Harmonic regression and scale stability. Psychometrika, 78(4), 815–829. https://doi.org/10.1007/s11336-013-9337-1

Lee, Y.-H., & Haberman, S. J. (2021). Studying score stability with a harmonic regression family: A comparison of three approaches to adjustment of examinee-specific demographic data. Journal of Educational Measurement, 58(1), 54–82. https://doi.org/10.1111/jedm.12266

Lee, Y.-H., & Lewis, C. (2021). Monitoring item performance with CUSUM statistics in continuous testing. Journal of Educational and Behavioral Statistics, 46(5), 611–648. https://doi.org/10.3102/1076998621994563

Lee, Y.-H., Lewis, C., & von Davier, A. A. (2014). Monitoring the quality and security of multistage tests. In D. Yan, A. A. von Davier, & C. Lewis (Eds.), Computerized multistage testing: Theory and applications (pp. 285–300). CRC Press.

Lee, Y.-H., & von Davier, A. A. (2013). Monitoring scale scores over time via quality control charts, model-based approaches, and time series techniques. Psychometrika, 78(3), 557–575. https://doi.org/10.1007/s11336-013-9317-5

Leenknecht, M., Hompus, P., & van der Schaaf, M. (2019). Feedback seeking behaviour in higher education: The association with students’ goal orientation and deep learning approach. Assessment & Evaluation in Higher Education, 44(7), 1069–1078. https://doi.org/10.1080/02602938.2019.1571161

Lehman, B., Sparks, J. R., & Zapata-Rivera, D. (2018). When should an adaptive assessment care? In N. Guin & A. Kumar (Eds.), Proceedings of ITS 2018: Intelligent Tutoring Systems 14th International Conference, Workshop on Exploring Opportunities for Caring Assessments (pp. 87–94). ITS. https://ceur-ws.org/Vol-2354/w3paper1.pdf

Leonhardt, D. (2024, January 7). The misguided war on the SAT. The New York Times. https://www.nytimes.com/2024/01/07/briefing/the-misguided-war-on-the-sat.html

Lewin, T. (2002, December 4). Henry Chauncey dies at 97; Shaped admission testing for the nation’s colleges. The New York Times. https://www.nytimes.com/2002/12/04/nyregion/henry-chauncey-dies-at-97-shaped-admission-testing-for-the-nation-s-colleges.html

Lewis, C. (2001). Expected response functions. In A. Boomsma, M. A J. van Duijn, & T. A. B. Snijders (Eds.), Essays on item response theory (pp. 163–171). Springer. https://doi.org/10.1007/978-1-4613-0169-1_9

Lewis, C.,&Thayer, D. T. (1998). The power of the K-index (or PMIR) to detect copying (Research Report No. RR-98-49). ETS. https://doi.org/10.1002/j.2333-8504.1998.tb01798.x

LinkedIn Talent Solutions. (2019). Global talent trends: The 3 trends transforming your workplace. https://business.linkedin.com/content/dam/me/business/en-us/talent-solutions/resources/pdfs/global_talent_trends_2019_emea.pdf

Linzarini, A., & Catarino da Silva, D. (2024). Innovative assessments for Social Emotional Skills [webinar slides]. SlideShare. https://www.slideshare.net/slideshow/webinar-innovative-assessments-for-social-emotional-skills/270083576

Lira, B., O’Brien, J. M., Peña, P. A., Galla, B. M., D’Mello, S., Yeager, D. S., Defnet, A., Kautz, T., Munkacsy, K., & Duckworth, A. L. (2022). Large studies reveal how reference bias limits policy applications of self-report measures. Scientific Reports, 12, Article 19189. https://doi.org/10.1038/s41598-022-23373-9

Lissitz, R.W. (2009). Introduction. In R.W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and applications (pp. 1–15). IAP Information Age Publishing.

Liu, O. L., Bridgeman, B., & Adler, R. M. (2012). Measuring learning outcomes in higher education: Motivation matters. Educational Researcher, 41(9), 352–362. https://doi.org/10.3102/0013189X12459679

Liu, O. L., Kell, H. J., Liu, L., Ling, G., Wang, Y., Wylie, C., Sevak, A., Sherer, D., LeMahieu, P., & Knowles, T. (2023). A new vision for skills-based assessment. ETS. https://ets.org/pdfs/rd/new-vision-skills-based-assessment.pdf

Liu, O. L., Mao, L., Frankel, L., & Xu, J. (2016). Assessing critical thinking in higher education: The HEIghten approach and preliminary validity evidence. Assessment & Evaluation in Higher Education, 41(5), 677–694. https://doi.org/10.1080/02602938.2016.1168358

Liu, X., Zhang, Z., Wang, Y., Pu, H., Lan, Y., & Shen, C. (2023). COCO: Coherence-enhanced machine-generated text detection under low resource with contrastive learning. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 16167–16188). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.1005

Loewus, L. (2016). What is digital literacy? Education Week. https://www.edweek.org/teaching-learning/what-is-digital-literacy/2016/11

Loukina, A., Yoon, S.-Y., Sakano, J., Wei, Y., & Sheehan, K. (2016). Textual complexity as a predictor of difficulty of listening items in language proficiency tests. In Y. Matsumoto & R. Prasad (Eds.), Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical papers (pp. 3245–3253). https://aclanthology.org/C16-1306

Ludlow, L. H., O’Keefe, T., Braun, H., Anghel, E., Szendey, O., Matz, C., & Howell, B., (2022). An enhancement to the theory and measurement of purpose. Practical Assessment, Research, and Evaluation 27(1), Article 4. https://doi.org/10.7275/c5jb-rr95

Ma, W., Adesope, O. O., Nesbit, J. C., & Liu, Q. (2014). Intelligent tutoring systems and learning outcomes: A meta-analysis. Journal of Educational Psychology, 106(4), 901–918. https://doi.org/10.1037/a0037123

MacCann, C., & Roberts, R. D. (2008). New paradigms for assessing emotional intelligence: Theory and data. Emotion, 8(4), 540–551. https://doi.org/10.1037/a0012746

Madnani, N., & Cahill, A. (2018). Automated scoring: Beyond natural language processing. In E. M. Bender, L.Derczynski, & P. Isabelle (Eds.), Proceedings of the 27th International Conference on Computational Linguistics (pp. 1099–1109). ACL. https://aclanthology.org/C18-1094

Mammadov, S. (2022). Big Five personality traits and academic performance: A meta-analysis. Journal of Personality, 90(2), 222–255. https://doi.org/10.1111/jopy.12663

Mankki, V. (2023). Research using teacher or teacher educator job advertisements: A scoping review. Cogent Education, 10(1), Article 2223814. https://doi.org/10.1080/2331186X.2023.2223814

Martin-Raugh, M. P., Kyllonen, P. C., Hao, J., Bacall, A., Becker, D., Kurzum, C., Yang, Z., Yan, F., & Barnwell, P. (2020). Negotiation as an interpersonal skill: Generalizability of negotiation outcomes and tactics across contexts at the individual and collective levels. Computers in Human Behavior, 104, Article 105966. https://doi.org/10.1016/j.chb.2019.03.030

Martín-Raugh, M., Roohr, K. C., Leong, C. W., Molloy, H., McCulla, L., Ramanarayan, V., & Mladineo, Z. (2023). Better understanding oral communication skills: The impact of perceived personality traits. American Journal of Distance Education. Advance online publication. https://doi.org/10.1080/08923647.2023.2235950

Mattingly, S.M., Gregg, J.M., Audia, P., Bayraktaroglu, A. E., Campbell, A. T., Chawla, N. V., Das Swain, V., DeChoudhury, M., D’Mello, S. K., Dey, A. K., Gao, G., Jagannath, K., Jiang, K., Lin, S., Liu, Q., Mark, G., Martinez, G. J. Masaba, K., Mirjafari, S., … Striege, A. (2019, May). The tesserae project: Large-scale, longitudinal, in situ, multimodal sensing of information workers. In Extended abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1-8). ACM. https://doi.org/10.1145/3290607.3299041

McLaughlin, K., Ainslie. M., Coderre, S., Wright, B., & Violato, C. (2009). The effect of differential rater function over time (DRIFT) on objective structured clinical examination ratings. Medical Education, 43(10), 989–992. https://10.1111/j.1365-2923.2009.03438.x

McWhorter, J. (2024, March 14). No, the SAT isn’t racist. The New York Times. https://www.nytimes.com/2024/03/14/opinion/sat-college-admissions-antiracism.html

Mervosh, S. (2022, September 1). The pandemic erased two decades of progress in math and reading: The results of a national test showed just how devastating the last two years have been for 9-year-old schoolchildren, especially the most vulnerable. The New York Times. https://www.nytimes.com/2022/09/01/us/national-test-scores-math-reading-pandemic.html

Meyer, R. H., Wang, C., & Rice, A. B. (2018). Measuring students’ social-emotional learning among California’s CORE districts: An IRT modeling approach [Working paper]. Policy Analysis for California Education. https://edpolicyinca.org/sites/default/files/Measuring_SEL_May-2018.pdf

Mignogna, G., Carey, C. E., Wedow, R., Baya, N., Cordioli, M., Pirastu, N., Bellocco, R., Mlerbi, K. F., Nivard, M. G., Neale, B. M., Walters, R. K., & Ganna, A. (2023). Patterns of item nonresponse behaviour to survey questionnaires are systematic and associated with genetic loci. Nature Human Behaviour, 7, 1371–1387. https://doi.org/10.1038/s41562-023-01632-7

Millsap, R. (2011). Statistical approaches to measurement invariance. Routledge.

Mirjafari, S., Masaba, K., Grover, T., Wang, W., Audia, P., Campbell, A. T., Chawla, N. V., Das Swain, V., De Choudhury, M., Dey, A. K., D’Mello, S. K., Gao, G., Gregg, J. M., Jagannath, K., Jiang, K., Lin, S., Qiang, L., Mark, G., Martinez, G. J., Martinez, S. M., … Striegel, A. (2019). Differentiating higher and lower job performers in the workplace using mobile sensing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(2), 1–24. https://doi.org/10.1145/3328908

Mislevy, R. (2018). Sociocognitive foundations of educational measurement. Routledge.

Mislevy, R. J., Oranje, A., Bauer, M. I., von Davier, A., Hao, J., Corrigan, S., Hoffman, E., DiCerbo, K., & Michael, J. (2014). Psychometric considerations in game-based assessment. GlassLab Research, Institute of Play. https://web.archive.org/web/20160320151604/http://www.instituteofplay.org/wp-content/uploads/2014/02/GlassLab_GBA1_WhitePaperFull.pdf

Mislevy, R. J., Sheehan, K.M., & Wingersky, M. (1993). How to equate tests with little or no data. Journal of Educational Measurement, 30(1), 55–78. https://doi.org/10.1111/j.1745-3984.1993.tb00422.x

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–62. https://doi.org/10.1207/S15366359MEA0101_02

Molenaar, I., de Mooij, S., Azevedo, R., Bannert, M., Järvelä, S., & Gaševi´c, D. (2023). Measuring self-regulated learning and the role of AI: Five years of research using multimodal multichannel data. Computers in Human Behavior, 139, Article 107540. https://doi.org/10.1016/j.chb.2022.107540

Morell, Z. (2017). Introduction to the New York State next generation early learning standards. https://www.nysed.gov/sites/default/files/introduction-to-the-nys-early-learning-standards.pdf

Moreno, R. (2004). Decreasing cognitive load for novice students: Effects of explanatory versus corrective feedback in discovery-based multimedia. Instructional Science, 32(1–2), 99–113. https://doi.org/10.1023/B:TRUC.0000021811.66966.1d

Moro, E., Frank, M. R., Pentland, A., Rutherford, A., Cebrian, M., & Rahwan, I. (2021). Universal resilience patterns in labor markets. Nature Communications, 12, Article 1972. https://doi.org/10.1038/s41467-021-22086-3

Mumford, M. D., & Owens, W. A. (1987). Methodology review: Principles, procedures, and findings in the application of background data measures. Applied Psychological Measurement, 11(1), 1–31. https://doi.org/10.1177/014662168701100101

Murphy, S. C., Klieger, D. M., Borneman, M. J., & Kuncel, N. R. (2009). The predictive power of personal statements in admissions: A meta-analysis and cautionary tale. College and University, 84(4), 83–86.

Narciss, S. (2004). The impact of informative tutoring feedback and self-efficacy on motivation and achievement in concept learning. Experimental Psychology, 51(3), 214–228. https://doi.org/10.1027/1618-3169.51.3.214

Narciss, S., Sosnovsky, S., Schnaubert, L., Andrès, E., Eichelmann, A., Goguadze, G., & Melis, E. (2014). Exploring feedback and student characteristics relevant for personalizing feedback strategies. Computers & Education, 71, 56–76. https://doi.org/10.1016/j.compedu.2013.09.011

National Academies of Sciences, Engineering, and Medicine. (2018). How people learn II: Learners, contexts, and cultures. The National Academies Press. https://doi.org/10.17226/24783

National Academies of Sciences, Engineering, and Medicine. (2019). Monitoring educational equity. The National Academies Press. https://doi.org/10.17226/25389

National Association of Colleges and Employers. (2022). NACE job outlook 2022. https://www.naceweb.org/uploadedFiles/files/2022/resources/nace-job-outlook-2022.pdf

National Research Council. (1999a). High stakes: Testing for tracking, promotion, and graduation. The National Academies Press. https://doi.org/10.17226/6336

National Research Council. (1999b). Myths and tradeoffs: The role of tests in undergraduate admissions. The National Academies Press. https://doi.org/10.17226/9632

National Research Council. (2000). How people learn: Brain, mind, experience, and school (expanded ed.). The National Academies Press. https://doi.org/10.17226/9853

National Research Council (2001). Knowing what students know: The science and design of educational assessment. The National Academies Press. https://doi.org/10.17226/10019.

National Research Council. (2012). Education for life and work: Developing transferable knowledge and skills in the 21st century. The National Academies Press. https://doi.org/10.17226/13398.

Nesbit, J.C., Adesope, O.O., Liu, Q.,&Ma, W. (2014, July). How effective are intelligent tutoring systems in computer science education? In 2014 IEEE 14th International Conference on Advanced Learning Technologies (pp. 99–103). IEEE. https://doi.org/10.1109/ICALT.2014.38

Nguyen, T. H., Han, H.-R., Kim, M. T., & Chan, K. S. (2014). An introduction to item response theory for patient-reported outcome. Measurement, 7(1), 23–35. https://doi.org/10.1007/s40271-013-0041-0

Nickow, A., Oreopoulos, P., & Quan, V. (2020). The impressive effects of tutoring on PreK-12 learning: A systematic review and meta-analysis of the experimental evidence (NBER working paper No. 27476). National Bureau of Economic Research. https://doi.org/10.3386/w27476

Niessen, A. S. M., Meijer, R. R., & Tendeiro, J. N. (2017). Measuring non-cognitive predictors in high-stakes contexts: The effect of self-presentation on self-report instruments used in admission to higher education. Personality and Individual Differences, 106, 183–189. https://doi.org/10.1016/j.paid.2016.11.014

Noor, N., Beram, S., Yuet, F. K. C., Gengatharan, K., Syafiq, M., & Rasidi, M. S. M. (2023). Bias, halo effect and horn effect: A systematic literature review. International Journal of Academic Research in Business & Social Sciences, 13(3), 1116–1140. https://doi.org/10.6007/IJARBSS/v13-i3/16733

Norville, V. (2022). States sketch ‘portraits of a graduate.’ State Innovations, 27(1), 1–4.

Novarese, M., & Di Giovinazzo, V. (2013). Promptness and academic performance (MPRA Paper No. 49746). Munich Personal RePEc Archive. https://mpra.ub.uni-muenchen.de/49746/

Ober, T. M., Lehman, B. A., Gooch, R., Oluwalana, O., Solyst, J., Phelps, G., & Hamilton, L. S. (2023). Culturally responsive learning: Recommendations for a working definition and framework (Research Report No. RR-23-09). Educational Testing Service. https://doi.org/10.1002/ets2.12372

O’Dwyer, E., Sparks, J. R., & Nabors Oláh, L. (2023). Enacting a process for developing culturally relevant classroom assessments. Applied Measurement in Education, 36(3), 286–303. https://doi.org/10.1080/08957347.2023.2214652

OECD. (n.d.). Education & Skills Online Assessment. https://www.oecd.org/skills/ESonline-assessment/abouteducationskillsonline/

OECD. (2015). Skills for social progress: The power of social and emotional skills. OECD Publishing. https://doi.org/10.1787/9789264226159-en

OECD. (2019). An OECD learning framework 2030. In G. Bast, E. G. Carayannis, & D. F. J. Campbell (Eds.), The future of education and labor. Arts, research, innovation and society (pp. 23–35). Springer. https://doi.org/10.1007/978-3-030-26068-2_3

OECD. (2021). AI and the future of skills: Volume 1. Capabilities and assessments. OECD Publishing. https://doi.org/10.1787/5ee71f34-en.

OECD. (2022a). Building the future of education. OECD Publishing. https://web-archive.oecd.org/2022-11-30/618066-future-of-education-brochure.pdf

OECD. (2022b). PISA 2022 results. https://www.oecd.org/publication/pisa-2022-results#pisa2022results

OECD. (2023). OECD skills outlook 2023: Skills for a resilient green and digital transition. OECD Publishing. https://doi.org/10.1787/27452f29-en

Oh, I.-S., Wang, G., & Mount, M. K. (2011). Validity of observer ratings of the five-factor model of personality traits: A meta-analysis. Journal of Applied Psychology, 96(4), 762–773. https://doi.org/10.1037/a0021832

O’Neil, H., Baker, E. L., Wainess, R., Chen, C., Mislevy, R., & Kyllonen, P. (2004). Final report on plan for the assessment and evaluation of individual and team proficiencies developed by the DARWARS Environments. Office of Naval Research; Defense Advanced Research Project Agency. https://apps.dtic.mil/sti/tr/pdf/ADA432802.pdf

OPM. (n.d.). Other assessment methods. OPM U.S. Office of Personnel Management. https://www.opm.gov/policy-data-oversight/assessment-and-selection/other-assessment-methods/

Ormerod, C. M., Malhorta, A.,&Jafari, A. (2021). Automated essay scoring using efficient transformer-based language models. PsyArXiv. https://arxiv.org/pdf/2102.13136.pdf

Ortner, T.M.,&Proyer, R. T. (2015). Objective personality tests. In T.M. Ortner&F. J. R. van de Vijver (Eds.), Behavior-based assessment in psychology: Going beyond self-report in the personality, affective, motivation, and social domains (pp. 133–149). Hogrefe.

Ortner, T.M., Proyer, R. T., & Kubinger, K. D. (2006). Theorie und praxis objektiver personlichkeitstests [Theory and practice of objective personality tests]. Verlag Hans Huber.

Ostini, R., & Nering, M. L. (2006). Polytomous item response theory models (No. 144). Sage.

Oswald, F. L., Schmitt, N., Kim, B. H., Ramsay, L. J., & Gillespie, M. A. (2004). Developing a biodata measure and situational judgment inventory as predictors of college student performance. Journal of Applied Psychology, 89(2), 187–207. https://doi.org/10.1037/0021-9010.89.2.187

Panadero, E. (2023). Toward a paradigm shift in feedback research: Five further steps influenced by self-regulated learning theory. Educational Psychologist, 58(3), 193–204. https://doi.org/10.1080/00461520.2023.2223642

Panadero, E., & Lipnevich, A. A. (2022). A review of feedback models and typologies: Towards an integrative model of feedback elements. Educational Research Review, 35, Article 100416. https://doi.org/10.1016/j.edurev.2021.100416

Panthier, C., & Gatinel, D. (2023). Success of ChatGPT, an AI language model, in taking the French language version of the European Board of Ophthalmology examination: A novel approach to medical knowledge assessment. Journal Français d’Ophtalmologie, 46(7), 706–711. https://doi.org/10.1016/j.jfo.2023.05.006

Patrick, S. (2021). Transforming learning through competency-based education. State Education Standard, 21(2), 23–29.

Paulhus, D. L. (2002). Socially desirable responding: The evolution of a construct. In H. I. Braubn, D. N. Jackson, & D. E. Wiley (Eds.), The role of constructs in psychological and educational measurement (pp. 49–69). Erlbaum.

Phelps, R. P. (2019). Test frequency, stakes, and feedback in student achievement: A meta-analysis. Evaluation Review, 43(3–4), 111–151. https://doi.org/10.1177/0193841X19865628

Poropat, A. E. (2014). A meta-analysis of adult-rated child personality and academic performance in primary education. British Journal of Educational Psychology, 84(2), 239–252. https://doi.org/10.1111/bjep.12019

Posso, A. (2016). Internet usage and educational outcomes among 15-year old Australian students. International Journal of Communication, 10, 3851–3876. https://ijoc.org/index.php/ijoc/article/view/5586/1742

Powers, D. E., & Fowles, M. E. (1997). The personal statement as an indicator of writing skill: A cautionary note. Educational Assessment, 4(1), 75–87. https://doi.org/10.1207/s15326977ea0401_3

Prabhumoye, S., Tsvetkov, Y., Salakhutdinov, R., & Black, A. W. (2018). Style transfer through back-translation. In I. Gurevych & Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Volume 1. Long Papers (pp. 866–876). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1080

Qian, Y., Tao, J., Suendermann-Oeft, D., Evanini, K., Ivanov, A. V., & Ramanarayanan, V. (2018a). Computer-implemented systems and methods for speaker recognition using a neural network (U.S. Patent 10,008,209). U.S. Patent Office and Trademark Office. https://ppubs.uspto.gov/pubwebapp/external.html?q=(10008209).pn.&db=USPAT&type=ids

Qian, Y., Tao, J., Suendermann-Oeft, D., Evanini, K., Ivanov, A. V., & Ramanarayanan, V. (2018b). Noise and metadata sensitive bottleneck features for improving speaker recognition with non-native speech input. In Proceedings of INTERSPEECH 2016: 17th Annual Conference of the International Speech Communication Association (pp. 3648–3652). https://doi.org/10.21437/Interspeech.2016-548

RAND. (2020). RAND education assessment finder. https://www.rand.org/education-and-labor/projects/assessments/tool.html

Randall, J. (2023). It ain’t near ’bout fair: Re-envisioning the bias and sensitivity review process from a justice-oriented antiracist perspective. Educational Assessment, 28(2), 68–82. https://doi.org/10.1080/10627197.2023.2223924

Rees, A., (2021, December 27). The history of predicting the future. Wired. https://www.wired.com/story/history-predicting-future/

Rios, J. A., Ling, G., Pugh, R., Becker, D., & Bacall, A. (2020). Identifying critical 21st-century skills for workplace success: A content analysis of job advertisements. Educational Researcher, 49(2), 80–89. https://doi.org/10.3102/0013189X19890600

Roediger III, H. L., Agarwal, P. K., McDaniel, M. A., & McDermott, K. B. (2011). Test-enhanced learning in the classroom: long-term improvements from quizzing. Journal of Experimental Psychology: Applied, 17(4), 382–395. https://doi.org/10.1037/a0026252

Roll, I., & Barhak-Rabinowitz, M. (2023). Measuring self-regulated learning using feedback and resources. In N. Foster & M. Piacentini (Eds.), Innovating assessments to measure and support complex skills. OECD Publishing. https://doi.org/10.1787/c93ac64e-en

Rowland, C. A. (2014). The effect of testing versus restudy on retention: a meta-analytic review of the testing effect. Psychological Bulletin, 140(6), 1432–1463. https://doi.org/10.1037/a0037559

Rupp, A. A. (2018). Designing, evaluating, and deploying automated scoring systems with validity in mind: Methodological design decisions. Applied. Measurement in Education, 31(3), 191–214. https://doi.org/10.1080/08957347.2018.1464448

Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. Guilford Press.

Salgado, J. F., & Moscoso, S. (2019). Meta-analysis of interrater reliability of supervisory performance ratings: Effects of appraisal purpose, scale type, and range restriction. Frontiers in Psychology, 10, Article 2281. https://doi.org/10.3389/fpsyg.2019.02281

Salgado, J. F., & Tauriz, G. (2014). The Five-Factor model, forced-choice personality inventories and performance: A comprehensive meta-analysis of academic and occupational validity studies. European Journal of Work and Organizational Psychology, 23(1), 3–30. https://doi.org/10.1080/1359432X2012.716198

Scalise, K., & Gifford, B. (2006). Computer-based assessment in e-learning: a framework for constructing “intermediate constraint” questions and tasks for technology platforms. The Journal of Technology, Learning and Assessment, 4(6). https://ejournals.bc.edu/index.php/jtla/article/view/1653

Scalise, K., Malcom, C. & Kaylor, E. (2023). A tale of two worlds: Machine learning approaches at the intersection with educational measurement. In N. Foster&M. Piacentini (Eds.), Innovating assessments to measure and support complex skills (pp. 229–237). OECD Publishing. https://doi.org/10.1787/d01eb8a4-en

Schmeiser, C. B., & Welch, C. J. (2006). Test development. In R. L. Brennan (Ed.), Educational measurement (4th ed.; pp. 307–353). American Council on Education; Praeger.

Schmill, S. (2022, March 28). We are reinstating our SAT/ACT requirement for future admissions cycles in order to help us continue to build a diverse and talented MIT. MIT Admissions. https://mitadmissions.org/blogs/entry/we-are-reinstating-our-sat-act-requirement-for-future-admissions-cycles/#annotation-10

Schmitt, N., Keeney, J., Oswald, F. L., Pleskac, T. J., Billington, A. Q., Sinha, R., & Zorzie, M. (2009). Prediction of 4-year college student performance using cognitive and noncognitive predictors and the impact on demographic status of admitted students. Journal of Applied Psychology, 94(6), 1479–1497. https://doi.org/10.1037/a0016810

Schrum, L., & Levin, B. B. (2013). Leadership for twenty-first-century schools and student achievement: Lessons learned from three exemplary cases. International Journal of Leadership in Education, 16(4), 379–398. https://doi.org/10.1080/13603124.2013.767380

Schwartz, D. L., Tsang, J. M., & Blair, K. P. (2016). The ABCs of how we learn: 26 scientifically proven approaches, how they work, and when to use them. W.W. Norton & Company.

Segal, C. (2012). Working when no one is watching: Motivation, test scores, and economic success. Management Science, 58(8), 1438–1457. https://doi.org/10.1287/mnsc.1110.1509

Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61(2), 331–354. https://doi.org/10.1007/BF02294343

Shafer, G. W., Viskupic, K., & Egger, A. E. (2023). Critical workforce skills for bachelor-level geoscientists: An analysis of geoscience job advertisements. Geosphere, 19(2), 628–644. https://doi.org/10.1130/GES02581.1

Shen, T., Lei, T., Barzilay, R.,&Jaakkola, T. (2017). Style transfer from non-parallel text by cross-alignment. In I. Gurevych, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems 30 (NIPS 2017) (pp. 1–12). Curran Associates. https://papers.nips.cc/paper_files/paper/2017/file/2d2c8394e31101a261abf1784302bf75-Paper.pdf

Shepard, L. A. (2017). Formative assessment: Caveat emptor. In C. A. Dwyer (Ed.), The future of assessment (pp. 279–303). Routledge. https://doi.org/10.4324/9781315086545-12

Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge. https://doi.org/10.4324/9780203122761

Shute, V. J. (2008). Focus on formative feedback. Review of Educational Research, 78(1), 153–189. https://doi.org/10.3102/0034654307313795

Shute, V. J., & Zapata-Rivera, D. (2012). Adaptive educational systems. Adaptive technologies for Training and Education, 7(27), 1–35. https://doi.org/10.1017/CBO9781139049580.004

Sinatra, A. M., Robinson, R. L., Goldberg, B., & Goodwin, G. (2023). Impact of engaging with intelligent tutoring system lessons prior to class start. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 67(1), 2262–2266. https://doi.org/10.1177/21695067231192709

Sinharay, S. (2023). Statistical methods for detection of test fraud on educational assessments. In R. J. Tierney, F. Rizvi, & K. Ercikan (Eds.) International encyclopedia of education (4th ed., pp. 298–307). Elsevier. https://doi.org/10.1016/B978-0-12-818630-5.10030-2

Sinharay, S., & Johnson, M. S. (2013). Statistical modeling of automatically generated items. In M. J. Gier & T. M. Haladyna (Eds.), Automatic item generation (pp. 183–195). Routledge.

Sinharay, S., & Johnson, M. S. (2023). Computation and accuracy evaluation of comparable scores on culturally responsive assessments. Journal of Educational Measurement, 61(1), 5-46. https://doi.org/10.1111/jedm.12381

Sireci, S. G. (2020). Standardization and UNDERSTANDardization in educational assessment. Educational Measurement: Issues and Practice, 39(3), 100–105. https://doi.org/10.1111/emip.12377

Slavich, G. (2019). Stressnology: the primitive (and problematic) study of life stress exposure and pressing need for better measurement. Brain Behavior and Immunity, 75, 3–5. https://doi.org/10.1016/j.bbi.2018.08.011

Society for Industrial Organizational Psychology. (2018). Principles for the validation and use of personnel selection procedures (5th ed.). https://www.apa.org/ed/accreditation/personnel-selection-procedures.pdf

Soland, J.,&Kuhfeld, M. (2021). Do response styles affect estimates of growth on social-emotional constructs? Evidence from four years of longitudinal survey scores. Multivariate Behavioral Research, 56(6), 853–873. https://doi.org/10.1080/00273171.2020.1778440

Solano-Flores, G. (2019). Examining cultural responsiveness in large-scale assessment: The matrix of evidence for validity argumentation. Frontiers in Education, 4, Article 2019.00043. https://doi.org/10.3389/feduc.2019.00043

Solano-Flores, G. (2023). How serious are we about fairness in testing and how far are we willing to go? A response to Randall and Bennett with reflections about the Standards for Educational and Psychological Testing. Educational Assessment, 28(2), 105–117. https://doi.org/10.1080/10627197.2023.2226388

Soto, C. J., Napolitano, C. M., Sewell, M. N., Yoon, H. J., & Roberts, B. W. (2022). An integrative framework for conceptualizing and assessing social, emotional, and behavioral skills: The BESSI. Journal of Personality and Social Psychology, 123(1), 192–222. https://doi.org/10.1037/pspp0000401

Sottilare, R. A., Baker, R. S., Graesser, A. C., & Lester, J. (2018). Special issue on the generalized intelligent framework for tutoring (GIFT): Creating a stable and flexible platform for innovations in AIED research. International Journal of Artificial Intelligence and Education, 28(1), 139–151. https://doi.org/10.1007/s40593-017-0149-9

Sparks, J. R., Lehman, B., & Zapata-Rivera, D. (2024). Caring assessments: Challenges and opportunities. Frontiers in Education, 9, Article 1216481. https://doi.org/10.3389/feduc.2024.1216481

Stankov, L., Kleitman, S., & Jackson, S. A. (2015). Measures of the trait of confidence. In G. J. Boyle, D. H. Saklofske, & G. Matthews (Eds.), Measures of personality and social psychological constructs (pp. 158–189). Elsevier Academic Press. https://doi.org/10.1016/B978-0-12-386915-9.00007-3

Steenbergen-Hu, S., & Cooper, H. (2013). A meta-analysis of the effectiveness of intelligent tutoring systems on K–12 students’ mathematical learning. Journal of Educational Psychology, 105(4), 970–987. https://doi.org/10.1037/a0032447

Sternberg, R. J., Forsythe, G. B., Hedlund, J., Horvath, J. A., Wagner, R. K., Williams, W. M., Snook, S. A., & Grigorenko, E. L. (2000). Practical intelligence in everyday life. Cambridge University Press.

Stecher, B. M., & Hamilton, L. S. (2014). Measuring hard-to-measure student competencies: A research and development plan (Research Report No. RR-863-WFHF). RAND Corporation. https://doi.org/10.7249/RR863

Stocking, M. L., & Swanson, L. (1993). A method for severely constrained item selection in adaptive testing. Applied Psychological Measurement, 17(3), 277–292. https://doi.org/10.1177/014662169301700308

Stowe, K., Ghosh, D., & Zhao, M. (2022). Controlled language generation for language learning items. arXiv. https://doi.org/10.48550/arXiv.2211.15731

Straub, L. M., Lin, E., Tremonte-Freydefont, L., & Schmid, P. C. (2023). Individuals’ power determines how they respond to positive versus negative performance feedback. European Journal of Social Psychology, 53(7), 1402–1420. https://doi.org/10.1002/ejsp.2985

Su, R., Tay, L., Liao, H.-Y., Zhang, Q., & Rounds, J. (2019). Toward a dimensional model of vocational interests. Journal of Applied Psychology, 104(5), 690–714. https://doi.org/10.1037/apl0000373

Tang, R., Chuang, Y.-N.,&Hu, X. (2023). The science of detecting LLM-generated texts. arXiv. https://doi.org/10.48550/arXiv.2303.07205

Tang, Z., & Kirman, B. (2023). Exploring curiosity in games: A framework and questionnaire study of player perspectives. International Journal of Human-Computer Interaction. Advance online publication. https://doi.org/10.1080/10447318.2024.2325171

Tannenbaum, R. J., & Kane, M. T. (2019). Stakes in testing: Not a simple dichotomy but a profile of consequences that guides needed evidence of measurement quality (Research Report No. RR-19-19). ETS. https://doi.org/10.1002/ets2.12255

Tenison, C., & Sparks, J. R. (2023). Combining cognitive theory and data driven approaches to examine students’ search strategies in simulated digital environments. Large-Scale Assessments in Education, 11, Article 28. https://doi.org/10.1186/s40536-023-00164-w

Turchin, D. (Host). (2023, March 6). Andi Mann, Sageable CEO and AIOps pioneer, discusses enterprise AI wins and the impact of automation on jobs [Audio podcast episode]. In AI and the Future of Work. Apple Podcasts. https://podcasts.apple.com/us/podcast/andi-mann-sageable-ceo-and-aiops-pioneer-discusses/id1476885647?i=1000602978601

Trull, T. J., & Ebner-Priemer, U. (2013). Ambulatory assessment. Annual Review of Clinical Psychology, 9, 151–176. https://doi.org/10.1146/annurev-clinpsy-050212-185510

U.S. Congress, Office of Technology Assessment. (1992). Testing in American schools: Asking the right questions (Report No. OTA-SET-519). U.S. Government Printing Office.

U.S. Office of Personnel Management. (n.d.). Situational judgment tests. https://www.opm.gov/policy-data-oversight/assessment-and-selection/other-assessment-methods/situational-judgment-tests/

van der Linden, W. J. (2005). Linear models for optimal test design. Springer. https://doi.org/10.1007/0-387-29054-0

van der Linden, W. J., & Glas, C. A. W. (Eds.). (2010). Elements of adaptive testing. Springer. https://doi.org/10.1007/978-0-387-85461-8

van de Vijver, F. J. R.,& He, J. (2016). Bias assessment and prevention in noncognitive outcome measures in context assessments. In S. Kuger, E. Klieme, N. Jude, & D. Kaplan (Eds.), Assessing contexts of learning: International perspectives (pp. 229–253). Springer. https://doi.org/10.1007/978-3-319-45357-6_9

van de Vijver, F., & Poortinga, Y. H. (2005). Conceptual and methodological issues in adapting tests. In R. H. Hambleton, P. F. Merenda, & C. D. Spielberger (Eds.), Adapting educational and psychological tests for cross-cultural assessment. Taylor & Francis. https://doi.org/10.4324/9781410611758

VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4), 197–221. https://doi.org/10.1080/00461520.2011.611369

VanLehn, K., Graesser, A. C., Jackson, G. T., Jordan, P., Olney, A., & Rosé, C. P. (2007). When are tutorial dialogues more effective than reading? Cognitive Science, 31(1), 3–62. https://doi.org/10.1080/03640210709336984

von Davier, M. (2010). Hierarchical mixtures of diagnostic models. Psychological Test and Assessment Modeling, 52(1), 8–28.

von Davier, M., Tyack, L., & Khorramdel, L. (2023). Scoring graphical responses in TIMSS 2019 using artificial neural networks. Educational and Psychological Measurement, 83(3), 556–585. https://doi.org/10.1177/00131644221098021

Waheed, H., Hassan, S.-U., Aljohani, N. R., Hardman, J., Alelyani, S., & Nawaz, R. (2020). Predicting academic performance of students from VLE big data using deep learning models. Computers in Human Behavior, 104, Article 106189. https://doi.org/10.1016/j.chb.2019.106189

Wainer, H. (1987). The first four millennia of mental testing: From ancient China to the computer age (Research Report No. RR-87-34). ETS. https://doi.org/10.1002/j.2330-8516.1987.tb00238.x

Wainer, H., & Thissen, D. (2001). True score theory: The traditional method. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 35–84). Routledge. https://doi.org/10.4324/9781410604729

Walker, M. E., Olivera-Aguilar, M., Lehman, B., Laitusis, C., Guzman-Orth, D.,&Gholson, M. (2023). Culturally responsive assessment: provisional principles (Research Report No. RR-23-11). ETS. https://doi.org/10.1002/ets2.12374

Walkington, C., & Bernacki, M. L. (2020). Appraising research on personalized learning: Definitions, theoretical alignment, advancements, and future directions. Journal of Research on Technology in Education, 52(3), 235–252. https://doi.org/10.1080/15391523.2020.1747757

Wang, F., Liu, Q., Chen, E., Huang, Z., Chen, Y., Yin, Y., Huang, Z.,&Wang, S. (2020). Neural cognitive diagnosis for intelligent education systems. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 6153–6161. https://doi.org/10.1609/aaai.v34i04.6080

Wang, J., Jou, M., Lv, Y., & Huang, C. C. (2018). An investigation on teaching performances of model-based flipping classroom for physics supported by modern teaching technologies. Computers in Human Behavior, 84, 36–48. https://doi.org/10.1016/j.chb.2018.02.018

Weinberger, C. J. (2014). The increasing complementarity between cognitive and social skills. Review of Economics and Statistics, 96(5), 849–861. https://doi.org/10.1162/REST_a_00449

Weirich, S., Hecht, M., Penk, C., Roppelt, A., & Böhme, K. (2017). Item position effects are moderated by changes in test-taking effort. Applied Psychological Measurement, 41(2), 115–129. https://doi.org/10.1177/0146621616676791

Weiss, S., Wilhelm, O., & Kyllonen, P. (2021). An improved taxonomy of creativity measures based on salient task attributes. Psychology of Aesthetics, Creativity, and the Arts. Advance online publication. https://doi.org/10.1037/aca0000434

West, M., Pier, L., Fricke, H., Hough, H. J., Loeb, S., Meyer, R. H., & Rice, A. B. (2018). Trends in student social-emotional learning: evidence from the CORE districts (Working paper). Policy Analysis for California Education. https://edpolicyinca.org/publications/trends-student-social-emotional-learning

Williamson, D.M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x

Wilkie, D. (2023, December 21). Employers say students aren’t learning soft skills in college. https://www.shrm.org/topics-tools/news/employee-relations/employers-say-students-arent-learning-soft-skills-college

Wise, S. L. (2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52–61. https://doi.org/10.1111/emip.12165

Wise, S. L., & DeMars, C. E. (2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10(1), 1–17. https://doi.org/10.1207/s15326977ea1001_1

Wisniewski, B., Zierer, K., & Hattie, J. (2020). The power of feedback revisited: A meta-analysis of educational feedback research. Frontiers in Psychology, 10, Article 3087. https://doi.org/10.3389/fpsyg.2019.03087

Wolcott, M. D., Lobczowski, N. G., Zeeman, J. M., & McLaughlin, J. E. (2020). Situational judgment test validity: an exploratory model of the participant response process using cognitive and think-aloud interviews. BMC Medical Education, 20, Article 506. https://doi.org/10.1186/s12909-020-02410-z

World Economic Forum. (2021). Building a common language for skills at work: A global taxonomy. https://www3.weforum.org/docs/WEF_Skills_Taxonomy_2021.pdf

World Economic Forum. (2022). Catalysing Education 4.0: Investing in the future of learning for a human-centric recovery. https://www3.weforum.org/docs/WEF_Catalysing_Education_4.0_2022.pdf

World Economic Forum. (2023). Defining Education 4.0: A taxonomy for the future of learning. https://www3.weforum.org/docs/WEF_Defining_Education_4.0_2023.pdf

Xuan, Q., Cheung, A., & Sun, D. (2022). The effectiveness of formative assessment for enhancing reading achievement in K-12 classrooms: A meta-analysis. Frontiers in Psychology, 13, Article 990196. https://doi.org/10.3389/fpsyg.2022.990196

Yang, Z., Hu, Z., Dyer, C., Xing, E. P., & Berg-Kirkpatrick, T. (2018). Unsupervised text style transfer using language models as discriminators. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems 31 (NeurIPS 2018) (pp. 7287–7289). Curran Associates. https://papers.neurips.cc/paper_files/paper/2018/hash/398475c83b47075e8897a083e97eb9f0-Abstract.html

Yeung, C. (2019). Deep-IRT: Make deep learning based knowledge tracing explainable using item response theory. arXiv. https://doi.org/10.48550/arXiv.1904.11738.

Yeung, C.-K., & Yeung, D.-Y. (2019). Incorporating features learned by an enhanced deep knowledge tracing model for STEM/non-STEM job prediction. International Journal of Artificial Intelligence and Education, 29, 317–341. https://doi.org/10.1007/s40593-019-00175-1

Youyou, W., Kosinski, M.,&Stillwell, D. (2015). Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences, 112(4), 1036–1040. https://doi.org/10.1073/pnas.1418680112

Zapata-Rivera, D., & Forsyth, C. M. (2022). Learner modeling in conversation-based assessment. In R. A. Sottilare, & J. Schwarz (Eds.), Adaptive instructional systems: International Conference on Human-Computer Interaction. HCII 2022 (pp. 73–83). Springer. https://doi.org/10.1007/978-3-031-05887-5_6

Zapata-Rivera, D., Graesser, A. C., Kay, J., Hu, X.,&Ososky, S. J. (2020). Visualization implications for the validity of intelligent tutoring systems. In A. M. Sinatra, A. C. Graesser, X. Hu, B. Goldberg, & A. J. Hampton (Eds.), Design recommendations for intelligent tutoring systems: Volume 8. Data visualization (pp. 61-68). US Army Combat Capabilities Development Command - Soldier Center.

Zapata-Rivera, D., & Hu, X. (2022). Assessment in intelligent tutoring systems SWOT analysis. In A. M. Sinatra, A. C. Graesser, X. Hu, G. Goodwin,&V. Rus (Eds.), Design recommendations for intelligent tutoring systems: Vol. 10. Strengths, weaknesses, opportunities and threats (SWOT) analysis of intelligent tutoring systems (pp. 83–90). US Army Combat Capabilities Development Command – Soldier Center. https://gifttutoring.org/attachments/download/4751/Vol%2010_DesignRecommendationsforITSs_SWOTAnalysisofITSs.pdf#page=87

Zapata-Rivera, D., Lehman, B., & Sparks, J. R. (2020). Learner modeling in the context of caring assessments. In R. A. Sottilare & J. Schwarz (Eds.), Adaptive instructional systems: Second International Conference (AIS) 2020 (pp. 422–431). Springer.

Zhan, J., Her, Y. W., Hu, T., & Du, C. (2018). Integrating Data Analytics into the Undergraduate Accounting Curriculum. Business Education Innovation Journal, 10(2), 169–178. http://www.beijournal.com/images/V10N2_draft_5.pdf

Zhang, Z. (2012). Microsoft Kinect sensor and its effect. IEEE multimedia, 19(2), 4–10. https://doi.org/10.1109/MMUL.2012.24

Zu, J., & Choi, I. (2023a, April 12–15). Utilizing deep language models to predict item difficulty of language proficiency tests [Paper presentation]. The annual meeting of National Council on Measurement in Education, Chicago, IL, United States.

Zu, J., & Choi, I. (2023b, July 25–28). Predicting the psychometric properties of automatically generated items [Paper presentation]. International Meeting of the Psychometric Society, College Park, MD, United States.

Zu, J., Choi, I., & Hao, J. (2023). Automated distractor generation for fill-in-the-blank items using a prompt-based learning approach. Psychological Testing and Assessment Modeling, 65(2), 55–75.

Zu, J.,&Kyllonen, P. C. (2020). Nominal response model is useful for scoring multiple-choice situational judgment tests. Organizational Research Methods, 23(2), 342–366. https://doi.org/10.1177/1094428118812669