Continuous Evaluation Frameworks for AI in Educational Settings

Diane Gavin, Executive Director – Center for Academic Innovation, Texas A&M University-San Antonio

Diane Gavin, Executive Director – Center for Academic Innovation, Texas A&M University-San Antonio

In an era where many colleges and universities have closed, while others are making significant cuts due to budget deficits, does investing in AI make fiscal sense? As higher education leaders are having to make difficult decisions regarding spending and resources, can implementing AI at our institutions be considered in terms of cost benefit and return on investment (ROI)?

Continuous evaluation frameworks for AI in educational technology have evolved considerably in 2025. The fundamental shift is moving from evaluation as a periodic activity to evaluation as an integrated, continuous process that directly feeds back into system improvement based on rigorous documentation of both positive and negative impact.

positive and negative impact. Compared to traditional continuous evaluation frameworks like the Kirkpatrick Model for training evaluation, the COSO framework for internal controls, and the CDC Program Evaluation Framework for public health programs, continuous evaluation frameworks with AI use automated evaluation processes that rely on data orchestration tools and pre-defined metric data to model predictions, including input, output and real-time, realworld (or ground truth) data. “Ground truth data” are data known to be real or true, gathered and delivered through direct observation and measurement instead of information provided by interpretation or inference.

For AI systems, ground truth data evaluate students’ ideal performance by comparing outputs to the “correct answer” (the data based on real-world observations). Educational technology professionals can explain that with continuous evaluation frameworks rooted in AI, the AI uses student input based on data and correct answers observed and recorded to evaluate what the student’s ideal performance is or could be grounded in the data gathered from the AI’s observations of learner responses.

“AI-driven continuous evaluation in education transforms feedback into real-time insights, driving ongoing system improvement”

There are various AI-driven continuous evaluation frameworks that educational technologists can choose from to understand learner performance. The following areas – multi-dimensional assessment, stakeholder feedback loops, comparative benchmarking, cross-institutional data sharing (with privacy protections) and adaptation protocols – are current good practice in applying continuous evaluation frameworks for AI use across various education environments.

Multi-dimensional Assessment Approach

With a multi-dimensional assessment approach, contemporary evaluation frameworks no longer focus solely on academic metrics but incorporate multiple dimensions:

• Learning outcome measurements track subject mastery using adaptive assessments that adjust to user/learner progress.

• Engagement metrics measure not just time spent but the quality of interactions with AI systems.

• Equity indicators show if AI systems are serving all student populations effectively. •Factors inherent in some multi-dimensional approaches, also known as well-being factors, can check for potential negative impact on student mental health or social development.

Currently, multi-dimensional or multi-modal assessments incorporate various data inputs that integrate traditional assessment with digital learning behaviors like biometrics and verbal/non-verbal cues. Process-based analytics that capture both the answers and keystroke patterns, eye tracking, and time-on-task metrics also align with a multi-dimensional approach with AI systems. These assessment practices, as well as collaboration assessment models that evaluate group dynamics and collaborative learning through interaction analysis offer avenues to capture holistic views of student learning.

Stakeholder Feedback Loops

With continuous evaluation frameworks that include AI, effective evaluation now includes structured input from all parties to provide a 360° feedback for learning. In a K-12 environment, for instance, stakeholder feedback loops with AI often looks like the following

• Student feedback mechanisms for assignments built directly into AI interfaces.

• Teacher observation protocols to document classroom impact on learning.

• Parent portals that gather insights about home learning experiences based on parental use of the portal.

• Administrator review cycles that assess institutional integration and student performance.

Comparative Benchmarking

Similar to the TIMMS Index for K-12 math success, the Programme for International Student Assessment (PISA) Index uses strong comparative benchmarking frameworks include contextual performance analysis to measure student performance over time and to measure education reforms and politics to decide if positive changes occur in three-year student cycles. PISA Index comparative benchmarking is considered a best-practice model to guide the formation of national or federal assessment practices.

Comparative benchmarking has been found to

• Complement validated national data against international benchmarking.

• Help in evaluating the effectiveness of educational reform.

• Set and monitor school system performance targets and indicators.

• Establish cross-institutional data sharing (with privacy protections).

• Construct control group comparisons where appropriate.

• Generate longitudinal tracking to show long-term impact beyond immediate school year gains when using AI tools.

Adaptation Protocols

Modern evaluation frameworks connected to AI-enabled learning use adaptation protocols to close the loop. There are a range of adaptive assessment protocols that exist for AI-enabled learning, from knowledge graph navigation that maps students’ conceptual understanding and generates target questions to explore or expose weak areas in a student’s knowledge structure to scenario-based adaptive assessment that relies upon dynamic simulations that branch based on students’ demonstrated competencies and decisions. Some adaptation protocols function using continuous calibration systems that recalibrate questions in real time; for example, GMAT and GRE exams recalibrate difficulty based on immediate test-taker performance.

Conclusion

Regardless of the continuous evaluation frameworks for AI being reviewed, education technologists need to consider learner level when selecting a protocol. Be sure to

• Select version control systems that track adjustments to AI models.

• Consider the impact on current assessment requirements before implementing significant changes.

• Develop transparency reporting to all stakeholders about system performance and changes.

Weekly Brief

Read Also

Preparing Every Classroom for Career Success

Preparing Every Classroom for Career Success

Jarrad Grandy, Executive Director of Student, Oakland Schools
Navigating Through Cybersecurity in the AI Era

Navigating Through Cybersecurity in the AI Era

Dennis Guillette, Director and Security Architect, University of South Florida
Digitalizing Education

Digitalizing Education

Eva Harvell, Director of Technology, Pascagoula-Gautier School District
Transforming Education Through Technology Leadership

Transforming Education Through Technology Leadership

Hector Hernandez, Director of Technology Operations, Aspire Public Schools
Social Impact and Artificial Intelligence: Understanding Indirect Measures

Social Impact and Artificial Intelligence: Understanding Indirect Measures

Kent Seaver, Director of Academic Operations, the University of Texas, Dallas
Building Smarter Infrastructure through Faculty Partnership

Building Smarter Infrastructure through Faculty Partnership

Brad Shook, Senior VP of Technology and Operations, the University of Texas Permian Basin