( ・_・)ノ Ritchot's Corner

the evaluation framework for my AI literacy course

Key Points

I have taken a fair amount of corporate training, the mandatory compliance and upskilling kind, and I have a confession about how I do it. I do not watch the videos, and I rarely read the content. I open the assessment, read the questions, and reason out which answers the course is looking for, which is usually not hard, because the questions telegraph their own keys. I pass. The system records a completion. I understood nothing I did not already know, and I changed nothing about how I worked the next day. The usual approach to corporate training is broken, and it is broken in a specific place: what it chooses to measure.

When I rebuilt my AI literacy course, the needs analysis ended on that same complaint. The programs already on the market, I wrote, measure completion rather than understanding or behavior change, and almost none of them ship a manager observation protocol or any behavioral follow-up at all. So before I wrote a single module, I had to envision how the course could be judged for efficacy.

Designing the evaluation first is a method used at any level of teaching called backward design: you specify the understanding and the behavior change you intend to produce, and the evidence that would show it, before you build the content meant to get you there. An evaluation assembled at the end can only ask whether people liked the experience and clicked through it, which is exactly the data my compliance-training trick generates and exactly the data that tells you nothing. So I started by deciding what would count as success.

Why completion is the wrong metric

The standard artifact of corporate training is the completion rate, and it is useful for exactly one thing: confirming that people clicked through the modules. A 95% completion rate is a real number, and an L&D team that reports it is not lying. The problem is everything the number cannot see. It cannot tell whether anyone understood a concept, whether they changed a single workflow, or whether the organization is any different on the far side of the spend, which is how someone like me passes a course while learning nothing. Donald Kirkpatrick's model, the dominant evaluation taxonomy in the field since the 1950s (hypothetically at least—the amount of awful corporate L&D courses may beg to differ), separates the questions that matter into four levels: reaction (did they like it), learning (did they understand it), behavior (did they change what they do), and results (did the organization benefit).

AI training tends to stop at the first two levels, and I think that is because Levels 1 and 2 are where a course is cheapest to make look successful. A satisfaction survey returns warm numbers almost by default, and a multiple-choice post-quiz returns a knowledge gain almost by construction, particularly the recall-style quiz a motivated test-taker can reverse-engineer without understanding anything (I would know). The behavior and the business case sit at Levels 3 and 4, where measurement gets expensive, slow, and uncomfortable, because that is where a program can be shown not to have worked. So I built the harder instruments first. If I was going to claim the course produces understanding and changes behavior, I needed measures that could catch me being wrong about both.

Level 1: Reaction

Level 1 is the easy tier, and most programs waste it. A generic "I enjoyed the program" item produces a number that maps to no design decision and therefore cannot be acted on. So every item in the reaction survey was written against a single test: can this response drive a specific revision? "The scenarios reflected situations I encounter in my role" maps straight to the scenario layer and can be fixed if it scores low. "How would you rate this program overall" maps to nothing. The instrument runs eight to thirteen scaled items across four dimensions (relevance, confidence change, content quality, intent to apply), plus a standard Net Promoter Score item so the program can be benchmarked against an organization's other L&D initiatives.

One structural decision matters here: the reaction survey is not a live instrument. What the page presents is the proposed item bank, the questions a deploying organization would draw from, not a running survey. It stays a design for two reasons. First, the course is built for the general workforce, and a reaction survey only works once it is tuned to the institution administering it, so shipping a fixed, generic version as if it were deployable would misrepresent how reaction measurement works. Second, privacy: the course collects no personally identifiable information and runs no server-side database (a deliberate choice for a solo-built portfolio piece), and a reaction survey with free-text fields is precisely the data I am not equipped to hold as a solo practitioner. A deploying organization would adapt the items to its own context and run them through its existing survey infrastructure.

Level 2: Learning

Level 2 is where the course measures whether anyone understood anything, and it is the level with empirical evidence behind it (albeit re-written for a new audience). The instrument is a ten-item, scenario-based pre/post assessment built on a parallel-form design, which means the pre and post tests measure the same ten constructs through different workplace scenarios so participants cannot pass by remembering their earlier answers. The scenario format is also the part that resists the reverse-engineering trick I opened with: a stem that asks what you should do in a specific, unfamiliar situation, with distractors drawn from documented misconceptions, is far harder to game than a definitional recall item, because you have to recognize the construct rather than pattern-match the phrasing. The correct answer letter shifts on eight of the ten construct pairs, with only tokenization and output verification keeping their position, which closes off answer recall. The vocabulary is staged across the two forms as well: the pre-assessment uses no course terminology, since learners have not met it yet, while the post-assessment leans on the program's language (tokenization, next-token prediction, the context window, the augmentation-automation spectrum), so reasoning fluently in that vocabulary is itself part of what the post measures. Every item traces backward to a documented capability gap and forward to one of the program's seventeen performance objectives, the traceability I keep returning to.

The real data comes from the prior version. My M.Ed. capstone piloted the tokenization instruction this course is built on with ten participants in a mixed-methods pre/post design: pre-assessment scores averaged 3.1 out of 5, post-assessment scores averaged 4.5, and nine of ten participants showed positive gains, with triangulation across reflections and a scenario task confirming the knowledge transferred.1 I want to be precise in that this validates the instructional design overall, not this specific ten-item corporate instrument, and it came from a different population, secondary students rather than the mid-career professionals this program targets, so it carries as suggestive evidence rather than proof. The corporate assessment has no reliability statistics yet, partly because the privacy-by-design choice keeps every response on the participant's own device, and partly because those statistics require a first deployment cohort it has not had. What I can say is that this is a second iteration on a validated foundation, not a first-generation prototype, which is a meaningfully different epistemic position from "trust me, it works."

Level 3: Behavior

Level 3 measures whether participants do something differently at work, sampled at 30, 60, and 90 days, through two parallel instruments: a manager evidence review and a participant self-assessment, each rating the same behavioral indicators on a four-point frequency scale. The scale has no neutral midpoint, which forces a directional call between "occasionally" and "consistently," and its top rating is "modeling for others," because peer modeling is the mechanism by which one person's competence becomes a team's. The indicators themselves are organized by the program's 4D framework (Delegation, Description, Discernment, Diligence) and phrased as concrete actions a manager can witness or a participant can document. "Verifies factual claims against independent sources before incorporating them" is observable. "Values accuracy" is not, and the instrument refuses to score it.

A great many serious companies now measure AI adoption through usage volume, and I "get" the logic: usage is cheap to capture, it scales without managers in the loop, and more usage plausibly means more value. Through early 2026 this became its own genre of management. Reporting in CIO described internal leaderboards ranking employees by token consumption, AI usage folded into performance reviews, and a coinage, tokenmaxxing, for the predictable result: people firing tools at low-value busywork to climb the board. One leaked Meta dashboard reportedly ranked roughly 85,000 employees by tokens burned before it was taken down. The logic breaks at the assumption buried in the multiplication step, that each unit of usage yields a uniform unit of value.

The research the course is built on contradicts that assumption directly. Tamkin and McCrory found task-time savings ranging from 20% to 95% depending on the task, and Handa et al. found that 57% of AI-touched tasks are augmentation while 43% are automation, which means the value of an interaction depends on what was delegated and how it was structured. Those are the exact competencies the program trains, Delegation and Description. So the most literate user often shows lower token consumption, not higher, because they decompose the task, prompt it well enough to need fewer iterations, and route the parts the model cannot do reliably back to their own hands. A leaderboard would rank them below the colleague pasting whole documents into a chatbot and accepting the first answer, which is backwards and entirely unhelpful.

The companies running these leaderboards have started dismantling them. By late May 2026, Amazon had deprecated an employee-built usage leaderboard, with a senior engineering VP reportedly telling staff to "please don't use AI just for the sake of using AI," Meta had quietly taken its own leaderboard down, and Fortune had declared tokenmaxxing over as a way to measure AI return. None of this should have required a reversal to see. Measuring competence by token volume was a bad idea on day one, and plenty of practitioners, myself among them, said so while the leaderboards were still going up: it rewards the loudest, least disciplined user and punishes the one who gets more done with less. I built that conviction into the evaluation framework in mid-May, before the start of the public climbdown. When a metric collapses this fast under its own incentives, it was never measuring the right thing. The course measures what a participant does with AI, not how much, for that reason.

Behavior is hard to measure for a reason that predates AI: the transfer gap. Philippa Hardman's synthesis of the transfer literature estimates that only 10 to 20% of learning investment produces measurable on-the-job behavior change, against a global L&D spend she puts near $400 billion.2 An evaluation that stops at Level 2 cannot even see this leak. Mine is built to catch it: the parallel manager and self instruments produce convergence data, and the interesting signal is where they disagree. When a participant rates a behavior high and the manager rates it low, that gap is diagnostic rather than embarrassing, because it separates a legibility problem (the behavior is happening but leaves no visible artifact) from a self-report inflation problem from a manager who was never oriented on what to look for. Each diagnosis points to a different fix, which is the entire point of measuring at this level instead of asserting success from a completion log.

The honest weakness of a manager review is the manager. The instrument only works if managers spend real attention on it, reading the artifacts and the written examples rather than speed-running the rating scale, and a manager who treats it as one more compliance checkbox produces confident noise. That is a design risk worth naming, so the framework offers alternatives a deploying organization can layer in or substitute: sampled artifact audits, where an L&D reviewer or quality lead scores a random sample of AI-assisted deliverables against the same indicators; peer review, where colleagues close enough to the work can actually see it; and self-assessment with spot-checks, where a participant's specific examples are verified against real artifacts rather than taken on faith. None of these is bias-free either, but each fails differently, and the more independent the measurement surfaces, the less any single blind spot decides the result.

Level 4: Results

Level 4 asks the question an executive cares about: did the organization benefit, and was the spend worth it. The failure mode here is the volume-ROI calculation, where a vendor multiplies seats by an assumed time saving and returns a number with a dollar sign. It is the structural equivalent of estimating a gym's health benefit by counting badge swipes at the door, since the turnstile count tells you nothing about whether anyone trained. The bill comes due literally: Uber's COO told Fortune in May 2026 that the company had burned its entire annual AI budget in four months and could not yet draw a clean line from that spend to anything its users would feel, which is the volume-ROI fallacy arriving as an invoice. So the Level 4 blueprint measures outcomes downstream of competent use instead: time-to-first-draft and revision cycles read together (a faster draft that needs more rounds is not a gain), AI-related error rate as the highest-consequence quality measure, and a concealment-reduction metric that tracks whether AI use is moving from hidden to visible (the 69% concealment finding from Anthropic's interviewer study, now measured as an organizational KPI).

A fair question is whether any of this is measurable at enterprise scale, and the answer cuts against the easy assumption. What the AI vendors expose to administrators is usage telemetry, not outcomes. ChatGPT Enterprise reports active users, message volumes, and top tools; Claude Enterprise reports conversation and message counts, projects, and seat utilization; Gemini for Workspace reports active users and per-app feature usage across Gmail and Docs. Every one of those is an adoption metric, the same volume signal the leaderboards ran on. The conversation content that might let you judge quality sits behind a compliance or eDiscovery interface built for litigation, not performance review. So the Level 4 measures that actually matter, revision cycles, time-to-first-draft, AI-attributable error rate, do not come from the AI provider at all. They come from the organization's own systems: the project tracker, the version history, the QA and incident logs, the review workflow. Some, like a clean count of AI-caused errors, do not exist in most organizations yet and have to be built before they can be measured. That is the unglamorous reality of results measurement: the provider hands you an adoption dashboard, and the outcomes you actually need are yours to instrument.

The honest part is the attribution, which is where most ROI claims quietly cheat. You cannot prove a program caused an outcome without a randomized trial no company will run, so the blueprint reports a range rather than a point estimate, built from comparison groups where deployment is phased, trend-line analysis against pre-program baselines, and explicitly hedged manager and participant estimates adjusted downward for optimism. The worked example runs a 200-person deployment with deliberately conservative inputs at every step (25% of task time AI-eligible against a documented 57%, 30% time savings against an 81% in-conversation median, 40% of improvement attributed to the program rather than 100%) and still projects a 191% return. I trust that number about as far as its inputs, which is to say I present it as a model the deploying organization repopulates with its own data, not as a result I am claiming. The point of the chain is not the figure. It is that each level hands the next one its evidence: reaction establishes credibility, learning establishes knowledge gain, behavior establishes that the knowledge moved into practice, and only then does a results claim have a mechanism to stand on. Strip out Level 3 and the Level 4 number is just a correlation waiting for someone to ask how you know.

Limitations

I should be clear about how much of this is built versus proven. One level has real data. Level 2 rests on a ten-person capstone pilot, and even that validated the instructional design on a different population, not the corporate instrument that now carries it. Levels 1, 3, and 4 are instrument designs that no cohort has run through, which means they are arguments about what good measurement would look like rather than evidence that the program measures up. The Level 3 manager review depends on managers who are oriented, willing, and actually reading the qualitative examples rather than scanning the numbers, and an organization that cannot supply that will get noise. The Level 4 ROI model is illustrative, sensitive to three inputs I flagged, and useless without baseline data the organization has to collect before deployment, not after. Three of the twelve behavioral indicators are self-report by nature, mitigated by a specificity requirement but not eliminated by it. And I built all of this solo, around a full-time teaching job, which is the right frame for reading it: rigorous and traceable in design, unproven in deployment, and waiting on the one thing a portfolio piece cannot manufacture, which is an organization willing to run it.

The framework is mostly a promissory note, and an honest one only because it says so. The needs analysis told me what understanding and which behaviors would matter. The evaluation is the apparatus for finding out whether the course produces them, or whether I built another well-made thing that certifies its own completion to people who, like me, never really showed up. I do not get to know which until someone deploys it. The next post turns from how the course is judged to how it is built, the program design that sits between the two. Until a cohort runs the instruments, the most defensible thing I can say about the course is not that it works, but that I have made it falsifiable, and I would rather ship a course that can be proven wrong than one engineered to look right.


The full instrument behind each level is available as a PDF: Level 1: Reaction, Level 2: Learning, Level 3: Behavior, and Level 4: Results.

If you want to chat, shoot me an email. If you would like to get updates, subscribe to my blog via email or RSS feed. You can also follow me at LinkedIn and X.

  1. The capstone (Ritchot, Western Governors University, June 2025) taught tokenization to students in Grades 9 through 11. That is the different population I mean, and I do not want to imply it was the easy version of this work: getting adolescents to engage with a course they took around everything else competing for their attention took months of relationship-building and the soft skills the role runs on. The point is narrower, that secondary students differ enough from mid-career professionals that the capstone data transfers as suggestive support, not direct proof. I discussed the capstone and its limits in the needs analysis post.

  2. Hardman's transfer estimate and the $400 billion global L&D figure are widely cited in the field but worth treating as directional rather than precise, since both depend heavily on how "behavior change" and "L&D spend" are defined across very different organizations.