on writing, and an MIT study

25-Jun-2025

Key Points

Despite widespread interpretations of a recent MIT study as evidence that ChatGPT erodes critical thinking, the study itself is preliminary, highlighting reduced cognitive engagement and memory when users depend on AI tools for short, timed writing tasks—but stops short of declaring AI inherently harmful.
EEG-measured brain activity during standardized, time-constrained essays may not accurately capture genuine critical thinking or meaningful cognitive engagement; deeper thought and original ideas often emerge through iterative, reflective writing tasks.
Human evaluators in the study consistently rated essays written without AI assistance as more original and meaningful, yet prior research has shown human evaluators struggle significantly at distinguishing AI-generated writing, often performing no better than chance.
Current standardized essay assessments reward superficial structure and fluency rather than genuine learning, idea ownership, or internal revision, a flaw now amplified by generative AI's capability to mimic surface-level polish.
More authentic assessments—allowing preparation with various tools but requiring unaided writing—would better measure critical thinking and memory retention, challenging educational systems to raise standards rather than simplistically labeling technology as harmful.

When a thought crosses my mind, or I start consistently coming across a topic that I may find value in writing about later, I typically store these ideas into a little notepad to revisit. The intermediary time is usually where most of my thinking happens, though it is largely unstructured and in some vague, ethereal form at best. Once I hit some sort of vague critical mass of information or ideas on a topic, I will undertake the task of formalizing my thoughts in writing, afterward in which I may share it publicly for others to read and share their own thoughts in return. I typically give these ideas a title that may not be indicative of the viewpoint of the final piece, but is a good reminder of where my headspace was when initially writing it on my list. Currently, the list is a bit short, but to give a glimpse into what is currently on my mind, it is as follows:

Deep Research / The Essay is Dead
Useless AI Tech Policies
What bias actually exists in LLMs?

The result is that I end up writing less and probably fall amiss of content generation algorithms, but I do hope it ends up in the creation of pieces that are more thoughtful. I want my time and writing to be more than what is trending at the time. What is coming up is a bit of an exception to avoiding trends, at least somewhat, but is at least a bit more substantive than talking about how you can create action figure images using ChatGPT.

This most recent gap since writing my last piece, however, was due to completing my Master’s in Educational Technology and Instructional Design before summer, which had me at the the edge of my mental bandwidth. Now that I have finished my Master’s degree, culminating in a capstone project built for rubric requirements, but that will largely be tossed into a void, I can get back to this style of personal writing—that I do find much more enjoyable and with a bit more value.

Despite being busy, I still managed to amass a piece of writing approximately six thousand words long titled the essay is dead, but Deep Research didn’t kill it. I had been slowly piecing together thoughts on essays over some time with a rather strong thesis on how essays, at least as typically taught in secondary institutions, have extremely limited value. Essay assignments have become less of an exercise in generating original thought and instead focus on factual regurgitation and syntax, and generative AI tools have done a great job of exposing their flaws. While I may go into this idea with a bit more depth later—I feel like that piece of writing is still half cooked—I have decided to move to a topic that is somewhat tangentially related and currently making the rounds on social media.

Recently, a piece from Andrew R. Chow, titled ChatGPT May Be Eroding Critical Thinking Skills, According to a New MIT Study, made some waves across social media and teaching circles. It was the first thing I saw when heading back to X after a break, and, the very next day, a coworker of mine also sent the piece my way. If you are interested more in the text of the study itself, you may get it from MIT’s own page, or from arXiv. If the conversations on social media sites and in teaching circles about the study are to be cited, detractors from AI usage seem to have found their smoking gun to show that tools such as LLMs are making you more dumb. If reactions online are to be believed, AI tools' wide availability and existence is going to be the cause of cognitive atrophy across society.

I feel that the study is being misinterpreted, though this has not been helped by the author’s vocal anti-AI viewpoints (they even tried to include a prompt injection "attack" in the paper—which failed to work). Further, I will go as far as to say that I believe the study methodology seems fundamentally flawed for essay writing tasks and for evaluating the accumulation of cognitive debt.

Before getting to what I view as some fundamental issues with the study methodology, I want to be entirely clear that I actually applaud the approach the authors took here on getting their study out, publicly, before peer review. Foregoing the fact that peer reviews are far from infallible (you are asking busy people to do largely uncompensated work on consistently tight deadlines), it shows agency on taking action against something they view as a problem (some policymakers rolling out LLM tools at age levels way before they are likely to be developmentally appropriate). If there is any skill that is going to be most valuable in the age of widespread AI tools, it’s agency, and putting this paper out publicly means it will at the very least be read, and valuable discourse can hopefully happen surrounding the paper and AI tool usage. Putting their work out like this at least assures that it will meet a kinder fate than, for example, a third of World Bank reports that are never downloaded or the deluge of academic research that starts and ends its life in a void. At the very least, hopefully its data will get scraped for LLM training.

The paper is over 150 pages long, but in short, they took three groups of people and had them write essays on SAT-style prompts. One group used no tools at all (the “brain-only” group). The second group could use Google Search, and the third used only ChatGPT (a controlled version of GPT-4o). EEG headsets measured their brain activity as they wrote. Most participants completed three sessions in their assigned conditions, but a smaller subset returned for a fourth session, where the brain-only and LLM users swapped roles. Each essay was written in 20 minutes, making time a real constraint across all conditions.

From their results, across sessions, the brain-only group showed the highest levels of mental engagement. The LLM group consistently showed the lowest, particularly in brain regions associated with memory recall and attentional control. In session four, when LLM users were asked to write without assistance, they demonstrated significantly lower brain activity than the brain-only group did in their unaided sessions—suggesting that prior exposure to AI may result in sustained reductions in cognitive effort, even when the tool is no longer in use.

Despite this reduction in neural activation, essays produced by the LLM group were frequently rated higher by an AI scoring system, likely due to surface-level fluency and structural polish. However, the human evaluators consistently favored essays written by the brain-only group, scoring them higher for originality, depth of argument, and evidence of independent reasoning. These essays, although sometimes less refined, were perceived as demonstrating greater diversity in vocabulary and thought patterns, as shown through natural language processing analysis.

Interview data further revealed a marked contrast in participants' sense of ownership and memory. Those in the LLM group were frequently unable to recall or quote from their own essays shortly after writing, and many described their work as feeling less personal or meaningful. In contrast, brain-only participants exhibited strong memory recall and expressed a clear sense of authorship. Although the paper does not explicitly refer to these essays as perhaps being “soulless,” it strongly suggests that LLM-assisted writing lacked the distinctiveness and personal investment that characterize authentic, human-generated work.

The authors themselves are far more measured in their conclusions than much of the commentary circulating online would suggest, and the actual findings are much more restrained. What the study shows is that using ChatGPT for short-form, time-constrained writing tasks appears to reduce cognitive engagement, lower short-term memory recall, and diminish participants' sense of ownership over their writing. These effects were most evident when users transitioned away from AI and continued to show diminished neural activity, suggesting potential dependency on external tools.

However, the paper wisely stops well short of declaring that AI tools are inherently harmful or intellectually corrosive. The authors frame their results as preliminary and exploratory, highlighting important trends that warrant further study—particularly as LLMs are integrated into educational settings with little structured guidance. They raise valid concerns about how frictionless automation may reshape the writing process, especially when paired with shallow assessments, but they do not treat the observed changes as signs of irreversible cognitive decline. If anything, the study underscores the urgent need for intentionality and design in how we use these tools in learning environments.

In that respect, the paper deserves credit not only for the transparency of its release, but also the tone of its conclusions. The paper itself avoids panic and opens space for productive debate. This is precisely the kind of work we should want circulating publicly: data-rich, largely methodologically transparent, and cautious in interpretation, yet bold enough to flag early patterns worth paying attention to.

With that being said, there were several things that immediately jumped out at me from this study as odd:

The reliance on EEG data to infer cognitive engagement during writing and whether this truly reflects meaningful or productive thought.
The use of standardized, time-constrained essay prompts, which prioritize external form over internal idea development.
The strength of the claims around memory and authorship, particularly given the impersonal nature of the writing task.
The lack of clarity around the human scoring process, especially in light of recent findings that suggest people often cannot distinguish between human- and AI-generated writing.

Purpose vs. Real-World Relevance of EEG Use to Measure Brain Activity

From my understanding of using EEG (electroencephalography) to measure brain activity for writing tasks, high brain activity does not necessarily mean better thinking or better writing.

Also, in everyday contexts, most deep thinking happens before and after writing, not always during (see the start of this piece). I know whenever I hit standardized essay assessments as a student I went into an autopilot state of trance, and nothing I would have been involved in I would consider deep thinking as I was mostly focused on external revision processes. Do EEG readings during a 20-minute typing task truly map to idea-level engagement or synthesis, especially with writing as multifaceted as essay composition?

While I have no evidence, I also could not shake the feeling that the author's wanted to do a cool looking EEG study, and then fit a piece of research to that, rather than actually thinking if EEG's would be important to the outcomes that are actually important to essay writing.

Task Suitability as Standardized Essays Emphasize External Revision

The SAT-style prompts in this study emphasized constrained, impromptu writing—a format that trains students to perform under pressure, not to develop rich, internally revised ideas over time.

Research (including my own experience) supports that:

Short-timed essays push students toward surface-level strategies.
Deep cognitive processing and ownership emerge more in self-selected, iterative projects or collaborative writing.

Would the findings hold for more authentic writing tasks—long-form essays, personal projects, collaborative reports, or inquiry-based writing? The study does not answer this. It implicitly treats SAT-style essays as a proxy for academic writing and critical thinking, which is a flawed assumption in terms of authenticity, autonomy, and memory relevance.

Very simply, our cognitive systems aren't built to retain unmeaningful content, which is how I would categorize most of the writing prompts used in the study.

Reported Ownership and Memory

The study does show that in interviews, many LLM participants couldn’t quote or summarize their own essays shortly after writing, while brain-only participants did so more effectively. But we should interpret this cautiously:

20 minutes is not long enough for durable memory encoding, especially for a non-personal task.
Memory differences may reflect writing method familiarity, not AI disengagement. LLM users may have focused on prompt engineering or editing rather than internalizing content.
Interview data is subjective and potentially influenced by participant expectations or confirmation bias—especially if they suspect what the study is testing.

Also, the idea of “ownership” is slippery. What does it mean to “own” an idea generated under time pressure in a lab setting? Would those same participants feel differently about an essay, a poem, or a story co-written with AI over days or weeks? Quite possibly.

Human Consistency and Bias in Essay Scoring

The study says that human teachers “consistently scored brain-only essays higher,” but it’s unclear:

Who the teachers were (e.g., were they trained scorers, writing instructors, or general educators?).
What rubric they used (was it holistic? Trait-based? Aligned with SAT standards?).
Whether essays were blinded or randomized in terms of origin.

In prior research from Clark et al. (2021), humans have consistently struggled to reliably distinguish AI-generated writing from human-authored work across a wide range of formats. Without specific training or prompts, evaluators perform at chance levels, roughly as reliable as flipping a coin. Even when training is introduced, accuracy only improves marginally, often hovering around 55%.

Other research has found similar patterns. Unless the writing contains obvious giveaways—common AI phrases or formulaic sentence structures—evaluators frequently fall victim to anchoring bias. Polished writing is often assumed to be AI-generated (I hate how I now get called out for AI writing due to knowing what an em dash is, or for my general writing patterns I have had for a decade), while disfluency or informality is mistaken for human authenticity. Ironically, attempts to mimic natural "errors" can make AI output more convincing to readers who associate imperfection with sincerity.

Expert reviewers, such as professors, do outperform general audiences but still commonly fall prey to false positives. And both human and algorithmic detectors are easily thrown off by paraphrasing tools or minor stylistic tweaks, which suggests that the boundary between human and machine writing is porous. I find it incredibly suspect that a blind assessor would be able to consistently tell the difference between AI and human writing.

What I Would Like To See

I'm not going to argue that copying and pasting from an LLM like ChatGPT leads to memory retention. But I also don’t think this study evaluated the right thing, because copying and pasting from any source in tight timeline constraints isn’t going to result in meaningful memory retention. If you gave me twenty minutes and access to any tool, I’d do exactly what most people would: copy, paraphrase, and move on, just to complete the task and score well. They, knowingly or not, gave the different groups different tasks. The LLM group would likely view the task as "generate an essay," while the other groups would view it as "write an essay" due to the time constraints involved.

In that sense, the study’s methodology just reinforced how flawed our current essay-based assessments are when it comes to measuring actual learning or thought. The writing task rewarded speed and surface-level structure, not deep engagement. If anything, the results reflect the task more than the tool.

What I would have liked to see is something closer to how I’ve taught in humanities classrooms, though I’ll admit it’s harder to control experimentally. For example:

Give students five potential essay prompts ahead of time (I typically got the students to generate these ideas to start them thinking on how we could tie overarching themes together in class). Allow one group to use AI during their preparation, another group access to search engine tools, and the other only their written notes or hard copy sources. Then, under proctored conditions, three of the five potential essay prompts will appear on the exam, and students choose one to write about. No tools are allowed during writing. Then compare performance, originality, and recall.

That kind of design would speak more directly to how people prepare, encode, and transfer knowledge and not just how they complete a timed task.

So, does this study prove that ChatGPT atrophies your brain or makes you less intelligent? Not really. It does potentially provide evidence that over-reliance on AI tools can lead to measurable changes in cognitive engagement, memory encoding, and the perceived authenticity of one’s own writing (but I would argue this could be as a result of an over-reliance on any tool).

I don’t think their results are invalid, but I am concerned about how it was overgeneralized from this sample to broader claims about the atrophy of critical thinking. We have been saying for decades about how technology is going to make people dumber, when really it’s about the fact we now need to raise the bar in light of technological advancements. How to do so in such an industrial machine as education in the current era is a story for another day.

Other Pieces of Interest

Here are some other links that people may find interesting, but didn’t find their way into the main body of this piece:

Another paper came out of Microsoft, which I found much more interesting on AI and Critical Thinking titled, The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers.