Em-dash Usage by LLM's Reveals a Lot About AI Biases "—"

The training data were written by verbose cognitive elites.

Aug 11, 2025

Spend any amount of time with ChatGPT, or its siblings from Anthropic, Google, or Meta, and you’ll start to notice a quirk. These bots love the em dash, you know this thing: “—”. They reach for it again and again to pivot, to elaborate, to embed an idea mid-sentence with a dash of drama. The em dash isn’t just a punctuation mark here. It’s a tell. A stylistic fingerprint.

And it opens a window into a deeper, more consequential insight: the training data behind large language models (LLMs) is heavily shaped by cognitive elites. If these systems overuse the em dash, it’s because their training data overrepresents people who overuse the em dash, highly educated, text-producing elites. If the style of these voices is overrepresented, so are their assumptions, norms, and biases.

Since the training data is proprietary, it’s one of the few insights we have into who wrote the data that trained LLMs, and they use to generate ideas and text. Its usage shows that erbose cognitive elites wrote the training data, and that reveals a lot about potential biases in LLM’s.

The Em Dash as Cultural Marker

The em dash is not neutral. It’s a marker of a particular kind of writing, more common in academia, longform journalism, op-ed essays, and cultural criticism. It’s less prevalent in informal writing, technical manuals, or conversational speech. In fact, many people don’t use it at all.

In a 2019 analysis of online English usage, linguist Gretchen McCulloch described the em dash as a feature of "elite internet style", the kind of language used by people who read The Atlantic or The New Yorker, and who engage in long-form, carefully crafted prose online.

When LLMs like ChatGPT use em dashes frequently, they are, in effect, reflecting the voice of the people most likely to write for public consumption: journalists, academics, bloggers, subreddit moderators, and Medium essayists. In other words, LLMs are trained on a corpus shaped by the people who write the most, not necessarily those who represent the full range of opinion or experience.

The Real Source of Bias: Who Writes Online?

When people worry about political bias in AI, they often look at outputs: Does it lean left? Is it pro-DEI? Is it hostile to conservative values? But this approach misses a more foundational question: who created the content that trained the model in the first place?

Here’s the catch: the vast majority of people online are consumers, not producers. They comment occasionally, like posts, maybe write a review. But LLMs aren’t trained on “likes.” They’re trained on text, and that comes disproportionately from those who write a lot.

This means the training data behind LLMs tends to reflect the views of:

Academics and public intellectuals
Journalists and editorial writers
Activists with strong online presences
Reddit power users and subreddit moderators
Wikipedia contributors (an especially elite and narrow group)

These groups are not monolithic, but they do skew in particular ways: highly educated, verbally adept, politically expressive, and often clustered in cultural hubs. Even if the AI itself is not “biased,” it is trained to mimic language created by a non-representative slice of the public.

And so we return to the em dash. The bot’s affinity for it is a small, but revealing, signal of this deeper dynamic.

Not Just a Quirk—A Proxy for Representation

This is more than a linguistic curiosity. If em dash overuse is a proxy for elite writer dominance, then we can think of it as a canary in the algorithmic coal mine.

It signals that:

Style choices reflect whose voices were loudest in the training set
Ideological priors are subtly encoded through tone, framing, and rhetorical emphasis
Underrepresented groups (rural voices, working-class writers, non-native speakers, etc.) may have limited influence on the AI’s worldview

Of course, this doesn’t mean the models are irrevocably biased or broken. But it does mean we need to reframe the “bias in AI” conversation. The issue isn’t only content moderation or “alignment” at the output stage—it’s corpus construction at the input stage.

Can This Be Measured?

Actually, yes. and it should be. Researchers could run stylometric analyses on chatbot output and compare it to public corpora like the Corpus of Contemporary American English (COCA) or even social media datasets.

Has LLM punctuation converged with elite writing norms?

Do models underrepresent syntactic patterns more common in working-class speech?

Does output replicate the ideological framing of dominant Wikipedia editors?

These questions point to a richer, more empirical understanding of bias, beyond the surface-level “does it say nice things about X ideology?” and toward a more structural view of who gets to write the data.

Political Implications

In political science, there's a long tradition of analyzing who sets the agenda in public discourse. If LLMs are now major players in shaping political information, we should be asking: whose views are they replicating?

And here’s where things get thorny. Many of the loudest online political voices represent minority factions with outsize attention. These groups produce a large volume of content, but not necessarily representative or consensus views. That volume gets amplified in the training set.

Which raises this uncomfortable question: Are we building chatbots that reflect the voice of the public, or the voice of the hyper-verbal few?

Toward Better Data—and Better AI

What can be done?

Corpus Auditing: AI labs should conduct and publish stylometric and demographic audits of their training sets.
Downweighting Dominant Voices: Heavily represented contributors (e.g., Reddit power users) could be algorithmically adjusted to reduce their influence.
Broader Inclusion: Incorporating more writing from underrepresented groups, regions, and linguistic styles can diversify model outputs.
Bias Transparency: Chatbots should come with some disclosure of their linguistic and ideological leanings based on corpus origins.

Final Thoughts: Reading Between the Dashes

The next time ChatGPT throws an em dash into a response, take a second look. It may just be a punctuation mark, but it also carries the weight of millions of training examples written by a specific kind of person with a specific kind of voice.

Bias in AI isn’t always obvious. Sometimes, it hides in the quiet stylistic flourishes we take for granted. But if we care about democratic information systems, we need to pay attention to whose ideas are subtly being replicated and reified.

Sean Richey is a researcher focused on political communication, emerging technologies, and the intersection of AI and society.

Subscribe to get the next post in your inbox.

AI and Politics

Discussion about this post