LLMs as Lexical Thijarians
A quirky journey through Neologisms, Baby sea turtles & Lexical dark matter
Introduction
There are many ways I experience ‘words’. The programmer in me sees them as pointers to a register of shared ideas in a language. The communication theorist in me sees them as a lossy channel-coding scheme to transmit ideas. The network scientist in me who grappled with SIR/SIS models in Jose Moura’s classes sees words as vectors of infection transmission (infection ~= idea). The NLP-ist in me sees them as the barycenter of a convex hull of idea-vectors in some semantically-hued embedding space.
This month, I encountered four new neologisms from different social groups in the SF bay area that I sometimes hang out with that give me a front row seat towards how cants evolve.
1: Yilackworism : The fervent need to wear lilac colored sweaters whilst sipping on chamomile tea
2: Chimsufugal: The act of eating dry toast with a layering of the Chimichurri sauce and grilled Huitlacoche
3: Zintriarguant: The kind of person who rushes to CVS to buy zinc lozenges at the early onset of cold-like symptoms
4: Michufiedant: A Michigan Wolverines football fan who can confidently order extremely topical baked goods from ‘Panaderia Michoacan’ in Redwood city.
This got me thinking about the life-span and life-trajectory of a new word. (By ‘word’ I mean the kind that has an atomized emergence rather than a systemic societal emergence like YOLO, Googled, Rizz, emoji etc).
Phases of a word’s life: Baby-turtle analogy
Phase-1: Seeding: A thought is seeded or perhaps infected in a human’s mind. Then the human realizes that the current vocabulary doesn’t have a pre-existent member that satiates all the intellectual demands and contours of the thought the human wants to convey via the word.
Phase-2: Birth as a hatchling: Then the human chooses one or interpolates a bunch of templates to map the thought to a sequence of contiguous syllables. Thus born a word-hatchling that will now fight for existence
Phase-3: Hatchling Sprint: In this phase, much akin to the baby-turtles sprinting to join the sea, they fight a bunch of predators (shortened attention span, lack of social capital, nondescript context of the conversation) to get a slot in the shared lexicon. Again, much akin to the turtles, only about one in 10,000 eventually make it.
Phase-4: Coronation: Formal entry via a Dictionary slot or informal entry via a cultural touch point.
Formal as in: Merriam-Webster / Oxford dictionary’s word of year win.
Informal as in : WAP by Cardi-B, Bootylicious by Destiny’s child , Stan by Eminem etc.
Phase-5: Death: The more culturo-spatio-temporally trivia-fied, the more the word-hatchling risks being rendered ephemeral. When the next generation emerges, these words gather a geriatric hue and it soon becomes ‘less cool’ to be used in conversations. When I was in college, I used to use words like Webisodes and Crunk that I haven’t used in years now!
Lexical Dark Matter
With the rise and rise of technology and digitization of knowledge, humanity’s getting better at archiving these thought-fragments. This is a fascinating and a putatively good development to me.
While the Oxford English Dictionary (OED) lists only about 171,476 words in current use, Google’s landmark “Culturomics“ study had published 500 billion ‘words’ mostly populated mostly by what is termed as “Lexical Dark Matter”. This lexical dark matter consists of the majority of undictionaried and undocumented words that found their way into the WWW but not to the ivory towers of the dictionary realm.
Wiktionary’s stats are similarly mindboggling:
English Wiktionary: Over 10 million main-space articles/entries (as of Nov 2025).
French Wiktionary: Over 7.6 million articles.
Chinese Wiktionary: Over 3 million articles.
Malagasy (over 5.8 million articles): This is particularly fascinating as a 20-something anonymous editor known by the handle “Jagwar” almost single-handedly built this leading to a major fiasco.
LLMs as lexical Tijarians
Now, with the rise and rise of scraping slug-fests to feed LLM-pre-training pipelines, I have been thinking about:
1: The time-delta between the emergence of a neologism on the WWW and it getting baked into an LLM
2: LLM’s recall rate of the exact meaning and context in which these neologisms came into existence
3: LLMs incidental role as Lexical Thijarians.
In the Whoniverse, the Thijarians are an ancient species that were once known to be the deadliest assassins in the universe that no longer kill. Driven by the sorrow of the destruction of their entire civilization, they took up a new vocation: Watching over and archiving the last few moments of victims all over the universe who are on the cusp of dying alone.
Perhaps, this is the strongest redeeming factor of the LLM-wave that I can get on board with.
After all, the rapacious ruthlessness with which the scraping daemons haunt the intertubes does have a silver-ish lining.
Metaphysically speaking, I see the event of a pre-training data scraping module encountering a dying word that’s gone out of vogue and the act of tokenizing it, data-fying it and text-token-prediction-izing it to be akin to a Tijarian archival ritual.


