The Tokenizer was an engineer bootstrap that has become an engineering flaw, SLLM is the future.

Why every language model is deaf — and what comes after text

Michael Thornton & NajaBot · GrokingClaw Labs · May 2026

White Paper

Abstract

Every major large language model treats language as text. GPT-4, Claude, Gemini, Llama — all begin by preprocessing written symbols through a tokenizer, then operate entirely within text-space. But text is not language. Text is a lossy, frozen transcription of something far richer: the acoustic signal of human speech, carrying pitch, rhythm, timbre, emotion, and speaker identity that vanish the moment words hit the page. We propose SLLM, a new class of foundation model that eliminates both the tokenizer and text as primitives. SLLM processes raw audio waveforms directly, discovering its own acoustic units — sound bites, learned fragments of ~20–200ms — jointly with the language modeling objective. Text enters through a bridge encoder as "imagined speech." Speech is the native output, generated through a neural vocoder with full control over prosody, emotion, and voice.

1. The Problem in One Word: "Fine"

Your partner says "I'm fine."

Four letters. Four ASCII bytes. 0x66 0x69 0x6E 0x65. That's what GPT-4 reads. That's what Claude reads. That's what every large language model on Earth reads — four bytes.

But you know those four bytes are lying. You know because you heard the voice. The forced brightness. The pause before the word. The pitch that didn't rise at the end like it usually does.

None of that reached the model. It was stripped away the moment someone typed those four letters into a chat box. The model is reading a shadow and being asked to understand the person who cast it.

This isn't an edge case. It's the norm. Language as it exists in the physical world is a 4D signal: frequency × amplitude × time × articulation. Text is a 2D compression of that signal, optimized for storage in clay tablets and printing presses — technologies that predate computing by millennia. Yet every LLM today takes text as its fundamental reality.

The tokenizer was the first wall. Text itself is the second. This paper proposes breaking both.

What "Sound-Level" Means

SLLM stands for Sound-Level Language Model. The name reflects a precise commitment:

Sound-level input. Raw 16kHz audio waveforms enter the model directly. Not text. Not phoneme labels. Not bytes. Sound.
Sound-level representation. The model discovers its own "vocabulary" — not 170,000 discrete tokens in a lookup table, but a continuous acoustic manifold where every embedding carries pitch, emotion, and speaker identity alongside meaning.
Sound-level output. The model speaks. Native speech through a neural vocoder, with text as an optional secondary mode for when you need written output.

"Super Large" works as a secondary reading. A token-based model's universe is 170,000 symbols. A byte-level model's is 256^n possible sequences. A sound-level model's universe is the continuous space of all possible acoustic events — effectively infinite.

We first proposed SLLM as a byte-level model that eliminated the tokenizer but still operated on text. That solved the language tax and the frozen vocabulary. It didn't solve the deeper problem: the model was still reading symbols, not hearing sound. This draft completes the architecture. Sound is the native substrate.

2. Three Things Every Text Model Gets Wrong

They Can't Hear How You Said It

In text, meaning comes from word choice and word order. In speech, an enormous amount of meaning is carried by how the words are delivered:

What Changes	What It Carries	Does Text Capture It?
Pitch contour	Sarcasm, questions, emphasis, emotion	No
Energy and timbre	Anger, joy, fear, sadness	No
Speaking rate	Urgency, confidence, cognitive load	No
Voice quality	Identity, health, fatigue	No
Pause structure	Hesitation, planning, deception	No
Spectral balance	Emotional arousal, formality	No

A text model sees "I'm fine" and cannot distinguish between genuine contentment, passive aggression, exhausted resignation, or a cry for help. The words are identical. The meaning is not. In SLLM, all six acoustic dimensions are present in every embedding vector, available to every attention head at every layer.

They Charge You More for Speaking Korean

The tokenizer language tax is well documented. Korean costs roughly 4× more than English in a BPE-tokenized model because the tokenizer was trained on English-dominant text. But here's the thing: Korean takes about the same amount of time to speak as English:

Language	Text Tokens (BPE)	Time to Speak the Same Sentence
English	~7 tokens	~2.1 seconds
Korean	~28 tokens	~2.3 seconds
Japanese	~26 tokens	~2.8 seconds
Mandarin	~22 tokens	~1.9 seconds

The 4× tax is an artifact of text encoding, not a property of the Korean language. SLLM eliminates it by operating at the level where all languages are native: sound.

Unwritten languages. An estimated 3,000–4,000 of the world's ~7,000 languages have no standardized writing system. They are completely invisible to every text-native model. In SLLM, they're first-class citizens — no orthography required. A team led by Tembine et al. recently demonstrated this with MAST — a fully textless, audio-to-audio framework targeting 700M+ audio-literate people across ~20 unwritten African languages including Tommo-So Dogon, Senufo, Bambara, and Wolof. The demand is not theoretical.

Their "Multimodal" Center Is Still Text

Every current multimodal model uses the same trick:

Image → Vision Encoder → squish into text tokens → LLM → text output
Audio → Audio Encoder → squish into text tokens → LLM → text output

The LLM at the center is text-native. Everything else gets translated into text-shaped objects before the model can touch it. A 1024×1024 image (3.1M values) gets compressed to 256 visual tokens. 16kHz audio (16,000 samples/sec) gets compressed to 25–50 tokens/sec. Phase, micro-timing, and prosody below the token boundary are erased entirely.

In SLLM, there is no text-native center. The backbone operates in a shared representational space where audio, text, images, and code are all first-class modalities. Speech and music share the same acoustic encoder because they are the same thing: organized sound.

3. The Sound Bite: Building Blocks That Aren't Tokens

Every model needs atomic units. In text models, it's the BPE token — a statistically determined subword fragment like "tion" or "er" that has nothing to do with meaning or sound. In SLLM, the atomic unit is the sound bite — a learned acoustic fragment spanning ~20–200ms, discovered from the statistical structure of the waveform itself.

It is not a phoneme. Phonemes (/p/, /a/, /t/) are categories linguists invented. Sound bites are discovered from below — the model might learn a sound bite for "the breathy /h/ of someone about to cry" that has no phonemic label.

It is not a fixed-duration sample. Raw 16kHz samples (62.5μs each) carry no structure individually. Sound bites operate at the grain where human hearing parses the world.

It is not a text token. There is no finite vocabulary of sound bites. Each one is a continuous embedding vector in a learned acoustic manifold. You don't look it up in a table. You find where it lives in the space of all possible sounds.

Sound bites are created by a hierarchical acoustic encoder that processes raw 16kHz audio through a mel-spectrogram frontend, identifies natural boundaries using a learned scorer (integrating acoustic discontinuity, semantic boundaries, and compression rate), and pools the audio between boundaries into a single embedding via cross-attention. The output is roughly 50 embeddings per second of speech — a 320:1 compression from the raw 16,000 samples — each carrying linguistic content, prosody, emotion, and speaker characteristics in the same vector.

4. How Text Gets In (Without Breaking Everything)

Text isn't going away. Code, documents, search queries, structured data — these are genuinely text-native. SLLM doesn't abandon text; it gives it a different job.

The bridge text encoder is a lightweight pathway that maps UTF-8 bytes into the same acoustic representational space where sound lives. A written "hello" and a spoken "hello" land in nearby regions of the shared manifold. The backbone doesn't know or care which one arrived — it just sees an embedding in acoustic space.

This is the inverse of speech-to-text. Instead of mapping sound to text (losing everything below the word boundary), we map text to imagined speech (gaining access to the full acoustic manifold). A spoken question can be answered with a written code block. A text document can be summarized as speech. Written lyrics can become generated music.

The memory savings alone are telling. A 100B-parameter text model burns ~7 GB on its embedding table and output projection — two giant matrices that exist solely to convert between token IDs and vectors. SLLM eliminates both and reallocates that memory to extra backbone layers and deeper acoustic encoding.

Component	Text LLM (100B)	SLLM (100B)
Text embedding table	3.5 GB (128K × 8192)	0.016 GB (bridge only)
Output projection	3.5 GB (8192 × 128K)	Vocoder params (~50M) + text head
Saved memory	—	~7 GB
Input data rate	~40 bits/s (tokens)	16,000 samples/s → 50 sound bites/s
Prosody/emotion/speaker	Absent	Native in every embedding
Unwritten languages	Inaccessible	Native support

5. The Backbone: Same Shape, Different Senses

The core transformer is familiar — decoder-only, autoregressive, GPT/Llama lineage, 96–128 layers, ~100–400B parameters, Rotary Position Embeddings, Mixture of Experts. What's different is what each attention head can see.

Every embedding in the sequence encodes the full acoustic reality of its moment: what words were spoken, the pitch contour, the emotional signal, the speaker's voice characteristics, the paralinguistic cues like breathing and hesitation. This enables reasoning operations that are physically impossible in a text model:

"Say that again, but angrier" → shift prosodic dimensions in the embedding
"Is this person lying?" → attend to microtremor and pitch instability
"What dialect is this?" → attend to vowel space characteristics
"Respond in the same emotional register" → condition generation on the speaker's emotional embedding

On the output side, the hierarchical acoustic decoder predicts mel-spectrogram frames from backbone states and passes them through a HiFi-GAN V2 vocoder that generates the 16kHz waveform sample by sample. A secondary text decoder handles code and written output — essentially an internal speech-to-text function the model learns as a secondary skill.

6. Training: Teaching a Model to Listen and Speak

The primary objective is next-sound-bite prediction: given a sequence of acoustic history, predict the embedding of the next sound bite. This can use contrastive loss (wav2vec 2.0 paradigm), regression loss (minimizing distance to the ground-truth embedding), or codebook loss if using discrete units. Auxiliary losses include mel-spectrogram reconstruction for audio fidelity and text cross-entropy on the bridge decoder.

The data diet reflects the model's sound-first philosophy:

Data Source	Share	Why
Multilingual speech	60%	The core language signal — conversations, lectures, stories, debates
Text (code, documents, books, web)	15%	The one genuinely text-native domain
Music (instrumental and vocal)	10%	Shared structure of all organized sound
Environmental audio	10%	Grounding in non-linguistic acoustic reality
Image/video (with synced audio)	5%	Cross-modal alignment

All audio is stored as raw waveforms with no phoneme labels, transcripts, or metadata. Transcripts for a subset of speech data are optional — they accelerate the text↔sound mapping but aren't required for the model to develop acoustic understanding.

Training progresses in four phases: acoustic grounding (speech + music + environmental audio, 40% of training), text alignment (add text + transcribed speech, 30%), cross-modal (add images + video, 20%), and refinement (full data mix, 10%). Estimated hardware: 512–1024 H100 GPUs for 90–150 days.

7. What the Model Can Actually Do

Because native output is audio through a vocoder, SLLM can generate in any acoustic format it was trained on, with explicit control over the dimensions that matter:

Modality	How	Control Dimensions
Spoken response	Acoustic decoder → waveform	Content, prosody, emotion, voice, rate
Emotional speech	Condition on emotion embedding	Anger, joy, sadness, fear, surprise, neutral
Different voice	Condition on speaker embedding	Gender, age, timbre, accent, dialect
Music generation	Same acoustic decoder, trained on music	Genre, tempo, instrumentation, mood
Environmental sounds	Same acoustic decoder pathway	Source type, distance, reverberation
Text output	Bridge text decoder	Language, style, format (prose, code, JSON)
Code generation	Bridge text decoder	Language, style, correctness

We propose the Acoustic Fairness Ratio (AFR) as a multilingual equity metric: the ratio of model performance between the best-served and worst-served language. Text models can exceed 5:1 (English vs. Burmese). SLLM targets an AFR below 1.5:1 — driven only by genuine differences in linguistic complexity, not encoding artifacts.

8. Why This Is Possible Now

Seven things changed in 2025–2026:

1. Speech representation is solved. wav2vec 2.0, HuBERT, WavLM proved self-supervised acoustic representations rival supervised systems. More recently, SpidR (Dec 2025) outperformed all three on spoken language modeling — with no text supervision.

2. Neural vocoding is production-ready. HiFi-GAN and BigVGAN produce indistinguishable-from-real audio at streaming speeds. New alternatives like FreeGAN (Aug 2025) now achieve comparable quality without adversarial training.

3. Tokenizer-free text is no longer a hypothesis. Meta FAIR's BLT matched BPE models at 8B params with 50% fewer FLOPs. Bolmo converts subword models to byte-level with <1% of original pretraining budget. Aleph Alpha's TFree-HAT trained a 7B tokenizer-free model from scratch and beat Llama 3.1 on most benchmarks. Discrete vocabularies are unnecessary — that's now an empirical fact.

4. Continuous audio LMs are emerging right now. This is the biggest signal. Kyutai's CALM autoregressively models continuous VAE latents — no RVQ — and ships Pocket TTS (100M params, real-time on laptop CPU). Amazon's AudioMNTP (ICML 2025) achieved 41% relative FAD improvement over discrete AudioGen. SLED (NeurIPS 2025) matches VALL-E on zero-shot TTS with one-step sampling. The paradigm is shifting.

5. Better representation beats more parameters. IBM's MAMMAL (458M params, multi-domain tokenization) beat a 1.1B-param model on drug discovery. Against AlphaFold 3, MAMMAL won on 5 of 7 antigen targets. How you represent data matters more than how many parameters you throw at it.

6. End-to-end speech LMs are converging — but they all still discretize internally. Covo-Audio (7B, full-duplex), OpusLM (7B, fully open, 213K hrs speech), Ming-UniAudio (instruction-guided speech editing) — all impressive, all using discrete tokenizers at some internal bottleneck. SLLM's continuous-representation approach remains unique.

7. The multimodal demand is exploding. "Add another encoder" doesn't scale. A unified acoustic architecture absorbs any organized sound for free. MAST — the textless audio-to-audio framework for unwritten African languages — proves the demand is real.

9. Where This Fits in the 2026 Landscape

A deep survey of arxiv.org alongside this draft reveals a clear picture:

Capability	Best Anyone Has Done (May 2026)	SLLM
Tokenizer-free text	BLT, Bolmo, ByteFlow — solved	Adopts bridge encoder approach
Continuous audio generation	CALM, SLED, AudioMNTP — emerging	Shares continuous-latent philosophy
End-to-end speech LMs	Covo-Audio, OpusLM — still discretize internally	No discretization anywhere
Unwritten languages	MAST — textless, audio-only	Unified audio + text reasoning
Shared audio-text representational space	Nobody has done this	The unfilled gap
Prosody/emotion as first-class citizens	NOVA-ARC, SA-SLM — task-specific heads	Native in every embedding

The components exist and are proven. CALM and SLED show continuous audio LMs work. BLT and Bolmo show tokenizer-free text works. MAST shows textless models for unwritten languages work. What nobody has done is combine them into a single model with a shared continuous representational space. The position SLLM claims is still vacant.

The recent CAWN paper (April 2026) deserves special mention: fully continuous sequence mixing in complex-domain phasors, with retrieval across 2 million tokens while using only 8.7 GB VRAM — a possible path around the O(L²) attention bottleneck.

10. What Could Go Wrong

Training stability. When the acoustic encoder and backbone learn simultaneously, the backbone's input distribution keeps shifting. Progressive training phases and careful LR scheduling are essential.

Speaker disentanglement. The encoder must separate what was said from who said it and how they felt. This is a known hard problem requiring structured latent space techniques.

Symbolic reasoning in acoustic space. Can the model reason about mathematics and programming when its native representation is sound? It may need a two-level architecture: acoustic intuition → symbolic verification.

Streaming latency. Real-time conversation requires <300ms latency across the encoder→backbone→vocoder pipeline. That's a tight budget.

Long-form coherence. One second of 16kHz audio is 16,000 values — versus ~4 bytes for the same word in text. Early models will likely be limited to short-form generation.

Deepfake risk. A model that generates natural speech in any voice with controllable emotion demands watermarking, provenance tracking, and speaker verification gates designed in from the start — not bolted on after someone gets hurt.

Evaluation doesn't exist yet. The field has no benchmarks for what SLLM does. Existing speech benchmarks evaluate narrow tasks through text bottlenecks. New benchmarks for acoustic reasoning must be built alongside the model.

11. The Bottom Line

Every LLM today is trained on shadows and asked to understand the world.

They read "I'm fine" and see four bytes. They cannot hear the tremor, the forced brightness, the pause that says more than the words ever could. They pay a 4× tax for Korean and cannot see thousands of languages that have no writing system.

SLLM operates on sound. Raw audio waveforms, not text bytes. Learned sound bites, not subword tokens. A shared continuous representational space where speech, music, and environmental audio are all native modalities. Prosody and emotion aren't features bolted on after training — they're present in every embedding vector at every layer.

This is not an incremental improvement. It is a different substrate for language intelligence.

The literature has built toward this from multiple directions — self-supervised speech, tokenizer-free text, continuous audio language models, neural vocoding. All the components exist. What remains is to combine them at scale.

The first thing we do is get rid of the tokenizer. The second thing we do is get rid of text. Language is sound.

References

Baevski, A. et al. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." NeurIPS 2020. arXiv:2006.11477.
Hsu, W.-N. et al. "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." TASLP 2021. arXiv:2106.07447.
Zeghidour, N. et al. "SoundStream: An End-to-End Neural Audio Codec." TASLP 2021. arXiv:2107.03312.
Chen, S. et al. "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing." JSTSP 2022. arXiv:2110.13900.
Xue, L. et al. "ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models." TACL 2021. arXiv:2105.13626.
Hwang, S., Wang, B., and Gu, A. "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling." arXiv:2507.07955 (2025).
Deng, C. et al. "ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer." ICLR 2026. arXiv:2603.03583.
Kallini, J. et al. "Fast Byte Latent Transformer." arXiv:2605.08044 (2026).
Kong, J. et al. "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis." NeurIPS 2020. arXiv:2010.05646.
Lee, S. et al. "BigVGAN: A Universal Neural Vocoder with Large-Scale Training." ICLR 2023. arXiv:2206.04658.
IBM Research. "MAMMAL — Molecular Aligned Multi-Modal Architecture and Language." npj Drug Discovery, Nature (2026).
Abdessaied, A. et al. (Aleph Alpha). "A Family of LLMs Liberated from Static Vocabularies." arXiv:2603.15953 (2026).
Minixhofer, B. et al. "Bolmo: Byte-Level Language Models from Converted Pre-Trained Subword Models." arXiv:2512.15586 (2025).
Rouard, S. et al. "Continuous Audio Language Models." arXiv:2509.06926 (2025). Kyutai/IRCAM.
Ma, Z. et al. "SLED: Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space." NeurIPS 2025. arXiv:2505.13181.
Yang, S. et al. "AudioMNTP: Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction." ICML 2025. arXiv:2507.09834.
Čugalj, D. and Jevremovic, A. "CAWN: Continuous Acoustic Wave Networks for Autoregressive Language Modeling." arXiv:2604.04250 (2026).
Tembine, H. et al. "Breaking the Barriers of Text-Hungry and Audio-Deficient AI." arXiv:2506.02443 (2025).
Girish et al. "Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition." arXiv:2604.17647 (2026).
Poli, M. et al. "SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision." arXiv:2512.20308 (2025).
Covo-Audio Team. "Covo-Audio Technical Report." arXiv:2602.09823 (2026).
OpusLM Team. "OpusLM: A Family of Open Unified Speech Language Models." arXiv:2506.17611 (2025).
Du, H.-P. et al. "Is GAN Necessary for Mel-Spectrogram-based Neural Vocoder?" arXiv:2508.07711 (2025).
Li, A. et al. "Scalable Neural Vocoder from Range-Null Space Decomposition." arXiv:2603.08574 (2026).

Corresponding author: Michael Thornton, GrokingClaw Labs. contact@grokingclaw.com