Member-only story
Welcome to the World of LLMs
4 min readFeb 6, 2025
Large Language Models (LLMs) aren’t just fancy autocomplete engines anymore; they’ve grown into powerhouse assistants that can chat, summarize, solve math puzzles, and even craft short stories. Think of them as massive text jugglers — trained on colossal swaths of internet text — predicting the next word in a sequence based on patterns they’ve absorbed.
1.1 The Data Hunt: Internet Text, Filtered & Refined
- Gathering the Goods
First, massive datasets are scooped up from the web (e.g., Common Crawl). This includes everything from news articles and forum threads to recipe blogs. - Clean-Up & Quality Control
Raw HTML is stripped, languages are identified, and explicit or harmful content is weeded out. What’s left is a leaner (yet still gigantic) text collection — think terabytes of data, translating to trillions of tokens.
1.2 Tokens and Tokenization
Before an LLM can read or generate text, it needs to break down words and phrases into “tokens.” Each token is basically a bite-sized snippet of text. For instance:
- “Hello world” might be
[“Hello”, “ world”]
—two tokens total. - Modern models can handle vocabularies of 100,000+ tokens, balancing speed and comprehension.