Autoregressive Language Models (LLMs) have their limitations. While they are fascinating and undoubtedly useful, they fall short in key areas that define intelligent behavior.

Intelligence involves several crucial characteristics: understanding the physical world, remembering and retrieving information, reasoning, and planning. These are essential traits of intelligent entities, whether human or animal. Unfortunately, LLMs can only simulate these abilities in a rudimentary way.

LLMs are trained on massive datasets, drawing from an enormous corpus of text—approximately 10^13 tokens. With each token typically comprising 2 bytes, the total training data amounts to around 20 terabytes. To put that in perspective, it would take me 170,000 years to read through all of it, assuming I spent 8 hours a day reading.

But when you consider the data processed by a human, it’s clear that this isn’t as impressive as it sounds. A developmental psychologist once told me that a 4-year-old child, who has been awake for roughly 16,000 hours, has likely processed around 10^15 bytes of visual information. This estimate comes from the fact that the optic nerve transmits approximately 20 megabytes per second.

In short, a 4-year-old processes 10^15 bytes of visual data, while an LLM is trained on just 2*10^13 bytes of text—data that would take 170,000 years to read. The key difference is that sensory input provides us with vastly more information than language alone. Most of our knowledge is acquired through observation, not just language.

The environment around us is far richer and more complex than anything that can be fully captured in words. Language, in many ways, is an approximate representation of our perceptions and mental models.

Consider this: we have LLMs that can pass the bar exam, a remarkable achievement by any measure. Yet, they can’t learn to drive in 20 hours like a typical 17-year-old. They can’t clear the dinner table and load a dishwasher after one demonstration, as a 10-year-old can. So, what are we missing? What kind of learning or reasoning architecture is preventing us from developing truly intelligent systems, like fully autonomous vehicles or domestic robots?