A Primer on Symbol Emergence Systems Theory - Part 2: Core Methods and Hypotheses

symbolemergoutreac
4月27日
読了時間: 15分

更新日：5月9日

As we saw in Part 1, Symbol Emergence Systems Theory (SEST) addresses the symbol emergence problem, namely, understanding how humans and robots communicate with symbols and how those symbols are created and modified. But how can we actually solve this problem? In Part 2, we’ll examine the methodology behind SEST, along with its core concepts and hypotheses.

2-1 Probabilistic Generative Models

Building models to understand

SEST adopts a constructive approach: we formulate hypotheses, build models accordingly, run them in practice, and then evaluate their behavior. Models can take many forms, but in SEST the primary tools are mathematical models and their implementation in robots. Demonstrating that a model works shows only that it is a “possible solution,” not the only one. Still, discovering a single mechanism capable of solving the problem is a significant advance in our understanding.

For this reason, correctly formulating the problem is crucial. Even if we create a mathematical model or a robot that “solves” the task, if the task itself is defined incorrectly, we haven’t moved closer to our original goal of grasping how symbols emerge.

First, let’s consider the symbol emergence problem at the level of an individual agent. The key question is:

How do agents learn to use symbols meaningfully and communicate with others? In other words, how do they develop internal representations?

A simple scenario: learning an object category

Imagine you arrive in a foreign country for the first time, encounter an unfamiliar fruit, and hear people speaking an unknown language. Your task is to form a new concept of the fruit in your mind and link it with the word they use.

Notice that you aren’t provided the fruit’s name as a “correct answer.” Both the fruit’s appearance and the sounds referring to it are sensory inputs that initially carry no meaning for you. SEST aims to learn such categories starting from complete ignorance, which is known in machine learning as unsupervised learning. While adults might rely on interpreters or contextual clues, children learning their first words must build internal representations from multimodal sensory data, entirely in a bottom‑up fashion.

Modeling the World with Probability

SEST models learning using probabilistic generative models (PGM). Let’s briefly review what that means.

First, we accept a key premise: the world as we perceive it is a random variable drawn from some probability distribution. Rolling a die and getting “1” reflects an underlying distribution in which each face has a probability of 1/6. In the same way, every observable event is assumed to be sampled from its own (unknown) probability distribution.

Why make this assumption? In reality, nothing is ever 100% certain. As Yoshitake Shinsuke’s beautiful picture book It Might Be an Apple reminds us, the red object before you that looks like an apple could be plastic—or an orange on the other side. All we can do is raise our subjective certainty that it “might be an apple.” Knowing, then, means updating a subjective probability distribution. When we model cognition from this perspective, it’s both natural and convenient to assume a probabilistic distribution behind the world.

note: If you’d like to explore the philosophical side of this worldview, see Thinking About Statistics: The Philosophical Foundations by Jun Otsuka.

Under this framework, we posit a latent variable z (not directly observed) behind each observed variable x. The latent z itself is sampled from some prior distribution P(z), and once z is set, it determines the conditional distribution P(x|z) from which x is generated. Together,

P(x,z) = P(x|z)P(z)

is called the generative model, capturing how x depends on z.

When we observe actual data x, we invert this process, inferring the likely values of z that generated it. This is Bayesian inference, and although exact computation is often intractable, we can approximate it via sampling or variational methods, techniques widely used beyond cognitive science.

PGMs as models of representation learning

In SEST and certain branches of cognitive science, PGMs model how agents acquire internal representations. In the case of learning new words, the hidden variable z corresponds to an internal concept, and updating its distribution through experience mirrors Bayesian inference in a PGM.

The simple one‑step model z→x can be extended in several ways:

Multi-modal inputs: Increase the number of observed variables to learn from vision, audition, touch, etc.
Temporal dynamics: Add variables over time to capture and predict the world’s changing state.
Hierarchical structure: Stack latent layers to learn representations at multiple levels—crucial for modeling social, cultural, or linguistic structures.

Returning to our fruit example:

Observations: The agent sees fruits x and hears word-sounds w.
Latent concept: A latent variable z ties x and w together in a joint model P(x,w,z).
Learning: As many (x,w) pairs are experienced, the agent refines P(x,w,z).
Emergence: Distinct values of z emerge (corresponding to e.g., “starfruit”), so thatP(x|z) predicts appearance, and P(w|z) predicts the corresponding word.
Cross‑modal inference: Given a novel fruit appearance, one can compute the probability it’s called “starfruit” by combining both conditionals.

In Peircean semiotic terms (covered in Part 1), x is the object, w the sign, and z the interpretant—showing how a symbolic process can be captured by a PGM.

Category learning in cognitive systems isn’t an end in itself; it’s a means to predict future sensory inputs. The arrow z→x in our PGM reads naturally as the system’s prediction of the world, which aligns with predictive coding theories.

Remember, a PGM is an abstract, high‑level model (“computational‑level” in Marr’s framework). The latent variable z may be a high‑dimensional vector, not just a scalar, and the PGM doesn’t prescribe how P(x,z) is physically realized in the brain. A robot implementation might prove the concept, but the brain could implement it quite differently.

Using this PGM framework, researchers in Symbol Emergence Robotics have modeled various learning processes—vocabulary acquisition, place‑sense formation, and more—as unsupervised learning in robots. By adding behavioral elements and time series, it’s possible to capture cognitive and behavioral dynamics even more richly.

2-2 The mind predicts the world using generative models

In the previous section, we looked at the process of acquiring new concepts—or internal representations—and learned about probabilistic generative models (PGMs), the mathematical tools for describing that process. We gave the simple example of “learning the names of things,” but the scope of PGM application isn’t limited to this. In fact, in recent years in cognitive science, AI, and robotics, whenever we model aspects of intelligence—perception, behavior, learning—it’s the probabilistic generative model that comes into play. From now on, we’ll simply call it a generative model.

This section is a prelude to the idea of Collective Predictive Coding, which will be discussed in the next section. Here, we’ll briefly touch upon four key terms in contemporary AI and cognitive science: predictive coding, world model, free energy principle, and active inference.

The mind predicts the world using generative models

We can immediately experience that we carry “generative models” within us:

“Aoccdrnig to rscheearch at Cmabrigde uinervtisy, it deosn't mttaer waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteres are at the rghit pclae.” (*Example of source)

If you can more or less read this, you see that our brains don’t process every letter in its raw form. Instead, we subconsciously predict the next sequence of characters and correct any jumbled inputs as we read.

This isn’t limited to reading: whenever we see or hear something, we’re constantly making predictions. The cognitive subject doesn’t perceive “the world as it is,” but rather the output of its generative model combined with actual sensory input.

When incoming sensations deviate too far from our predictions, we update the model. Consider:

Example 1: At a restaurant abroad, I ate a green vegetable that looked like a bell pepper—but it was extremely hot. It wasn’t a bell pepper at all, but a green chili pepper.

Let’s explain this with a simple generative model. As before, let z→x, where:

z (latent) = the type of vegetable
x (observed) = its taste

The joint distribution is P(x,z)=P(z) P(x|z) where P(z) is the prior probability of each vegetable, and P(x|z) is the likelihood of tasting x given z. When you actually taste “hot” (x=hot), you compute the posterior P(z|x)∝P(x|z) P(z).

Although your prior P(z=bell pepper) was high (because it looked like a bell pepper), the posterior satisfies

P(z=bell pepper | x=hot) < P(z=chili pepper | x=hot)

so you correctly infer that you’ve eaten a green chili pepper.

In the example we just saw, the generative model already “knew” about green peppers and green chilies. But what happens when it doesn’t?

Example 2: A visitor to Japan mistakes a dab of wasabi for avocado paste—and is shocked by its heat. Ever after, they only use a tiny amount of wasabi.

In this case, the agent must create a new latent variable z to represent “wasabi.” This is precisely the representation learning we discussed: when an existing model cannot accurately predict an observation, the system learns a new internal representation in latent space. By adding this new representation, future predictions improve.

Cognitive systems don’t passively record sensory data; they perceive by continuously predicting and then updating their generative models through learning. We all experience this in daily life. In fact, Helmholtz’s 19th‑century notion of “unconscious inference” already framed perception as prediction.

Only in recent decades has neuroscience caught up, showing how closely brains approximate Bayesian inference—which quantities serve as predictions versus errors, and which neural mechanisms perform the updates. Beginning in the 1990s, the theory of predictive coding proposed that the brain minimizes prediction errors to resolve sensory ambiguities. More recently, this framework has expanded beyond external inputs to include interoceptive bodily signals—from emotions to heartbeats and visceral sensations—treating them as additional domains of prediction and error correction (see for example How Emotions are Made by Lisa Feldman Barrett).

The “World” of robots

In AI and robotics, too, predictive processing with generative models has gained traction. Just as humans anticipate outcomes before acting, a robot navigating an unfamiliar environment must predict what will happen next—otherwise, it couldn’t react in time.

If one had complete information about the environment, prediction would be unnecessary. In a video game, for instance, an AI character with a full map and perfect knowledge of every other agent’s actions could always choose the optimal move (as Neo momentarily does in The Matrix). Real‑world robots, however, lack that “god’s‑eye view.”

Instead, we equip them with a world model—a generative model that lets an agent with limited observations simulate and predict its surroundings. While the idea of a world model isn’t new, deep learning’s advances have made it practical. In their 2018 paper “World Models,” David Ha and Jürgen Schmidhuber demonstrated that agents trained via reinforcement learning inside a deep‑generative‑model world can transfer those skills to the real world.

Because the learned latent variable space z is far more compact than the full complexity of the real environment, a robot can efficiently run reinforcement‑learning simulations within it. Analogously, in our bell‑pepper vs. chili‑pepper example, having the concepts “bell pepper” and “green chili pepper” lets you instantly predict taste upon seeing a green vegetable. By mastering the structure of z, both humans and robots can act effectively by continuously predicting—and then updating—their model of the world.

The Free Energy Principle and Active Inference

Returning to cognitive science and neuroscience, the idea of predictive coding was unified into a theoretical framework in the mid‑2000s by British neuroscientist Karl Friston under the banner of the free energy principle. Friston built on variational Bayesian inference—an approximate method for Bayesian updating—and proposed that intelligence itself could be understood as performing variational inference to minimize a single quantity called variational free energy (denoted F).

Concretely, we define a functional F that depends on our generative model, our inference model (i.e., our approximate posterior), and our chosen actions. Perception, action, and learning then all proceed in the direction that reduces F (and its expected future value).

note: You might wonder whether it’s natural to frame cognition as the “optimization” of some function. In physics—Newtonian mechanics, quantum mechanics, etc.—laws of motion can often be reformulated as variational problems (e.g., minimizing an action or energy integral). The free energy principle similarly posits an optimization goal, without claiming that biological systems always find a perfect optimum.

Unlike reinforcement learning, which explicitly maximizes expected reward, the free energy principle focuses on minimizing surprisal—or the “gap” between expected and actual sensory input. An agent can reduce F through three complementary processes:

Perception: Aligning the inference model more closely with the generative model.
Learning: Refining the generative model so it better matches the world’s true statistics.
Action: Selecting or shaping observations so that the generative model’s predictions become self‑fulfilling (or even altering the world directly).

Friston’s active inference framework thus treats perception, learning, and action all as facets of the same inferential process. While this unified view is elegant, its philosophical status, and how literally one should interpret free energy minimization in biological systems, remains an active topic of debate.

A converging view of intelligence

What we have sketched above is only a very rough introduction to predictive coding, world models, the free energy principle, and active inference. If you’d like a deeper dive, we recommend works such as The Predictive Mind by Jakob Hohwy or Active Inference by Thomas Parr et al. Here, the key takeaway is this: despite their different emphases, all these theories converge into a common view of intelligence:

An intelligent agent maintains a generative model, perceives and acts by (variational) inference to improve its predictions of the world, and continually updates that model through representation learning from experience.

Which exact form the generative model takes depends on the cognitive phenomenon you want to explain—or the function you want your robot to perform. Yet it’s striking that frameworks born in disparate fields can be unified under this abstract description. Recent papers have begun to map out how world models and active inference connect.

So far, we’ve focused on intelligence as “prediction plus continual model‑updating.” For symbol emergence, however, the crucial question becomes: how is the latent variable space z structured in the first place? In AI, we typically let an algorithm learn representations from massive unlabeled data. Humans, by contrast, do not build concepts entirely from scratch.

Here is where symbols—the categories and words that human cultures have already developed—enter the picture. Concepts like “chili pepper” or “wasabi” were invented collectively by past generations, and we learn them as constraints on our own internal models. In other words, our generative model is shaped not only by raw sensory data but also by the shared symbols of our community. And, as we’ve seen throughout this series, those very symbols emerge through a form of self‑organization.

Next, we’ll tackle Collective Predictive Coding, which models how external, shared symbols arise through joint learning and interaction.

2‑3 The Collective Predictive Coding (CPC) Hypothesis

This is the climax of this part of the series. Here, we introduce Collective Predictive Coding (CPC), a significant new theoretical advance in Symbol Emergence Systems Theory (SEST).

To recap, predictive coding posits that the brain maintains a generative model—its assumptions about how the world works—to forecast incoming sensations and guide actions. This model is continually updated through experience (for example, once you learn about wasabi, you predict its spiciness). Internally, unseen latent variables z capture the world’s hidden structure behind observable features x and y (such as appearance or taste). By mapping perceptions into this latent space—forming internal representations—the brain makes efficient, less surprising predictions. In machine learning, this process is called representation learning.

Representation learning is constrained by symbol systems

However, we do not learn representations in isolation. We are not alone in confronting the world. For instance, our knowledge of wasabi comes from others—someone tells us about it or we read it in a book. Unless we’re the original discoverer, our representation learning always involves social interaction and, often, symbols such as language. The internal representations we form are therefore strongly shaped by the constraints of the symbol systems human groups have collectively created.

In other words, learning internal representations entails both bottom‑up components based on individual experience and top‑down guidance from the shared symbol systems of our community. From the agent’s perspective, these top‑down constraints function as a prior distribution.

At the same time, the symbol system itself is dynamic. The link between the character string “wasabi” and the plant wasn’t generated anew by each person, nor was it handed down by a god; it emerged at some point through collective human activity. Understanding exactly how such symbolic priors come into being is at the heart of SEST—and it’s what Collective Predictive Coding aims to model.

Symbol Emergence in Naming Games

The approach here is also constructivist: researchers try to create the situations under which the phenomenon they wish to explain. There is now a two‑decade history of demonstrating symbol emergence from this constructionist perspective. Belgian AI researcher Luc Steels and colleagues pioneered emergent communication by having robots interact and negotiate symbols through simple games.

A 2019 study by Tadahiro Taniguchi’s group (Hagiwara et al. 2019) builds on those works. In their setup, two robots view the same object from different angles via cameras, then play a “naming game” to agree on a symbol. For instance, Robot A sees a plastic bottle and proposes the sign “c.” Robot B—observing the same object—accepts or rejects that proposal according to its own prior distribution. In effect, they are two robots communicating one letter at a time.

When this interaction repeats across multiple objects, the robots gradually converge on the same mappings (e.g., apple = “c,” plastic bottle = “d”). Though highly simplified, this demonstrates symbol emergence through robot interaction.

The unique contribution of this work is a Bayesian reinterpretation of this mechanism. They assume a “whole generative model” (A in the figure below): each robot serves as a distinct sensor, with its internal representation z_A or z_B, and together they infer a shared latent sign w. This is analogous to how a single person might integrate visual and auditory inputs to learn a word. In this view, what the experiment does is “representation learning of the latent variable w when two robots act as one agent.”

In reality, the robots aren’t physically fused—they communicate only via spoken one‑character signals (B). Nonetheless, this back‑and‑forth under a certain assumption can be shown to be mathematically equivalent to performing Bayesian inference on the combined generative model.

Generally speaking, exact Bayesian inference over complex generative models is intractable, so we resort to approximation methods like sampling or variational inference. Taniguchi et al. show that, by modeling the probability that B accepts A’s proposal, the robots are effectively executing a kind of sampling method called the Metropolis–Hastings (MH) algorithm together. This is why this experimental setting was named the Metropolis–Hastings Naming Game (MH Naming Game).

Since then, research on the MH Naming Game has branched out in many directions:

Multi‑modal inputs (Hagiwara+ 2023)
Three or more agents (Inukai+ 2023)
Sentence‑level communication (Hoang+ 2024)
Emergence from continuous signals (You+ 2024, etc.)
Symbol emergence in multi‑agent reinforcement learning (Ebara+ 2023)

These extensions continue to explore how simple interactive protocols can give rise to shared symbol systems.

The Collective Predictive Coding (CPC) Hypothesis

What does the MH Naming Game experiment tell us? To be precise, the conclusion we can draw from the 2019 paper alone should be something like the following:

In the particular naming game, if agents adopt signs according to a Metropolis–Hastings (MH) rule, then—even though each agent learns its representations locally—they collectively perform Bayesian inference on the shared latent variable w using MH algorithm.

However, its implications may reach further. What if this conclusion holds for interaction protocols beyond MH-style, and even beyond naming games? This insight motivates the Collective Predictive Coding (CPC) Hypothesis.

In a 2024 perspective article, Taniguchi described CPC as follows:

“...we propose the CPC hypothesis, which posits that symbol emergence in a multi-agent system can be regarded as decentralized Bayesian inference through language games. This can be considered socialrepresentation learning, as well. This is computationally analogous to the representation learning of multi-modal sensory information conducted by an individual agent, with social representation learning performed through CPC in the same manner as by individual PC.” (Taniguchi 2024)

Although the quote emphasizes “language,” CPC applies to symbols more broadly. Extending the MH Naming Game framework, each utterance or text becomes part of a collective Bayesian inference procedure. In effect, the human group forms a “super‑agent” that jointly learns a generative model of its shared experience.

Viewed from the symbol system’s perspective:

“Are not languages and symbols formed so that, as a group, we can predict and encode the world’s experience via our sensory‑motor systems?” (Taniguchi 2024, translated)

This is the CPC account of the origins of language and symbols. Unlike individual‑level predictive coding—where one agent refines its internal generative model—CPC describes how the symbol system itself learns to encode external reality through distributed human interaction.

Moreover, CPC can be framed as a natural extension of the free energy principle and active inference introduced earlier. In Japan, Yusuke Hayashi at the AI Alignment Network and others are actively developing this unified formulation (see Section 2.3 of Taniguchi et al. 2024 for initial results).

Is language truly the product of collective predictive encoding? As individuals, we lack the capacity to forecast complex phenomena—like next week’s weather or the positions of planets a year ahead. Yet, by creating cities, laws, and currencies, we build societies whose dynamics become highly predictable to us. Such collective predictability depends on a shared representation system: language and symbols.

Crucially, no single person knows all the words or symbols. To model the world and act effectively beyond any individual’s cognitive limits, we pool our distributed representations into a larger, communal system. Language and symbols emerge precisely to integrate these diverse internal models, functioning as a generative model that enables us, as a group, to predict and shape our world.

Conclusion: CPC as a candidate solution to the symbol emergence problem

Before we conclude, let’s recap. Symbol Emergence Systems Theory (SEST) addresses the symbol emergence problem, with two central questions:

How do humans learn internal representations and come to grasp the meanings of symbols?
How do external symbol systems emerge in the first place?

Collective Predictive Coding (CPC) offers one possible answer. It suggests that the rise of symbol systems can be described as distributed Bayesian inference among interacting agents.

To be sure, this remains only a candidate solution. Further validation will be needed to assess how well CPC can explain symbol emergence—whether through constructivist experiments with robots and simulations, or via empirical data analysis. Moving forward, the theory may need to extend its current generative model or confront phenomena that CPC alone cannot fully capture; alternative hypotheses might also arise as distinct solutions to the symbol‑emergence problem.

Altogether, these diverse explorations ensure that Collective Predictive Coding will be a major frontier worth pursuing. In the next part of this series, we’ll offer a brief overview of ongoing efforts to deepen and broaden the CPC perspective.

【Further reading】

Tadahiro Taniguchi (2024) “Collective predictive coding hypothesis: symbol emergence as decentralized Bayesian inference”

Part 3 is a future work.

Written by: Ryuichi Maruyama

Editorial supervision: Tadahiro Taniguchi

Design: Reira Endo, Masaya Shimizu

Translation support: Momoha Hirose.