For folks who aren’t sure how to interpret this, what we’re looking at here is early work establishing an upper bound on the complexity of a problem that a model can handle based on its size. Research like this is absolutely essential for determining whether these absurdly large models are actually going to achieve the results people have already ascribed to them on any sort of consistent basis. Previous work on monosemanticity and superposition are relevant here, particularly with regards to unpacking where and when these errors will occur.
I’ve been thinking about this a lot with regards to how poorly defined the output space they’re trying to achieve is. Currently we’re trying to encode one or more human languages, logical/spatial reasoning (particularly for multimodal models), a variety of writing styles, and some set of arbitrary facts (to say nothing of the nuance associated with these facts). Just by making an informal order of magnitude argument I think we can quickly determine that a lot of the supposed capabilities of these massive models have strict theoretical limitations on their correctness.
This should, however, give one hope for more specialized models. Nearly every one of the above mentioned “skills” is small enough to fit into our largest models with absolute correctness. Where things get tough is when you fail to clearly define your output space and focus training so as to maximize the encoding efficiency for a given number of parameters.
For folks who aren’t sure how to interpret this, what we’re looking at here is early work establishing an upper bound on the complexity of a problem that a model can handle based on its size. Research like this is absolutely essential for determining whether these absurdly large models are actually going to achieve the results people have already ascribed to them on any sort of consistent basis. Previous work on monosemanticity and superposition are relevant here, particularly with regards to unpacking where and when these errors will occur.
I’m not sure this work accomplishes that. Sure, it builds up on previous work that showed that a transformer can be simulated by a TC0 family. However, the limits of this fact are not clear. The paper even admits as such
Our result on the limitations of T-LLMs as general learners comes from Proposition 1 and Theorem 2. On the one hand, T-LLMs are within the TC0 complexity family; on the other hand, general learners require at least as hard as P/ poly-complete. In the field of circuit theory, it is known that TC0 is a subset of P/ poly and commonly believed that TC0 is a strict subset of P/ poly, though the strictness is still an open problem to be proved.
I believe this is one of the weakest points of the paper, as it bases all of its reasoning on an unproven theorem. And you can implement many things with a TC0 algorithm, addition, multiplication, basic logic, heck you can even make transformers.
There still is something that bothers me. Why did it define general learning as being at least a universal circuit for the set of all circuits within a polynomial size ? Why this restriction ? I tried googling general learner and universal circuit and only came up with this paper.
While searching, I found that this paper was rejected, you can find the reviews here : https://openreview.net/forum?id=e5lR6tySR7
If you are searching for a paper on the limits of T-LLMs the paper What Algorithms can Transformers Learn? A Study in Length Generalization may prove more informative. https://arxiv.org/pdf/2310.16028.pdf It explains why transformers are so bad at addition.
Here is the key part of their abstract :
Specifically, we leverage RASP (Weiss et al., 2021)— a programming language designed for the computational model of a Transformer— and introduce the RASP-Generalization Conjecture: Transformers tend to length generalize on a task if the task can be solved by a short RASP program which works for all input lengths.