Remember that shock of seeing some breakthrough for the first time?
Like when AI was able to complete black rectangles above with…
… Shockingly plausible solutions?
That AI was ImageGPT from OpenAI where GPT stands for Generative Pretrained Transformer. Now surely I am not the only one who immediately went crazy with ideas about what to find a “plausible solution” to next.
Some saw just yet another AI image generator. But …
I saw unfathomably big probability space reduced to a thin and focused slice.
And what's more shocking. Reduced to thin slice by following a lot of very complex observed and learned rules. Notice how the position of the light source influenced lighting in the resulting generated pedestrian scene.
Notice plausible transparency length and perspective of shadows. Plausible joint anatomy of furry cat paws. That is. If a cat ever wanted to hold a card? Heck, it even added a slight paw shadow over the card.
But what really is the most important problem worth completing to move humans forward?
Let's create a revolutionary new language model - equally as ginormous as Gpt3… But this time, instead of internet noise, let's train it only on all the best science books, journals, and peer-reviewed papers, along with critical feedback from the submission process.
Today everything is in Pdf format, where equations are internally easily processible texts. A benefit of this dataset could be that virtually all literature and papers contain at their end a list of alternative relevant work or influential sources they deemed important at that time.
Let's help gpt, with attention and explainability by putting arxiv id and date at the beginning of each doc and to all its references at its end. And by forcing the model to prefix each and every answer with a list of influencing doc ids. Example:
“Q: What is result of 1+1 and why”
“A: Result of 1+1 is 2 acording to rules seen in arxivurl1 and arxivurl2”
Advantages of training on all scientific peer-reviewed sources:
Learning to arrive at conclusions by observing thought process down to basic elements (Arxiv), not just conclusions themselves(Wikipedia).
Information quality. Information density. Smaller dataset. Faster training. Less memory. Ability to follow and focus attention on the chain of references and citation counts. Less bias. No leaks.
Image generated completely with AI. Just by pushing a button
After all, it is trained and tested by an increasingly more complex set of existing equations. Even going as far as training on all known competing unified theory proposals. One can then start asking for the completion of the ultimate theory itself.
What Shockingly plausible solutions will look like now ?
What hidden rules will AI observe and apply from all the human scientific data and knowledge it was trained on? Will it complete the equation with a term describing the pattern it observed within data in some LHC paper?
I proposed real-time Fact-Checking by AI for live TV and net and got access to Gpt3.
Here is an example of what gpt3 produced just for this article. Remember it was not taught anything. It was just shown a lot of mostly internet text and a few basic math examples. Yet, somehow it seems to have learned this ...
Test output of Gpt3 Beta as of 6.march 2021
Mind you, not every test worked and the number of operations and length of numbers is so far limited too. But that is not the point. Gpt3 is just beta and still in development. But the fact that we are observing an integer addition avalanche from the language model? For me, unexpected and shocking at least.
Now for those skeptics still believing that transformers are somehow still just a variant of something like Hidden Markov. I.e., not being able to predict math results it had not seen and not doing actual math like operations above. Here is an interesting recent paper from AI big shots. And explaining it fast for those short on time. But in short …
“With enough complexity there is computational utility in transformers… Yannic Kilcher“
Now short text like that activates only a part of model input and weights. aaaaSo out of pure curiosity, I tried to find max number of math equations I can ask gpt3 to compute at once.
So. How well would transformers fare if trained on a more complex math dataset? According to a recent study by the Facebook AI team not half bad.
Think of equations as observed shadows of some complex structure projected to our limited plane of understanding.
The brilliant Mathematician Antony Garrett Lisi was able to spot patterns within properties of known particles and mapped them to E8 lie group above. And thus was, in theory, be able to predict particles we had not even seen yet. Unbelievable breakthrough if his theory is one day complete. Right?
Now consider that the human ability to spot patterns was already largely superseded by AI.
And while Gpt3 did learn to follow and generate text language pretty well. As demonstrated by the last math gpt3 example above, we can conclude that…
Equations are just language too, with equally learnable rules.
If it can learn known Math rules just by observation.
Than it can learn unknown Math rules as well.
The multidimensional nature of AI weights so far most definitely proven to be able to observe and learn hidden rules of these higher dimensional constructs to project new construct shadows /equations.
Just as it did with ImageGpt (12 billion parameters) and Images of Cats on top of the article. For example, this AI exploring Math in an automated way already found new Conjecture.
OpenAI 175. Google claims 600 billion parameters but has given us no API to test yet. To gain scalability, Google didn't go for the expensive more layers approach but opted for juggling between many more task-specific? ff networks. Reducing training time to 4 days on 2048tpus?
So is 600 more than 175?. Well, People behind the current Gpt3 clone Eleuther AI are perhaps a bit skeptical
Mixture-of-experts-based models tend to significantly underperform monolithic (regular) models for the same number of parameters. ”Eleuther AI Faq”
Gpt3 undeniably produces often fascinating answers on subjects that were frequently present in the training dataset. Heck, it even seems to have learned simple math since seen just simple math.
Current Gpt3 Dataset as mentioned in the original paper: Tokens (Billions) Total499, Common Crawl (filtered by quality)410, WebText 219, Books1 12, Books2 55, Wikipedia 3.
But there lies a problem.
Wikipedia frequently contains only end results and conclusions.
Not the important process of how we arrived at them.
Model that observed multiple thought processes arriving at multiple conclusions is better then model that just observed multiple conclusions.
So what if we trained it on whole Arxiv.org but including complex Equations that are often 50% content or more?
Update: We are getting there fast. Thanks to EleutherAI excellent opensource gpt3 clone model Gpt-Neo and dataset “The Pile” was recently born being trained on a large chunk of Arxiv and DeepMind Math examples. Yay ;D
My approach to dataset would be completely different from currently common …
Lets just throw at it a lot of random text
You would not start teaching a child high school math without basic math first.
If you teach it basic rules from one educational book like algebra packed with new info every line. The kid will learn it faster than if the kid just did read random stories low on new info where it would learn it eventually too but it would take way longer.
Unfortunately, that’s how inefficiently current GPT models are trained
If there is something as new information density then not all training text is equal.
For example:
Say text file containing 1GB of just addition examples will teach it just addition no matter how long it was. After the 5th occurrence, you are pretty much feeding it duplicate info and waste CPU and memory reinforcing what is probably not that important.
So if we reorder the dataset with text having, say, much higher new information density first reshuffle little and repeat, then we should converge extremely faster and, more importantly, extract way higher-level information not decodable before due to not having required prior lower-level knowledge.
Why?
In a sense book containing higher-level concepts like high school, math is comparable to a zip file requiring you to possess a decoding dictionary/prior external know-how to decode this now compressed content. Even a single symbol or duplet here is often referencing complex and info-heavy prior knowledge.
So new info density and order are extremely important.
in order to be able to unpack and understand and extract this new higher-level information
Pass 1: train many epochs on just thousands (but the densest sources of new info on the planet) of school educational books where every line is pretty much new info. And do it in a very specific order, like in school. Start by explaining and teaching English language basics, basic math, basic physics, basic chemistry, then high school variants of all those.
Alternatively, when not having a large set of wished educational books just random but big existing dataset. One can implement this curriculum training and sort documents in the dataset by estimating new information density by tracking big weight changes and frequency per doc.
Pass 2: finetune on as many synthetic random computer-generated equations as you can.
I can't imagine a currently better candidate for this pass the excellent DeepMind Mathematics dataset. Here are train examples:
Question: Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r.
Answer: 4
Question: Calculate -841880142.544 + 411127.
Answer: -841469015.544
Question: Let x(g) = 9*g + 1. Let q(c) = 2*c + 1. Let f(i) = 3*i - 39. Let w(j) = q(x(j)). Calculate f(w(a)).
Answer: 54*a - 30
pass 3: Now that model has a firm grasp on mathematics and its computational units had formed. Only now train on all arxiv papers and theory candidates containing complex math equations. Well, complex for humans since our brain context and memory buffers are limited.
The model should then, in theory, be able to interactively explain any part of why it came to the conclusion it arrived at. And what's more. Like gpt3.
Contemplate new variants. Papers themselves often frequently propose other possible directions that were due to time constraints not explored. Potential that AI may definitely tap into due to its infinite scalability.
Of course. I immediately went crazy and proposed this to Arxiv OpenAI and Google in incoherent and overexcited mail ;D Like this article is.
Because sometimes even the smallest Spark can lead to Big Fire.
And yes. I know. There is a lot everyone has on their plate nowadays. And I also know it's not like you flick a switch and done. But when you think about it more. It can use already existing text-only toolchains…
And pretty much all papers or books are already in pdf format. Where if you try to select an equation, you will find out it already is in text-like form.
Indeed that's how 76GB arxiv dataset was meanwhile born. See this paper
I know, I know.
But as would Big Rocket Man say.
We should dream BIG to be exited about every next morning.
Also published on Medium.