Tuesday, 8 November, 2022 UTC


Summary

“A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.”
— Alan Turing
AI has become intertwined with every aspect of our lives. For the last 60 years, countless scientists and philosophers have worked hard to advance the field to what it is today. These insights have come to us in the form of cryptic, yet appealing phrases whose underlying meaning often escape us. We’re left there, nodding to a beautiful simplification of a complex soup of thought.
The Large Language Models are now a buzzword for a few years in the NLP domain. Whether it be any SOTA models like GPT, BERT, BART, etc all have been of great use in modern-day applications. With the rise of newer and newer architecture, these models have been solving NLP tasks at almost human-level experience, precision, and accuracy (focus on the word almost).
The current state-of-the-art in language modeling is the Transformer-based GPT-3 model, which was trained on a staggering amount of text data. This model is so large and so powerful that it can generate realistic-looking text on any topic, even if it has never seen that topic before.
This is both a strength and a weakness of the model. On the one hand, it can generate text that is uncannily realistic. On the other hand, it is also capable of generating text that is nonsensical, offensive, or simply untrue.
One problem with large language models is that they often contain factual errors. For example, the GPT-2 model has generated fake news stories that are convincing enough to fool people.
Another problem is that large language models can be used to generate offensive or hateful content (Bias). For example, the GPT-2 model has been used to generate racist and sexist tweets.
Finally, large language models can be used to generate text that is simply nonsensical. This is not necessarily a problem, but it can be frustrating for people who are trying to use the model for a specific task (such as machine translation) and find that the generated text is gibberish.
What if there is some method where LLM is also capable of self-improving with only unlabelled datasets? Yes, we are going to discuss a research paper that claims this and has proven results.
They claim that their approach improves the general reasoning ability of a 540B-parameter LLM (74.4% to 82.1% on GSM8K, 78.2% to 83.0% on DROP, 90.0% to 94.4% on OpenBookQA, and 63.4% to 67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label.
where GSM8K = grade school math word problems
DROP = reading comprehension benchmark requiring discrete reasoning over paragraphs
OpenBookQA = question-answering dataset modelled after open book exams for assessing human understanding of a subject. 5,957 multiple-choice elementary-level science questions
ANLI-A3 = adversarial benchmark designed to be challenging to current state-of-the-art models
What does it mean?
Well, if you are thinking of how it will impact the performance of LLM in the future, this is something I came across on a blog comment and seemed very perfect.
  1. It gives us an example of self-improvement on which we can run experiments. We can look for failure modes in current systems, without having to scale them up to dangerous capabilities.
  2. Of all the plausible ways to do self-improvement that I can imagine, this one seems the most stable and least likely to have sudden capabilities spikes. The reason is that the basic building block of the model’s self-improvement is autoregressive pretraining. When the model generates a new dataset and trains on that dataset, the result is to move the model’s median generation in the direction of the dataset. We can simply look at the dataset the model generated to get a pretty good idea of the direction in which this step of self-modification will move the model. Of course, the compounding effects of multiple self-improvement rounds are a different matter, but as I mentioned above, we can run experiments to investigate.
  3. Humans can improve themselves in this manner. If language models had turned out to be incapable of doing so, that would have been an example of non-convergence between LMs and humans. Instead, we see convergence.
How does it work?
There is a quite simple explanation in the paper for this.
“We first sample multiple predictions using few-shot Chain-of-Thought (CoT) as prompts, filter high-confidence predictions using majority voting, and finally fine-tune the LLM on these high-confidence predictions.”
Credits: Research Paper
The above one-line explanations and images give us some superficial understanding of architecture. But we won't leave it like that. We want to understand how it's happing in some more depth.
LMSI — Language Models Self Improved
Step 1> We download a pre-trained Model and feed it only the Questions Dataset.
As shown in the example above in the image the question passed is
“Amy is 10. Jake is 8. Alex’s age is right in the middle. How old is Alex?”
Step 2> Along with it we pass a few-shot Chain-of thought (CoT) examples
Chain of Thought Prompting
Google announced breakthrough research in Natural Language Processing called Chain of Thought Prompting that raises the state of the art of advanced technologies like PaLM and LaMDA to what the researchers call a remarkable level.
Chain of Thought
  1. allows language models to break down complex multi-step problems into a sequence of steps
  2. process allows engineers to peek into the process and when things go wrong, this allows them to identify where it went wrong and fix it
  3. Can solve math word problems, can accomplish common-sense reasoning and according to the research paper can (in principle) solve any word-based problem that a human can.
Step 3> This will generate multiple reasoning paths and answers for each question passed in step 1.
This is shown in the image as different paths to get the answer. Some will be wrong, but some will be right.
Step 4> Then we will use majority voting (self-consistency) to select the most consistent, highest confidence answer. If you want to learn more about how Self Consistency improves CoT reasoning, you can read this research paper here.
Showing Steps 3 and 4 of the Explanation
Step 5> Finally, all reasoning paths that lead to the most consistent answer are kept and applied to mixed formats of prompts and answers for augmentation and fine-tuning the model of these self-generated reasoning-answer data.
Results
The paper ends with multiple case scenario results. Explaining the effects of Self Consistency, CoT prompting, etc.
Conclusion
The paper concludes that the Large Language Model (LLM) can improve its performance on reasoning datasets by training on its own generated labels, given input questions only. Experiments using an LLM with 540 billion parameters shows that this approach improves the accuracy scores on the six datasets by 1.1% to 7.7%, achieving new state-of-the-art results on ARC, OpenBookQA, and ANLI, without training on ground truth labels.
Furthermore, the paper shows that it is possible for the LLM to self-improve even on its own generated questions and few-shot Chain-of-Thought prompts.
You can read the original paper here.
If you liked the blog, please give it a Clap, and follow me for more content.
https://arvrjourney.com/
Can Large Language Models Self-Learn? was originally published in AR/VR Journey: Augmented & Virtual Reality Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.