On November 15th, MetaAI and Papers with Code announced the release of Galactica, a game-changer, open-source large language model trained on scientific knowledge with 120 billion parameters.
As one of my friends shared on Twitter, the model can write whitepapers, reviews, Wikipedia pages, and code. It knows how to cite and how to write equations. It’s kind of a big deal for AI and science.
On November 17th, Galactica was shut down.
Why? Because, as with all deep learning models, it didn’t understand the task at hand and was wrong in many cases. This shouldn’t be an issue, especially if we add a warning saying the model may be wrong and not to trust it blindly. Just like nobody trusted Wikipedia, we couldn’t put this as a reference in High School projects. The issue is that Galactica was wrong or biased but sounded right and authoritative.
Still, the model is available to researchers, and I believe it is important to keep it open-sourced.
As another one of my friend's shared, all the drama around the new model seems a bit excessive. Of course, the model isn’t perfect, just like all others that are currently available online. We need it online to test its limitations, work on it and improve it. We should see these kinds of publications as students and allow for mistakes and improvements without fear of being shut down or canceled.
Anyways, we are not here to discuss that. Hopefully, it will be back online soon.
We are here to see what Galactica is, or was, and how it could achieve writing papers, reviews, code, and more…
►Read the full article: https://www.louisbouchard.ai/galactica/
►Taylor et al., 2022: Galactica, https://galactica.org/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
0:00
on November 15th Metairie and papers
0:03
with code announced the release of
0:04
galatica a game changer open source
0:07
large language model trained on
0:09
scientific knowledge with 120 billion
0:12
parameters as one of my friends shared
0:14
on Twitter the model can write white
0:16
papers reviews Wikipedia pages and code
0:19
it knows how to cite and how to write
0:22
equations it really is kind of a big
0:24
deal for AI and science on November 17th
0:28
Galactica was shut down why because as
0:31
with all deep learning models it didn't
0:34
understand the task at hand and was
0:36
wrong in many cases this shouldn't be an
0:39
issue especially if we add a warning
0:41
saying the model may be wrong and not to
0:43
trust it blindly just like nobody
0:45
trusted Wikipedia we couldn't put it as
0:48
a reference in high school projects the
0:50
issue was that Galactica was wrong and
0:52
biased but sounded right and uteritative
0:55
still the model is available to
0:57
researchers and I believe it's important
0:59
to keep bit open sourced as another of
1:02
my friends shared all the drama around
1:04
this new model seems a bit excessive of
1:06
course the model isn't perfect just like
1:08
all others that are currently available
1:10
online we need it online to test its
1:13
limitations work on it and improve it we
1:16
should see these kinds of Fabrications
1:18
as students and allow for mistakes and
1:21
improvements without the fear of being
1:22
shut down or canceled anyways we are not
1:26
here to discuss that hopefully it will
1:28
be back online soon we are here to see
1:30
what Galactica is or was and how it
1:33
could achieve writing papers reviews
1:35
code math and more basically Galactica
1:39
is a large language model with a size
1:41
comparable to gpt3 but specialized on
1:44
scientific knowledge more precisely it
1:46
was trained on a large and curated
1:48
Corpus of scientific knowledge including
1:50
over 48 million papers textbooks and
1:54
lecture notes millions of compounds and
1:56
proteins scientific websites
1:58
encyclopedias and more as they highlight
2:00
data were of high quality and highly
2:03
curated which is one of the big
2:05
difference with gpt3 So in theory
2:08
Galactica contains pretty much all of
2:10
Humanity's scientific knowledge imagine
2:12
having an amazing memory and the time to
2:15
read millions of research remembering
2:18
most of it well this is Galactica it
2:21
seems like its memory isn't so good
2:23
after all and it mixes everything even
2:25
though we could assume most information
2:27
present in the training data set was
2:29
accurate even considering all devices
2:31
and failures Galactica stays pretty
2:34
powerful and outperforms pretty much all
2:36
other approaches for scientific related
2:39
tasks it's just not enough for a product
2:41
we can have confidence in still it's
2:44
worth understanding how it works
2:46
especially because it will come back
2:48
even more powerful pretty soon as we
2:51
mentioned Galactica is a large language
2:53
model similar to gpt3 or Bloom
2:55
specifically trained for as they say
2:58
organize science there's also a lot of
3:01
engineering going on in this model
3:03
allowing so much versatility in its
3:05
inputs and outputs like special
3:07
tokenization of citations or protein
3:09
sequences which you can learn more in
3:11
their paper linked below their
3:13
tokenization effort is by far the
3:15
biggest contribution of this work
3:17
tokensation basically means the way the
3:20
model will see the data instead of words
3:23
math or shapes that we understand I
3:26
actually share a video on embedding and
3:28
tokenization later this week so if that
3:30
sounds interesting stay tuned for that
3:33
and subscribe to not miss it so accept
3:35
this weird tokensation and
3:37
pre-processing steps what is Galactica
3:39
and what does it do after taking the
3:42
words or different scientific inputs and
3:44
preparing it for the model doing
3:46
tokenization no surprise Galactica is
3:50
yet another Transformer based
3:52
architecture like gpt3 with a couple of
3:55
variations including the tokenization
3:57
differences so I definitely invite you
3:59
to but one of the many videos I or some
4:02
of my friends made covering the
4:04
Transformer architectures as I won't get
4:06
into how they work once again the second
4:09
major difference between Galactica and
4:11
other large language models is what they
4:13
call the prompt pre-training this means
4:16
that they will include prompts extracted
4:18
from the training data set alongside the
4:21
data itself which has been shown to
4:23
maximize the generality of the model
4:25
while boosting performance on some tasks
4:28
of interest and that's pretty much it as
4:31
I said the architecture is very similar
4:33
to what you already know and mostly the
4:35
training and pre-processing schemes vary
4:37
which shows that the model isn't
4:39
everything but how we preach through the
4:41
data for it might actually matter even
4:43
more you can basically see the
4:45
difference between gpt3 and Galactica as
4:48
the same student with a bad science
4:49
teacher versus a good one it has the
4:52
same capabilities and resources the
4:55
teacher just made it more accessible and
4:57
understandable for him of course this
4:59
was just an overview of the paper and I
5:02
strongly recommend reading it there are
5:04
tons of details about the multiple
5:06
engineering tricks they've implemented
5:08
along with results analysis details on
5:11
all the tasks they tackle using the
5:13
model and how it understood the input
5:15
data and its predictions its limitations
5:18
biases and more I hope you've enjoyed
5:21
this video and I will see you next week
5:23
with another amazing paper and a special
video covering what embeddings are