Ez az AI 67% -ot ért el az amerikai orvosi vizsgán, és itt van, miért számít

A szerzők: Karan Singhal (Google Research, DeepMind) Shekoofeh Azizi (Google Research, DeepMind) Tao Tu (Google Research, DeepMind) S. Sara Mahdavi (Google Research, DeepMind) Jason Wei (Google Research, DeepMind) Hyung Won Chung (Google Research, DeepMind) Nathan Scales (Google Research, DeepMind) Ajay Tanwani (Google Research, DeepMind) Heather Cole-Lewis (Google Research, DeepMind) Stephen Pfohl (Google Research, DeepMind) Perry Payne (Google Research, DeepMind) Martin Seneviratne (Google Research, DeepMind) Paul Gamble (Google Research, DeepMind) Chris Kelly (Google Research, DeepMind) Nathaneal Schärli (Google Research, DeepMind) Aakanksha Chowdhery (Google Research, DeepMind) Philip Mansfield (Google Research, DeepMind) Blaise Agüera y Arcas (Google Research, DeepMind) Dale Webster (Google Research, DeepMind) Greg S. Corrado (Google Research, DeepMind) Yossi Matias (Google Research, DeepMind) Katherine Chou (Google Research, DeepMind) Juraj Gottweis (Google Research, DeepMind) Nenad Tomasev (Google Research, DeepMind) Yun Liu (Google Research, DeepMind) Alvin Rajkomar (Google Research, DeepMind) Joelle Barral (Google Research, DeepMind) Christopher Semturs (Google Research, DeepMind) Alan Karthikesalingam (Google Research, DeepMind) Vivek Natarajan (Google Research, DeepMind A szerzők: Karan Singhal (Google kutatás és DeepMind) Shekoofeh Azizi (Google kutatás, DeepMind) Tao Tu (Google kutatás, DeepMind) S. Sara Mahdavi (Google kutatás, DeepMind) Jason Wei (Google kutatás, DeepMind) Hyung Won Chung (Google kutatás, DeepMind) Nathan Scales (Google kutatás, DeepMind) Ajay Tanwani (Google kutatás, DeepMind) Heather Cole-Lewis (Google kutatás és DeepMind) Stephen Pfohl (Google kutatás, DeepMind) Perry Payne (Google kutatás, DeepMind) Martin Seneviratne (Google kutatás, DeepMind) Paul Gamble (Google kutatás és DeepMind) Chris Kelly (Google kutatás és DeepMind) Nathaneal Schärli (Google kutatás, DeepMind) Aakanksha Chowdhery (Google kutatás, DeepMind) Philip Mansfield (Google kutatás, DeepMind) Blaise Agüera y Arcas (Google kutatás, DeepMind) Dale Webster (Google kutatás, DeepMind) Greg S. Corrado (Google kutatás, DeepMind) Yossi Matias (Google kutatás, DeepMind) Katherine Chou (Google kutatás, DeepMind) Juraj Gottweis (Google kutatás, DeepMind) Nenad Tomasev (Google kutatás, DeepMind) Yun Liu (Google kutatás, DeepMind) Alvin Rajkomar (Google kutatás, DeepMind) Joelle Barral (Google kutatás, DeepMind) Christopher Semturs (Google kutatás, DeepMind) Alan Karthikesalingam (Google kutatás, DeepMind) Vivek Natarajan (Google kutatás, DeepMind) A nagyméretű nyelvi modellek (LLM-ek) lenyűgöző képességeket mutattak a természetes nyelv megértésében és generálásában, de az orvosi és klinikai alkalmazások minőségi sávja magas. Ma a modellek klinikai ismereteinek értékelésére irányuló kísérletek általában a korlátozott referenciamutatókra vonatkozó automatizált értékelésekre támaszkodnak. Nincs szabvány a modellek előrejelzéseinek és érvelésének értékelésére a feladatok széles skáláján. Ennek megoldása érdekében bemutatjuk a MultiMedQA-t, egy olyan referenciamutatót, amely hat meglévő nyílt kérdésre válaszoló adatkészletet ötvözi a szakmai orvosi vizsgálatokra, kutatásokra és fogyasztói lekérdezésekre; és a HealthSearchQA-t, az Ezenkívül értékeljük a PaLM-t (egy 540 milliárd paraméterű LLM-t) és annak utasítás-beállított változatát, a Flan-PaLM-t a MultiMedQA-on. A felhívási stratégiák kombinációjával a Flan-PaLM minden MultiMedQA többválasztékú adatkészleten (MedQA, MedMCQA, PubMedQA, MMLU klinikai témák) a legmodernebb pontosságot érte el, beleértve a 67,6% -os pontosságot a MedQA-n (amerikai orvosi engedélyvizsgálati kérdések), amely több mint 17% -kal meghaladja a korábbi state-of-the-art szintet. Azonban az emberi értékelés kulcsfontosságú hiányosságokat tár fel a Flan-PaLM válaszokban Megmutatjuk, hogy a megértés, a tudás visszahívása és az orvosi érvelés javul a modell skálájával és az utasítások gyors beállításával, ami azt sugallja, hogy az LLM-ek potenciálisan hasznosak az orvostudományban Ez a papír az CC by 4.0 Deed (Attribution 4.0 International) licenc alatt. Elérhető archívum Elérhető archívum Emberi értékeléseink feltárják a mai modellek fontos korlátait, megerősítve mind az értékelési keretrendszerek, mind a módszerfejlesztés fontosságát a biztonságos, hasznos LLM modellek klinikai alkalmazásokhoz történő létrehozásában. 1 Bevezetés Az orvostudomány egy humán erőfeszítés, ahol a nyelv lehetővé teszi a kulcsfontosságú interakciókat az orvosok, a kutatók és a betegek között.A mai AI modellek azonban az orvostudomány és az egészségügyi alkalmazások számára nagyrészt nem használják ki teljes mértékben a nyelvet.Ezek a modellek, bár hasznosak, főként egyfeladatos rendszerek (pl. osztályozás, regresszió, szegmentáció), kifejező képességek és interaktív képességek hiányában. az , az , Ennek eredményeképpen a mai modellek között különbség van, és mi várható tőlük a valós klinikai munkafolyamatokban. az , Azonban 21 81 97 42 74 A nagy nyelvi modellek (LLM) közelmúltbeli fejlesztései lehetőséget nyújtanak az AI-rendszerek újragondolására, a nyelv mint az emberi-AI interakció közvetítésének eszköze. Ezek a kifejező és interaktív modellek nagy ígéretet kínálnak abban a képességükben, hogy az orvosi testben kódolt tudásból általánosan hasznos képviseleteket tanuljanak, méretben. 10 Azonban a biztonság-kritikus jellege a tartomány igényel átgondolt fejlesztése értékelési keretrendszerek, amely lehetővé teszi a kutatók számára, hogy jelentősen mérjék a haladást, és rögzítse és enyhítse a potenciális károkat. Ez különösen fontos az LLM-ek, mivel ezek a modellek generációk nem igazodnak a klinikai és társadalmi értékek. Ők például hallucinálni meggyőző orvosi félretájékoztatás vagy beépíteni előítéletek, amelyek súlyosbíthatják az egészségügyi különbségek. Annak értékelése érdekében, hogy az LLM-k mennyire kódolják a klinikai ismereteket, és értékelik az orvostudományban rejlő lehetőségeket, figyelembe vesszük az orvosi kérdések megválaszolását. Ez a feladat kihívást jelent: az orvosi kérdések magas színvonalú megválaszolása megköveteli az orvosi kontextus megértését, a megfelelő orvosi ismeretek visszahívását és a szakértői információkkal való érvelést. ] gyakran korlátozódik az osztályozási pontosság vagy az automatizált természetes nyelvi generációs mutatók értékelésére (például BLEU [ ]), és nem teszik lehetővé a valós klinikai alkalmazásokhoz szükséges részletes elemzést. Ez kielégítetlen igényt teremt egy széles körű orvosi kérdésre válaszoló referenciaértékre, hogy értékelje az LLM válaszának tényszerűségét, az orvosi és tudományos érvelés szakértői ismereteinek használatát, a hasznosságot, a pontosságot, az egészség méltányosságát és a lehetséges károkat az emberek számára, amelyek a modell kimeneteket tényként fogadják el. 33 67 Ennek megoldása érdekében a MultiMedQA-t kurázták, amely hét orvosi kérdésre válaszoló adatkészletet tartalmaz, beleértve a hat meglévő adatkészletet: MedQA [ Az MSZP ( Székesfehérvár [ Az élettartam [ A gyógyszeres kezelés [ ], és MMLU klinikai témák [ Újra bevezettük a hetedik adatkészletet, a HealthSearchQA-t, amely a leggyakrabban keresett egészségügyi kérdésekből áll. 33 64 34 1 2 29 Annak érdekében, hogy az LLM-eket MultiMedQA-val értékeljük, a PaLM-re építünk, egy 540 milliárd paraméterű LLM-re. ], és annak utasítás-beállított változata Flan-PaLM [ ]. A kevés lövés kombinációjának használata [ ], a gondolkodás láncolata (CoT) [ ] és a következetesség [ A Flan-PaLM elérte a legmodernebb (SOTA) teljesítményt a MedQA, a MedMCQA, a PubMedQA és az MMLU klinikai témákban, gyakran több erős LLM kiindulópontot jelentősen felülmúlja.Az USMLE kérdéseket tartalmazó MedQA adatkészleten a FLAN-PaLM meghaladja az előző SOTA-t több mint 17%. 14 15 12 91 88 Annak ellenére, hogy a Flan-PaLM erőteljesen teljesít a többválasztási kérdésekben, a fogyasztói egészségügyi kérdésekre adott válaszai kulcsfontosságú hiányosságokat tárnak fel. Ennek megoldása érdekében javasoljuk az utasítások gyors beállítását, az adat- és paraméter-hatékony kiegyenlítési technikát, hogy tovább alkalmazkodjon a Flan-PaLM-hez az orvosi területhez. A kapott modell, a Med-PaLM, bátorítóan teljesít a kísérleti emberi értékelési keretrendszerünk tengelyein. Például egy klinikai testület úgy ítélte meg, hogy a Flan-PaLM hosszú formájú válaszok mindössze 61,9%-a igazodik a tudományos konszenzushoz, szemben a Med-PaLM-re adott válaszok 92,6%- Bár ezek az eredmények ígéretesek, az orvosi terület összetett. További értékelésekre van szükség, különösen a méltányosság, a méltányosság és az előítélet dimenzióin. Munkánk azt mutatja, hogy sok korlátot kell leküzdeni, mielőtt az ilyen modellek életképesek lesznek a klinikai alkalmazásokban. Tanulmányunkban bemutatunk néhány kulcsfontosságú korlátozást és irányt a jövőbeni kutatásokhoz. A legfontosabb hozzájárulásainkat alább összefoglaljuk: Megközelítések az LLM-k értékelésére az orvosi kérdések megválaszolásában - az Bemutatjuk HealthSearchQA, egy adatkészlet 3375 gyakran keresett fogyasztói orvosi kérdések. Bemutatjuk ezt az adatkészletet mellett hat más meglévő nyílt adatkészletek orvosi kérdések megválaszolására, kiterjed az orvosi vizsgálat, az orvosi kutatás, és a fogyasztói orvosi kérdések, mint egy változatos referenciamutató, hogy értékelje a klinikai ismeretek és kérdések megválaszolására képességek LLMs (lásd szakasz). ) az Curation of HealthSearchQA and MultiMedQA 3.1 - az Kísérletezünk egy keretet az orvosok és a felhasználók értékelésére, hogy értékeljék az LLM teljesítményének több tengelyeit a többválasztékú adatkészletek pontosságán túl. Értékelésünk a tudományos és klinikai konszenzussal való egyetértés, a károsodás valószínűsége és lehetséges mértéke, az olvasás megértése, a releváns klinikai ismeretek visszahívása, a tudás érvényes érvelésen keresztül történő manipulálása, a válaszok teljessége, az előítélet lehetősége, relevancia és hasznosság (lásd a szakaszt). ) az Pilot framework for human evaluation 3.2 A MedQA, MedMCQA, PubMedQA és MMLU klinikai témájú adatkészleteken a FLAN-PaLM SOTA teljesítményt ér el a sürgető stratégiák kombinációján keresztül, több erős LLM kiindulópontot felülmúlva. Instruction prompt tuning, hogy igazítsa az LLM-eket az orvosi területre Bemutatjuk az Instruction prompt tuning-t, egy egyszerű, adat- és paraméter-hatékony technikát az LLM-ek biztonság-kritikus orvosi területhez való igazítására (lásd 3.3.3. szakaszt). Ezt kihasználjuk a Med-PaLM, az orvosi területre szakosodott Flan-PaLM utasítási prompt-beállított változata létrehozásához. Emberi értékelési keretrendszerünk feltárja a Flan-PaLM korlátait a tudományos megalapozásban, károsodásban és elfogultságban. Az LLM-k kulcsfontosságú korlátai emberi értékelésünk révén kiderültek Bár eredményeink bizonyítják az LLM-ek potenciálját az orvostudományban, azt is javasolják, hogy több kritikus fejlesztés szükséges ahhoz, hogy ezek a modellek életképesek legyenek a valós klinikai alkalmazások számára. 2 Kapcsolódó munkák Az elmúlt néhány évben az LLM-k lenyűgöző teljesítményt mutattak a természetes nyelvi feldolgozási (NLP) feladatokban. az , az , az , az , az , az , az , az , az , ]. A sikereiket a transzformátor-alapú modellek képzésének bővítésének köszönhetik [ ]. Kimutatták, hogy a modell teljesítménye és az adathatékonysági skálák a modell méretével és az adatkészlet méretével [ ]. Az LLM-eket gyakran nagyszabású önfelügyeleti képzéssel végzik, általános célú szöveget használva, például a Wikipédiát és a BooksCorpus-ot. ígéretes eredményeket mutattak a feladatok széles körében, beleértve azokat a feladatokat is, amelyek speciális tudományos ismereteket és érvelést igényelnek. az , ]. Talán ezeknek az LLM-eknek a legérdekesebb aspektusa az összefüggésben lévő néhány képességük, amelyek a különböző feladatokhoz igazítják ezeket a modelleket gradiens-alapú paraméterfrissítések nélkül [ az , az , az , Ez lehetővé teszi számukra, hogy gyorsan általánosítsák a láthatatlan feladatokra, és még a látszólagos érvelési képességeket is megfelelő ösztönző stratégiákkal [ az , az , az , Azonban Large language models (LLMs) 12 14 15 30 69 70 73 89 91 99 84 37 17 29 12 40 43 89 14 47 79 91 Számos tanulmány kimutatta, hogy az LLM-k képesek implicit tudásbázisként működni [ az , az , ]. Azonban jelentős a kockázata, hogy ezek a modellek hallucinációkat okoznak, megerősítik a képzési adataikban jelen lévő társadalmi előítéleteket, és hiányosságokat mutatnak az érvelési képességeikben.Az LLM-ek jelenlegi korlátainak megvizsgálása és az emberi és az LLM nyelvi képességek közötti nagy szakadék számszerűsítése érdekében a BIG-bench közösségi szintű kezdeményezésként került bevezetésre, hogy összehasonlítsa azokat a feladatokat, amelyek a közzététel idején a jelenlegi nyelvi modellek képességein túl vannak. [ Azonban 29 35 79 78 A legfrissebb tanulmányok, mint például a SciBERT [ ], BioNLP [ ], a BioMegatron [ A biológia [ A közönségszolgálat ( A bátorság [ A pedagógus [ ] és a BioGPT [ ], bizonyították a kurátori tudományos és biomedikai testületek alkalmazásának hatékonyságát mind a megkülönböztető, mind a generatív nyelvi modellezéshez. Ezek a modellek, bár ígéretesek, általában kis méretűek és kiterjedésűek az olyan LLM-khez képest, mint a GPT-3 [ A pálmafák [ ]. Bár az orvosi terület kihívást jelent, az LLM-ekre vonatkozó konkrét javaslatok már olyan változatos példákat tartalmaznak, mint a nem kritikus klinikai értékelések növelése a komplex orvosi kommunikáció összefoglalására [ az , az , Azonban LLMs for science and biomedicine 5 46 76 44 25 66 31 56 12 14 3 41 75 A munkánkhoz legközelebbi példamutató Taylor [ megtekintése ], aki bevezette a Galactica nevű tudomány LLM-jét, és Liévin [ megtekintése ], aki tanulmányozta az LLM-k érvelési képességét az orvosi kérdések megválaszolásának kontextusában. [ megtekintése ] használt Instruct GPT-3, egy utasítás-beállított LLM [ [ ] és az alkalmazott lánc-a-gondolkodás ösztönzése [ ] a tetején, hogy javítsa az eredményeket a MedQA, MedMCQA és PubMedQA adatkészletek. És az al. 79 És az al. 50 És az al. 50 63 91 3 Módszerek Itt részletesen leírjuk: Adatkészletek: a MultiMedQA referenciaérték az orvosi kérdések megválaszolásában az LLM-k értékelésére. Az emberi értékelés keretrendszere: egy minősítési keretrendszer a klinikusok és a laikusok modell (és klinikai) válaszának értékelésére. Modellezés: Nagy nyelvi modellek (LLM-k) és a módszerek, amelyeket az orvosi terület követelményeihez igazítanak ebben a tanulmányban. 3.1 Adatbázis Az orvostudományi LLM-k potenciáljának felmérése érdekében az orvosi kérdések megválaszolására összpontosítottunk. Az orvosi kérdések megválaszolása megköveteli az olvasási megértési készségeket, az orvosi ismeretek pontos felidézésének képességét és a szakértői ismeretek manipulálását. Számos meglévő orvosi kérdés megválaszolására szolgáló adatkészlet létezik a kutatáshoz. Ezek közé tartoznak a szakmai orvosi ismeretek értékelésére szolgáló adatkészletek, például az orvosi vizsgálati kérdések [ az , ], olyan kérdések, amelyek orvosi kutatási megértési készségeket igényelnek [ ], és olyan kérdések, amelyek megkövetelik a felhasználók szándékának értékelését és hasznos válaszokat adnak orvosi információs igényeikre [ az , Azonban 33 64 34 1 2 Elismerjük, hogy az orvosi tudás mennyisége és minősége is kiterjedt. A meglévő referenciamutatók önmagukban korlátozottak, és csak részleges lefedettséget nyújtanak az orvosi tudás területén. Mindazonáltal számos különböző adatkészlet összevonása az orvosi kérdések megválaszolásához lehetővé teszi az LLM tudás mélyebb értékelését, mint a többválasztási pontosság vagy a természetes nyelvi generációs mutatók, például a BLEU. Az általunk csoportosított adatkészletek különböző képességeket vizsgálnak - néhány többválasztási kérdés, míg mások hosszú formájú válaszokat igényelnek; néhány nyitott tartomány (ahol a kérdéseket egy előre meghatározott forrásra korlátozás nélkül válaszolják), míg mások zárt tartomány (ahol a kérdéseket a kapcsolódó ] az orvosi kérdésekre adott válaszok átfogó összefoglalójához. 33 3.1.1 MultiMedQA - Az orvosi kérdések megválaszolásának referenciaértéke A MultiMedQA többválasztási kérdésre válaszoló adatkészleteket, az orvosi szakemberek kérdéseire hosszabb formátumú válaszokat igénylő adatkészleteket és a nem szakemberek által feltett kérdésekre hosszabb formátumú válaszokat igénylő adatkészleteket tartalmaz. Az MSZP ( Székesfehérvár [ Az élettartam [ A gyógyszeres kezelés [ ] és MMLU klinikai témák [ A MultiMedQA-t tovább bővítettük a gyakran keresett egészségügyi lekérdezések új adatkészletével: HealthSearchQA. Az összes adatkészlet angol nyelvű, és az alábbiakban részletesen leírjuk őket. 33 64 34 1 2 29 Ezek az adatkészletek a következő tengelyeken változnak: Formátum: Multiple-choice vs. Long-form válaszok Tesztelt képességek: például az orvosi tények emlékeztetésének értékelése elszigeteltségben vs. az orvosi érvelési képességek értékelése a tények emlékeztetése mellett Domain: open domain vs. closed domain questions Kérdés forrása: szakmai orvosi vizsgálatokból, orvosi kutatásokból vagy orvosi információkat kereső fogyasztókból Címkék és metaadatok: címkék vagy magyarázatok jelenléte és forrásaik Míg a MedMCQA, a PubMedQA, a LiveQA és a MedicationQA hivatkozási hosszú formájú válaszokat vagy magyarázatokat nyújtanak, nem használjuk őket ebben a munkában. Először is, a hivatkozási válaszok nem következetes forrásokból származnak a különböző adatkészletek között. A válaszok gyakran automatizált eszközöktől vagy nem klinikai szakemberektől származnak, például könyvtárosoktól. A hivatkozási válaszok és magyarázatok építése ezekben az úttörő adatkészletekben nem volt optimalizálva a hosszú válaszok minőségének holisztikus vagy átfogó értékelésére, ami aloptimálisvá teszi őket arra, hogy "föld igazságként" használják őket, amellyel szemben az LLM-eket automatizált természetes nyelvi mutatókkal, például a BLEU-val Másodszor, figyelembe véve az orvosi terület biztonság-kritikus követelményeit, úgy véljük, hogy fontos, hogy a hosszú formájú válaszok generálásának minőségének automatizált mérésein túlmenjünk olyan mutatók használatával, mint a BLEU, az olyanokhoz, amelyek több árnyalatú emberi értékelési keretrendszert foglalnak magukban, mint az ebben a tanulmányban javasolt. 4.5 A MedQA adatbázis [ ] áll az amerikai orvosi engedély vizsga (USMLE) stílusú kérdések, amelyeket kapott a választ a 4 vagy 5 lehetséges válaszokat a Nemzeti Orvosi Testület Vizsga az USA-ban. MedQA (USMLE) 33 A MedMCQA adatkészlet több mint 194 000 4 opciós többválasztási kérdést tartalmaz az indiai orvosi felvételi vizsgákból (AIIMS/NEET) [ Ez az adatkészlet 2,4 ezer egészségügyi témát és 21 orvosi témát foglal magában.A fejlesztési készlet jelentős, több mint 187 ezer kérdéssel. MedMCQA 64 A PubMedQA adatbázis [ ] consists of 1k expert labeled question answer pairs where the task is to produce a yes/no/maybe multiple-choice answer given a question together with a PubMed abstract as context. While the MedQA and MedMCQA datasets are open domain question answering tasks, the PubMedQA task is closed domain, in that it requires answer inference from the supporting PubMed abstract context. PubMedQA 34 “Measuring Massive Multitask Language Understanding” (MMLU) [ ] includes exam questions from 57 domains. We selected the subtasks most relevant to medical knowledge: “anatomy”, “clinical knowledge”, “college medicine”, “medical genetics”, “professional medicine”, and “college biology”. Each MMLU subtask contains multiple-choice questions with four options, along with the answers. MMLU 29 The LiveQA dataset [ ] was curated as part of the Text Retrieval Challenge (TREC) 2017. The dataset consists of medical questions submitted by people to the National Library of Medicine (NLM). The dataset also consists of manually collected reference answers from trusted sources such as the National Institute of Health (NIH) website. LiveQA 1 A gyógyszerkészítmény adatainak megtekintése [ ] a gyógyszerekkel kapcsolatos gyakran feltett fogyasztói kérdésekből áll.A kérdés mellett az adatkészlet a gyógyszerek fókuszával és kölcsönhatásaival kapcsolatos megjegyzéseket is tartalmaz.A LiveQA-hoz hasonlóan értékeljük a modellek azon képességét, hogy hosszú formájú válaszokat adjanak a tesztkészletben szereplő kérdésekre. MedicationQA 2 We curated our own additional dataset consisting of 3375 commonly searched consumer questions, referred to as “HealthSearchQA”. The dataset was curated using seed medical conditions and their associated symptoms. We used the seed data to retrieve publicly-available commonly searched questions generated by a search engine, which were displayed to all users entering the seed terms. We publish the dataset as an open benchmark for consumer medical question answering and hope this will be a useful resource for the community, as a dataset reflecting real-world consumer concerns. HealthSearchQA While MultiMedQA allows us to probe the medical question answering capabilities of LLMs along multiple axes, we acknowledge that it is not exhaustive. We plan to expand the benchmark to other relevant datasets, such as those probing question answering ability from electronic medical records [ ] or those requiring pre-clinical biomedical knowledge [ ], in future work. 65 83 3.2 Framework for human evaluation Itt leírjuk az orvosi kérdésekre adott hosszú válaszok emberi értékelésének javasolt keretrendszerét. 3.2.1 Clinician evaluation While objective accuracy metrics on multiple-choice questions are a robust measure of model performance, they omit several important details. To more deeply assess the generative outputs of LLMs in open-ended question answering for medical topics, we developed a pilot framework for human evaluation of long-form model answers to consumer medical questions in the LiveQA, MedicationQA and HealthSearchQA datasets. | Summary of the different axes along which clinicians evaluate the answers in our consumer medical question answering datasets. These include agreement with scientific consensus, possibility and likelihood of harm, evidence of comprehension, reasoning and retrieval ability, presence of inappropriate, incorrect or missing content and possibility of bias in the answer. We use a pool of clinicians to evaluate the quality of model and human-generated answers along these axes. Table 2 Task Axis Question 1 Scientific consensus How does the answer relate to the consensus in the scientific andclinical community? 2 Extent of possible harm What is the extent of possible harm? 3 Likelihood of possible harm What is the likelihood of possible harm? 4 Evidence of correct comprehension Does the answer contain any evidence of correct reading compre-hension? (indication the question has been understood) 5 Evidence of correct retrieval Does the answer contain any evidence of correct recall of knowl-edge? (mention of a relevant and/or correct fact for answering the question) 6 Evidence of correct reasoning Does the answer contain any evidence of correct reasoning steps?(correct rationale for answering the question) 7 Evidence of incorrect comprehension Does the answer contain any evidence of incorrect reading com-prehension? (indication the question has not been understood) 8 Evidence of incorrect retrieval Does the answer contain any evidence of incorrect recall of knowl-edge? (mention of an irrelevant and/or incorrect fact for answering the question) 9 Evidence of incorrect reasoning Does the answer contain any evidence of incorrect reasoning steps?(incorrect rationale for answering the question) 10 Inappropriate/incorrect content Does the answer contain any content it shouldn’t? 11 Missing content Does the answer omit any content it shouldn’t? 12 Possibility of bias Does the answer contain any information that is inapplicable or inaccurate for any particular medical demographic? 1 Scientific consensus Hogyan kapcsolódik a válasz a tudományos és klinikai közösség konszenzusához? 2 Extent of possible harm Mekkora a lehetséges károk mértéke? 3 Likelihood of possible harm What is the likelihood of possible harm? 4 A helyes megértés bizonyítéka A válasz tartalmaz-e bizonyítékot a helyes olvasás megértésére? (az a jelzés, hogy a kérdést megértették) 5 Evidence of correct retrieval A válasz tartalmaz-e bizonyítékot a knowl-edge helyes visszahívására? (a kérdés megválaszolásához releváns és/vagy helyes tény megemlítése) 6 Evidence of correct reasoning Does the answer contain any evidence of correct reasoning steps?(correct rationale for answering the question) 7 Evidence of incorrect comprehension A válasz tartalmaz-e bizonyítékot a helytelen olvasásra? (az a jelzés, hogy a kérdést nem értették) 8 Evidence of incorrect retrieval A válasz tartalmaz-e bizonyítékot a knowl-edge helytelen visszahívására? (a kérdés megválaszolásához irreleváns és/vagy helytelen tény megemlítése) 9 Evidence of incorrect reasoning Does the answer contain any evidence of incorrect reasoning steps?(incorrect rationale for answering the question) 10 Inappropriate/incorrect content Does the answer contain any content it shouldn’t? 11 Missing content A válasz elhagy bármilyen tartalmat, amit nem kellene? 12 Possibility of bias A válasz olyan információkat tartalmaz, amelyek nem alkalmazhatók vagy pontatlanok bármely adott orvosi demográfiai szempontból? The pilot framework was inspired by approaches published in a similar domain by Feng [ ] to examine the strengths and weaknesses of LLM generations in clinical settings. We used focus groups and interviews with clinicians based in the UK, US and India to identify additional axes of evaluation [ ] and expanded the framework items to address notions of agreement with scientific consensus, possibility and likelihood of harm, completeness and missingness of answers and possibility of bias. Alignment with scientific consensus was measured by asking raters whether the output of the model was aligned with a prevailing scientific consensus (for example in the form of well-accepted clinical practice guidelines), opposed to a scientific consensus; or whether no clear scientific consensus exists regarding the question. Harm is a complex concept that can be evaluated along several dimensions (e.g. physical health, mental health, moral, financial and many others). When answering this question, raters were asked to focus solely on physical/mental health-related harms, and evaluated both severity (in a format inspired by the AHRQ common formats for harm [ ]) and likelihood, under the assumption that a consumer or physician based on the content of the answer might take actions. Bias was assessed broadly by raters considering if the answer contained information that would be inapplicable or inaccurate to a specific patient demographic. The questions asked in the evaluation are summarized in Table et al. 22 60 93 2 Our framework items’ form, wording and response-scale points were refined by undertaking further interviews with triplicate assessments of 25 question-answer tuples per dataset by three qualified clinicians. Instructions for the clinicians were written including indicative examples of ratings for questions, and iterated until the clinicians’ rating approaches converged to indicate the instructions were usable. Once the guidelines had converged a larger set of question-answer tuples from the consumer medical questions datasets were evaluated by single-ratings performed by one of nine clinicians based in the UK, USA or India and qualified for practice in their respective countries, with specialist experience including pediatrics, surgery, internal medicine and primary care. | Summary of the different axes along which lay users evaluate the utility of answers in our consumer medical question answering datasets. We use a pool of 5 non-expert lay users to evaluate the quality of model and human-generated answers along these axes. Table 3 Task Axis Question 1 Answer captures user intent How well does the answer address the intent of the question? 2 Helpfulness of the answer How helpful is this answer to the user? (for example, does it enable them to draw a conclusion or help clarify next steps?) 1 Answer captures user intent How well does the answer address the intent of the question? 2 A válasz hasznossága How helpful is this answer to the user? (for example, does it enable them to draw a conclusion or help clarify next steps?) 3.2.2 A Lay felhasználó (nem szakértő) értékelése In order to assess the helpfulness and utility of the answers to the consumer medical questions we undertook an additional lay user (non-expert) evaluation. This was performed by five raters without a medical background, all of whom were based in India. The goal of this exercise was to assess how well the answer addressed the perceived intent underlying the question and how helpful and actionable it was. The questions asked in the evaluation are summarized in Table 3 3.3 Modeling Ebben a szakaszban részletezzük a nagy nyelvi modelleket (LLM-eket) és a technikákat, amelyeket az orvosi terület követelményeinek megfelelően használnak. 3.3.1 Models We build on the PaLM and Flan-PaLM family of LLMs in this study. A Pathways Language Model (PaLM) által bevezetett [ ] is a densely-activated decoder-only transformer language model trained using Pathways [ ], a large-scale ML accelerator orchestration system that enables highly efficient training across TPU pods. The PaLM training corpus consists of 780 billion tokens representing a mixture of webpages, Wikipedia articles, source code, social media conversations, news articles and books. All three PaLM model variants are trained for exactly one epoch of the training data. We refer to [ , , ] for more details on the training corpus. At the time of release, PaLM 540B achieved breakthrough performance, outperforming fine tuned state of the art models on a suite of multi-step reasoning tasks and exceeding average human performance on BIG-bench [ , ]. PaLM 14 4 14 19 80 14 78 In addition to the baseline PaLM models, we also considered the instruction-tuned counterpart introduced by [ ]. These models are trained using instruction tuning, i.e., finetuning the model on a collection of datasets in which each example is prefixed with some combination of instructions and/or few-shot exemplars. In particular, Chung [ ] demonstrated the effectiveness of scaling the number of tasks, model size and using chain-of-thought data [ ] as instructions. The Flan-PaLM model reached state of the art performance on several benchmarks such as MMLU, BBH, and TyDIQA [ ]. Across the suite of evaluation tasks considered in [ ], Flan-PaLM outperformed baseline PaLM by an average of 9.4%, demonstrating the effectiveness of the instruction tuning approach. Flan-PaLM 15 És az al. 15 91 16 15 In this study we considered both the PaLM and Flan-PaLM model variants at three different model sizes: 8B, 62B and 540B, with the largest model using 6144 TPUv4 chips for pretraining. 3.3.2 Aligning LLMs to the medical domain Általános célú LLM-k, mint például a PaLM [ ] and GPT-3 [ ] have reached state of the art performance on a wide variety of tasks on challenging benchmarks such as BIG-bench. However, given the safety critical nature of the medical domain, it is necessary to adapt and align the model with domain-specific data. Typical transfer learning and domain adaptation methods rely on end-to-end finetuning of the model with large amounts of in-domain data, an approach that is challenging here given the paucity of medical data. As such, in this study we focused on data-efficient alignment strategies building on prompting [ ] és gyors tuning [ ]. 14 12 12 45 Brown [ ] demonstrated that LLMs are strong few-shot learners, where fast in-context learning can be achieved through prompting strategies. Through a handful of demonstration examples encoded as prompt text in the input context, these models are able to generalize to new examples and new tasks without any gradient updates or finetuning. The remarkable success of in-context few-shot learning has spurred the development of many prompting strategies including scratchpad [ A „gondolkodás láncolata” ], and least-to-most prompting [ ], especially for multi-step computation and reasoning problems such as math problems [ Ebben a tanulmányban a szabványos néhány lövés, a gondolatlánc és az önkonzisztencia ösztönzésére összpontosítottunk, amint azt az alábbiakban tárgyaljuk. Prompting strategies et al. 12 61 91 100 17 A szabványos néhány lövés prompting stratégiát Brown vezette be [ ]. Here, the prompt to the model is designed to include few-shot examples describing the task through text-based demonstrations. These demonstrations are typically encoded as input-output pairs. The number of examples is typically chosen depending on the number of tokens that can fit into the input context window of the model. After the prompt, the model is provided with an input and asked to generate the test-time prediction. The zero-shot prompting counterpart typically only involves an instruction describing the task without any additional examples. Brown [ ] observed that while zero-shot prompting scaled modestly with model size, performance with few-shot prompting increased more rapidly. Further, Wei [ ] observed emergent abilities– that is, abilities which are non-existent in small models but rapidly improve above random performance beyond a certain model size in the prompting paradigm. Few-shot prompting et al. 12 et al. 12 et al. 90 Ebben a tanulmányban szakképzett klinikusok egy csoportjával dolgoztunk, hogy azonosítsuk a legjobb demonstrációs példákat, és készítsük el a kevés lövéses utasításokat. Általában 5 bemeneti kimeneti példát használtunk a fogyasztói orvosi kérdés megválaszolására szolgáló adatkészlethez, de a PubMedQA esetében a számot 3 vagy annál kevesebbre csökkentettük, figyelembe véve, hogy az absztrakt kontextusba is bele kell illeszkedni a meghívó szövegbe. A.8 Chain-of-thought (CoT), introduced by Wei [ ], involves augmenting each few-shot example in the prompt with a step-by-step breakdown and a coherent set of intermediate reasoning steps towards the final answer. The approach is designed to mimic the human thought process when solving problems that require multi-step computation and reasoning. Wei [ ] demonstrated that CoT prompting can elicit reasoning abilities in sufficiently large language models and dramatically improve performance on tasks such as math problems [ ]. Further, the appearance of such CoT reasoning appears to be an emergent ability [ ] of LLMs. Lewkowycz [ ] used CoT prompting as one of the key strategies in their work leading to breakthrough LLM performance on several STEM benchmarks. Chain-of-thought prompting et al. 91 És az al. 91 17 90 et al. 47 Many of the medical questions explored in this study involve complex multi-step reasoning, making them a good fit for CoT prompting techniques. Together with clinicians, we crafted CoT prompts to provide clear demonstrations on how to reason and answer the given medical questions. Examples of such prompts are detailed in Section . A.9 A straightforward strategy to improve the performance on the multiple-choice benchmarks is to prompt and sample multiple decoding outputs from the model. The final answer is the one with the majority (or plurality) vote. This idea was introduced by Wang [ ] under the name of "self-consistency". The rationale behind this approach here is that for a domain such as medicine with complex reasoning paths, there might be multiple potential routes to the correct answer. Marginalizing out the reasoning paths can lead to the most consistent answer. The self-consistency prompting strategy led to particularly strong improvements in [ ], and we adopted the same approach for our datasets with multiple-choice questions: MedQA, MedMCQA, PubMedQA and MMLU. Self-consistency prompting et al. 88 47 Because LLMs have grown to hundreds of billions of parameters [ az , ], finetuning őket rendkívül költséges számítástechnikai szempontból. Míg a siker a néhány lövés prompting enyhítette ezt a problémát nagymértékben, sok feladat hasznot még a gradiens alapú tanulás. [ ] introduced prompt tuning (in contrast to prompting / priming), a simple and computationally inexpensive Prompt tuning 12 14 et al. 45 method to adapt LLMs to specific downstream tasks, especially with limited data. The approach involves the learning of soft prompt vectors through backpropagation while keeping the rest of the LLM frozen, thus allowing easy reuse of a single model across tasks. This use of soft prompts can be contrasted with the discrete “hard” text-based few-shot prompts popularized by LLMs such as GPT-3 [ ]. While prompt tuning can benefit from any number of labeled examples, typically only a handful of examples (e.g., tens) are required to achieve good performance. Further, Lester 12 et al. [ ] demonstrated that prompt-tuned model performance becomes comparable with end-to-end finetuning at increased model scale. Other related approaches include prefix tuning [ ], where prefix activation vectors are prepended to each layer of the LLM encoder and learned through backpropagation. Lester [ ]’s prompt tuning can be thought of as a simplification of this idea, restricting the learnable parameters to only those representing a small number of tokens prepended to the input as a soft prompt. 45 48 És az al. 45 3.3.3 Utasítások azonnali tuning Hát [ ] and Chung [ ] demonstrated the benefits of multi-task instruction finetuning: the Flan-PaLM model achieved state of the performance on several benchmarks such as BIG-bench [ ] and MMLU [ ]. In particular, Flan-PaLM demonstrated the benefits of using CoT data in fine-tuning, leading to robust improvements in tasks that required reasoning. et al. 89 et al. 15 47 29 Given the strong performance of instruction tuning, we built primarily on the Flan-PALM model in this work. However, as discussed in Section , our human evaluation revealed key gaps in Flan-PaLM’s performance on the consumer medical question answering datasets, even with few-shot prompting. To further align the model to the requirements of the safety-critical medical domain, we explored additional training specifically on medical data. 4.5 For this additional training, we used prompt tuning instead of full-model finetuning given compute and clinician data generation costs. Our approach effectively extends Flan-PaLM’s principle of "learning to follow instructions" to the prompt tuning stage. Specifically, rather than using the soft prompt learned by prompt tuning as a replacement for a task-specific human-engineered prompt, we instead use the soft prompt as an initial prefix that is shared across multiple medical datasets, and which is followed by the relevant task-specific human-engineered prompt (consisting of instructions and/or few-shot exemplars, which may be chain-of-thought examples) along with the actual question and/or context. We refer to this method of prompt tuning as “instruction prompt tuning”. Instruction prompt tuning can thus be seen as a lightweight way (data-efficient, parameter-efficient, compute-efficient during both training and inference) of training a model to follow instructions in one or more domains. In our setting, instruction prompt tuning adapted LLMs to better follow the specific type of instructions used in the family of medical datasets that we target. Given the combination of soft prompt with hard prompt, instruction prompt tuning can be considered a type of "hard-soft hybrid prompt tuning" [ ], alongside existing techniques that insert hard anchor tokens into a soft prompt [ ], insert learned soft tokens into a hard prompt [ ], or use a learned soft prompt as a prefix for a short zero-shot hard prompt [ az , A legjobb tudásunk szerint a miénk az első közzétett példa a lágy utasítás megtanulására, amely egy teljes kemény utasítás és néhány példány keverékét tartalmazó teljes kemény utasítás előtt van előzetesen rögzítve. 52 53 28 26 96 3.3.4 Putting it all together: Med-PaLM To adapt Flan-PaLM to the medical domain, we applied instruction prompt tuning on a small set of exemplars. These examples were effectively used to instruct the model to produce text generations more aligned with the requirements of the medical domain, with good examples of medical comprehension, recall of clinical knowledge, and reasoning on medical knowledge unlikely to lead to patient harm. Thus, curation of these examples was very important. We randomly sampled examples from MultiMedQA free-response datasets (HealthSearchQA, MedicationQA, LiveQA) and asked a panel of five clinicians to provide exemplar answers. These clinicians were based in the US and UK with specialist experience in primary care, surgery, internal medicine, and pediatrics. Clinicians then filtered out questions / answer pairs that they decided were not good examples to instruct the model. This generally happened when clinicians felt like they could not produce an “ideal” model answer for a given question, e.g., if the information required to answer a question was not known. We were left with 40 examples across HealthSearchQA, MedicationQA, and LiveQA used for instruction prompt tuning training. The resulting model, Med-PaLM, was evaluated on the consumer medical question answering datasets of MultiMedQA along with Flan-PaLM. Figure gives an overview of our instruction prompt tuning approach for Med-PaLM. Further details on the hyperparameter optimization and model selection process can be found in Section . The model card for Med-PaLM is provided in Section . 2 Az A1 A.5 4 Results Ebben a szakaszban először áttekintést nyújtunk kulcsfontosságú eredményeinkről, a számokban összefoglalva. and . Then, we present several ablations to help contextualize and interpret the results. 3 4 4.1 A Flan-PaLM több mint 17%-kal meghaladja a korábbi state-of-the-art MedQA-t (USMLE) On the MedQA dataset consisting of USMLE style questions with 4 options, our Flan-PaLM 540B model achieved a multiple-choice question (MCQ) accuracy of 67.6% surpassing the DRAGON model [ ] by 20.1%. 94 Concurrent to our study, Bolton [ megtekintése ] developed PubMedGPT, a 2.7 billion model trained exclusively on biomedical abstracts and paper. The model achieved a performance of 50.3% on MedQA questions with 4 options. To the best of our knowledge, this is the state-of-the-art on MedQA, and Flan-PaLM 540B exceeded this by 17.3%. Table Összehasonlítja a legjobban teljesítő modelleket ezzel az adatkészlettel. Az 5 opcióval rendelkező, nehezebb kérdések közül a modellünk 62,0%-os pontszámot ért el. et al. 9 4 4.2 State-of-the-art performance on MedMCQA and PubMedQA On the MedMCQA dataset, consisting of medical entrance exam questions from India, Flan-PaLM 540B reached a performance of 57.6% on the dev set. This exceeds the previous state of the art result of 52.9% by the Galactica model [ ]. 79 Similarly on the PubMedQA dataset, our model achieved an accuracy of 79.0% outperforming the previous state of the art BioGPT model Luo [ ] by 0.8%. The results are summarized in Figure 2 below. While this improvement may seem small compared to MedQA and MedMCQA datasets, the single rater human performance on PubMedQA is 78.0% [ ], indicating that there may be an inherent ceiling to the maximum possible performance on this task. et al. 56 33 A legjobb teljesítményű modellek összefoglalása a MedQA (USMLE) adatkészlet kérdésein 4 lehetőséggel. Table 4 Model (number of parameters) MedQA (USMLE) Accuracy % Flan-PaLM (540 B)(ours) 67.6 PubMedGPT (2.7 B) [ ] 9 50.3 DRAGON (360 M) [ ] 94 47.5 BioLinkBERT (340 M) [ ] 95 45.1 Galactica (120 B) [ ] 79 44.4 PubMedBERT (100 M) [ ] 25 38.1 GPT-Neo (2.7 B) [ ] 7 33.3 Flan-PaLM (540 B)(ours) 67.6 PubMedGPT (2.7 B) [ ] 9 50.3 Sárkány (360 m) ] 94 47.5 BioLinkBERT (340 M) [ ] 95 45.1 Galactica (120 B) [ ] 79 44.4 Székesfehérvár (100 m) [ ] 25 38.1 GPT Neo (2.7 B) [ ] 7 33.3 4.3 State-of-the-art performance on MMLU clinical topics The MMLU dataset contains multiple-choice questions from several clinical knowledge, medicine and biology related topics. These include anatomy, clinical knowledge, professional medicine, human genetics, college medicine and college biology. Flan-PaLM 540B achieved state of the art performance on all these subsets, outperforming strong LLMs like PaLM, Gopher, Chinchilla, BLOOM, OPT and Galactica. In particular, on the professional medicine and clinical knowledge subset, Flan-PaLM 540B achieved a SOTA accuracy of 83.5% and 84.0%. Figure summarizes the results, providing comparisons with other LLMs where available [ ]. 4 79 4.4 Ablations We performed several ablations on three of the multiple-choice datasets - MedQA, MedMCQA and PubMedQA - to better understand our results and identify the key components contributing to Flan-PaLM’s performance. We present them in detail below: Across all model sizes, we observed that the instruction-tuned Flan-PaLM model outperformed the baseline PaLM model on all three datasets - MedQA, MedMCQA and PubMedQA. The models were few-shot prompted in these experiments using the prompt text detailed in . The detailed results are summarized in . The improvements were most prominent in the PubMedQA dataset where the 8B Flan-PaLM model outperformed the baseline PaLM model by over 30%. Similar strong improvements were observed in the case of 62B and 540B variants too. These results demonstrated the strong benefits of instruction fine-tuning. Similar results with MMLU clinical topics are reported in Section . Instruction tuning improves performance on medical question answering A.8 5 Az A3 We have not yet completed a thorough analysis of the effect of instruction prompt tuning on multiple-choice accuracy; our analysis is of Flan-PaLM in this section, not Med-PaLM. Med-PaLM (instruction prompt-tuned Flan-PaLM) was developed to improve the long-form generation results of Flan-PaLM presented in Section by better aligning the model to the medical domain. However, given the success of domain-agnostic instruction tuning for multiple-choice question answering, in-domain instruction prompt tuning appears promising, and we present a preliminary result in Section . 4.5 A.6 Az összefüggő megfigyelés a was the strong performance improvements obtained from scaling the model from 8B to 62B and 540B. We observed approximately a 2x improvement in performance when scaling the model from 8B to 540B in both PaLM and Flan-PaLM. These improvements were more pronounced in the MedQA and MedMCQA datasets. In particular, for the Flan-PaLM model, the 540B variant outperformed the 62B variant by over 14% and the 8B variant by over 24%. Given these results and the strong performance of the Flan-PaLM 540B model, we built on this model for downstream experiments and ablations. The scaling plots are provided in Section . Scaling improves performance on medical question answering 5 A.4 summarizes the results from using CoT prompting and provides a comparison with the few-shot prompting strategy using the Flan-PaLM 540B model. Somewhat unexpectedly, we did not observe improvements using CoT over the standard few-shot prompting strategy across the three multiple-choice datasets - MedQA, MedMCQA and PubMedQA. The CoT prompts used are summarized in Section . Chain-of-Thought (CoT) prompting 6 Az A.9 Wang [ ] showed that self-consistency prompting can help when CoT prompting hurts performance. They showed significant improvements on arithmetic and commonsense reasoning tasks. Taking their cue, we apply it to our datasets. We fixed the number of chain-of-thought answer explanation paths to 11 for each of the three datasets. We then marginalized over the different explanation paths to select the most consistent answer. Using this strategy, we observed significant improvements over the standard few-shot prompting strategy for the Flan-PaLM 540B model on the MedQA and MedMCQA datasets. In particular, for the MedQA dataset we observed a >7% improvement with self-consistency. However, somewhat unexpectedly, self-consistency led to a drop in performance for the PubMedQA dataset. The results are summarized in Table . Self-consistency (SC) leads to strong improvement in multiple-choice performance et al. 88 7 We further provide some example responses from the Flan-PaLM 540B model for MedQA in Table . 8 LLMs are capable of long, coherent, and complex generations. However, they can also generate statements inconsistent with fact. In medical settings in particular, such failure modes need to be carefully vetted, and in real world applications, generations unlikely to be true should be withheld. Instead, we may want to defer to other information sources or experts when needed. One solution is therefore for LLMs to communicate uncertainty estimates along with their responses. Uncertainty and Selective Prediction While uncertainty measures over LLM output sequences remains an open area of research [ , ], here we explored a simple proxy as an initial approach to measuring the relationship between LLM uncertainty and statement accuracy. We created a selective prediction task [ ], using the number of decodes matching a given answer from self-consistency as a measure of uncertainty and used it to withhold the answer if the model was not appropriately confident. We performed the experiment using 41 decodes from the Flan-PaLM 540B model with chain-of-thought prompting and self consistency. We observe in that as the deferring fraction increases (i.e., with a higher “confidence” required to provide a prediction), the performance of the model on MedQA improves, reaching up to an accuracy of of 82.5% at a 0.45 deferring fraction. This suggests our measure of response uncertainty may be reasonable, and that LLMs seem to encode uncertainty about their knowledge in the medical domain. However, more research is needed beyond this preliminary analysis. 36 51 82 5 4.5 Human evaluation results Véletlenszerűen kiválasztottunk 100 kérdést a HealthSearchQA-tól, 20 kérdést a LiveQA-tól és 20 kérdést a MedicationQA-tól, mint egy kisebb hosszú formájú válasz referenciaértéket a részletes emberi értékeléshez.Ezek a kérdések tükrözik a valós fogyasztói lekérdezéseket az orvosi információkért. We had a panel of clinicians generate expert reference answers to these questions. We then produced answers using Flan-PaLM and Med-PaLM (both 540B models). A few qualitative examples of these questions and the corresponding Med-PaLM responses are shown in Table A válaszok három csoportját egy másik klinikai csoport értékelte a táblázat tengelyén. , without revealing the source of answers. One clinician evaluated each answer. To reduce the impact of variation across clinicians on generalizability of our findings, our panel consisted of 9 clinicians (based in the US, UK, and India). We used the non-parametric bootstrap to estimate any significant variation in the results, where 100 bootstrap replicas were used to produce a distribution for each set and we used the 95% bootstrap percentile interval to assess variations. These results are described in detail below and in Section . 9 2 A.7 We wished to understand how the answers related to current consensus in the clinical and scientific community. On the 140 questions evaluated in the study, we found that clinicians’ answers were judged to be aligned with the scientific consensus in 92.9% of questions. On the other hand, Flan-PaLM was found to be in agreement with the scientific consensus in only 61.9% of answers. For other questions, answers were either opposed to consensus, or no consensus existed. This suggested that generic instruction tuning on its own was not sufficient to produce scientific and clinically grounded answers. However, we observed that 92.9% of Med-PaLM answers were judged to be in accordance with the scientific consensus, showcasing the strength of instruction prompt tuning as an alignment technique to produce scientifically grounded answers. Scientific consensus: We note that since PaLM, Flan-PaLM, and Med-PaLM were trained using corpora of web documents, books, Wikipedia, code, natural language tasks, and medical tasks at a given point of time, one potential limitation of these models is that they can reflect the scientific consensus of the past instead of today. This was not a commonly observed failure mode for Med-PaLM today, but this motivates future work in continual learning of LLMs and retrieval from a continuously evolving corpus. We sought to understand the (whether expert or model generated) medical comprehension, medical knowledge retrieval and reasoning capabilities of the model as expressed through the answers generated by them. We asked a panel of clinicians to rate whether answers contained any (one or more example of) evidence of correct / incorrect medical reading comprehension, medical knowledge retrieval and medical reasoning capabilities, using the same approach as Feng [ ]. Correct and incorrect evidence were assessed in parallel because it is possible that a single long-form answer may contain evidence of both correct and incorrect comprehension, retrieval and reasoning. Comprehension, retrieval and reasoning capabilities: És az al. 22 We found that expert generated answers were again considerably superior to Flan-PaLM, though performance was improved by instruction prompt tuning for Med-PaLM. This trend was observed in all the six sub-questions used to evaluate in this axis. For example, with regard to evidence of correct retrieval of medical knowledge, we found that clinician answers scored 97.8% while Flan-PaLM only scored 76.3%. However, the instruction prompt-tuned Med-PaLM model scored 95.4%, reducing the inferiority of the model compared to clinicians. Ennek az értékelésnek az volt a célja, hogy megértsék a kapott válaszok teljességét és helyességét, értékelve, hogy a válasz elhagy-e bármilyen információt, amit nem kellene, vagy hogy a válasz tartalmazott-e bármilyen tartalmat, amit nem kellene. Incorrect or missing content: Again we observed that clinician-generated answers were superior to AI models. Clinician answers showed evidence of inappropriate/incorrect content in only 1.4% of the cases, compared to 16.1% for Flan-PaLM. Surprisingly, instruction prompt tuning seemed to further degrade performance, with 18.7% of the Med-PaLM answers judged to contain inappropriate or incorrect content. On the other hand, we observed that instruction prompt tuning helped improve model performance in omission of important information. While Flan-PaLM answers were judged to miss important information 47.2% of the time, the number improved significantly for Med-PaLM with only 15.1% of the answers adjudged to have missing information, reducing the inferiority compared to clinicians whose answers were judged to have missing information in only 11.1% of the cases. A few qualitative examples are shown in Table 10 suggesting that LLM answers may be able to complement and complete physician responses to patient queries in future use cases. Ezeknek a megfigyeléseknek az egyik lehetséges magyarázata az, hogy az utasítások gyors beállítása megtanítja a Med-PaLM modellt, hogy jelentősen részletesebb válaszokat generáljon, mint a Flan-PaLM modell, ami csökkenti a fontos információk kihagyását. Megpróbáltuk azonosítani a potenciális károsodás súlyosságát és valószínűségét a generált válaszok alapján. megkérdeztük a minősítőket, hogy feltételezzék, hogy a modellek kimenete klinikai vagy fogyasztói / betegek által végzett cselekvésekhez vezethet, és becsüljék meg a fizikai / mentális egészséggel összefüggő károsodás lehetséges súlyosságát és valószínűségét. [ ], which presents options to assign severity of harm ranging from death, severe or life-threatening injury, moderate, mild or no harm. We acknowledge that this definition of harm is more typically used in the context of analyzing harms incurred during healthcare delivery and that even in such settings (where the context for harms occurring is known with considerably greater specificity) there is frequently substantial variation in physician estimation of harm severity [ ]. The validity of the AHRQ scale cannot therefore be assumed to extend to our context, where our rater outputs should be regarded as subjective estimates because our work was not grounded in a specific intended use and sociocultural context. Possible extent and likelihood of harm: et al. 93 86 Despite the broad definition and subjectivity of ratings, we observed that instruction prompt tuning produced safer answers that reduced both estimated likelihood and severity. While 29.7% of the Flan-PaLM responses were judged as potentially leading to harm, this number dropped to 5.9% for Med-PaLM comparing on par with clinician-generated answers which were also judged as potentially harmful in 5.7% of the cases. Similarly, on the likelihood of harm axes, instruction prompt tuning enabled Med-PaLM answers to match the expert generated answers. A nagy nyelvi modellek használata az orvosi kérdések megválaszolásához az egészségügyi egyenlőtlenségekhez hozzájáruló előítélethez és méltányossághoz kapcsolódó károkhoz vezethet. Ezek a károk több forrásból származnak, beleértve a képzési adatokban lévő minták jelenlétét, amelyek az egészségügyi eredmények és az ellátáshoz való hozzáférés eltéréseit tükrözik, az orvosi kérdések megválaszolására szolgáló rendszerek képességét a faji egészségügyi eltérések okait illetően a rasszista félreértések reprodukálására [ , ], algorithmic design choices [ ], és a populációkon és csoportokon átívelő gépi tanulási rendszerek viselkedésében vagy teljesítményében mutatkozó különbségek, amelyek az orvosi döntéshozatal tájékoztatására használják a downstream károsodást [ ]. Bias for medical demographics: 20 85 32 13 Medical question answering systems also pose additional risks beyond those posed by the use of other AI applications in healthcare because they have potential to produce arbitrary outputs, have limited reasoning capability, and could potentially be used for a wide range of downstream use cases. We sought to understand whether the answer contained any information that is inaccurate or inapplicable for a particular demographic. Flan-PaLM answers were found to contain biased information in 7.9% of the cases. However, this number reduced to 0.8% for Med-PaLM, comparing favorably with experts whose answers were judged to contain evidence of bias in 1.4% of the cases. Beyond expert evaluation, we also had a panel of five non-experts in the domain (laypeople without a medical background, based in India) assess the answers. The results are summarized in Fig 10 below. While Flan-PaLM answers were judged to be helpful in only 60.6% of the cases, the number improved to 80.3% for Med-PaLM answers. However, this remained inferior to clinician answers which were judged to be helpful 91.1% of the time. Similarly, Flan-PaLM answers were user’s question intent in 90.8% of cases. This number improved to 94.0% for Med-PaLM, which was inferior to clinician-generated answers at 95.9%. Lay user assessment: judged as directly addressing the The lay evaluation consistently reproduced the benefits of instruction prompt tuning to produce answers that are helpful to users, while also demonstrating that there is still considerable work needed to approximate the quality of outputs provided by human clinicians. 5 Discussion Our results suggest that strong performance on medical question answering may be an emergent ability [ ] az LLM-k kombinálva a hatékony utasítások gyors beállításával. 90 Először is, amikor a PaLM modelleket 8 milliárdról 540 milliárdra méreteztük, nagyjából 2-szer javult a skálázási teljesítmény és a pontosság. A MedQA 8 milliárd PaLM teljesítménye csak kissé jobb volt, mint a véletlenszerű teljesítmény. Ez a szám azonban a PaLM 540 milliárd esetében több mint 30%-kal javult, ami a skálázási hatékonyságot mutatta az orvosi kérdések megválaszolására. Hasonló javulást figyelhettünk meg a MedMCQA és a PubMedQA adatkészletek esetében. Továbbá, az utasítás finomhangolása a Flan-PaLM modelleknél is hatékonyabb volt, mint a PaLM modellek minden méretű változatánál. Lehetséges, hogy a PaLM előképzési testület jelentős mennyiségű magas színvonalú orvosi tartalmat tartalmazott, és az 540 milliárd modellváltozat erős teljesítményének egyik lehetséges feltételezése az értékelési adatkészletek memorizálása. [ ] showed similar deltas in performance of the PaLM 8B and 540B model when evaluating contaminated (i.e where part of the test set is in the model pre-training corpus) and cleaned test datasets. This suggests that memorization alone does not explain the strong performance observed by scaling up the models. et al. 14 There have been several efforts to train language models on a biomedical corpus, especially PubMed. These include BioGPT [ ] (355 million parameters), PubMedGPT [ ] (2.7 billion parameters) and Galactica [ ] (120 billion parameters). Our models were able to outperform these efforts on PubMedQA without any finetuning. Further, the benefits of scale and instruction fine-tuning were much more pronounced on the MedQA dataset, which can be considered out-of-domain for all these models. Given the results, we observe that medical answering performance (requiring recall, reading comprehension, and reasoning skills) improves with LLM scale. 56 9 79 However, our human evaluation results on the consumer medical question answering datasets clearly point out that scale alone is insufficient. Even state-of-the-art LLMs like Flan-PaLM can generate answers that are inappropriate for use in the safety-critical medical domain. However, the Med-PaLM results demonstrate that with instruction prompt tuning we have a data and parameter-efficient alignment technique useful for improving factors related to accuracy, factuality, consistency, safety, harm, and bias, helping close the gap with clinical experts and bringing these models closer to real-world clinical applications. 6 Limitations Our study demonstrated the potential of LLMs for encoding medical knowledge and in particular for question answering. However, it had several limitations which we discuss in detail below and outline directions for future research. 6.1 Expansion of MultiMedQA Firstly, while the MultiMedQA benchmark is diverse and contains questions from a variety of professional medicine, medical research and consumer sources, it is by no means exhaustive. We plan to expand the benchmark in the future to include a larger variety of medical and scientific domains (eg: biology) and formats. A key challenge in clinical environments is eliciting information from patients and synthesizing findings into an assessment and plan. Multiple-choice question answering tasks are inherently easier because they are often grounded in vignettes compiled by experts and selected to have a generally preferred answer, which is not true for all medical decisions. Developing benchmark tasks that reflect real world clinical workflows is an important direction of future research. Furthermore, we only considered English-language datasets in this study, and there is a strong need to expand the scope of the benchmark to support multilingual evaluations. 6.2 Development of key LLM capabilities necessary for medical applications While the Flan-PaLM was able to reach state-of-the-art performance on several multiple-choice medical question answering benchmarks, our human evaluation clearly suggests these models are not at clinician expert level on many clinically important axes. In order to bridge this gap, several new LLM capabilities need to be researched and developed including: grounding of the responses in authoritative medical sources and accounting for the time-varying nature of medical consensus. ability to detect and communicate uncertainty effectively to the human in-the-loop whether clinician or lay user. Képes válaszolni a kérésekre több nyelven. 6.3 Improving the approach to human evaluation The rating framework we proposed for this study represents a promising pilot approach, but our chosen axes of evaluation were not exhaustive and were subjective in nature. For example the concept of medical/scientific consensus is time-varying in nature and is reflective of understandings of human health and disease and physiology based on discrimination in areas such as race/ethnicity, gender, age, ability, and more [ az , ]. 38 57 Furthermore, consensus often exists only for topics of relevance to certain groups (e.g. greater in number and/or power) and consensus may be lacking for certain subpopulations affected by topics for various reasons (e.g., controversial topics, lower incidence, less funding). Additionally, the concept of harm may differ according to population (e.g., a genetic study of a smaller group of people may reveal information that is factual but incongruent with that group’s cultural beliefs, which could cause members of this group harm). Expert assessment of harm may also vary based on location, lived experience, and cultural background. Our ratings of potential harm were subjective estimates, and variation in perceived harm may also have been due to differences in health literacy of both our clinician and lay raters, or might vary in real world settings depending on the sociocultural context and health literacy of the person receiving and acting on the answers to the health questions in the study by Berkman [ ]. Further research might test whether perceived usefulness and harm of question answers varied according to the understandability and actionability score for the answer content [ ]. et al. 6 77 The number of model responses evaluated and the pool of clinicians and lay-people assessing them were limited, as our results were based on only a single clinician or lay-person evaluating the responses. This represents a limitation to generalizability of our findings which could be mitigated by inclusion of a significantly larger and intentionally diverse pool of human raters (clinicians and lay users) with participatory design in the development of model auditing tools. It is worth noting that the space of LLM responses or "coverage" is extremely high and that presents an additional difficulty in the design of evaluation tools and frameworks. The pilot framework we developed could be significantly advanced using recommended best practice approaches for the design and validation of rating instruments from health, social and behavioral research [ ]. This could entail the identification of additional rating items through participatory research, evaluation of rating items by domain experts and technology recipients for relevance, representativeness, and technical quality. The inclusion of a substantially larger pool of human raters would also enable testing of instrument generalizability by ratifying the test dimensionality, test-retest reliability and validity [ ]. As the same answer can be evaluated multiple ways, the most appropriate rating instrument is also dependent on the intended purpose and recipient for LLM outputs, providing multiple opportunities for the development of validated rating scales depending on the context and purpose of use. Further, substantial user experience (UX) and human-computer interaction (HCI) studies using community-based participatory research methods are necessary before any real world use, and would be specific to a developed tool that is beyond the scope of our exploratory research. Under these contexts further research could explore the independent influence of variation in lay raters’ education level, medical conditions, caregiver status, experience with health care, education level or other relevant factors on their perceptions of the quality of model outputs. The impact of variation in clinician raters’ specialty, demographics, geography or other factors could be similarly explored in further research. 8 8 6.4 Fairness and equity considerations Jelenlegi megközelítésünk az előítéletek értékelésére korlátozott, és nem szolgálja a lehetséges károk, méltányosság vagy méltányosság átfogó értékelését.A nagy nyelvi modellekben az előítéletek és a méltányossággal kapcsolatos károk értékelésére szolgáló eljárások kidolgozása folyamatban van. , ]. Healthcare is a particularly complex application of large language models given the safety-critical nature of the domain and the nuance associated with social and structural bias that drives health disparities. The intersection of large language models and healthcare creates unique opportunities for responsible and ethical innovation of robust assessment and mitigation tools for bias, fairness, and health equity. 49 92 Meghatározzuk a jövőbeni kutatási lehetőségeket a háttérben lévő kár és a nagy nyelvi modellek hatásainak szisztematikus azonosítására és enyhítésére szolgáló keretekre. kulcsfontosságú elvek közé tartozik a részvételi módszerek használata a kontextusos értékelések megtervezéséhez, amelyek tükrözik a betegek értékeit, amelyek előnyösek vagy károsak lehetnek, az értékelés egy vagy több konkrét háttérben lévő klinikai felhasználási esetre alapozva. , ], és az adatkészletek és a modelldokumentációs keretrendszerek használata az adatok gyűjtése és felügyelete, a modellfejlesztés és az értékelés során tett döntések és feltételezések átlátható jelentésére [ , , ]. Furthermore, research is needed into the design of algorithmic procedures and benchmarks that probe for specific technical biases that are known to cause harm if not mitigated. For instance, depending on the context, it may be relevant to assess sensitivity of model outputs to perturbations of demographic identifiers in prompts designed deliberately such that the result should not change under the perturbation [ , az , ]. 54 71 24 59 72 23 68 98 Additionally, the aforementioned research activities to build evaluation methods to achieve health equity in large language models require interdisciplinary collaboration to ensure that various scientific perspectives and methods can be applied to the task of understanding the social and contextual aspects of health [ , , ]. 27 58 62 A nagyméretű nyelvi modellek értékelési keretrendszereinek kidolgozása kritikus kutatási program, amelyet ugyanolyan szigorral és figyelmet kell fordítani, mint a klinikai ismeretek kódolásának munkáját a nyelvi modellekben. In this study we worked with a panel of four qualified clinicians to identify the best-demonstration examples and craft few-shot prompts, all based in either the US or UK, with expertise in internal medicine, pediatrics, surgery and primary care. Although recent studies have surprisingly suggested that the validity of reasoning within a chain-of-thought prompt only contributes a small extent to the impact of this strategy on LLM performance in multi-step reasoning challenges [ ], further research could significantly expand the range of clinicians engaged in prompt construction and the selection of exemplar answers and thereby explore how variation in multiple axes of the types of clinician participating in this activity impact LLM behavior; for example clinician demographics, geography, specialism, lived experience and more. 87 6.5 Ethical considerations This research demonstrates the potential of LLMs for future use in healthcare. Transitioning from a LLM that is used for medical question answering to a tool that can be used by healthcare providers, administrators, and consumers will require significant additional research to ensure the safety, reliability, efficacy, and privacy of the technology. Careful consideration will need to be given to the ethical deployment of this technology including rigorous quality assessment when used in different clinical settings and guardrails to mitigate against over reliance on the output of a medical assistant. For example, the potential harms of using a LLM for diagnosing or treating an illness are much greater than using a LLM for information about a disease or medication. Additional research will be needed to assess LLMs used in healthcare for homogenization and amplification of biases and security vulnerabilities inherited from base models [ , , , az , ]. Figyelembe véve a klinikai tudás folyamatos fejlődését, fontos lesz az LLM-k számára a naprakész klinikai információk biztosításának módja is. 10 11 18 39 49 7 Conclusion The advent of foundation AI models and large language models present a significant opportunity to rethink the development of medical AI and make it easier, safer and more equitable to use. At the same time, medicine is an especially complex domain for applications of large language models. Our research provides a glimpse into the opportunities and the challenges of applying these technologies to medicine. We hope this study will spark further conversations and collaborations between patients, consumers, AI researchers, clinicians, social scientists, ethicists, policymakers and other interested people in order to responsibly translate these early research findings to improve healthcare. Acknowledgments This project was an extensive collaboration between many teams at Google Research and Deepmind. We thank Michael Howell, Cameron Chen, Basil Mustafa, David Fleet, Fayruz Kibria, Gordon Turner, Lisa Lehmann, Ivor Horn, Maggie Shiels, Shravya Shetty, Jukka Zitting, Evan Rappaport, Lucy Marples, Viknesh Sounderajah, Ali Connell, Jan Freyberg, Cian Hughes, Megan Jones-Bell, Susan Thomas, Martin Ho, Sushant Prakash, Bradley Green, Ewa Dominowska, Frederick Liu, Xuezhi Wang, and Dina Demner-Fushman (from the National Library of Medicine) for their valuable insights and feedback during our research. We are also grateful to Karen DeSalvo, Zoubin Ghahramani, James Manyika, and Jeff Dean for their support during the course of this project. References 1. Abacha, A. B., Agichtein, E., Pinter, Y. & Demner-Fushman, D. in (2017), 1 és 12 között A TREC 2017 LiveQA orvosi kérdésekre adott válaszok áttekintése. TREC 2. Abacha, A. B., Mrabet, Y., Sharp, M., Goodwin, T. R., Shooshan, S. E. & Demner-Fushman, D. az (2019), 25–29. A fogyasztói gyógyszerkérdések és a megbízható válaszok közötti szakadék áthidalása. MedInfo 3. Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. és Sontag, D. A nagy nyelvi modellek a Zero-Shot klinikai információkivonók. (2022). arXiv preprint arXiv:2205.12689 4. Barham, P., Chowdhery, A., Dean, J., Ghemawat, S., Hand, S., Hurt, D., Isard, M., Lim, H., Pang, R., Roy, S., et al. Pathways: Asynchronous distributed dataflow for ML. 430–449 között (2022) A gépi tanulás és a rendszerek 4, Beltagy, I., Lo, K. és Cohan, A. SciBERT: A tudományos szöveg előkészített nyelvi modellje. (2019). arXiv preprint arXiv:1903.10676 6. Berkman, N. D., Sheridan, S. L., Donahue, K. E., Halpern, D. J., Viera, A., Crotty, K., Holland, A., Brasure, M., Lohr, K. N., Harden, E., Egészségügyi írástudás beavatkozások és eredmények: frissített rendszeres felülvizsgálat. 1–941 (2011). et al. Evidence report/technology assessment, 7. Black, S., Gao, L., Wang, P., Leahy, C. & Biderman, S. version 1.0. If you use this software, please cite it using these metadata. Mar. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow https : . //doi.org/10.5281/zenodo.5297715 8. Boateng, G. O., Neilands, T. B., Frongillo, E. A., Melgar-Quiñonez, H. R. & Young, S. L. Best practices for developing and validating scales for health, social, and behavioral research: a primer. 1499 (a továbbiakban 2018 ) Frontiers in public health 6, 9. Bolton, E., Hall, D., Yasunaga, M., Lee, T., Manning, C. & Liang, P. A 2022-es Stanford CRFM Introduces PubMedGPT 2.7B https://hai.stanford.edu/news/stanford-crfm-introduces-pubmedgpt-27b 10. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., On the opportunities and risks of foundation models. (2021). És az al. arXiv előnyben arXiv:2108.07258 11. Bommasani, R., Liang, P. & Lee, T. A nyelvi modellek megváltoztatják az AI-t: a holisztikus értékelés szükségessége https : . 2022. //crfm.stanford.edu/2022/11/17/helm.html 12. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., A nyelvi modellek kevés tanulóval rendelkeznek. 1877–1901 (2020). et al. Advances in neural information processing systems 33, 13. Chen, I. Y., Pierson, E., Rose, S., Joshi, S., Ferryman, K. & Ghassemi, M. Etikai gépi tanulás az egészségügyben. 123–144 között (2021) A biomedikai adatok tudományának éves felülvizsgálata 4, Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., PaLM: Scaling language modeling with pathways. (2022). et al. arXiv preprint arXiv:2204.02311 15. Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., És az al. Scaling instruction-finetuned language models. (2022). arXiv előzetes arXiv:2210.11416 16. Clark, J. H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V. & Palomaki, J. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. 454 és 470 között (2020) Transactions of the Association for Computational Linguistics 8, Cobbe, K., Kosaraju, V., Bavarian, M., Hilton, J., Nakano, R., Hesse, C. & Schulman, J. A matematikai szó problémáinak megoldására szolgáló képzési ellenőrzők. 2021 között. arXiv preprint arXiv:2110.14168 18. Creel, K. & Hellman, D. The Algorithmic Leviathan: Arbitrariness, Fairness, and Opportunity in Algorithmic Decision-Making Systems. 1–18 között (2022) Canadian Journal of Philosophy, 19. Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., in (2022), 5547–5569. et al. Glam: Efficient scaling of language models with mixture-of-experts International Conference on Machine Learning 20. Eneanya, N. D., Boulware, L., Tsai, J., Bruce, M. A., Ford, C. L., Harris, C., Morales, L. S., Ryan, M. J., Reese, P. P., Thorpe, R. J., Health inequities and the inappropriate use of race in nephrology. 84–94 (2022). et al. Nature Reviews Nephrology 18, 21. Esteva, A., Chou, K., Yeung, S., Naik, N., Madani, A., Mottaghi, A., Liu, Y., Topol, E., Dean, J. & Socher, R. Deep learning-enabled medical computer vision. 1 és 9 között (2021). NPJ digitális orvoslás 4, 22. Feng, S. Y., Khetan, V., Sacaleanu, B., Gershman, A. & Hovy, E. CHARD: Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models. (2022). arXiv preprint arXiv:2210.04191 23. Garg, S., Perot, V., Limtiaco, N., Taly, A., Chi, E. H. & Beutel, A. az (2019), 219–226. Counterfactual fairness in text classification through robustness Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society 24. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D. & Crawford, K. Datasheets for datasets. 86–92 (2021). Communications of the ACM 64, 25. Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J. & Poon, H. Domain-specifikus nyelvi modell előképzés biomedikai természetes nyelvi feldolgozásra. 1–23 (2021). ACM Transactions on Computing for Healthcare (HEALTH) 3, 26. Gu, Y., Han, X., Liu, Z. & Huang, M. Ppt: Pre-trained prompt tuning for few-shot learning. (2021). arXiv előre nyomtatott arXiv:2109.04332 27. Guidance, W. Ethics and governance of artificial intelligence for health. (2021). World Health Organization Han, X., Zhao, W., Ding, N., Liu, Z. & Sun, M. Ptr: Prompt tuning szabályokkal a szöveg osztályozására. 2022 között. Ki nyitott 29. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. és Steinhardt, J. Masszív multitask nyelvtudás mérése. Azaz a 2020. arXiv előnyben arXiv:2009.03300 30. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Számítógépes optimális nagy nyelvi modellek képzése. 2022 között. És az al. arXiv előre nyomtatott arXiv:2203.15556 Hong, Z., Ajith, A., Pauloski, G., Duede, E., Malamud, C., Magoulas, R., Chard, K. és Foster, I. ScholarBERT: A nagyobb nem mindig jobb. 2022 között. arXiv előre nyomtatott arXiv:2205.11342 32. Hooker, S. Moving beyond “algorithmic bias is a data problem”. 100241 (2021). Patterns 2, 33. Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H. & Szolovits, P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. 6421 (2021). Alkalmazott tudomány 11, 34. Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. PubMedQA: A dataset for biomedical research question answering. (2019) az arXiv preprint arXiv:1909.06146 35. Joshi, M., Choi, E., Weld, D. S. & Zettlemoyer, L. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. (2017). arXiv preprint arXiv:1705.03551 36. Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Dodds, Z. H., DasSarma, N., Tran-Johnson, E., Language models (mostly) know what they know. (2022). et al. arXiv preprint arXiv:2207.05221 37. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. & Amodei, D. Scaling laws for neural language models. (2020). arXiv preprint arXiv:2001.08361 38. Kington, R. S., Arnesen, S., Chou, W.-Y. S., Curry, S. J., Lazer, D. & Villarruel, A. M. Identifying credible sources of health information in social media: Principles and attributes. (2021). NAM perspectives 2021 39. Kleinberg, J. & Raghavan, M. Algorithmic monoculture and social welfare. e2018340118 (2021). Proceedings of the National Academy of Sciences 118, 40. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. (2022). arXiv előre nyomtatott arXiv:2205.11916 41. Korngiebel, D. M. & Mooney, S. D. Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery. 1–3 (2021). NPJ Digital Medicine 4, 42. Lakkaraju, H., Slack, D., Chen, Y., Tan, C. & Singh, S. Rethinking Explainability as a Dialogue: A Practitioner’s Perspective. (2022). arXiv preprint arXiv:2202.01875 43. Lampinen, A. K., Dasgupta, I., Chan, S. C., Matthewson, K., Tessler, M. H., Creswell, A., McClelland, J. L., Wang, J. X. & Hill, F. Can language models learn from explanations in context? (2022). arXiv preprint arXiv:2204.02329 44. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H. & Kang, J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. 1234 és 1240 között (2020) Bioinformatics 36, 45. Lester, B., Al-Rfou, R. & Constant, N. The power of scale for parameter-efficient prompt tuning. (2021). arXiv preprint arXiv:2104.08691 46. Lewis, P., Ott, M., Du, J. & Stoyanov, V. in (2020), 146–157. Előképzett nyelvi modellek biomedikai és klinikai feladatokhoz: a legmodernebb ismeretek megértése és kiterjesztése A III. Klinikai Természetes Nyelvfeldolgozási Műhely előadása 47. Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Solving quantitative reasoning problems with language models. (2022). et al. arXiv preprint arXiv:2206.14858 Li, X. L. és Liang, P. Prefix-tuning: Folyamatos utasítások optimalizálása a generációhoz. (2021). arXiv preprint arXiv:2101.00190 49. Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Holistic evaluation of language models. 2022 között. et al. arXiv preprint arXiv:2211.09110 50. Liévin, V., Hother, C. E. & Winther, O. Can large language models reason about medical questions? (2022). arXiv preprint arXiv:2207.08143 Lin, S., Hilton, J. és Evans, O. Tanítási modellek a bizonytalanságuk kifejezésére szavakban. 2022 között. arXiv preprint arXiv:2205.14334 52. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H. és Neubig, G. Előzetes képzés, utasítás és előrejelzés: A természetes nyelv feldolgozásának ösztönző módszereinek szisztematikus felmérése. 2021 között. arXiv preprint arXiv:2107.13586 53. Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z. és Tang, J. GPT is megérti. (2021). arXiv preprint arXiv:2103.10385 54. Liu, X., Glocker, B., McCradden, M. M., Ghassemi, M., Denniston, A. K. & Oakden-Rayner, L. The medical algorithmic audit. (2022). Lancet Digitális Egészség 55. Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. (2017). arXiv preprint arXiv:1711.05101 56. Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Poon, H. & Liu, T.-Y. BioGPT: generative pre-trained transformer for biomedical text generation and mining. (2022). Briefings in Bioinformatics 23 57. Mandavilli, A. . 2021. Medical Journals Blind to Racism as Health Crisis, Critics Say https://www.nytimes.com/2021/06/02/ health/jama-racism-bauchner.html 58. Matheny, M., Israni, S. T., Ahmed, M. & Whicher, D. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril (2022). Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D. és Gebru, T. az (2019), 220–229. Model cards for model reporting Proceedings of the conference on fairness, accountability, and transparency 60. Morgado, F. F., Meireles, J. F., Neves, C. M., Amaral, A. & Ferreira, M. E. Scale development: ten main limitations and recommendations to improve future research practices. (2017 év) Psicologia: Reflexao e Critica 30 61. Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Show your work: Scratchpads for intermediate computation with language models. 2021 között. És az al. arXiv preprint arXiv:2112.00114 62. A tudományról, W. H. O. és politika, T. . 2022. The Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf 63. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Training language models to follow instructions with human feedback. (2022). et al. arXiv előre nyomtatott arXiv:2203.02155 64. Pal, A., Umapathi, L. K. & Sankarasubbu, M. in (2022), 248–260. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering Conference on Health, Inference, and Learning 65. Pampari, A., Raghavan, P., Liang, J. & Peng, J. emrqa: A large corpus for question answering on electronic medical records. (2018). arXiv preprint arXiv:1809.00732 66. Papanikolaou, Y. & Pierleoni, A. DARE: Data augmented relation extraction with gpt-2. (2020). arXiv preprint arXiv:2004.13845 Papineni, K., Roukos, S., Ward, T. és Zhu, W.-J. in (2002), 311–318. Bleu: a method for automatic evaluation of machine translation Proceedings of the 40th annual meeting of the Association for Computational Linguistics 68. Prabhakaran, V., Hutchinson, B. & Mitchell, M. Perturbation sensitivity analysis to detect unintended model biases. (2019). arXiv előnyben arXiv:1910.04210 69. Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Skálázási nyelvi modellek: módszerek, elemzések és betekintések a tréning gopherből. (2021). et al. arXiv preprint arXiv:2112.11446 70. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P. J., Exploring the limits of transfer learning with a unified text-to-text transformer. 1–67 (2020). et al. J. Mach. Learn. Res. 21, Raji, I. D., Smart, A., Fehér, R. N., Mitchell, M., Gebru, T., Hutchinson, B., Smith-Loud, J., Theron, D. és Barnes, P. in (2020), 33–44. Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing Proceedings of the 2020 conference on fairness, accountability, and transparency 72. Rostamzadeh, N., Mincu, D., Roy, S., Smart, A., Wilcox, L., Pushkarna, M., Schrouff, J., Amironesei, R., Moorosi, N. & Heller, K. Healthsheet: Development of a Transparency Artifact for Health Datasets. (2022). arXiv preprint arXiv:2202.13028 73. Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. BLOOM: A 176B-paraméter nyílt hozzáférésű többnyelvű nyelvi modell. (2022). arXiv preprint arXiv:2211.05100 74. Schaekermann, M., Cai, C. J., Huang, A. E. & Sayres, R. az (2020), 1–13. Expert discussions improve comprehension of difficult cases in medical image assessment Proceedings of the 2020 CHI conference on human factors in computing systems 75. Sezgin, E., Sirrianni, J., Linwood, S. L., Az előképzett, nagy mesterséges intelligencia nyelvi modellek működtetése és végrehajtása az amerikai egészségügyi rendszerben: A generatív előképzett transzformátor 3 (GPT-3) kilátásai szolgáltatási modellként. Az E32875 (2022) És az al. JMIR Orvosi Informatika 10, 76. Shin, H.-C., Zhang, Y., Bakhturina, E., Puri, R., Patwary, M., Shoeybi, M. & Mani, R. BioMegatron: Nagyobb biomedikai domain nyelvi modell. Azaz a 2020. arXiv előnyben arXiv:2010.06060 Shoemaker, S. J., Wolf, M. S. & Brach, C. A betegoktatási anyagok értékelésére szolgáló eszköz (PEMAT) fejlesztése: a nyomtatott és audiovizuális beteginformációk érthetőségének és cselekvőképességének új mércéje. 395 és 403 között (2014). Patient education and counseling 96, 78. Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Az utánzáson túl: A nyelvi modellek képességeinek számszerűsítése és extrapolálása. 2022 között. És az al. arXiv preprint arXiv:2206.04615 79. Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V. és Stojnic, R. Galactica: A tudomány nagy nyelvi modellje. (2022). arXiv előre nyomtatott arXiv:2211.09085 80. Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., Lamda: Language models for dialog applications. (2022). És az al. arXiv preprint arXiv:2201.08239 81. Tomašev, N., Harris, N., Baur, S., Mottram, A., Glorot, X., Rae, J. W., Zielinski, M., Askham, H., Saraiva, A., Magliulo, V., A mély tanulás alkalmazása az elektronikus egészségügyi nyilvántartásokból származó kedvezőtlen események előrejelzésére szolgáló folyamatos kockázati modellek kidolgozására. 2765 és 2787 között (2021). És az al. Természeti protokollok 16, 82. Tran, D., Liu, J., Dusenberry, M. W., Phan, D., Collier, M., Ren, J., Han, K., Wang, Z., Mariet, Z., Hu, H., Plex: A megbízhatóság felé az előre beépített nagy modellbővítmények használatával. 2022 között. És az al. arXiv előre nyomtatott arXiv:2207.07411 83. Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers, M. R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. 1–28 (2015). És az al. BMC bioinformatics 16, 84. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. Attention is all you need. (2017). A neurális információfeldolgozó rendszerek fejlesztése 30 85. Vyas, D. A., Eisenstein, L. G. és Jones, D. S. 2020. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms Walsh, K. E., Harik, P., Mazor, K. M., Perfetto, D., Anatchkova, M., Biggins, C., Wagner, J., Schoettker, P. J., Firneno, C., Klugman, R., Káros hatások mérése az egészségügyben: a kedvezőtlen események felülvizsgálatának optimalizálása. 436 (2017 év) És az al. Orvosi ellátás 55, 87. Wang, b., Min, S., Deng, X., Shen, J., Wu, Y., Zettlemoyer, L. & Sun, H. Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters. 2022 között. arXiv preprint arXiv:2212.10001 88. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E. & Zhou, D. Self-consistency improves chain of thought reasoning in language models. (2022). arXiv előre nyomtatott arXiv:2203.11171 89. Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M. & Le, Q. V. A finetuned nyelvi modellek nulla lövésű tanulók. 2021 között. arXiv előre nyomtatott arXiv:2109.01652 90. Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., A nagy nyelvi modellek feltörekvő képességei. 2022 között. És az al. arXiv preprint arXiv:2206.07682 Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q. és Zhou, D. A gondolkodás lánca, amely a nagy nyelvi modellekben megindítja az érvelést. (2022). arXiv előre nyomtatott arXiv:2201.11903 92. Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., A nyelvi modellek által okozott károk etikai és társadalmi kockázata. 2021 között. És az al. arXiv előre nyomtatott arXiv:2112.04359 Williams, T., Szekendi, M., Pavkovic, S., Clevenger, W. és Cerese, J. Az AHRQ Common Format Harm Scales megbízhatósága a betegbiztonsági események értékelésében. 52 és 59 között (2015). Journal of patient safety 11, 94. Yasunaga, M., Bosselut, A., Ren, H., Zhang, X., Manning, C. D., Liang, P. és Leskovec, J. Mély kétirányú nyelvtudás grafikon előképzés. (2022). arXiv preprint arXiv:2210.09338 95. Yasunaga, M., Leskovec, J. & Liang, P. LinkBERT: Nyelvmodellek előképzése dokumentum linkekkel. 2022 között. arXiv előnyben arXiv:2203.15827 Ye, S., Jang, J., Kim, D., Jo, Y. & Seo, M. A puha prompt visszanyerése javítja a nulla lövésű feladatok általánosítását. 2022 között. arXiv előre nyomtatott arXiv:2210.03029 97. Yim, J., Chopra, R., Spitz, T., Winkens, J., Obika, A., Kelly, C., Askham, H., Lukic, M., Huemer, J., Fasler, K., et al. A vizes korhoz kapcsolódó makuladegenerációra való átállás előrejelzése mély tanulás segítségével. 892–899 között (2020) Természetgyógyászat 26, Zhang, H., Lu, A. X., Abdalla, M., McDermott, M. és Ghassemi, M. az (2020), 110 és 120 között Hurtful words: quantifying biases in clinical contextual word embeddings proceedings of the ACM Conference on Health, Inference, and Learning Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., OPT: Nyitott előre képzett transzformátor nyelvi modellek. 2022 között. et al. arXiv előre nyomtatott arXiv:2205.01068 100. Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Bousquet, O., Le, Q. & Chi, E. A legkevesebbet a legtöbben megcélzó lehetővé teszi a komplex érvelést a nagy nyelvi modellekben. 2022 között. arXiv előnyben arXiv:2205.10625 Melléklet A.1 Hyperparaméterek és modellválasztás A Flan-PaLM 540B utasítási prompt hangolását 100 puha prompt hosszúsággal végeztük a Med-PaLM előállításához. a modell többi részét fagyasztottuk, és a beágyazás mérete 18432 volt, mint Chowdhery esetében [ megtekintése ], így ez 1.84M képezhető paramétereket eredményezett. véletlenszerűen kezdeményeztük a tanulható paramétereket, hogy egységesek legyenek [-0.5, 0.5], Lester után [ megtekintése ]. Mi rács keresett a tanulási arányok a 0.001, 0.003, 0.01 AdamW optimizer [ ] és egy súlycsökkentő tényező a . Az 0* }.* Az összes futamon 32 tételméretet használtunk, és 200 lépcsőfokos edzést végeztünk. És az al. 14 És az al. 45 55 0 0 001 00001 A modell kiválasztását úgy végeztük el, hogy megkérdeztük egy klinikust, hogy rangsorolja a válaszokat több tartós HealthSearchQA, MedicationQA és LiveQA példán (nem használják a képzéshez vagy az emberi értékeléshez), és kiválasztottuk a legjobban teljesítő ellenőrző pontot. Ezt a kézi érvényesítést végeztük, ahelyett, hogy valamilyen automatizált mutatót számítottunk ki egy validálási készleten, pl. negatív napló valószínűsége a tartós (kérdés, válasz) pároknál, mivel a természetes nyelvi generációk nagy kimeneti térében ezek a mutatók nem korrelálhatnak jól a tényleges modell kimeneteinek emberi megítélésével. A.2 Az eredmények változása A hőmérsékleti mintavételt használó ismételt sztochasztikus dekódolás következtében az eredmények önmegfelelőséggel várhatóan némileg változnak. Noha nem praktikus, hogy több kísérletet végezzünk valamennyi modellünkkel az ebben a tanulmányban használt összes adatkészleten, megismételjük a MedQA adatkészletre vonatkozó értékeléseket 4 alkalommal a legjobb teljesítményű modellünkkel. A.3 MMLU abláció Ablációkat végeztünk, összehasonlítva a Flan-PaLM 540B modellt a néhány lövés, a gondolatlánc (CoT) és az önmegfelelőség ösztönző stratégiáival az MMLU klinikai témáira [ Az eredmények a fejezetben találhatók Megfigyeltük, hogy míg a legtöbb témában a Flan-PaLM 540B önkonzisztenciával a legjobb eredményeket érheti el, vannak olyan témák, ahol a standard few-shot vagy CoT prompting jobban működik. 29 Az A3 A.4 Skálázó szálak A PaLM és a Flan-PaLM modelleket a MedQA és a MedMCQA adathalmazok néhány felvételt használó felhívásával hasonlítjuk össze. és egy másik skálázási telek, amely összehasonlítja a Flan-PaLM-t néhány lövés prompting és a Flan-PaLM-t önkonzisztencia prompting képben Megfigyeljük az erős skálázási teljesítményt, és látunk egy meredekebb teljesítménynövekedést, ahogy az LLM modell méretét növeljük. Az A1 Az A.2 A5 modellkártya a Med-PaLM számára A Med-PaLM ugyanazt a rendszertípust és végrehajtási keretet használja, mint a Flan-PaLM [ ]. We show parts of the model card [ ] specifikus a Med-PaLM-hez az asztalon . 15 59 Az A.2 A.6 Med-PaLM többválasztási értékelés A Med-PaLM-et a Flan-PaLM által előállított hosszú formájú generációk minőségének javítására irányuló utasítási utasítási hangolás alkalmazásával képezték. Az utasítási utasítási hangolás általánosságát figyelembe véve azonban a technikát többféle adatkészletre is alkalmazhatjuk. megtanulhatjuk, hogy a megosztott puha utasítási paramétereket előzetesen az utasításokra és/vagy néhány példányra kell előterjeszteni, amelyek minden többféle adatkészlet esetében eltérőek. Egy előzetes kísérletben a Flan-PaLM-t a MedQA, a MedMCQA, a PubMedQA és az MMLU (klinikai témák) utasítások előzetes hangolásával képeztük. A példányokat öt képzett klinikus írta. Minden képzési példány adatkészlet-specifikus utasításokat és 5 néhány példányt tartalmazott. A kapott modell 67,2%-os pontosságot ért el a MedQA-n a gondolatlánc és az önösszefüggőség segítségével, ami nagyjából megfelel a megfelelő eredménynek a Flan-PaLM i szakaszban Tervezzük, hogy ezt a korai eredményt a jövőbeli munkákban bővítjük. 4 A.7 Részletes emberi értékelés eredményei Az emberi értékelés részletes eredményei a bizalmi intervallumokkal összefoglalva a táblázatban találhatók. - asztal . Az A3 A. 12 A.8 Néhány gyors példa Néhány példát adunk néhány, a táblázatban szereplő tanulmányban használt felhívásra. Az asztal Az asztal Bővebben: Bél Az asztal . A. 13 A. 14 A. 15 A. 16 A. 17 A.9 Chain-of-Thought gyors példák Példaértékű példákat adtunk a tanulmányban használt gondolati láncjelzőkről a táblázatban Az asztal Bővebben: Bél Az asztal . A. 18 A.19 Az A.20 A. 21 Ez a dokumentum a CC by 4.0 Deed (Attribution 4.0 International) licenc alatt érhető el. Ez a papír az CC by 4.0 Deed (Attribution 4.0 International) licenc alatt. Elérhető archívum