Authors:
(1) Jinge Wang, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA;
(2) Zien Cheng, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA;
(3) Qiuming Yao, School of Computing, University of Nebraska-Lincoln, Lincoln, NE 68588, USA;
(4) Li Liu, College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA and Biodesign Institute, Arizona State University, Tempe, AZ 85281, USA;
(5) Dong Xu, Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA;
(6) Gangqing Hu, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA ([email protected]).
Table of Links
4. Biomedical Text Mining and 4.1. Performance Assessments across typical tasks
4.2. Biological pathway mining
5.1. Human-in-the-Loop and 5.2. In-context Learning
6. Biomedical Image Understanding
7.1 Application in Applied Bioinformatics
7.2. Biomedical Database Access
7.2. Online tools for Coding with ChatGPT
7.4 Benchmarks for Bioinformatics Coding
8. Chatbots in Bioinformatics Education
9. Discussion and Future Perspectives
9. DISCUSSION AND FUTURE PERSPECTIVES
The year 2023 marked significant progress in leveraging ChatGPT for bioinformatics and biomedical informatics. Early studies affirming its capability in drafting workable code for basic bioinformatics data analysis[10, 12]. The chatbot has also demonstrated competitiveness with SOTA models in other bioinformatics areas, including identifying cell type from single-cell RNA-Seq data[16], performing question-answering tasks in biomedical text mining[26], and generating molecular captions in drug discovery[52]. These achievements underscore ChatGPT’s proficiency in text-generative tasks. Meanwhile, other LLMs are catching up. For example, Google developed Gemini and open-source LLM Gemma, which delivered impressive performance in various tasks. Although their applications in bioinformatics and medical informatics have not been reported, their potentials provide users a viable alternative to ChatGPT.
Current chatbots exhibit limitations in performing biomedical tasks that require reasoning and quantitative analysis, such as regression and classification, as evidenced by references[30, 32, 67, 68, 104]. Though not yet widely adapted in bioinformatics[18], OpenAI’s fine-tuning APIs such as for GPT-3.5 and GPT-4 hold great potential for performance improvements when the training dataset is large. Nevertheless, the accuracy of ChatGPT's responses can be significantly improved through a strategic design of its input instructions with prompt engineering. Incorporating examples into prompts and employing CoT reasoning has proven an effective strategy, as evidenced in various bioinformatics applications[30, 35, 62, 67, 68, 101]. While examples in prompts are sometimes hardcoded, they can also be dynamically and strategically sourced from external knowledge bases or knowledge graphs[62, 63, 65, 109]. This approach, known as retrieval-augmented generation (RAG), improves ChatGPT's reliability by sourcing facts from domain-specific knowledge and represents a promising avenue for future bioinformatics with chatbots.
In this rapidly evolving domain, ChatGPT has experienced several significant upgrades within its first year alone. We acknowledge that not every upgrade enhances performance across the board[110]. Consequently, prompts that are highly effective with the current version for specific tasks may not maintain the same level of efficacy following future updates. The technique of prompt engineering, which includes strategies like role prompting and in-context learning, offers a way to partially counteract this variability[45]. An innovative approach, rather than manually adjusting the prompts, involves instructing ChatGPT to autonomously optimize prompts to align with its latest model iteration. This strategy has shown promise in tasks such as mining gene relationships[45] but remains largely unexplored in other bioinformatics topics and therefore warrants further exploration to fully leverage ChatGPT's capabilities in the field.
Numerous studies repeatedly show that using ChatGPT with human augmentations significantly improve the performance. Iterative human-AI communication plays a pivotal role in this process, where feedback from human operator grounds the chatbot's responses for improved accuracy. This human-in-the-loop methodology is particularly evident in prompt optimization[10] and molecular optimization[60, 63]. For code generation tasks, runtime error message represents commonly used feedback that has been automated into several GPT-based tools[95, 96, 102]. Conversely, the chatbot can also be instructed to provide feedback to human operators. As demonstrated byChen and Stadler [101], ChatGPT can produce textual descriptions for the generated code through an inverse generation process. Comparing these descriptions with the original instructions from the human operator ensures that the chatbot's output aligns closely with the intended task requirements. This iterative exchange of feedback between AI and human operators enhances the overall quality of the bioinformatics tasks being addressed.
The assessment of ChatGPT's capabilities across various bioinformatics tasks has illuminated both its strengths and weaknesses. However, the reliability of these evaluations largely hinges on the quality of the benchmarks used and the methodologies applied in these assessments. Currently, many benchmarks are available for biomedical text mining and chemistry-related tasks. The development of benchmarks designed specifically for assessing ChatGPT’s capability in other bioinformatics tasks, including multimodality, is still in its infancy. It's important to recognize that in generative tasks like coding, producing expected results is not the sole criterion for gauging effectiveness and efficiency. Factors such as the readability of the code and the inclusion of code examples also play crucial roles[104]. Nonetheless, conducting such comprehensive evaluations can be resource-intensive, underscoring the need for community efforts to enhance the scope. While alternatives exist for automation, such as transforming tasks into multiple-choice questions or verifying responses against reference texts, for example through lexical overlap or semantic similarity, each method comes with its own set of limitations[7]. Consequently, there is a pressing need to develop new, scalable, and accurate evaluation metrics and benchmark datasets that can accommodate a wide range of bioinformatics tasks, ensuring that assessments are both meaningful and reflective of real-world and cutting-edge applicability.
While aiming for comprehensiveness, our review does not encompass areas that, although outside the direct scope of bioinformatics and biomedical informatics, are closely related and significant. These areas include the management of electronic health records[111, 112], emotion analysis through social media[113], and medical consultation[114, 115]. To mitigate transparency and security concerns, fine-tuning open-source language models deployed locally with task-specific instructions presents a practical approach. Our review has spotlighted such advancements for drug discovery. However, we refer our readers to additional reviews for an expansive understanding of similar developments in other bioinformatics topics, as well as the ethical and legal issues involved[7-9, 116, 117]. Looking ahead, we envision a future where both online proprietary models such as ChatGPT and open-source, locally deployable finetuned language models coexist for bioinformatics and biomedical informatics, ensuring users with the most suitable tools to address their specific needs.
This paper is available on arxiv under CC BY 4.0 DEED license.