Authors:
(1) Mingjie Liu, NVIDIA {Equal contribution};
(2) Teodor-Dumitru Ene, NVIDIA {Equal contribution};
(3) Robert Kirby, NVIDIA {Equal contribution};
(4) Chris Cheng, NVIDIA {Equal contribution};
(5) Nathaniel Pinckney, NVIDIA {Equal contribution};
(6) Rongjian Liang, NVIDIA {Equal contribution};
(7) Jonah Alben, NVIDIA;
(8) Himyanshu Anand, NVIDIA;
(9) Sanmitra Banerjee, NVIDIA;
(10) Ismet Bayraktaroglu, NVIDIA;
(11) Bonita Bhaskaran, NVIDIA;
(12) Bryan Catanzaro, NVIDIA;
(13) Arjun Chaudhuri, NVIDIA;
(14) Sharon Clay, NVIDIA;
(15) Bill Dally, NVIDIA;
(16) Laura Dang, NVIDIA;
(17) Parikshit Deshpande, NVIDIA;
(18) Siddhanth Dhodhi, NVIDIA;
(19) Sameer Halepete, NVIDIA;
(20) Eric Hill, NVIDIA;
(21) Jiashang Hu, NVIDIA;
(22) Sumit Jain, NVIDIA;
(23) Brucek Khailany, NVIDIA;
(24) George Kokai, NVIDIA;
(25) Kishor Kunal, NVIDIA;
(26) Xiaowei Li, NVIDIA;
(27) Charley Lind, NVIDIA;
(28) Hao Liu, NVIDIA;
(29) Stuart Oberman, NVIDIA;
(30) Sujeet Omar, NVIDIA;
(31) Sreedhar Pratty, NVIDIA;
(23) Jonathan Raiman, NVIDIA;
(33) Ambar Sarkar, NVIDIA;
(34) Zhengjiang Shao, NVIDIA;
(35) Hanfei Sun, NVIDIA;
(36) Pratik P Suthar, NVIDIA;
(37) Varun Tej, NVIDIA;
(38) Walker Turner, NVIDIA;
(39) Kaizhe Xu, NVIDIA;
(40) Haoxing Ren, NVIDIA.
A. Considerations for Domain Adaptation
Although domain-adapted ChipNeMo models achieve significant improvements over their corresponding foundation models, we also observe that the larger LLaMA2 70B can sometimes achieve similar accuracy as ChipNeMo, as seen in Figures 8, 9, and 10. Recent work has leveraged these powerful models to perform chip design tasks.
However, it is important to consider the cost-efficiency benefits gained from the use of a smaller model. Pope et al. demonstrate that inference costs on an 8B model are 8- 12x lower than on a 62B model for equal latency targets [34]. Furthermore, model size reduction can lead to dramatic increases in inference speed by allowing a model to fit within a single GPU or node where it otherwise could not [35]. Our ChipNeMo 13B model can be loaded within the memory of a single A100 GPU without any quantization, unlike the LLaMA2 70B model. This leads to significant inference speed increases under normal GPU operation, which can be traded off for significant inference cost reduction should the GPU be underclocked.
Thus, when deciding between the use of a larger general-purpose model versus a smaller specialized model in a production environment the following criteria must be considered:
• Training and inference trade-off: Smaller domain adapted models can match the accuracy of larger general purpose models. While domain adaptation incurs additional up-front costs, the use of smaller models leads to significantly reduced operating costs.
• Uniqueness of use case: As can be seen from Figures 6, 9, and 10, domain adapted models show the most improvement on tasks that are rarely present in the public domain, such as writing code in proprietary languages or libraries. Indeed, our data shows that even when they are provided with hand-picked contexts, large general-purpose models have difficulty matching the accuracy of domain adapted models in such scenarios.
• Availability of domain data: Domain adaption works best when there is large amount of training data, i.e. billions of training tokens. This is often the case for large corporations and projects which have accumulated a large amount of internal documents and code, but not necessarily true for smaller businesses or projects.
• End use case diversity: It is possible to fine-tune a general-purpose model for a particular task, but domain adapted models are suited for a diverse set of tasks in a domain. Although we only demonstrate three use cases for ChipNeMo models in this work, it can be readily re-used for other use cases with sufficient SFT data.
B. Performance Gap
Although ChipNeMo achieves impressive results in our selected applications as shown in Appendix E, the evaluation results for all applications still show a considerate gap with human expert performance. We are considering the following approaches to bridge this performance gap:
1) Data Collection: We can expand the DAPT dataset to include more internal proprietary data. In addition, we plan to add more task specific instruction sets for SFT as evidence shown task specific SFT improves the evaluation results meaningfully.
2) Base Model: We expect better and larger base models can improve performance, such as LLaMA2 70B. We can also explore applying DAPT to code-specific base models such as Code LLaMA [32] for code generation tasks.
3) Training: We also plan to conduct reinforcement learning from human feedback (RLHF) [36] over the ChipNeMo chat model to make it more versatile. We plan to leverage pretrained reward models trained over general purpose datasets. We also plan to conduct long-context training [37] to overcome the challenge where long context is needed, e.g. in the bug summarization application. In general, longer context support would help improve retrieval-based methods for chat assistance as well as code generation.
4) Retrieval: We will further investigate better RAG methods for both the engineering assistant chatbot and EDA script generation. For the engineering assistant chatbot, we can create different data stores for different application areas. We can also integrate enterprise search engines with RAG to find relevant context for a diverse set of problems. For code generation, we can investigate automated retrieval of context from existing code and documentation.
C. Agent-Based Design Methodologies
The use cases we experimented in this work are straightforward applications of the prompt and response capability of LLMs. Agents refer to the use of an LLM to choose a sequence of actions to take, where an LLM is acting as a reasoning engine to drive outside tools. Chip design processes involve many existing EDA tools and methodologies. We believe some of these methodologies can be driven by agents powered by domain-adapted LLMs such as ChipNeMo models. We plan to work on agent-based design methodologies for verification and optimization in the future.
This paper is available on arxiv under CC 4.0 license.