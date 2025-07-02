Abstract and I. Introduction

III. RESULTS

A. Results from the RandomForestClassifier (RFC)





Of the 2000 contracts used on the model, the RFC was tested on 800 (40%). 717 out of the 800 contracts were predicted accurately for an accuracy of 89.6% and an F1 score of 0.76. The generated confusion matrix further detailed that for positive predictions (“True”), 133 were true positives, and 23 were false positives. For negative predictions, 584 were true negatives, and 60 were false negatives. The false positive rate was only 3.8%, successfully fulfilling our goal. This is a significant improvement over just static analysis tools, such as Slither, which alone has a false positive rate of 10.9% [20]. Furthermore, the RFC is able to examine the source code without a limited number of vulnerability detectors, making it more adaptable to syntax changes.





B. Results from the GPT-3.5-Turbo and Llama-2-7B Error Correction Models









To test the GPT-3.5-Turbo and the fine tuned Llama-2-7B model with our prompt, we aimed to repair vulnerabilities as reported by Slither. The results are shown in the graphs above. The results of Slither checks on GPT-corrected smart contracts are promising, with the fine-tuned GPT-3.5 Turbo model able to repair 97.5% of vulnerabilities. Specifically, out of the 40 vulnerabilities encountered while running through the source code, only a single medium level vulnerability remained. Meanwhile, the fine-tuned Llama-2 model was able to correct all but two errors across 60 vulnerabilities encountered, with one medium- and one low-impact vulnerability remaining. Thus the Llama-2 model was able to decrease the proportion of vulnerabilities by 96.7%. We reviewed a random third of repaired smart contracts and found that all of them had retained their previous functionality, with the models usually correcting syntax-level errors rather than changing underlying structures.





The CoT GPT-3.5-Turbo prompts and fine-tuning of the Llama-2-7B classifier were vital to the accuracy of these models. Upon initial testing, the GPT-3.5-Turbo was able to repair fewer than 85% of smart contracts and the Llama-2-7B model was unable to produce code that could be compilied. However, with the methods outlined above, the results demonstrate a reliable process to repair smart contracts.





Indeed, these results demonstrate that the LLMs were able to successfully repair vulnerable smart contracts with near perfect accuracy, with only three total vulnerabilities remaining. The error correction rate was well above that of any existing methods, making them state-of-the-art tools with impressive error reduction capabilities. Moreover, due to the “Two Timin’” framework described above, only malicious contracts were repaired, cutting down on computing time and maximizing the quantity of secure, reliable smart contracts available. Due to the tens of millions of smart contracts on blockchains such as Etherscan [21], minimizing computational complexity and cost in an already energy-intensive industry is beneficial to users, companies, and the environment.





IV. CONCLUSION

In this paper, we used the Solidity source code of smart contracts to build a novel approach to identify and repair vulnerabilities. This approach utilized a two tiered flow for identifying and repairing vulnerabilities. First, the Slither static code analyzer and a Random Forest Classifier were used to identify malicious smart contracts and their specific vulnerabilities. These malicious smart contracts and their vulnerabilities were used as parameters in a prompt on two separate LLMs, GPT-3.5-Turbo and Llama-2-7B. This prompt was a result of prompt engineering using Chain of Thought reasoning. The two smart contract repair models, one using pre-trained GPT3.5-Turbo and the other a fine-tuned Llama-2-7B, reduced the overall vulnerability count by 97.5% and 96.7% respectively. This novel approach, with state of the art accuracy, allows for smart contracts to be screened and repaired before being deployed. Thus, cybercriminals are unable to exploit vulnerabilites in the contracts. Indeed, this paper establishes a framework that is easy to use, with reliable results, increasing access to safe smart contracts for all. Using the ”Two Timin’” framework, businesses and DAOs can utilize LLMs to repair smart contracts efficiently and effectively, an important step forward as the prevalence of blockchain continues to increase.

FUTURE WORK

Different methods of classifiers powered by transformers or neural networks could be used to identify malicious smart contracts. These could learn across a broader concentration of data with access to a larger proportion of malicious smart contracts. In addition, more finetuning could be completed on Llama-2-7B, with more hidden layers and a larger dataset in order to raise its error correction rate above that of GPT-3.5- Turbo. At the time of writing this paper, GPT-3.5-Turbo is unable to be fine-tuned, however if fine-tuning capabilities were to be developed, further research could focus on fine tuning GPT-3.5-Turbo for repairing smart contracts. Moreover, advances in PEFT and/or QLoRa could allow for a less memory intensive but more accurate LLM for repairing smart contracts.

