Authors:
(1) Yilin Sai, CSIRO Data61 and The University of New South Wales, Sydney, Australia;
(2) Qin Wang, CSIRO Data61 and The University of New South Wales, Sydney, Australia;
(3) Guangsheng Yu, CSIRO Data61;
(4) H.M.N. Dilum Bandara, CSIRO Data61 and The University of New South Wales, Sydney, Australia;
(5) Shiping Chen, CSIRO Data61 and The University of New South Wales, Sydney, Australia.
VII. Conclusion and References
Abstract—As Artificial Intelligence (AI) integrates into diverse areas, particularly in content generation, ensuring rightful ownership and ethical use becomes paramount. AI service providers are expected to prioritize responsibly sourcing training data and obtaining licenses from data owners. However, existing studies primarily center on safeguarding static copyrights, which simply treats metadata/datasets as non-fungible items with transferable/- trading capabilities, neglecting the dynamic nature of training procedures that can shape an ongoing trajectory.
In this paper, we present IBIS, a blockchain-based framework tailored for AI model training workflows. IBIS integrates onchain registries for datasets, licenses and models, alongside offchain signing services to facilitate collaboration among multiple participants. Our framework addresses concerns regarding data and model provenance and copyright compliance. IBIS enables iterative model retraining and fine-tuning, and offers flexible license checks and renewals. Further, IBIS provides APIs designed for seamless integration with existing contract management software, minimizing disruptions to established model training processes. We implement IBIS using Daml on the Canton blockchain. Evaluation results showcase the feasibility and scalability of IBIS across varying numbers of users, datasets, models, and licenses.
The proliferation of Large Language Models (LLMs) based applications [1], [2] represents a significant milestone in the integration of Artificial Intelligence (AI) technologies into various facets of daily life, spanning from information retrieval [3], [4] to content generation [5], [6]. Concurrently, AI service providers have made strides in commercializing their services. Nevertheless, as LLMs and other AI models rely on extensive datasets aggregated from diverse sources for training [7], [8], apprehensions have emerged regarding the potential infringement of copyrights [9]–[11] during the data acquirement and model training process. To uphold responsible and ethical AI practices [12], [13], comply with regulations, and reduce legal liabilities, AI service providers must actively collaborate with data owners, including content creators and media industry stakeholders. Establishing licensing agreements [14], [15] and obtaining consent before utilizing data for AI model training is a key element of this collaboration [16]. Hence, there is a growing need for new frameworks addressing data provenance, lineage, and copyright compliance in the AI industry, tailored to its distinct needs and workflows.
However, addressing the concerns of AI data provenance and copyright compliance can be a nontrivial task, particularly when the entire training process occurs locally or within a black-box cloud service [17], limiting transparency for users. To bridge this gap, we harness the properties of blockchain technology, which offers a tamper-proof and trustworthy environment [18] to establish authenticity, provenance, and lineage [19], [20]. Owing to its inherent characteristics of immutability and transparency, blockchain has garnered widespread recognition as a suitable technology for achieving regulatory compliance [21]–[23]. For instance, data recorded on the blockchain is digitally signed and inherently tamperproof, thereby constituting an authentic and persistent record that accurately reflects an event(s) at a specific point in time. This makes blockchain a fitting candidate to address concerns related to data provenance and copyright compliance within the AI industry [24]–[26].
We have identified a series of functional challenges that must be addressed in the development of a such blockchain-based compliance framework: (i) The framework must be designed to seamlessly integrate with the existing workflow of AI model training. (ii) The framework should support continuous model retraining and fine-tuning with new datasets, allowing for the generation of updated models while maintaining data provenance and lineage. (iii) The framework should support mechanisms for license expiration and renewal, accommodating diverse business models employed by data owners. (iv) The ownership of datasets and models, along with all training actions, should be accompanied by evidence to clarify their licensing scope and ensure accountability for any subsequent actions. (v) The framework should facilitate communication between AI service providers and data owners, enabling efficient attainment and documentation of licensing agreements. (vi) The framework should ensure the effective management and commercial sensitivity of licenses, safeguarding them against unauthorized access by third parties.
In this paper, we design, implement, and evaluate IBIS, a blockchain-based framework for data and model copyright management, provenance, and lineage in AI model training processes. IBIS empowers model owners to establish the provenance and lineage of their AI models and training datasets throughout retraining and fine-tuning processes, efficiently obtaining copyright licenses from the relevant copyright holders, and securely recording and renewing bilaterally signed copyright licenses as evidence of legal compliance. Our detailed contributions are as follows:
• We propose a blockchain-integrated framework, IBIS, to track data and model copyright management, provenance, and lineage. IBIS exhibits the following characteristics:
⋄ Seamless integration (addressing c-i): By supporting iterative model retraining and fine-tuning, accommodating diverse copyright agreements through flexible license checks and renewals, and providing a unified API that integrates with existing contract lifecycle management software, the framework ensures minimal disruption to established model training and copyright management processes.
⋄ Adaptability (addressing c-ii and iii): By establishing links between models in the model metadata, and integrating periodic license renewal checks via smart contracts, IBIS supports ongoing model retraining and license renewal. Moreover, the on-chain license registry leverages blockchain’s immutability property, allowing model owners and copyright holders to retrieve their past licenses to prove regulatory compliance and avoid any disputes.
⋄ Traceable registry (addressing c-iv): By deploying three on-chain, immutable registries for dataset metadata, licenses, and model metadata, the framework maintains authentic records of dataset and model relationships, ownership, and their copyright agreements. The bidirectional links between these records enables two-way traceability throughout data and model copyright management, provenance, and lineage processes.
⋄ Blockchain-based multi-party signing (addressing cv): By leveraging the identity management and digital signature capabilities offered by private-permissioned blockchains, IBIS enables efficient and secure multiparty signing workflows between AI model owners and copyright holders, ensuring the establishment of legally compliant licensing agreements.
⋄ Controllability (addressing c-vi): By implementing on-chain access control mechanisms and adhering to strict permission rules, IBIS ensures that only authorized parties can access the information pertaining to training datasets, models, and licenses. Consequently, IBIS facilitates an ecosystem encompassing many AI models, datasets, and licenses, enabling model and data owners to leverage the network effect of a unified platform while safeguarding their commercial sensitivity needs.
• We implement a fully-functional prototype[1] based on the Daml smart contract language [27] and Canton blockchain protocol [28]. We adopted Daml and Canton’s renowned privacy-preserving capabilities and modular design to implement a secure and commercial-sensitivity-preserving framework with six modules dedicated to license registration, management, and updating.
• We conduct a series of performance evaluations of IBIS, especially its performance under a parameterized real-world scenario. Evaluation results show that a model owner can retrieve a model’s datasets and its licenses in approximately 1.5 and 3 seconds, respectively. This is irrespective of the number of model owners, datasets, and licenses hosted within the framework. Additionally, retrieving authorized models for a license takes approximately 1.5 seconds, regardless of the number of training datasets per model, model owners, and licenses within the framework. These results demonstrate scalability under varying numbers of users, datasets, models, and licenses.
The rest of the paper is organized as follows: Sec.II provides background and related work. Sec.III gives the system architecture and our design. The construction details of our framework, including data models and functional operations, are presented in Sec.IV. Sec.V and VI present our implementation with performance evaluations. Sec.VII offers conclusions and suggests avenues for future research.
This paper is
[1] Open released at https://github.com/yilin-sai/ai-copyright-framework.