With the cost of a cup of Starbucks and two hours of your time, you can own your own trained open-source large-scale model. The model can be fine-tuned according to different training data directions to enhance various skills, such as medical, programming, stock trading, and love advice, making your large-scale model more “understanding” of you. Let’s try training an open-source large-scale model empowered by the open-source DolphinScheduler!
The birth of ChatGPT has undoubtedly filled us with anticipation for the future of AI. Its sophisticated expression and powerful language understanding ability have amazed the world. However, because ChatGPT is provided as a Software as a Service (SaaS), issues of personal privacy leaks and corporate data security are concerns for every user and company. More and more open-source large-scale models are emerging, making it possible for individuals and companies to have their own models. However, getting started with, optimizing, and using open-source large-scale models have high barriers to entry, making it difficult for everyone to use them easily. To address this, we use Apache DolphinScheduler, which provides one-click support for training, tuning, and deploying open-source large-scale models. This enables everyone to train their own large-scale models using their data at a very low cost and with technical expertise.
Our goal is not only for professional AI engineers but for anyone interested in GPT to enjoy the joy of having a model that “understands” them better. We believe that everyone has the right and ability to shape their own AI assistant. The intuitive workflow of Apache DolphinScheduler makes this possible. As a bonus, Apache DolphinScheduler is a big data and AI scheduling tool with over 10,000 stars on GitHub. It is a top-level project under the Apache Software Foundation, meaning you can use it for free and modify the code without worrying about any commercial issues.
Whether you are an industry expert looking to train a model with your own data, or an AI enthusiast wanting to understand and explore the training of deep learning models, our workflow will provide convenient services for you. It solves complex pre-processing, model training, and optimization steps, and only requires 1–2 hours of simple operations, plus 20 hours of running time to build a more “understanding” ChatGPT large-scale model.
So let’s start this magical journey! Let’s bring the future of AI to everyone.
First, you need a 3090 graphics card. If you have a desktop computer, you can use it directly. If not, there are many hosts for rent with GPU online. Here we use AutoDL as an example to apply. Open https://www.autodl.com/home, register and log in. After that, you can choose the corresponding server in the computing power market according to steps 1, 2, and 3 shown on the screen.
Here, it is recommended to choose the RTX 3090 graphics card, which offers a high cost-performance ratio. After testing, it has been found that one to two people can use the RTX 3090 for online tasks. If you want faster training and response speeds, you can opt for a more powerful graphics card. Training once takes approximately 20 hours, while testing requires around 2–3 hours. With a budget of 40 yuan, you can easily get it done.
Click on the community mirror, and then enter WhaleOps/dolphinscheduler-llm/dolphinscheduler-llm-0521
the red box below. You can select the image as shown below. Currently, only the V1 version is available. In the future, as new versions are released, you can choose the latest one.
If you need to train the model multiple times, it is recommended to expand the hard disk capacity to around 100GB.
After creating it, wait for the progress bar shown in the following image to complete.
In order to deploy and debug your own open-source large-scale model on the interface, you need to start the DolphinScheduler software, and we need to do the following configuration work:
There are two methods available. You can choose the one that suits your preference:
Click on the JupyterLab button shown below.
The page will redirect to JupyterLab; from there, you can click “Terminal” to enter.
2. Login via Terminal (for coders):
We can obtain the SSH connection command from the button shown in the following image.
Then, establish the connection through the terminal.
In DolphinScheduler, all metadata is stored in the database, including workflow definitions, environment configurations, tenant information, etc. To make it convenient for users to see these workflows when DolphinScheduler is launched, we can directly import pre-defined workflow metadata by copying it from the screen.
Using the terminal, navigate to the following directory:
cd apache-dolphinscheduler-3.1.5-bin
Execute the command: vim import_ds_metadata.sh
to open the import_ds_metadata.sh
file.The content of the file is as follows:
Set variables
Hostname
HOST="xxx.xxx.xxx.x"
UsernameUSERNAME="root"PasswordPASSWORD="xxxx"PortPORT=3306Database to import intoDATABASE="ds315_llm_test"SQL filenameSQL_FILE="ds315_llm.sql"mysql -h $HOST -P $PORT -u $USERNAME -p$PASSWORD -e "CREATE DATABASE $DATABASE;"mysql -h $HOST -P $PORT -u $USERNAME -p$PASSWORD $DATABASE < $SQL_FILE
Replace xxx.xxx.xxx.x and xxxx with the relevant configuration values of a MySQL database on your public network (you can apply for one on Alibaba Cloud, Tencent Cloud, or install one yourself). Then execute:
bash import_ds_metadata.sh
After execution, if interested, you can check the corresponding metadata in the database (connect to MySQL and view, skipping this step if you are not familiar with the code).
In the server command line, open the following file and modify the configuration to connect DolphinScheduler with the previously imported database:
/root/apache-dolphinscheduler-3.1.5-bin/bin/env/dolphinscheduler_env.sh
Modify the relevant configuration in the database section, and leave other sections unchanged. Change the values of ‘HOST’ and ‘PASSWORD’ to the configuration values of the imported database, i.e., xxx.xxx.xxx.x and xxxx:
export DATABASE=mysqlexport SPRING_PROFILES_ACTIVE=${DATABASE}export SPRING_DATASOURCE_URL="jdbc:mysql://HOST:3306/ds315_llm_test?useUnicode=true&characterEncoding=UTF-8&useSSL=false"export SPRING_DATASOURCE_USERNAME="root"export SPRING_DATASOURCE_PASSWORD="xxxxxx"......
After configuring, execute (also in this directory /root/apache-dolphinscheduler-3.1.5-bin):
bash ./bin/dolphinscheduler-daemon.sh start standalone-server
Once executed, we can check the logs by using tail -200f standalone-server/logs/dolphinscheduler-standalone.log
. At this point, DolphinScheduler is officially launched!
After starting the service, we can click on “Custom Services” in the AutoDL console (highlighted in red) to be redirected to a URL:
Upon opening the URL, if it shows a 404 error, don’t worry. Just append the suffix /dolphinscheduler/ui to the URL:
The AutoDL module opens port 6006. After configuring DolphinScheduler’s port to 6006, you can access it through the provided entry point. However, due to the URL redirection, you may encounter a 404 error. In such cases, you need to complete the URL manually.
Username: admin
Password: dolphinscheduler123
After logging in, click on “Project Management” to see the predefined project named “vicuna”. Click on “vicuna” to enter the project.
Upon entering the Vicuna project, you will see three workflows: Training, Deploy, and Kill_Service. Let’s explore their uses and how to configure large models and train your data.
You can click the run button below to execute corresponding workflows.
By clicking on the training workflow, you will see two definitions. One is for fine-tuning the model through Lora (mainly using alpaca-lora, https://github.com/tloen/alpaca-lora), and the other is to merge the trained model with the base model to get the final model.
The workflow for deploying large models (mainly using FastChat, https://github.com/lm-sys/FastChat). It will first invoke kill_service to kill the deployed model, then sequentially start the controller, add the model, and then open the Gradio web service.
The start parameters are as follows:
This workflow is used to kill the deployed model and release GPU memory. This workflow has no parameters, and you can run it directly. If you need to stop the deployed service (such as when you need to retrain the model or when there is insufficient GPU memory), you can directly execute the kill_service workflow to kill the deployed service.
After going through a few examples, your deployment will be complete. Now let’s take a look at the practical operation:
Start the workflow directly by executing the training workflow and selecting the default parameters.
Right-click on the corresponding task to view the logs, as shown below:
You can also view the task status and logs in the task instance panel at the bottom left of the sidebar. During the training process, you can monitor the progress by checking the logs, including the current training steps, loss metrics, remaining time, etc. There is a progress bar indicating the current step, where step = (data size * epoch) / batch size.
Our default data is in /root/demo-data/llama_data.json
. The current data source is Huatuo, a medical model finetuned using Chinese medical data. Yes, our example is training a family doctor:
If you have data in a specific field, you can point to your own data, the data format is as follows:
For example:
{"instruction": "calculation", "input": "1+1 equals?", "output": "2"}
Please note that you can merge the instruction and input fields into a single instruction field. The input field can also be left empty.
When training, modify the data_path parameter to execute your own data.
Note:
During the first training execution, the base model will be fetched from the specified location, such as TheBloke/vicuna-7B-1.1-HF. There will be a downloading process, so please wait for the download to complete. The choice of this model is determined by the user, and you can also choose to download other open-source large models (please follow the relevant licenses when using them).
Due to network issues, the base model download may fail halfway through the first training execution. In such cases, you can click on the failed task and choose to rerun it to continue the training. The operation is shown below:
To stop the training, you can click the stop button, which will release the GPU memory used for training.
On the workflow definition page, click on the deploy workflow to run it and deploy the model.
If you haven’t trained your own model, you can execute the deploy workflow with the default parameters TheBloke/vicuna-7B-1.1-HF to deploy the vicuna-7b
model, as shown in the image below:
If you have trained a model in the previous step, you can now deploy your model. After deployment, you can experience your own large model. The startup parameters are as follows, where you need to fill in the output_path
of the model from the previous step:
Next, let’s enter the deployed workflow instance. Click on the workflow instance, and then click on the workflow instance with the “deploy” prefix.
Right-click and select “refresh_gradio_web_service” to view the task logs and find the location of our large model link.
The operation is shown below:
In the logs, you will find a link that can be accessed publicly, such as:
Here are two links. The link 0.0.0.0:7860
cannot be accessed because AutoDL only opens port 6006, which is already used for dolphinscheduler. You can directly access the link below it, such as [https://81c9f6ce11eb3c37a4.gradio.live.](https://81c9f6ce11eb3c37a4.gradio.live.)
Please note that this link may change each time you deploy, so you need to find it again from the logs.
Once you enter the link, you will see the conversation page of your own ChatGPT!
Yes! Now you have your own ChatGPT, and its data only serves you!
And you only spent less than the cost of a cup of coffee~~
Go ahead and experience your own private ChatGPT!
In this data-driven and technology-oriented world, having a dedicated ChatGPT model has immeasurable value. With the advancement of artificial intelligence and deep learning, we are in an era where personalized AI assistants can be shaped. Training and deploying your own ChatGPT model can help us better understand AI and how it is transforming our world.
In summary, training and deploying a ChatGPT model on your own can help you protect data security and privacy, meet specific business requirements, save on technology costs, and automate the training process using workflow tools like DolphinScheduler. It also allows you to comply with local laws and regulations. Therefore, training and deploying a ChatGPT model on your own is a worthwhile option to consider.
Also published here.