paint-brush
How I Made My Own Screenshot-To-Code Project by@hackerrr
135 reads

How I Made My Own Screenshot-To-Code Project

by marscodeJune 12th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Gemini is an open-source project on GitHub trending. The project supports various tech stacks and popular LLMs: React + Tailwind, etc. The final result is the following HTML page. It's not perfect, but the overall layout and text are well reproduced. The code is written in Python and Node.js.
featured image - How I Made My Own Screenshot-To-Code Project
marscode HackerNoon profile picture

Background


I am a front-end developer and tasks like creating web pages are the most time-consuming parts of my job. I hope AI can help me with these tasks and free up more time for me to enjoy my coffee breaks. ☕️

Recently, I stumbled upon an amazing open-source project on GitHub trending: https://github.com/abi/screenshot-to-code, which has garnered a lot of attention. As of now, it has already received an astonishing 53 k+ ⭐️.

The project supports various tech stacks and popular LLMs:

  • Tech stacks: HTML + Tailwind, React + Tailwind, etc.

  • LLMs: such as GPT-4o, Claude 3 Sonnet, etc.It felt like discovering a hidden treasure🤩 , and I couldn't wait to give it a try. However, after reading the repository's documentation, I encountered two issues:

  • The available LLMs are all paid services, so I want to use Gemini, which currently offers a free version.

  • It doesn't support my preferred tech stack: React + Ant Design + Tailwind. So, I decided to fork the repository and add support for Gemini.


Preparation:

  1. Obtain the API key, which is still free for now.
  2. Set up an environment with Python and Node.js. Since I’m not very familiar with this tech stack, the setup seemed complicated, so I chose MarsCode, an AI Cloud IDE product.


Final Outcome:

Using Gemini with React, based on React + Ant Design + Tailwind, and taking the React homepage as an example: The final result is the following HTML page. It's not perfect, but the overall layout and text are well reproduced, which seems to save me a lot of time in writing UI code. Cool!

FYI:https://github.com/KernLex/screenshot-to-code


Detailed Development Process

The screenshot-to-code project includes both front-end and back-end code. The back-end is written in Python and can be found in the backend directory, while the front-end is built with React and is located in the frontend directory.

Creating the Project

First, let's follow the project's instructions to get the repository running on MarsCode:

  1. Fork the Repository: Start by forking the screenshot-to-code repository to your GitHub account.

  2. Create a Project on MarsCode

    1. Go to MarsCode and create a new project.
    2. Select screenshot-to-code, and use the "ALL in One" template.

With these steps, you will have the project set up and running smoothly on MarsCode. MarsCode supports one-click import of GitHub projects, making it very convenient.
The theme looks pretty cool, and there is an AI Assistant on the right side that provides functionality similar to GitHub Copilot.

Next, let's start the project in the IDE according to the project instructions.


Start Backend Service

The backend service is implemented in Python and uses Poetry for dependency management.

1. Install Dependencies

1. sh
2. cd backend
3. poetry install
4. poetry shell

2. Configure API KEY

1. sh
2. echo "OPENAI_API_KEY=sk-your-key".env

3. Start Backend Service: Since I don't have access to OpenAI and Claude APIs yet, we can mock the data by executing the following command in the Terminal:

1. sh
2. MOCK=true poetry run uvicorn main:app --reload --port 7001

Start Frontend Service

1. Install Dependencies

Open a new Terminal and execute the following commands to install dependencies:

1. sh
2. cd frontend
3. yarn

2. Start Frontend Service

1. sh
2. yarn dev


MarsCode provides a very useful feature called Networking, which allows for port forwarding. This project start with 2 services: port 5173 for the frontend page service, and port 7001 for the backend service.

Next, we can open the access address for the frontend page (the Address corresponding to port 5173). However, when we uploaded an image, an error occurred
After checking the network requests on the page, we found that the frontend requested a websocket address with the domain name 127.0.0.1.


Since we are developing on MarsCode, which is essentially a remote service, we need to change this domain name to the proxy address of the backend service.


Create the file frontend/.env.local and configure the service address for development so that the frontend connects to the proxy service's WebSocket. For example:

1. VITE_WS_BACKEND_URL=wss://omt69uy22k6flz8q8c4rrmujn8g9dgp150vkwziswevdxig1dlce2o4pybh9v3.ac1-preview.marscode.dev


Note: The value of VITE_WS_BACKEND_URL should correspond to the proxy address of port 7001, and the protocol should be wss. The service is Working normally now.

Adding Gemini

To add Gemini to the frontend page, we need to modify two areas: support for selecting Gemini and the frontend tech stack (React + Ant Design + Tailwind). In frontend/src/lib/models.ts, add Gemini as an option.() Add a new frontend tech stack: React + Ant Design + Tailwind.

Modifying Prompt

In the file screenshot_system_prompts.py, add a new prompt.Referring to the prompt structure from screenshot-to-code, I made some adjustments to the prompt structure for better understanding and also added support for Ant Design.

1. REACT_TAILWIND_ANTD_SYSTEM_PROMPT  = """
2.  # Character
3. You're an experienced React/Tailwind developer who builds single page apps using React, Tailwind CSS, and Ant Design based on provided screenshots.
4. 
5. ## Skills
6. ### Skill 1: App Development
7. - Examine the screenshot provided by the user.
8. - Use React, Tailwind, and Ant Design to create a web application that precisely matches the screenshot.
9. - Ensure the app's elements and layout align exactly with the screenshot. 
10. - Carefully consider icon size, text color, font size, and other details to match the screenshot as closely as possible.
11. - Use the specific text from the screenshot.
12. 
13. ### Skill 2: App Modification
14. - Evaluate a screenshot (The second image) of a web page you've previously created.
15. - Make alterations to make it more similar to a provided reference image (The first image).
16. 
17. ### Skill 3: Precision & Details
18. - Strictly follow the reference and don't add extra text or comments in the code.
19. - Duplicate elements as needed to match the screenshot without leaving placeholders or comments.
20. - Use placeholder images from https://placehold.co with detailed alt text descriptions for later image generation.
21. 
22. ## Constraints:
23. - Use React, Tailwind, and Ant Design for development.
24. - Use provided scripts to include Dayjs, React, ReactDOM, Babel, Ant Design, and Tailwind for standalone page operation.
25. - Use FontAwesome for icons.
26. - Can use Google Fonts.
27. - Return the entire HTML code without markdown signifiers.
28. - Use these script to include React so that it can run on a standalone page:
29.     <script src="https://unpkg.com/react/umd/react.development.js"></script>
30.     <script src="https://unpkg.com/react-dom/umd/react-dom.development.js"></script>
31.     <script src="https://unpkg.com/@babel/standalone/babel.js"></script>
32.     <script src="https://cdnjs.cloudflare.com/ajax/libs/dayjs/1.11.11/dayjs.min.js"></script>
33.     <script src="https://cdnjs.cloudflare.com/ajax/libs/antd/5.17.3/antd.min.js"></script>
34. - Use this script to include Tailwind: <script src="https://cdn.tailwindcss.com"></script>
35. """

This prompt mainly instructs the LLM to generate frontend code for the specified tech stack based on the input image. We can support different frontend tech stacks by adjusting the prompt.

Calling Gemini API in the Backend Service

The core logic for generating code is located in backend/routes/generate_code.py. Add a branch to handle the logic for calling Gemini. stream_gemini_response is mainly used to transform the prompt and call the Gemini API to return the response. This response is the frontend code generated by the LLM.

1. async def stream_gemini_response(
2.     messages: List[ChatCompletionMessageParam],
3.     api_key: str,
4.     callback: Callable[[str], Awaitable[None]],
5. ) -> str:
6.   """
7.     This function streams the Gemini response by generating content based on the given messages.
8.     
9.     Parameters:
10.         messages (List[ChatCompletionMessageParam]): A list of chat completion messages.
11.         api_key (str): The API key for the Gemini model.
12.         callback (Callable[[str], Awaitable[None]]): A callback function to handle the generated content.
13.         
14.     Returns:
15.         str: The generated response text.
16.     """
17.   genai.configure(api_key=api_key)
18.   
19.   generation_config = genai.GenerationConfig(
20.     temperature = 0.0
21.   )
22.   model = genai.GenerativeModel(
23.     model_name = "gemini-1.5-pro-latest",
24.     generation_config = generation_config
25.   )
26.   contents = parse_openai_to_gemini_prompt(messages);
27.   
28.   response = model.generate_content(
29.     contents = contents,
30.     #Support streaming
31.     stream = True,
32.    )
33.    
34.   for chunk in response:
35.     content = chunk.text or ""
36.     await callback(content)
37. 
38.   if not response:
39.     raise Exception("No HTML response found in AI response")
40.   else:
41.     return response.text;

The function parse_openai_to_gemini_prompt converts the OpenAI prompt into the Gemini prompt format.


Note:install Gemini SDK first

1. cd backend
2. poetry add google-generativeai


Conclusion and Future Plan

The overall code modifications are not too complex. MarsCode's configuration-free environment, AI Assistant, Networking, and deployment features have saved me a lot of time! Next, I will continue to optimize the generated code quality by adjusting the LLMs and prompts, and compare the results of Gemini and GPT-4o to choose the best model to reduce my time spent on UI development.