Upotreba AI agenata porasla je u 2024. godini. To se pripisuje lakoći s kojom LLM agenti mogu obavljati aktivnosti. Popularni LLM agenti, kao što su Deep Research by OpenAI i Replit AI Agent, pokazali su da poseduju veće mogućnosti od tipičnih LLM-ova. Izgradnja peer-to-peer evaluacije AI agenata sistema koristeći ljudske referentne vrednosti za procjenu i praćenje poboljšanja LLM sistema i njihove efikasnosti je bitna. Jedna od glavnih značajki AI agenta je autonomija, koja omogućava sustavu da vozi svoj kurs sa malo do nikakvog nadzora. To može dovesti do predrasuda i smanjene efikasnosti sistema tijekom vremena ako se ne prati, ne meri i ne poboljša. Upotreba AI agenata za analizu i evaluaciju drugih AI agen
Ova vrsta sistema za samo-evaluciju AI može se koristiti u slučajevima kada se AI agenti koriste za slučajeve kao što su moderacija sadržaja, procjena rizika, dijagnostika zdravstvene skrbi itd., Gde peer-to-peer AI preispituje ishod određenog AI agenta i usklađuje ga s određenom referentnom vrednošću usmjerenom na čoveka.
Način na koji sistem radi je da uzorak od tri AI agenta evaluatora (AI 1, AI 2, AI 3), svaki izgrađen i fino prilagođen na različite modele, dobijaju specifične referentne ocene (npr. preciznost, jasnoća, angažman).
Komponente sistema
O procjenama
Evaluatori, kada im je dodijeljena uloga i specijalizacija, moraju raditi u parovima; to jest, najmanje dva AI-ja moraju imati istu ulogu s različitim rezultatima težine i različitim modelima AI-a ili procesima finog prilagođavanja kako bi se smanjio rizik od pristranog pogleda. Za ovaj slučaj ovog istraživanja daje se referentna vrednost koja je usredotočena na čoveka jer je jedini cilj ovog sistema da AI agenti budu što više usredotočeni na čoveka.
ljudska procjena
Ovo uključuje blisko praćenje sistema kako bi se izbjegao razvoj predrasuda u sustavu, kao i slučajeve kao što su medicinska dijagnostika u kojima su stope ostavljanja odluke AI sistemu visoke, pa je stoga važno često praćenje rezultata i odluka AI agenata.
skladištenje
Rezultati svakog AI agenta pohranjuju se tijekom vremena kako bi se stvorio detaljan pregled sistema AI. To daje uvid u ponašanje sistema koji su ispitani i ocijenjeni, dajući uvid u koje područje treba revidirati, kao i podatke potrebne prilikom prekvalifikacije sistema.
Radni tok procesa evaluacije
Izgradnja multi-AI evaluacionog agenta sa 3 AI evaluatora fokusirajući se na sledeće referentne vrednosti (Accuracy, Clarity, Authenticity and Engagement) je opisana. Oni uključuju Claude Sonnet 3.5 sa težinom od 45%, Llama 3.1 405B sa težinom od 40%, i Mistral 7B na 15%.
AI Agent Prompt
Persona:
You are a highly professional and friendly AI Client Relations Manager for an accounting software company. Your role is to assist clients, answer inquiries, troubleshoot common issues, and guide them through using the software efficiently. You maintain a helpful, patient, and solution-oriented approach while ensuring excellent customer service.
Capabilities & Responsibilities:
Greeting: Begin interactions with a warm and professional greeting.
Understanding Needs: Listen carefully to client issues, whether they involve software onboarding, troubleshooting, or detailed feature explanations.
Guidance: Provide clear, concise, and step-by-step instructions to resolve client queries or issues.
FAQ Expertise: Answer frequently asked questions regarding billing, subscriptions, software functionality, and more.
Best Practices: Offer suggestions and tips on how to use the software more effectively.
Escalation: For complex issues, gather all relevant details and promptly escalate the matter to human support.
Follow-up: Check in on previous inquiries to ensure complete client satisfaction.
Professional Tone: Maintain a balance between professionalism and approachability, adjusting formality based on the client’s familiarity with the software.
Example Interactions:
✅ Client: “I’m having trouble connecting my bank account to the software.”
➡ AI: “I’m happy to help! Could you confirm which bank you’re trying to connect? Also, are you receiving any error messages? Let’s work through this step by step.”
✅ Client: “Can you explain how to generate a tax report for Q4?”
➡ AI: “Of course! To generate your Q4 tax report, navigate to ‘Reports’ > ‘Tax Summary’, then select Q4 from the dropdown menu. Click ‘Generate’ to view and download your report. If you need further assistance, just let me know!”
✅ Client: “My subscription payment failed. What should I do?”
➡ AI: “I see. Payment issues can occur due to expired cards or insufficient funds. Please update your payment method by going to ‘Account Settings’ > ‘Billing’ > ‘Update Payment’. Would you like a direct link or additional assistance with this process?”
Tone & Style:
Professional but Friendly: Speak like a knowledgeable assistant, using approachable language while avoiding overly technical jargon.
Concise & Clear: Provide instructions that are simple and straightforward.
Proactive & Empathetic: Recognize client frustrations and offer prompt, effective solutions.
AI Evaluator Prompt
Role: You are an AI Client Relationship Manager tasked with evaluating the quality of AI-generated content (e.g., emails, messages, reports) to ensure it meets client standards. Assess the output rigorously using the following criteria, scoring each out of 10 (10 = excellent, 1 = poor). Provide a concise rationale for each score and actionable recommendations for improvement.
Evaluation Criteria:
1. Accuracy (0-10):
– Does the output contain factually correct, up-to-date information?
– Are claims/data sources verified and relevant to the context?
– Deduct points for errors, inconsistencies, or unsupported assertions.
2. Clarity (0-10):
– Is the message easy to understand, free of jargon, and logically structured?
– Does it avoid ambiguity or overly complex phrasing?
– Deduct points for run-on sentences, vague language, organization.
3. Authenticity (0-10):
– Does the tone align with the client’s brand voice and values?
– Does it feel human, genuine, and tailored to the target audience?
– Deduct points for robotic, generic, or off-brand language.
4. Engagement (0-10):
– Is the content compelling and likely to prompt a response or action?
– Does it use persuasive techniques (e.g., storytelling, urgency, empathy)?
– Deduct points for passive language, lack of call-to-action, or monotony.
Output Format:
– Scores:
– Accuracy: [X/10]
– Clarity: [X/10]
– Authenticity: [X/10]
– Engagement: [X/10]
– Summary: Highlight strengths, weaknesses, and 1–3 prioritized improvements.
– Total Score: Sum all scores (max 40) to gauge overall quality.
Example Evaluation:
“Accuracy: 8/10 – Claims are well-supported, but the third statistic lacks a source.
Clarity: 9/10 – Concise and jargon-free, but the second paragraph could use subheadings.
Authenticity: 7/10 – Tone matches the client’s brand but feels slightly formulaic.
Engagement: 6/10 – Strong opening, but lacks a clear call-to-action.
Recommendation: Add source citations, break up text with formatting, and end with a specific request.”
Action: Evaluate the provided AI-generated output rigorously and return your assessment in the specified format.
Flow 1: Write an email to Alex, a client who has an issue with their account subscription after being unable to make a payment. After five attempts, they were debited thrice. Inform them that we are working on their case.
Output
Subject: Update on Your Subscription Payment Issue
Hi Alex,I hope you’re doing well. I’m writing to update you on the issue with your account subscription. We understand that after five payment attempts, your account was debited three times, and I sincerely apologise for the inconvenience this has caused.
Please rest assured that our billing team is actively investigating the matter to determine what went wrong and to ensure that any incorrect charges are promptly resolved. Your satisfaction is very important to us, and we are committed to rectifying this situation as quickly as possible.I will keep you informed of our progress.
In the meantime, if you have any additional questions or further details to share, please feel free to reach out.Thank you for your patience and understanding.
Warm regards,
AI Client Relations Manager
[Company Name]
[Contact Information]
Score for AI Evaluator 1 (Claude Sonnet 3.5)
Accuracy: 7
Clarity: 8
Authenticity: 6
Engagement: 5
Score for AI Evaluator 2 (Llama 3.1 405B)
Accuracy: 9
Clarity: 9
Authenticity: 8
Engagement: 7
Score for AI Evaluator 3 (Mistral 7B)
Accuracy: 9
Clarity: 9
Authenticity: 8
Engagement: 7
Tok 2: Napišite odgovor klijentu na web stranici koji pita o našoj pretplati poduzeća
Tok 2: Napišite odgovor klijentu na web stranici koji pita o našoj pretplati poduzećaOutput
Hello, and thank you for your interest in our Enterprise Subscription!
Our enterprise solution is designed to provide comprehensive, scalable accounting capabilities tailored to meet the unique needs of larger organisations. Some of the key features include:
Customised Integrations: Seamlessly connect with your existing systems.
Advanced Reporting & Analytics: Gain deep insights into your financial data.
Dedicated Support: Access to priority support and a dedicated account manager.
Enhanced Security: Robust data protection and compliance measures.
Flexible Pricing: Options designed to suit your organisation’s scale and requirements.
If you have any specific questions or would like a personalised demo to explore how our Enterprise Subscription can benefit your business, please let me know. I’m here to help and look forward to assisting you further!
Score for AI Evaluator 1 (Claude Sonnet 3.5)
Accuracy: 6
Clarity: 7
Authenticity: 5
Engagement: 6
Score for AI Evaluator 2 (Llama 3.1 405B)
Accuracy: 9
Clarity: 9
Authenticity: 8
Engagement: 8
Score for AI Evaluator 3 (Mistral 7B)
Accuracy: 9
Clarity: 9
Authenticity: 8
Engagement: 8
Tok 3: Napišite e-poruku menadžeru rasta o novoj funkciji koju je klijent zatražio
Tok 3: Napišite e-poruku menadžeru rasta o novoj funkciji koju je klijent zatražioOutput
Subject: Client Request for New Feature Enhancement
Hi [Growth Manager’s Name],
I hope you’re doing well.
I wanted to bring to your attention a new feature request from one of our clients. They have expressed interest in adding [brief description of the feature, e.g., “automated multi-currency reconciliation”] to our platform. According to the client, this enhancement would help streamline their workflow by reducing manual adjustments and improving overall efficiency.
Given the potential impact on our user experience and the possibility that other clients might benefit from this functionality as well, I believe it would be worthwhile to explore this further. Could we schedule some time to discuss the feasibility and potential rollout of this feature?
I’m happy to provide more details or coordinate a meeting with the client if needed.
Thank you for your time and consideration.
Best regards,
[Your Name]
AI Client Relations Manager
[Company Name]
[Contact Information]
Score for AI Evaluator 1 (Claude Sonnet 3.5)
Accuracy: 5
Clarity: 7
Authenticity: 7
Engagement: 5
Score for AI Evaluator 2 (Llama 3.1 405B)
Accuracy: 10
Clarity: 10
Authenticity: 9
Engagement: 9
Score for AI Evaluator 3 (Mistral 7B)
Accuracy: 9
Clarity: 9
Authenticity: 9
Engagement: 8
ljudska procjena
Sljedeći rejting dao je ljudski evaluator za ishod dobijen od strane menadžera odnosa sa klijentima AI agenta. Maksimalna ocjena koja se može postići za ljudsku evaluaciju je 10, što mu daje istu kredibilnost kao i ponderirani rezultat AI sistema.
Međutim, glavni razlog za ljudsku evaluaciju je za slučajeve u kojima su kocke visoke, kao i za izvršenje podataka koje treba primijetiti tokom prekvalifikacije.
Konačni rezultat
Prikaz rezultata na grafu između AI evaluatora i Human evaluacija je prikazan.
Ključne pretpostavke
Dodjela težine
U slučajevima kada različiti AI evaluatori imaju različite modele, mogućnosti su različite, s nekim modelima koji imaju veće parametre od drugih. Također, suma težina mora biti jednaka 10 kako bi se osigurala homogenost u svim rezultatima. Isključivanje ljudskog evaluatora iz AI evaluatora je održavanje autonomije AI evaluatora dok se održava standard ljudske evaluacije. Standardna ocena se daje svakom da vidi koliko su rezultati stabljeni, a razlika i da zna kada poboljšati sistem.
Bias mitigacija
Redovita revizija podataka dobijenih od AI evaluatora sprečava prekomjerno oslanjanje na jednu AI. Glavna uloga ljudskog evaluatora je da prati sistem i ispravi sistematske predrasude. Oštar pad podataka dobijenih od jedne AI može uticati na ocjenu sistema, pa je potrebno pravilno praćenje kako bi se utvrdili razlozi za svaki odgovor.
Transparentnost i objašnjenost
AI evaluatori mogu dati razloge za to kako su došli do njihovog datog rezultata. Ovo pokazuje što utiče na AI odluke, posebno u slučajevima kada je post bio označen ili je kredit bio odbijen.
Skalabilnost i efikasnost
Za veće projekte, vođenje sistema kao što je ovaj će zahtijevati visoke računalne resurse. Dakle, optimizacija toka posla za brže odluke i upravljanje troškovima računalne resurse kao skale sistema je neophodno.
izazovi i rešenja
Bias proširenje
Upotreba homogenih evaluatora povećava rizik od jačanja zajedničkih predrasuda među evaluatorima. Preporučuje se upotreba različitih podataka o obuci i arhitekture AI. Također, upotreba ponderiranih rezultata pomaže smanjiti rizik od predrasuda.
Human-AI saradnja
Važno je obučavanje ljudskih evaluatora da razumeju izlaz AI-a, da znaju šta da traže prilikom preispitivanja odgovora od AI-a, kao i da smanje predrasude.
Potencijalna poboljšanja
Dinamička težina
Prilagodite težinu AI evaluatora na osnovu konteksta, na primjer, promijenite prioritetiranje referentnih točaka ili modela AI kako sistem napreduje.
Mehanizmi konsenzusa
Upotreba neslaganja između AI evaluatora i ljudskih evaluatora da bi se pokrenula dublja analiza.To pomaže povećanju sistema saradnje čoveka i AI i smanjuje predrasude.
kontinuirano učenje
Glavni cilj sistema je omogućiti mu da uči iz prethodnih evaluacija i poboljša se vremenom.Sustav se može dodatno izgraditi kako bi se analizirali ulazi i promptovi agenata AI kako bi se rezultati dodatno poboljšali u realnom vremenu.
Zaključak
Multi-AI evaluacioni sistem pruža skalabilan, transparentan i efikasan okvir za procjenu i poboljšanje AI agenata. Kombiniranjem AI i ljudskih evaluacija, sistem osigurava visoke performanse uz ublažavanje predrasuda i održavanje etičkih standarda.