Hello! Today I’d like to explain how to solve one of the most troublesome tasks in NLP — question answering. We’ll be labeling the SQuAD2.0 dataset with the help of Toloka-Kit — a Python library for data labeling projects that helps data scientists and ML engineers build scalable ML pipelines. But feel free to go with a different option, like Vertex AI, for instance. Let’s dive right in.
The Stanford Question Answering Dataset (SQuAD) is used to test NLP models and their ability to understand natural language. SQuAD2.0 consists of a set of paragraphs from Wikipedia articles, along with 100,000 question-answer pairs derived from these paragraphs, and 50,000 unanswerable questions. To show good results on SQuAD2.0, a model must not only answer questions correctly, but also determine whether a question has an answer in the first place, and refrain from responding if it doesn’t.
SQuAD2.0 is the most popular question answering dataset: it’s been cited in over 1000 articles, and in the three years since its release, 85 models have been published on its leaderboard.
Our task is to get the correct answer to a question based on a fragment of a Wikipedia article. The answer is a segment of text from the corresponding passage, or the question may not have an answer at all. Here’s an example of text, question, and answer:
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\’s Child. Managed by her father, Mathew Knowles, the group became one of the world\’s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\’s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles “Crazy in Love” and “Baby Boy”.
question: When did Beyonce start becoming popular?
answer: [in the late 1990s]
Crowdsourcing can be extremely useful in solving Q&A tasks. If you’re building a virtual assistant, a chatbot, or any other system that’s supposed to answer questions posed in natural language, you need to train your model on a dataset like SQuAD2.0. But using an open dataset is not always an option (for instance, there may be nothing available in the language you’re working with). You can use crowdsourcing to build your own dataset and make your labeling process easier.
Let’s create two projects for our labeling pipeline:
token = input("Enter your token:")
if token == '':
print('The token you entered may be invalid. Please try again.')
else:
print('OK')
# Prepare an environment and import everything we need
!pip install toloka-kit==0.1.3
import datetime
import json
import time
import toloka.client as toloka
import toloka.client.project.template_builder as tb
# Create a Toloka client instance
# All API calls will pass through it
toloka_client = toloka.TolokaClient(token, 'PRODUCTION') # or switch to SANDBOX
# We check the money available in your account, which also checks the validity of the OAuth token
requester = toloka_client.get_requester()
# How much money do you need for one question
PRICE_PER_TASK = 0.2
tasks_num = int(input("Enter the number of questions:"))
print('You have enough money on your account - ', requester.balance >= tasks_num * PRICE_PER_TASK)
# Download datasets
!curl https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json --output train-v2.0.json
!curl https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json --output dev-v2.0.json
with open('dev-v2.0.json') as f:
data = json.load(f)
with open('train-v2.0.json') as f:
train_data = json.load(f)
Our dataset is the collection of texts and questions with a list of possible answers to them.
{
'title': data['data'][0]['title'],
# Printing only the first paragraph for review
'paragraphs': [data['data'][0]['paragraphs'][0]]
}
{'title': 'Normans',
'paragraphs': [{'qas': [{'question': 'In what country is Normandy located?',
'id': '56ddde6b9a695914005b9628',
'answers': [{'text': 'France', 'answer_start': 159},
{'text': 'France', 'answer_start': 159},
{'text': 'France', 'answer_start': 159},
{'text': 'France', 'answer_start': 159}],
'is_impossible': False},
{'question': 'When were the Normans in Normandy?',
'id': '56ddde6b9a695914005b9629',
'answers': [{'text': '10th and 11th centuries', 'answer_start': 94},
{'text': 'in the 10th and 11th centuries', 'answer_start': 87},
{'text': '10th and 11th centuries', 'answer_start': 94},
{'text': '10th and 11th centuries', 'answer_start': 94}],
'is_impossible': False},
{'question': 'From which countries did the Norse originate?',
'id': '56ddde6b9a695914005b962a',
'answers': [{'text': 'Denmark, Iceland and Norway', 'answer_start': 256},
{'text': 'Denmark, Iceland and Norway', 'answer_start': 256},
{'text': 'Denmark, Iceland and Norway', 'answer_start': 256},
{'text': 'Denmark, Iceland and Norway', 'answer_start': 256}],
'is_impossible': False},
{'question': 'Who was the Norse leader?',
'id': '56ddde6b9a695914005b962b',
'answers': [{'text': 'Rollo', 'answer_start': 308},
{'text': 'Rollo', 'answer_start': 308},
{'text': 'Rollo', 'answer_start': 308},
{'text': 'Rollo', 'answer_start': 308}],
'is_impossible': False},
{'question': 'What century did the Normans first gain their separate identity?',
'id': '56ddde6b9a695914005b962c',
'answers': [{'text': '10th century', 'answer_start': 671},
{'text': 'the first half of the 10th century', 'answer_start': 649},
{'text': '10th', 'answer_start': 671},
{'text': '10th', 'answer_start': 671}],
'is_impossible': False},
{'plausible_answers': [{'text': 'Normans', 'answer_start': 4}],
'question': "Who gave their name to Normandy in the 1000's and 1100's",
'id': '5ad39d53604f3c001a3fe8d1',
'answers': [],
'is_impossible': True},
{'plausible_answers': [{'text': 'Normandy', 'answer_start': 137}],
'question': 'What is France a region of?',
'id': '5ad39d53604f3c001a3fe8d2',
'answers': [],
'is_impossible': True},
{'plausible_answers': [{'text': 'Rollo', 'answer_start': 308}],
'question': 'Who did King Charles III swear fealty to?',
'id': '5ad39d53604f3c001a3fe8d3',
'answers': [],
'is_impossible': True},
{'plausible_answers': [{'text': '10th century', 'answer_start': 671}],
'question': 'When did the Frankish identity emerge?',
'id': '5ad39d53604f3c001a3fe8d4',
'answers': [],
'is_impossible': True}],
'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.'}]}
In this project, performers will try to find answers to the questions. If this is not possible, they should mark the question as unanswerable or paste the answer otherwise.
# How performers will see the task
radio_group_field = tb.fields.RadioGroupFieldV1(
data=tb.data.OutputData(path='is_possible'),
label='Does the text contain an asnwer?',
validation=tb.conditions.RequiredConditionV1(),
options=[
tb.fields.GroupFieldOption(label='Yes', value='yes'),
tb.fields.GroupFieldOption(label='No', value='no')
]
)
helper = tb.helpers.IfHelperV1(
condition=tb.conditions.EqualsConditionV1(
to='yes',
data=tb.data.OutputData(path='is_possible')
),
then=tb.fields.TextareaFieldV1(
data=tb.data.OutputData(path='answer'),
label='Paste an answer',
validation=tb.conditions.RequiredConditionV1()
)
)
project_interface = toloka.project.view_spec.TemplateBuilderViewSpec(
config=tb.TemplateBuilder(
view=tb.view.ListViewV1(
items=[
tb.view.TextViewV1(label='Text', content=tb.data.InputData(path='text')),
tb.view.TextViewV1(label='Question', content=tb.data.InputData(path='question')),
tb.view.ListViewV1(items=[radio_group_field, helper])
]
)
)
)
public_instruction = open('marking_public_instruction.html').read().strip()
# Set up the project
marking_project = toloka.project.Project(
assignments_issuing_type=toloka.project.Project.AssignmentsIssuingType.AUTOMATED,
public_name='Find the answer in the text',
public_description='Read the text and find the text fragment that answers the question',
public_instructions=public_instruction,
# Set up the task: view, input, and output parameters
task_spec=toloka.project.task_spec.TaskSpec(
input_spec={
'text': toloka.project.field_spec.StringSpec(),
'question': toloka.project.field_spec.StringSpec(),
'question_id': toloka.project.field_spec.StringSpec(required=False)
},
output_spec={
'answer': toloka.project.field_spec.StringSpec(required=False),
'is_possible': toloka.project.field_spec.StringSpec(allowed_values=['yes', 'no'])
},
view_spec=project_interface,
),
)
# Call the API to create a new project
# If you have already created all pools and projects you can just get it using toloka_client.get_project('your marking project id')
marking_project = toloka_client.create_project(marking_project)
print(f'Created marking project with id {marking_project.id}')
print(f'To view the project, go to: https://toloka.yandex.com/requester/project/{marking_project.id}')
How performers will see the tasks
How performers see the instructions
Then we want to create training to help performers make the tasks better. We will add several training tasks and require to complete all of them before performing the real tasks.
# Set up the training pool
marking_training = toloka.training.Training(
project_id=marking_project.id,
private_name='SQUAD2.0 training',
may_contain_adult_content=True,
assignment_max_duration_seconds=10000,
mix_tasks_in_creation_order=True,
shuffle_tasks_in_task_suite=True,
training_tasks_in_task_suite_count=3,
task_suites_required_to_pass=1,
retry_training_after_days=1,
inherited_instructions=True,
public_instructions='',
)
marking_training = toloka_client.create_training(marking_training)
print(f'Created training with id {marking_training.id}')
print(f'To view the training, go to: https://toloka.yandex.com/requester/project/{marking_project.id}/training/{marking_training.id}')
We need to upload tasks for training with hints to help performers find the correct answers.
training_tasks = [
toloka.task.Task(
input_values={
'question_id': '56be85543aeaaa14008c9063',
'question': 'When did Beyonce start becoming popular?',
'text': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'
},
known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'is_possible': 'yes', 'answer': 'in the late 1990s'})],
message_on_unknown_solution='the answer can be found after "and rose to fame..."',
infinite_overlap=True,
pool_id=marking_training.id
),
toloka.task.Task(
input_values={
'question_id': '56be86cf3aeaaa14008c9076',
'question': 'After her second solo album, what other entertainment venture did Beyonce explore?',
'text': 'Following the disbandment of Destiny\'s Child in June 2005, she released her second solo album, B\'Day (2006), which contained hits "Déjà Vu", "Irreplaceable", and "Beautiful Liar". Beyoncé also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for "Single Ladies (Put a Ring on It)". Beyoncé took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyoncé (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.'
},
known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'is_possible': 'yes', 'answer': 'acting'})],
message_on_unknown_solution='the answer can be found before "... with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009)"',
infinite_overlap=True,
pool_id=marking_training.id
),
toloka.task.Task(
input_values={
'question_id': '5a8d7bf7df8bba001a0f9ab1',
'question': 'What category of game is Legend of Zelda: Australia Twilight?',
'text': 'The Legend of Zelda: Twilight Princess (Japanese: ゼルダの伝説 トワイライトプリンセス, Hepburn: Zeruda no Densetsu: Towairaito Purinsesu?) is an action-adventure game developed and published by Nintendo for the GameCube and Wii home video game consoles. It is the thirteenth installment in the The Legend of Zelda series. Originally planned for release on the GameCube in November 2005, Twilight Princess was delayed by Nintendo to allow its developers to refine the game, add more content, and port it to the Wii. The Wii version was released alongside the console in North America in November 2006, and in Japan, Europe, and Australia the following month. The GameCube version was released worldwide in December 2006.[b]'
},
known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'is_possible': 'no'})],
message_on_unknown_solution='There is no game called Legend of Zelda: Australia Twilight',
infinite_overlap=True,
pool_id=marking_training.id
)
]
tasks_op = toloka_client.create_tasks_async(training_tasks)
toloka_client.wait_operation(tasks_op)
Now we need to create a pool with real tasks.
We want to have manual solutions acceptance (based on the results of the verification projects) and some overlap to have multiple variants of answers for every question.
We want to filter performers by their knowledge of English and the result of the training.
Also we want to set up the quality control:
marking_pool = toloka.pool.Pool(
project_id=marking_project.id,
private_name='Pool 1',
may_contain_adult_content=True,
will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
reward_per_assignment=0.02,
auto_accept_solutions=False,
auto_accept_period_day=3,
assignment_max_duration_seconds=60*20,
defaults=toloka.pool.Pool.Defaults(
default_overlap_for_new_task_suites=3
),
filter=toloka.filter.Languages.in_('EN'),
)
marking_pool.set_mixer_config(real_tasks_count=4, golden_tasks_count=1, training_tasks_count=0) # 5 tasks per page
# We require at least 1 training task to be completed on the first attempt
marking_pool.quality_control.training_requirement=toloka.quality_control.QualityControl.TrainingRequirement(training_pool_id=marking_training.id, training_passing_skill_value=30)
# Increase overlap for the task if the assignment was rejected
marking_pool.quality_control.add_action(
collector=toloka.collectors.AssignmentsAssessment(),
conditions=[toloka.conditions.AssessmentEvent == toloka.conditions.AssessmentEvent.REJECT],
action=toloka.actions.ChangeOverlap(delta=1, open_pool=True)
)
# Ban performer if its quality in the binary classification of the existence of the answer is lower than for a random choice
marking_pool.quality_control.add_action(
collector=toloka.collectors.GoldenSet(),
conditions=[
toloka.conditions.GoldenSetCorrectAnswersRate < 50.0,
toloka.conditions.GoldenSetAnswersCount > 4
],
action=toloka.actions.RestrictionV2(
scope=toloka.user_restriction.UserRestriction.PROJECT,
duration=1,
duration_unit=toloka.user_restriction.DurationUnit.DAYS,
private_comment='Golden set'
)
)
# Ban performer who answers too fast
marking_pool.quality_control.add_action(
collector=toloka.collectors.AssignmentSubmitTime(history_size=5, fast_submit_threshold_seconds=120),
conditions=[toloka.conditions.FastSubmittedCount > 2],
action=toloka.actions.RestrictionV2(
scope=toloka.user_restriction.UserRestriction.PROJECT,
duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
private_comment='Fast responses'
)
)
# Another criteria to ban performer who answers too fast
marking_pool.quality_control.add_action(
collector=toloka.collectors.AssignmentSubmitTime(fast_submit_threshold_seconds=60),
conditions=[toloka.conditions.FastSubmittedCount > 0],
action=toloka.actions.RestrictionV2(
scope=toloka.user_restriction.UserRestriction.PROJECT,
duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
private_comment='Fast responses'
)
)
marking_pool = toloka_client.create_pool(marking_pool)
print(f'Created pool with id {marking_pool.id}')
print(f'To view the pool, go to: https://toloka.yandex.com/requester/project/{marking_project.id}/pool/{marking_pool.id}')
Let’s generate tasks from the test dataset and golden tasks from the training dataset. In the golden set we will compare only binary yes/no classification of the answer because it’s possible to have several different correct answers to the questions so we can’t directly compare them with the performer’s answer.
for d in train_data['data']:
if len(golden_tasks) == tasks_num / 2:
break
for paragraph in d['paragraphs']:
if len(golden_tasks) == tasks_num / 2:
break
for question in paragraph['qas']:
if len(golden_tasks) == tasks_num / 2:
break
golden_tasks.append(
toloka.task.Task(
input_values={
'text': paragraph['context'],
'question': question['question'],
'question_id': question['id']
},
known_solutions = [toloka.task.BaseTask.KnownSolution(output_values={'is_possible': 'no' if question['is_impossible'] else 'yes'})],
pool_id = marking_pool.id
)
)
tasks = []
for d in data['data']:
if len(tasks) >= tasks_num:
break
for paragraph in d['paragraphs']:
if len(tasks) >= tasks_num:
break
for question in paragraph['qas']:
if len(tasks) == tasks_num:
break
tasks.append(
toloka.task.Task(
input_values={
'text': paragraph['context'],
'question': question['question'],
'question_id': question['id']
},
pool_id = marking_pool.id,
)
)
# Restrict size of the golden set and create tasks
tasks_op = toloka_client.create_tasks_async(golden_tasks + tasks, allow_defaults=True)
toloka_client.wait_operation(tasks_op)
Our second project is about verification of the answers. Performer should read the text and the question and check the correctness of the suggested answer.
# How performers will see the task
helper = tb.helpers.IfHelperV1(
condition=tb.conditions.EqualsConditionV1(to='yes', data=tb.data.InputData(path='is_possible')),
then=tb.view.TextViewV1(label='Answer', content=tb.data.InputData(path='answer')),
else_=tb.view.TextViewV1(label='Answer', content='No answer in the text')
)
radio_group_field = tb.fields.RadioGroupFieldV1(
data=tb.data.OutputData(path='is_correct'),
label='Is the answer correct?',
validation=tb.conditions.RequiredConditionV1(),
options=[
tb.fields.GroupFieldOption(label='Yes', value='yes'),
tb.fields.GroupFieldOption(label='No', value='no')
]
)
verificaction_project_interface = toloka.project.view_spec.TemplateBuilderViewSpec(
config=tb.TemplateBuilder(
view=tb.view.ListViewV1(
items=[
tb.view.TextViewV1(label='Text', content=tb.data.InputData(path='text')),
tb.view.TextViewV1(label='Question', content=tb.data.InputData(path='question')),
helper,
radio_group_field
]
)
)
)
public_instruction = open('verification_public_instruction.html').read().strip()
# Set up the project
verification_project = toloka.project.Project(
assignments_issuing_type=toloka.project.Project.AssignmentsIssuingType.AUTOMATED,
public_name='Check if the answer is correct',
public_description='Read the text, the question, and the answer. Check if the answer is correct',
public_instructions=public_instruction,
# Set up the task: view, input, and output parameters
task_spec=toloka.project.task_spec.TaskSpec(
input_spec={
'text': toloka.project.field_spec.StringSpec(),
'question': toloka.project.field_spec.StringSpec(),
'question_id': toloka.project.field_spec.StringSpec(required=False),
'assignment_id': toloka.project.field_spec.StringSpec(required=False),
'answer': toloka.project.field_spec.StringSpec(required=False),
'is_possible': toloka.project.field_spec.StringSpec(allowed_values=['yes', 'no'])
},
output_spec={'is_correct': toloka.project.field_spec.StringSpec(allowed_values=['yes', 'no'])},
view_spec=verificaction_project_interface,
),
)
verification_project = toloka_client.create_project(verification_project)
print(f'Created verification project with id {verification_project.id}')
print(f'To view the project, go to: https://toloka.yandex.com/requester/project/{verification_project.id}')
How performers see the tasks
How performers see the instructions
Training is necessary for this project because it is hard to get a golden set (there is no source to get examples of correct/incorrect answers). So, we should create training with different types of the answers to prepare performers for a variety of possible tasks and filter performers who will complete it poorly.
verification_training = toloka.training.Training(
project_id=verification_project.id,
private_name='SQUAD2.0 training',
may_contain_adult_content=True,
assignment_max_duration_seconds=10000,
mix_tasks_in_creation_order=True,
shuffle_tasks_in_task_suite=True,
training_tasks_in_task_suite_count=5,
task_suites_required_to_pass=1,
retry_training_after_days=1,
inherited_instructions=True,
public_instructions='',
)
verification_training = toloka_client.create_training(verification_training)
print(f'Created training with id {verification_training.id}')
print(f'To view the training, go to: https://toloka.yandex.com/requester/project/{verification_project.id}/training/{verification_training.id}')
Let’s create some different tasks to cover as many possible correct/incorrect answer options as possible.
training_tasks = [
toloka.task.Task(
input_values={
'question_id': '',
'question': 'Who wrote later papers studying problems solvable by Turning machines?',
'answer': 'Hisao Yamada',
'is_possible': 'yes',
'text': 'Earlier papers studying problems solvable by Turing machines with specific bounded resources include John Myhill\'s definition of linear bounded automata (Myhill 1960), Raymond Smullyan\'s study of rudimentary sets (1961), as well as Hisao Yamada\'s paper on real-time computations (1962). Somewhat earlier, Boris Trakhtenbrot (1956), a pioneer in the field from the USSR, studied another specific complexity measure. As he remembers:'
},
known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'is_correct': 'no'})],
message_on_unknown_solution='The text is about earlier papers not later ones',
infinite_overlap=True,
pool_id=verification_training.id
),
toloka.task.Task(
input_values={
'question_id': '',
'question': 'Who wrote the paper "Reductibility Among Combinatorial Problems" in 1974?',
'answer': 'Richard Karp',
'is_possible': 'yes',
'text': 'In 1967, Manuel Blum developed an axiomatic complexity theory based on his axioms and proved an important result, the so-called, speed-up theorem. The field really began to flourish in 1971 when the US researcher Stephen Cook and, working independently, Leonid Levin in the USSR, proved that there exist practically relevant problems that are NP-complete. In 1972, Richard Karp took this idea a leap forward with his landmark paper, "Reducibility Among Combinatorial Problems", in which he showed that 21 diverse combinatorial and graph theoretical problems, each infamous for its computational intractability, are NP-complete.'
},
known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'is_correct': 'no'})],
message_on_unknown_solution='"Reductibility Among Combinatorial Problems" was written in 1972',
infinite_overlap=True,
pool_id=verification_training.id
),
toloka.task.Task(
input_values={
'question_id': '',
'question': 'What category of game is Legend of Zelda: Australia Twilight?',
'answer': '',
'is_possible': 'no',
'text': 'The Legend of Zelda: Twilight Princess (Japanese: ゼルダの伝説 トワイライトプリンセス, Hepburn: Zeruda no Densetsu: Towairaito Purinsesu?) is an action-adventure game developed and published by Nintendo for the GameCube and Wii home video game consoles. It is the thirteenth installment in the The Legend of Zelda series. Originally planned for release on the GameCube in November 2005, Twilight Princess was delayed by Nintendo to allow its developers to refine the game, add more content, and port it to the Wii. The Wii version was released alongside the console in North America in November 2006, and in Japan, Europe, and Australia the following month. The GameCube version was released worldwide in December 2006.[b]'
},
known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'is_correct': 'yes'})],
message_on_unknown_solution='There is no game called Legend of Zelda: Australia Twilight',
infinite_overlap=True,
pool_id=verification_training.id
),
toloka.task.Task(
input_values={
'question_id': '',
'question': 'What is the name of the state that the megaregion expands to in the east?',
'answer': 'Las Vegas',
'is_possible': 'yes',
'text': 'The 8- and 10-county definitions are not used for the greater Southern California Megaregion, one of the 11 megaregions of the United States. The megaregion\'s area is more expansive, extending east into Las Vegas, Nevada, and south across the Mexican border into Tijuana.'
},
known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'is_correct': 'no'})],
message_on_unknown_solution='The state is actually called Nevada',
infinite_overlap=True,
pool_id=verification_training.id
),
toloka.task.Task(
input_values={
'question_id': '',
'question': 'Which city is the most populous in California?',
'answer': 'Los Angeles',
'is_possible': 'yes',
'text': 'Within southern California are two major cities, Los Angeles and San Diego, as well as three of the country\'s largest metropolitan areas. With a population of 3,792,621, Los Angeles is the most populous city in California and the second most populous in the United States. To the south and with a population of 1,307,402 is San Diego, the second most populous city in the state and the eighth most populous in the nation.'
},
known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={'is_correct': 'yes'})],
message_on_unknown_solution='"With a population of 3,792,621, Los Angeles is the most populous city in California"',
infinite_overlap=True,
pool_id=verification_training.id
)
]
tasks_op = toloka_client.create_tasks_async(training_tasks)
toloka_client.wait_operation(tasks_op)
Now we need to create a pool with real tasks. We want to have big enough overlap to aggregate verdicts about every answer. We want to filter performers by their knowledge of English and the result on the training. Also, we want to ban performers who answer too fast and inaccurately solve captchas.
verification_pool = toloka.pool.Pool(
project_id=verification_project.id,
private_name='Pool 1',
may_contain_adult_content=True,
will_expire=datetime.datetime.utcnow() + datetime.timedelta(days=365),
reward_per_assignment=0.01,
auto_accept_solutions=True,
assignment_max_duration_seconds=60*20,
defaults=toloka.pool.Pool.Defaults(
default_overlap_for_new_task_suites=5
),
filter=toloka.filter.Languages.in_('EN'),
)
verification_pool.set_mixer_config(real_tasks_count=5, golden_tasks_count=0, training_tasks_count=0)
verification_pool.set_captcha_frequency('MEDIUM')
# Ban performer who answers too fast
verification_pool.quality_control.add_action(
collector=toloka.collectors.AssignmentSubmitTime(history_size=5, fast_submit_threshold_seconds=100),
conditions=[toloka.conditions.FastSubmittedCount > 2],
action=toloka.actions.RestrictionV2(
scope=toloka.user_restriction.UserRestriction.PROJECT,
duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
private_comment='Fast responses'
)
)
# Ban performer who answers too fast
verification_pool.quality_control.add_action(
collector=toloka.collectors.AssignmentSubmitTime(fast_submit_threshold_seconds=45),
conditions=[toloka.conditions.FastSubmittedCount > 0],
action=toloka.actions.RestrictionV2(
scope=toloka.user_restriction.UserRestriction.PROJECT,
duration_unit=toloka.user_restriction.DurationUnit.PERMANENT,
private_comment='Fast responses'
)
)
# Ban performer by captcha criteria
verification_pool.quality_control.add_action(
collector=toloka.collectors.Captcha(history_size=5),
conditions=[toloka.conditions.FailRate >= 60],
action=toloka.actions.RestrictionV2(
scope=toloka.user_restriction.UserRestriction.PROJECT,
duration=3,
duration_unit=toloka.user_restriction.DurationUnit.DAYS,
private_comment='Captcha'
)
)
verification_pool = toloka_client.create_pool(verification_pool)
print(f'Created pool with id {verification_pool.id}')
print(f'To view the training, go to: https://toloka.yandex.com/requester/project/{verification_project.id}/pool/{verification_pool.id}')
Let’s run a pipeline which will verify the answers and accept or reject assignments based on the results of the verification.
def wait_pool_for_close(pool):
sleep_time = 60
pool = toloka_client.get_pool(pool.id)
while not pool.is_closed():
print(
f'\t{datetime.datetime.now().strftime("%H:%M:%S")}\t'
f'Pool {pool.id} has status {pool.status}.'
)
time.sleep(sleep_time)
pool = toloka_client.get_pool(pool.id)
def prepare_verification_tasks():
verification_tasks = [] # Tasks that we will send for verification
request = toloka.search_requests.AssignmentSearchRequest(
status=toloka.assignment.Assignment.SUBMITTED, # Only take completed tasks that haven't been accepted or rejected
pool_id=marking_pool.id,
)
# Create and store new tasks
for assignment in toloka_client.get_assignments(request):
for task, solution in zip(assignment.tasks, assignment.solutions):
verification_tasks.append(
toloka.task.Task(
input_values={
'text': task.input_values['text'],
'question': task.input_values['question'],
'question_id': task.input_values['question_id'],
'is_possible': solution.output_values['is_possible'],
'answer': solution.output_values.get('answer', '').strip(),
'assignment_id': assignment.id,
},
pool_id=verification_pool.id,
)
)
print(f'Generate {len(verification_tasks)} new verification tasks')
return verification_tasks
def run_verification_pool(verification_tasks):
verification_tasks_op = toloka_client.create_tasks_async(
verification_tasks,
toloka.task.CreateTasksParameters(allow_defaults=True)
)
toloka_client.wait_operation(verification_tasks_op)
verification_tasks_result = [task for task in toloka_client.get_tasks(pool_id=verification_pool.id) if not task.known_solutions]
task_to_assignment = {}
for task in verification_tasks_result:
task_to_assignment[task.id] = task.input_values['assignment_id']
# Open the verification pool
run_pool2_operation = toloka_client.open_pool(verification_pool.id)
run_pool2_operation = toloka_client.wait_operation(run_pool2_operation)
print(f'Verification pool status - {run_pool2_operation.status}')
return task_to_assignment
def get_aggregation_results(pool_id):
print('Start aggregation in the verification pool')
aggregation_operation = toloka_client.aggregate_solutions_by_pool(
type='DAWID_SKENE',
pool_id=pool_id,
fields=[toloka.aggregation.PoolAggregatedSolutionRequest.Field(name='is_correct')]
)
aggregation_operation = toloka_client.wait_operation(aggregation_operation)
print('Results aggregated')
return list(toloka_client.get_aggregated_solutions(aggregation_operation.id))
def set_answers_status(verification_results):
print('Started adding results to marking tasks')
assignment_results = dict()
for r in verification_results:
if r.task_id not in task_to_assignment:
continue
assignment_id = task_to_assignment[r.task_id]
assignment_result = assignment_results.get(assignment_id, 0)
# Increase the number of correct tasks in assignment
if r.output_values['is_correct'] == 'yes':
assignment_result += 1
assignment_results[assignment_id] = assignment_result
for assignment_id, correct_num in assignment_results.items():
assignment = toloka_client.get_assignment(assignment_id)
if assignment.status.value == 'SUBMITTED':
# If 4 or 5 tasks in the assignment was marked as correct then we will accept the assignment
if correct_num >= 4:
toloka_client.accept_assignment(assignment_id, 'Well done!')
else:
toloka_client.reject_assignment(assignment_id, 'Incorrect answers')
print('Finished adding results to marking tasks')
toloka_client.open_pool(marking_training.id)
toloka_client.open_pool(verification_training.id)
toloka_client.open_pool(marking_pool.id)
# Run the pipeline
while True:
print('\nWaiting for marking pool to close')
wait_pool_for_close(marking_pool)
print(f'Marking pool {marking_pool.id} is finally closed!')
# Preparing tasks
verification_tasks = prepare_verification_tasks()
# Make sure all the tasks are done
if not verification_tasks:
print('All the tasks in our project are done')
break
# Add it to the pool and run the pool
task_to_assignment = run_verification_pool(verification_tasks)
print('\nWaiting for verification pool to close')
wait_pool_for_close(verification_pool)
print(f'Verification pool {verification_pool.id} is finally closed!')
# Aggregation operation
verification_results = get_aggregation_results(verification_pool.id)
# Reject or accept tasks in the segmentation pool
set_answers_status(verification_results)
print(f'Results received at {datetime.datetime.now()}')
Now, let’s evaluate the results. We have several different answers for every question so we need to aggregate them. Let’s select the final answer by majority vote between yes/no answer classification and pick shorter answers over longer ones.
request_for_result = toloka.search_requests.AssignmentSearchRequest(
status=toloka.assignment.Assignment.ACCEPTED,
pool_id=marking_pool.id,
)
answers = dict()
for assignment in toloka_client.get_assignments(request_for_result):
for i, sol in enumerate(assignment.solutions):
answer = sol.output_values['answer'].strip() if sol.output_values['is_possible'] == 'yes' else ''
current_list = answers.get(assignment.tasks[i].input_values['question_id'], [])
current_list.append(answer)
answers[assignment.tasks[i].input_values['question_id']] = current_list
final_answers = dict()
for key, value in answers.items():
sorted_value = sorted(value, key=lambda x: len(x))
n = len(sorted_value) // 2
if sorted_value[n] == '':
final_answers[key] = ''
else:
final_answers[key] = next(filter(lambda x: x != '', sorted_value))
# Download evaluation script
!curl 'https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/' --output evaluate.py
from evaluate import make_qid_to_has_ans, get_raw_scores, apply_no_ans_threshold, apply_no_ans_threshold, make_eval_dict, merge_eval
# Implement `score` method using the methods from the evaluation script downloaded from the official SQUAD2.0 website
def score(dataset, preds):
na_probs = {k: 0.0 for k in preds}
qid_to_has_ans = {k: v for k, v in make_qid_to_has_ans(dataset).items() if k in preds} # Maps qid to True/False
has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
exact_raw, f1_raw = get_raw_scores(dataset, preds)
exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans, 1)
f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans, 1)
out_eval = make_eval_dict(exact_thresh, f1_thresh)
if has_ans_qids:
has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
merge_eval(out_eval, has_ans_eval, 'HasAns')
if no_ans_qids:
no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
merge_eval(out_eval, no_ans_eval, 'NoAns')
print(json.dumps(out_eval, indent=2))
score(data['data'], final_answers)
Even though this project is still a work in progress, we’re already seeing promising results and we’re certain that with incremental changes and improvements we can even beat SOTA models. So, if you have any ideas on how to improve this labeling project’s architecture, settings, instructions, or result aggregation methods, or if you have any other suggestions, feel free to leave a comment.