When Did Beyoncé Start Becoming Popular? - Tackling One of the Most Common Problems in NLP: Q/A

Hello! Today I’d like to explain how to solve one of the most troublesome tasks in NLP — question answering. We’ll be labeling the SQuAD2.0 dataset with the help of Toloka-Kit — a Python library for data labeling projects that helps data scientists and ML engineers build scalable ML pipelines. But feel free to go with a different option, like Vertex AI, for instance. Let’s dive right in. What is SQuAD? The Stanford Question Answering Dataset (SQuAD) is used to test NLP models and their ability to understand natural language. SQuAD2.0 consists of a set of paragraphs from Wikipedia articles, along with 100,000 question-answer pairs derived from these paragraphs, and 50,000 unanswerable questions. To show good results on SQuAD2.0, a model must not only answer questions correctly, but also determine whether a question has an answer in the first place, and refrain from responding if it doesn’t. SQuAD2.0 is the most popular question answering dataset: it’s been cited in over 1000 articles, and in the three years since its release, 85 models have been published on its leaderboard. The Problem Our task is to get the correct answer to a question based on a fragment of a Wikipedia article. The answer is a segment of text from the corresponding passage, or the question may not have an answer at all. Here’s an example of text, question, and answer: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\’s Child. Managed by her father, Mathew Knowles, the group became one of the world\’s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\’s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles “Crazy in Love” and “Baby Boy”. question: When did Beyonce start becoming popular? answer: [in the late 1990s] Let’s Talk about Crowdsourcing Crowdsourcing can be extremely useful in solving Q&A tasks. If you’re building a virtual assistant, a chatbot, or any other system that’s supposed to answer questions posed in natural language, you need to train your model on a dataset like SQuAD2.0. But using an open dataset is not always an option (for instance, there may be nothing available in the language you’re working with). You can use crowdsourcing to build your own dataset and make your labeling process easier. The Solution Let’s create two projects for our labeling pipeline: Marking project — we will collect answers to the questions from the test dataset Verification project — we will verify these answers to improve the final quality token = input( ) token == : print( ) : print( ) "Enter your token:" if '' 'The token you entered may be invalid. Please try again.' else 'OK' # Prepare an environment and everything we need !pip install toloka-kit== datetime json time toloka.client toloka toloka.client.project.template_builder tb import 0.1 .3 import import import import as import as # Create a Toloka client instance # All API calls will pass through it toloka_client = toloka.TolokaClient(token, ) # or to SANDBOX # We check the money available your account, which also checks the validity the OAuth token requester = toloka_client.get_requester() # How much money you need one question PRICE_PER_TASK = tasks_num = int(input( )) print( , requester.balance >= tasks_num * PRICE_PER_TASK) 'PRODUCTION' switch in of do for 0.2 "Enter the number of questions:" 'You have enough money on your account - ' # Download datasets !curl https: !curl https: open( ) f: data = json.load(f) open( ) f: train_data = json.load(f) //rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json --output train-v2.0.json //rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json --output dev-v2.0.json with 'dev-v2.0.json' as with 'train-v2.0.json' as Review the dataset Our dataset is the collection of texts and questions with a list of possible answers to them. { : data[ ][ ][ ], # Printing only the first paragraph review : [data[ ][ ][ ][ ]] } 'title' 'data' 0 'title' for 'paragraphs' 'data' 0 'paragraphs' 0 { : , : [{ : [{ : , : , : [{ : , : }, { : , : }, { : , : }, { : , : }], : False}, { : , : , : [{ : , : }, { : , : }, { : , : }, { : , : }], : False}, { : , : , : [{ : , : }, { : , : }, { : , : }, { : , : }], : False}, { : , : , : [{ : , : }, { : , : }, { : , : }, { : , : }], : False}, { : , : , : [{ : , : }, { : , : }, { : , : }, { : , : }], : False}, { : [{ : , : }], : , : , : [], : True}, { : [{ : , : }], : , : , : [], : True}, { : [{ : , : }], : , : , : [], : True}, { : [{ : , : }], : , : , : [], : True}], : }]} 'title' 'Normans' 'paragraphs' 'qas' 'question' 'In what country is Normandy located?' 'id' '56ddde6b9a695914005b9628' 'answers' 'text' 'France' 'answer_start' 159 'text' 'France' 'answer_start' 159 'text' 'France' 'answer_start' 159 'text' 'France' 'answer_start' 159 'is_impossible' 'question' 'When were the Normans in Normandy?' 'id' '56ddde6b9a695914005b9629' 'answers' 'text' '10th and 11th centuries' 'answer_start' 94 'text' 'in the 10th and 11th centuries' 'answer_start' 87 'text' '10th and 11th centuries' 'answer_start' 94 'text' '10th and 11th centuries' 'answer_start' 94 'is_impossible' 'question' 'From which countries did the Norse originate?' 'id' '56ddde6b9a695914005b962a' 'answers' 'text' 'Denmark, Iceland and Norway' 'answer_start' 256 'text' 'Denmark, Iceland and Norway' 'answer_start' 256 'text' 'Denmark, Iceland and Norway' 'answer_start' 256 'text' 'Denmark, Iceland and Norway' 'answer_start' 256 'is_impossible' 'question' 'Who was the Norse leader?' 'id' '56ddde6b9a695914005b962b' 'answers' 'text' 'Rollo' 'answer_start' 308 'text' 'Rollo' 'answer_start' 308 'text' 'Rollo' 'answer_start' 308 'text' 'Rollo' 'answer_start' 308 'is_impossible' 'question' 'What century did the Normans first gain their separate identity?' 'id' '56ddde6b9a695914005b962c' 'answers' 'text' '10th century' 'answer_start' 671 'text' 'the first half of the 10th century' 'answer_start' 649 'text' '10th' 'answer_start' 671 'text' '10th' 'answer_start' 671 'is_impossible' 'plausible_answers' 'text' 'Normans' 'answer_start' 4 'question' "Who gave their name to Normandy in the 1000's and 1100's" 'id' '5ad39d53604f3c001a3fe8d1' 'answers' 'is_impossible' 'plausible_answers' 'text' 'Normandy' 'answer_start' 137 'question' 'What is France a region of?' 'id' '5ad39d53604f3c001a3fe8d2' 'answers' 'is_impossible' 'plausible_answers' 'text' 'Rollo' 'answer_start' 308 'question' 'Who did King Charles III swear fealty to?' 'id' '5ad39d53604f3c001a3fe8d3' 'answers' 'is_impossible' 'plausible_answers' 'text' '10th century' 'answer_start' 671 'question' 'When did the Frankish identity emerge?' 'id' '5ad39d53604f3c001a3fe8d4' 'answers' 'is_impossible' 'context' 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.' Create a new marking project In this project, performers will try to find answers to the questions. If this is not possible, they should mark the question as unanswerable or paste the answer otherwise. # How performers will see the task radio_group_field = tb.fields.RadioGroupFieldV1( data=tb.data.OutputData(path= ), label= , validation=tb.conditions.RequiredConditionV1(), options=[ tb.fields.GroupFieldOption(label= , value= ), tb.fields.GroupFieldOption(label= , value= ) ] ) helper = tb.helpers.IfHelperV1( condition=tb.conditions.EqualsConditionV1( to= , data=tb.data.OutputData(path= ) ), then=tb.fields.TextareaFieldV1( data=tb.data.OutputData(path= ), label= , validation=tb.conditions.RequiredConditionV1() ) ) project_interface = toloka.project.view_spec.TemplateBuilderViewSpec( config=tb.TemplateBuilder( view=tb.view.ListViewV1( items=[ tb.view.TextViewV1(label= , content=tb.data.InputData(path= )), tb.view.TextViewV1(label= , content=tb.data.InputData(path= )), tb.view.ListViewV1(items=[radio_group_field, helper]) ] ) ) ) public_instruction = open( ).read().strip() # up the project marking_project = toloka.project.Project( assignments_issuing_type=toloka.project.Project.AssignmentsIssuingType.AUTOMATED, public_name= , public_description= , public_instructions=public_instruction, # up the task: view, input, and output parameters task_spec=toloka.project.task_spec.TaskSpec( input_spec={ : toloka.project.field_spec.StringSpec(), : toloka.project.field_spec.StringSpec(), : toloka.project.field_spec.StringSpec(required=False) }, output_spec={ : toloka.project.field_spec.StringSpec(required=False), : toloka.project.field_spec.StringSpec(allowed_values=[ , ]) }, view_spec=project_interface, ), ) # Call the API to create a project # If you have already created all pools and projects you can just get it using toloka_client.get_project( ) marking_project = toloka_client.create_project(marking_project) print(f ) print(f ) 'is_possible' 'Does the text contain an asnwer?' 'Yes' 'yes' 'No' 'no' 'yes' 'is_possible' 'answer' 'Paste an answer' 'Text' 'text' 'Question' 'question' 'marking_public_instruction.html' Set 'Find the answer in the text' 'Read the text and find the text fragment that answers the question' Set 'text' 'question' 'question_id' 'answer' 'is_possible' 'yes' 'no' new 'your marking project id' 'Created marking project with id {marking_project.id}' 'To view the project, go to: https://toloka.yandex.com/requester/project/{marking_project.id}' How performers will see the tasks How performers see the instructions Marking training Then we want to create training to help performers make the tasks better. We will add several training tasks and require to complete all of them before performing the real tasks. # up the training pool marking_training = toloka.training.Training( project_id=marking_project.id, private_name= , may_contain_adult_content=True, assignment_max_duration_seconds= , mix_tasks_in_creation_order=True, shuffle_tasks_in_task_suite=True, training_tasks_in_task_suite_count= , task_suites_required_to_pass= , retry_training_after_days= , inherited_instructions=True, public_instructions= , ) marking_training = toloka_client.create_training(marking_training) print(f ) print(f ) Set 'SQUAD2.0 training' 10000 3 1 1 '' 'Created training with id {marking_training.id}' 'To view the training, go to: https://toloka.yandex.com/requester/project/{marking_project.id}/training/{marking_training.id}' We need to upload tasks for training with hints to help performers find the correct answers. training_tasks = [ toloka.task.Task( input_values={ : , : , : }, known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={ : , : })], message_on_unknown_solution= , infinite_overlap=True, pool_id=marking_training.id ), toloka.task.Task( input_values={ : , : , : }, known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={ : , : })], message_on_unknown_solution= , infinite_overlap=True, pool_id=marking_training.id ), toloka.task.Task( input_values={ : , : , : }, known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={ : })], message_on_unknown_solution= , infinite_overlap=True, pool_id=marking_training.id ) ] tasks_op = toloka_client.create_tasks_async(training_tasks) toloka_client.wait_operation(tasks_op) 'question_id' '56be85543aeaaa14008c9063' 'question' 'When did Beyonce start becoming popular?' 'text' 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".' 'is_possible' 'yes' 'answer' 'in the late 1990s' 'the answer can be found after "and rose to fame..."' 'question_id' '56be86cf3aeaaa14008c9076' 'question' 'After her second solo album, what other entertainment venture did Beyonce explore?' 'text' 'Following the disbandment of Destiny\'s Child in June 2005, she released her second solo album, B\'Day (2006), which contained hits "Déjà Vu", "Irreplaceable", and "Beautiful Liar". Beyoncé also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for "Single Ladies (Put a Ring on It)". Beyoncé took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyoncé (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.' 'is_possible' 'yes' 'answer' 'acting' 'the answer can be found before "... with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009)"' 'question_id' '5a8d7bf7df8bba001a0f9ab1' 'question' 'What category of game is Legend of Zelda: Australia Twilight?' 'text' 'The Legend of Zelda: Twilight Princess (Japanese: ゼルダの伝説 トワイライトプリンセス, Hepburn: Zeruda no Densetsu: Towairaito Purinsesu?) is an action-adventure game developed and published by Nintendo for the GameCube and Wii home video game consoles. It is the thirteenth installment in the The Legend of Zelda series. Originally planned for release on the GameCube in November 2005, Twilight Princess was delayed by Nintendo to allow its developers to refine the game, add more content, and port it to the Wii. The Wii version was released alongside the console in North America in November 2006, and in Japan, Europe, and Australia the following month. The GameCube version was released worldwide in December 2006.[b]' 'is_possible' 'no' 'There is no game called Legend of Zelda: Australia Twilight' Marking pool Now we need to create a pool with real tasks. We want to have manual solutions acceptance (based on the results of the verification projects) and some overlap to have multiple variants of answers for every question. We want to filter performers by their knowledge of English and the result of the training. Also we want to set up the quality control: We want to ban performers who answer too fast We want to ban performers based on low quality on the golden set tasks We want to increase overlap for the task if the assignment was rejected marking_pool = toloka.pool.Pool( project_id=marking_project.id, private_name= , may_contain_adult_content=True, will_expire=datetime.datetime.utcnow() + datetime.timedelta(days= ), reward_per_assignment= , auto_accept_solutions=False, auto_accept_period_day= , assignment_max_duration_seconds= * , defaults=toloka.pool.Pool.Defaults( default_overlap_for_new_task_suites= ), filter=toloka.filter.Languages.in_( ), ) marking_pool.set_mixer_config(real_tasks_count= , golden_tasks_count= , training_tasks_count= ) # tasks per page # We at least training task to be completed on the first attempt marking_pool.quality_control.training_requirement=toloka.quality_control.QualityControl.TrainingRequirement(training_pool_id=marking_training.id, training_passing_skill_value= ) # Increase overlap the task the assignment was rejected marking_pool.quality_control.add_action( collector=toloka.collectors.AssignmentsAssessment(), conditions=[toloka.conditions.AssessmentEvent == toloka.conditions.AssessmentEvent.REJECT], action=toloka.actions.ChangeOverlap(delta= , open_pool=True) ) # Ban performer its quality the binary classification the existence the answer is lower than a random choice marking_pool.quality_control.add_action( collector=toloka.collectors.GoldenSet(), conditions=[ toloka.conditions.GoldenSetCorrectAnswersRate ], action=toloka.actions.RestrictionV2( scope=toloka.user_restriction.UserRestriction.PROJECT, duration= , duration_unit=toloka.user_restriction.DurationUnit.DAYS, private_comment= ) ) # Ban performer who answers too fast marking_pool.quality_control.add_action( collector=toloka.collectors.AssignmentSubmitTime(history_size= , fast_submit_threshold_seconds= ), conditions=[toloka.conditions.FastSubmittedCount > ], action=toloka.actions.RestrictionV2( scope=toloka.user_restriction.UserRestriction.PROJECT, duration_unit=toloka.user_restriction.DurationUnit.PERMANENT, private_comment= ) ) # Another criteria to ban performer who answers too fast marking_pool.quality_control.add_action( collector=toloka.collectors.AssignmentSubmitTime(fast_submit_threshold_seconds= ), conditions=[toloka.conditions.FastSubmittedCount > ], action=toloka.actions.RestrictionV2( scope=toloka.user_restriction.UserRestriction.PROJECT, duration_unit=toloka.user_restriction.DurationUnit.PERMANENT, private_comment= ) ) marking_pool = toloka_client.create_pool(marking_pool) print(f ) print(f ) 'Pool 1' 365 0.02 3 60 20 3 'EN' 4 1 0 5 require 1 30 for if 1 if in of of for 50.0 4 1 'Golden set' 5 120 2 'Fast responses' 60 0 'Fast responses' 'Created pool with id {marking_pool.id}' 'To view the pool, go to: https://toloka.yandex.com/requester/project/{marking_project.id}/pool/{marking_pool.id}' Let’s generate tasks from the test dataset and golden tasks from the training dataset. In the golden set we will compare only binary yes/no classification of the answer because it’s possible to have several different correct answers to the questions so we can’t directly compare them with the performer’s answer. d train_data[ ]: len(golden_tasks) == tasks_num / : paragraph d[ ]: len(golden_tasks) == tasks_num / : question paragraph[ ]: len(golden_tasks) == tasks_num / : golden_tasks.append( toloka.task.Task( input_values={ : paragraph[ ], : question[ ], : question[ ] }, known_solutions = [toloka.task.BaseTask.KnownSolution(output_values={ : question[ ] })], pool_id = marking_pool.id ) ) tasks = [] d data[ ]: len(tasks) >= tasks_num: paragraph d[ ]: len(tasks) >= tasks_num: question paragraph[ ]: len(tasks) == tasks_num: tasks.append( toloka.task.Task( input_values={ : paragraph[ ], : question[ ], : question[ ] }, pool_id = marking_pool.id, ) ) for in 'data' if 2 break for in 'paragraphs' if 2 break for in 'qas' if 2 break 'text' 'context' 'question' 'question' 'question_id' 'id' 'is_possible' 'no' if 'is_impossible' else 'yes' for in 'data' if break for in 'paragraphs' if break for in 'qas' if break 'text' 'context' 'question' 'question' 'question_id' 'id' # Restrict size the golden set and create tasks tasks_op = toloka_client.create_tasks_async(golden_tasks + tasks, allow_defaults=True) toloka_client.wait_operation(tasks_op) of Verification project Our second project is about verification of the answers. Performer should read the text and the question and check the correctness of the suggested answer. # How performers will see the task helper = tb.helpers.IfHelperV1( condition=tb.conditions.EqualsConditionV1(to= , data=tb.data.InputData(path= )), then=tb.view.TextViewV1(label= , content=tb.data.InputData(path= )), else_=tb.view.TextViewV1(label= , content= ) ) radio_group_field = tb.fields.RadioGroupFieldV1( data=tb.data.OutputData(path= ), label= , validation=tb.conditions.RequiredConditionV1(), options=[ tb.fields.GroupFieldOption(label= , value= ), tb.fields.GroupFieldOption(label= , value= ) ] ) verificaction_project_interface = toloka.project.view_spec.TemplateBuilderViewSpec( config=tb.TemplateBuilder( view=tb.view.ListViewV1( items=[ tb.view.TextViewV1(label= , content=tb.data.InputData(path= )), tb.view.TextViewV1(label= , content=tb.data.InputData(path= )), helper, radio_group_field ] ) ) ) public_instruction = open( ).read().strip() # up the project verification_project = toloka.project.Project( assignments_issuing_type=toloka.project.Project.AssignmentsIssuingType.AUTOMATED, public_name= , public_description= , public_instructions=public_instruction, # up the task: view, input, and output parameters task_spec=toloka.project.task_spec.TaskSpec( input_spec={ : toloka.project.field_spec.StringSpec(), : toloka.project.field_spec.StringSpec(), : toloka.project.field_spec.StringSpec(required=False), : toloka.project.field_spec.StringSpec(required=False), : toloka.project.field_spec.StringSpec(required=False), : toloka.project.field_spec.StringSpec(allowed_values=[ , ]) }, output_spec={ : toloka.project.field_spec.StringSpec(allowed_values=[ , ])}, view_spec=verificaction_project_interface, ), ) verification_project = toloka_client.create_project(verification_project) print(f ) print(f ) 'yes' 'is_possible' 'Answer' 'answer' 'Answer' 'No answer in the text' 'is_correct' 'Is the answer correct?' 'Yes' 'yes' 'No' 'no' 'Text' 'text' 'Question' 'question' 'verification_public_instruction.html' Set 'Check if the answer is correct' 'Read the text, the question, and the answer. Check if the answer is correct' Set 'text' 'question' 'question_id' 'assignment_id' 'answer' 'is_possible' 'yes' 'no' 'is_correct' 'yes' 'no' 'Created verification project with id {verification_project.id}' 'To view the project, go to: https://toloka.yandex.com/requester/project/{verification_project.id}' How performers see the tasks How performers see the instructions Verification training Training is necessary for this project because it is hard to get a golden set (there is no source to get examples of correct/incorrect answers). So, we should create training with different types of the answers to prepare performers for a variety of possible tasks and filter performers who will complete it poorly. verification_training = toloka.training.Training( project_id=verification_project.id, private_name= , may_contain_adult_content=True, assignment_max_duration_seconds= , mix_tasks_in_creation_order=True, shuffle_tasks_in_task_suite=True, training_tasks_in_task_suite_count= , task_suites_required_to_pass= , retry_training_after_days= , inherited_instructions=True, public_instructions= , ) verification_training = toloka_client.create_training(verification_training) print(f ) print(f ) 'SQUAD2.0 training' 10000 5 1 1 '' 'Created training with id {verification_training.id}' 'To view the training, go to: https://toloka.yandex.com/requester/project/{verification_project.id}/training/{verification_training.id}' Let’s create some different tasks to cover as many possible correct/incorrect answer options as possible. training_tasks = [ toloka.task.Task( input_values={ : , : , : , : , : }, known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={ : })], message_on_unknown_solution= , infinite_overlap=True, pool_id=verification_training.id ), toloka.task.Task( input_values={ : , : , : , : , : }, known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={ : })], message_on_unknown_solution= , infinite_overlap=True, pool_id=verification_training.id ), toloka.task.Task( input_values={ : , : , : , : , : }, known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={ : })], message_on_unknown_solution= , infinite_overlap=True, pool_id=verification_training.id ), toloka.task.Task( input_values={ : , : , : , : , : }, known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={ : })], message_on_unknown_solution= , infinite_overlap=True, pool_id=verification_training.id ), toloka.task.Task( input_values={ : , : , : , : , : }, known_solutions=[toloka.task.BaseTask.KnownSolution(output_values={ : })], message_on_unknown_solution= , infinite_overlap=True, pool_id=verification_training.id ) ] tasks_op = toloka_client.create_tasks_async(training_tasks) toloka_client.wait_operation(tasks_op) 'question_id' '' 'question' 'Who wrote later papers studying problems solvable by Turning machines?' 'answer' 'Hisao Yamada' 'is_possible' 'yes' 'text' 'Earlier papers studying problems solvable by Turing machines with specific bounded resources include John Myhill\'s definition of linear bounded automata (Myhill 1960), Raymond Smullyan\'s study of rudimentary sets (1961), as well as Hisao Yamada\'s paper on real-time computations (1962). Somewhat earlier, Boris Trakhtenbrot (1956), a pioneer in the field from the USSR, studied another specific complexity measure. As he remembers:' 'is_correct' 'no' 'The text is about earlier papers not later ones' 'question_id' '' 'question' 'Who wrote the paper "Reductibility Among Combinatorial Problems" in 1974?' 'answer' 'Richard Karp' 'is_possible' 'yes' 'text' 'In 1967, Manuel Blum developed an axiomatic complexity theory based on his axioms and proved an important result, the so-called, speed-up theorem. The field really began to flourish in 1971 when the US researcher Stephen Cook and, working independently, Leonid Levin in the USSR, proved that there exist practically relevant problems that are NP-complete. In 1972, Richard Karp took this idea a leap forward with his landmark paper, "Reducibility Among Combinatorial Problems", in which he showed that 21 diverse combinatorial and graph theoretical problems, each infamous for its computational intractability, are NP-complete.' 'is_correct' 'no' '"Reductibility Among Combinatorial Problems" was written in 1972' 'question_id' '' 'question' 'What category of game is Legend of Zelda: Australia Twilight?' 'answer' '' 'is_possible' 'no' 'text' 'The Legend of Zelda: Twilight Princess (Japanese: ゼルダの伝説 トワイライトプリンセス, Hepburn: Zeruda no Densetsu: Towairaito Purinsesu?) is an action-adventure game developed and published by Nintendo for the GameCube and Wii home video game consoles. It is the thirteenth installment in the The Legend of Zelda series. Originally planned for release on the GameCube in November 2005, Twilight Princess was delayed by Nintendo to allow its developers to refine the game, add more content, and port it to the Wii. The Wii version was released alongside the console in North America in November 2006, and in Japan, Europe, and Australia the following month. The GameCube version was released worldwide in December 2006.[b]' 'is_correct' 'yes' 'There is no game called Legend of Zelda: Australia Twilight' 'question_id' '' 'question' 'What is the name of the state that the megaregion expands to in the east?' 'answer' 'Las Vegas' 'is_possible' 'yes' 'text' 'The 8- and 10-county definitions are not used for the greater Southern California Megaregion, one of the 11 megaregions of the United States. The megaregion\'s area is more expansive, extending east into Las Vegas, Nevada, and south across the Mexican border into Tijuana.' 'is_correct' 'no' 'The state is actually called Nevada' 'question_id' '' 'question' 'Which city is the most populous in California?' 'answer' 'Los Angeles' 'is_possible' 'yes' 'text' 'Within southern California are two major cities, Los Angeles and San Diego, as well as three of the country\'s largest metropolitan areas. With a population of 3,792,621, Los Angeles is the most populous city in California and the second most populous in the United States. To the south and with a population of 1,307,402 is San Diego, the second most populous city in the state and the eighth most populous in the nation.' 'is_correct' 'yes' '"With a population of 3,792,621, Los Angeles is the most populous city in California"' Verification pool Now we need to create a pool with real tasks. We want to have big enough overlap to aggregate verdicts about every answer. We want to filter performers by their knowledge of English and the result on the training. Also, we want to ban performers who answer too fast and inaccurately solve captchas. verification_pool = toloka.pool.Pool( project_id=verification_project.id, private_name= , may_contain_adult_content=True, will_expire=datetime.datetime.utcnow() + datetime.timedelta(days= ), reward_per_assignment= , auto_accept_solutions=True, assignment_max_duration_seconds= * , defaults=toloka.pool.Pool.Defaults( default_overlap_for_new_task_suites= ), filter=toloka.filter.Languages.in_( ), ) verification_pool.set_mixer_config(real_tasks_count= , golden_tasks_count= , training_tasks_count= ) verification_pool.set_captcha_frequency( ) # Ban performer who answers too fast verification_pool.quality_control.add_action( collector=toloka.collectors.AssignmentSubmitTime(history_size= , fast_submit_threshold_seconds= ), conditions=[toloka.conditions.FastSubmittedCount > ], action=toloka.actions.RestrictionV2( scope=toloka.user_restriction.UserRestriction.PROJECT, duration_unit=toloka.user_restriction.DurationUnit.PERMANENT, private_comment= ) ) # Ban performer who answers too fast verification_pool.quality_control.add_action( collector=toloka.collectors.AssignmentSubmitTime(fast_submit_threshold_seconds= ), conditions=[toloka.conditions.FastSubmittedCount > ], action=toloka.actions.RestrictionV2( scope=toloka.user_restriction.UserRestriction.PROJECT, duration_unit=toloka.user_restriction.DurationUnit.PERMANENT, private_comment= ) ) # Ban performer by captcha criteria verification_pool.quality_control.add_action( collector=toloka.collectors.Captcha(history_size= ), conditions=[toloka.conditions.FailRate >= ], action=toloka.actions.RestrictionV2( scope=toloka.user_restriction.UserRestriction.PROJECT, duration= , duration_unit=toloka.user_restriction.DurationUnit.DAYS, private_comment= ) ) verification_pool = toloka_client.create_pool(verification_pool) print(f ) print(f ) 'Pool 1' 365 0.01 60 20 5 'EN' 5 0 0 'MEDIUM' 5 100 2 'Fast responses' 45 0 'Fast responses' 5 60 3 'Captcha' 'Created pool with id {verification_pool.id}' 'To view the training, go to: https://toloka.yandex.com/requester/project/{verification_project.id}/pool/{verification_pool.id}' Running the pipeline Let’s run a pipeline which will verify the answers and accept or reject assignments based on the results of the verification. def wait_pool_for_close(pool): sleep_time = pool = toloka_client.get_pool(pool.id) not pool.is_closed(): print( f f ) time.sleep(sleep_time) pool = toloka_client.get_pool(pool.id) def prepare_verification_tasks(): verification_tasks = [] # Tasks that we will send verification request = toloka.search_requests.AssignmentSearchRequest( status=toloka.assignment.Assignment.SUBMITTED, # Only take completed tasks that haven text text question question question_id question_id is_possible is_possible answer answer assignment_id Generate {len(verification_tasks)} verification tasks assignment_id Verification pool status - {run_pool2_operation.status} Start aggregation the verification pool DAWID_SKENE is_correct Results aggregated Started adding results to marking tasks is_correct yes SUBMITTED Well done! Incorrect answers Finished adding results to marking tasks 60 while '\t{datetime.datetime.now().strftime("%H:%M:%S")}\t' 'Pool {pool.id} has status {pool.status}.' for 't been accepted or rejected pool_id=marking_pool.id, ) # Create and store new tasks for assignment in toloka_client.get_assignments(request): for task, solution in zip(assignment.tasks, assignment.solutions): verification_tasks.append( toloka.task.Task( input_values={ ' ': task.input_values[' '], ' ': task.input_values[' '], ' ': task.input_values[' '], ' ': solution.output_values[' '], ' ': solution.output_values.get(' ', ' ').strip(), ' ': assignment.id, }, pool_id=verification_pool.id, ) ) print(f' new ') return verification_tasks def run_verification_pool(verification_tasks): verification_tasks_op = toloka_client.create_tasks_async( verification_tasks, toloka.task.CreateTasksParameters(allow_defaults=True) ) toloka_client.wait_operation(verification_tasks_op) verification_tasks_result = [task for task in toloka_client.get_tasks(pool_id=verification_pool.id) if not task.known_solutions] task_to_assignment = {} for task in verification_tasks_result: task_to_assignment[task.id] = task.input_values[' '] # Open the verification pool run_pool2_operation = toloka_client.open_pool(verification_pool.id) run_pool2_operation = toloka_client.wait_operation(run_pool2_operation) print(f' ') return task_to_assignment def get_aggregation_results(pool_id): print(' in ') aggregation_operation = toloka_client.aggregate_solutions_by_pool( type=' ', pool_id=pool_id, fields=[toloka.aggregation.PoolAggregatedSolutionRequest.Field(name=' ')] ) aggregation_operation = toloka_client.wait_operation(aggregation_operation) print(' ') return list(toloka_client.get_aggregated_solutions(aggregation_operation.id)) def set_answers_status(verification_results): print(' ') assignment_results = dict() for r in verification_results: if r.task_id not in task_to_assignment: continue assignment_id = task_to_assignment[r.task_id] assignment_result = assignment_results.get(assignment_id, 0) # Increase the number of correct tasks in assignment if r.output_values[' '] == ' ': assignment_result += 1 assignment_results[assignment_id] = assignment_result for assignment_id, correct_num in assignment_results.items(): assignment = toloka_client.get_assignment(assignment_id) if assignment.status.value == ' ': # If 4 or 5 tasks in the assignment was marked as correct then we will accept the assignment if correct_num >= 4: toloka_client.accept_assignment(assignment_id, ' ') else: toloka_client.reject_assignment(assignment_id, ' ') print(' ') toloka_client.open_pool(marking_training.id) toloka_client.open_pool(verification_training.id) toloka_client.open_pool(marking_pool.id) # Run the pipeline True: print( ) wait_pool_for_close(marking_pool) print(f ) # Preparing tasks verification_tasks = prepare_verification_tasks() # Make sure all the tasks are done not verification_tasks: print( ) # Add it to the pool and run the pool task_to_assignment = run_verification_pool(verification_tasks) print( ) wait_pool_for_close(verification_pool) print(f ) # Aggregation operation verification_results = get_aggregation_results(verification_pool.id) # Reject or accept tasks the segmentation pool set_answers_status(verification_results) print(f ) while '\nWaiting for marking pool to close' 'Marking pool {marking_pool.id} is finally closed!' if 'All the tasks in our project are done' break '\nWaiting for verification pool to close' 'Verification pool {verification_pool.id} is finally closed!' in 'Results received at {datetime.datetime.now()}' Evaluate the results Now, let’s evaluate the results. We have several different answers for every question so we need to aggregate them. Let’s select the final answer by majority vote between yes/no answer classification and pick shorter answers over longer ones. request_for_result = toloka.search_requests.AssignmentSearchRequest( status=toloka.assignment.Assignment.ACCEPTED, pool_id=marking_pool.id, ) answers = dict() assignment toloka_client.get_assignments(request_for_result): i, sol enumerate(assignment.solutions): answer = sol.output_values[ ].strip() sol.output_values[ ] == current_list = answers.get(assignment.tasks[i].input_values[ ], []) current_list.append(answer) answers[assignment.tasks[i].input_values[ ]] = current_list for in for in 'answer' if 'is_possible' 'yes' else '' 'question_id' 'question_id' final_answers = dict() key, value answers.items(): sorted_value = sorted(value, key=lambda x: len(x)) n = len(sorted_value) sorted_value[n] == : final_answers[key] = : final_answers[key] = next(filter(lambda x: x != , sorted_value)) for in // 2 if '' '' else '' # Download evaluation script !curl --output evaluate.py evaluate make_qid_to_has_ans, get_raw_scores, apply_no_ans_threshold, apply_no_ans_threshold, make_eval_dict, merge_eval # Implement method using the methods the evaluation script downloaded the official SQUAD2 website def score(dataset, preds): na_probs = { : k preds} qid_to_has_ans = { : v k, v make_qid_to_has_ans(dataset).items() k preds} # Maps qid to True/False has_ans_qids = [k k, v qid_to_has_ans.items() v] no_ans_qids = [k k, v qid_to_has_ans.items() not v] exact_raw, f1_raw = get_raw_scores(dataset, preds) exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans, ) f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans, ) out_eval = make_eval_dict(exact_thresh, f1_thresh) has_ans_qids: has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids) merge_eval(out_eval, has_ans_eval, ) no_ans_qids: no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids) merge_eval(out_eval, no_ans_eval, ) print(json.dumps(out_eval, indent= )) 'https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/' from import `score` from from .0 k 0.0 for in k for in if in for in if for in if 1 1 if 'HasAns' if 'NoAns' 2 score(data[ ], final_answers) 'data' Conclusion Even though this project is still a work in progress, we’re already seeing promising results and we’re certain that with incremental changes and improvements we can even beat SOTA models. So, if you have any ideas on how to improve this labeling project’s architecture, settings, instructions, or result aggregation methods, or if you have any other suggestions, feel free to leave a comment. References Solving Q&A tasks with Toloka’s Python library and SQuAD2.0