Authors:
(1) Silei Xu, Computer Science Department, Stanford University Stanford, CA with equal contribution {[email protected]};
(2) Shicheng Liu, Computer Science Department, Stanford University Stanford, CA with equal contribution {[email protected]};
(3) Theo Culhane, Computer Science Department, Stanford University Stanford, CA {[email protected]};
(4) Elizaveta Pertseva, Computer Science Department, Stanford University Stanford, CA, {[email protected]};
(5) Meng-Hsi Wu, Computer Science Department, Stanford University Stanford, CA, Ailly.ai {[email protected]};
(6) Sina J. Semnani, Computer Science Department, Stanford University Stanford, CA, {[email protected]};
(7) Monica S. Lam, Computer Science Department, Stanford University Stanford, CA, {[email protected]}.
WikiWebQuestions (WWQ) Dataset
Conclusions, Limitations, Ethical Considerations, Acknowledgements, and References
A. Examples of Recovering from Entity Linking Errors
Wikidata is the largest public knowledge base with over 12 billion facts represented by subjectpredicate-object triples using 100+ million entities and 10,000 properties. 3,000 of the properties are useful for answering natural language questions, whereas the rest are used to link data in Wikidata with external library catalogs and database IDs.
Entities and properties are given unique identifiers, QIDs and PIDs, respectively. For example, the fact that Joe Biden is the president of the US can be represented as a triple (Q6279, P39, Q11696), where P39 is the PID for property position held, Q6279 and Q11696 are QIDs for Joe Biden and the president of the United States, respectively.
Unlike relational databases and Freebase, Wikidata has no predefined domains or types. Any entity can have an arbitrary set of properties. However, even though Wikidata is property-based, all named entities have one or more instance of properties to some domain entity; domain entities are organized into a hierarchy with the subclass of property.
Note that the names of domain entities and properties are unique. Non-domain entities, on the other hand, can be ambiguous. For example, “Lincoln” can refer to the president, a car brand, a sparrow, an aircraft, and many different cities.
We posit that it is impossible for LLMs to memorize the QIDs and PIDs for domains and properties. We modify the format of SPARQL queries to use the more mnemonic property name, instead of its PID. Similarly, we use entity names for domains. For example, the original SPARQL for the query “What car models does GM make?” is
SELECT DISTINCT ?x WHERE {
?x wdt:P31/wdt:P279* wd:Q3231690.
?x wdt:P176 wd:Q81965. }
This says that we are seeking x, where x is transitively either an instance of (wdt:P31) or a subclass of (wdt:P279) of an automobile model (wd:Q3231690), and x has General Motors (wd:Q81965) as the manufacturer (wdt:P176). Note wdt is the prefix for Wikidata property, and wd is for Wikidata entity
With our modification, the query becomes:
SELECT DISTINCT ?x WHERE {
?x wdt:instance_of/wdt:subclass_of*
wd:automobile_model.
?x wdt:manufacturer wd:Q81965. }
For non-domain entity QIDs, we also accept a string in lieu of a QID in case of entity linking errors. At inference time, we use simple heuristics to resolve the string to a QID before applying the query. For example, “wd:Q81965” in the query may be replaced with “wd:GM”. See Section 3.2.2 for more details.
Normally, we refrain from changing standard query notations since LLMs have been pretrained on them. However, we posit that learning this new syntax is much easier than learning the PIDs and QIDs. Our experimentation with few-shot prompting suggests that LLMs can easily adjust to this format.
Linking entities for WikiWebQuestions is particularly difficult. First, since the dataset is collected from real-world questions without prompting the users for more information, users tend to refer to their entities of interest without using their full names. Second, the questions are generally short with very limited context, making it harder to disambiguate among entities with similar names. Lastly, many QIDs in Wikidata are used to represent terms not generally known as “named entities”.
For example, domain entities are often ignored by entity linker models, as in “What is the biggest country in Europe by population?”, both “country” (Q6256) and “Europe” (Q46) are required to construct the correct SPARQL, but entity linkers only provide “Europe” and ignore “country”.
To handle ambiguous entities, we use an entity linker to first find the domain names and QIDs of the entities mentioned in the text. We train a semantic parser that accepts users’ input along with the results produced by the entity linker.
Formally, given a user input T, and a set of entity linker results ⟨e, q⟩, where e is the name (default label) Wikidata gives to an entity and q is its QID, the semantic parser produces the semantic parse of T in our modified SPARQL format.
For the example above, the SOTA ReFinED entity linker (Ayoola et al., 2022) returns {⟨General Motors, Q81965⟩}. Unfortunately, it misses the entity automobile model (Q3231690), a term not usually considered to be an entity.
We want our semantic parser to be able to recover from mistakes by an entity linker. That is, the semantic parser should use entity linking when it is helpful, but it should still try to predict the right logical form when the linker fails.
The semantic parser is trained to accept, along with the user query, an optional set of potentially useful QIDs from the entity linker. We include samples where some of the supplied linked entities are not used in the gold answer, as well as samples where there are missing linked entities. For the latter, we use mentions in the original query in lieu of the QIDs.
At inference time, we use the mentions to look up the QIDs in Wikidata. If multiple matches exist, the most popular entity is returned. An example is shown in Appendix A.
With the above example where the entity linker misses “automobile model”, the semantic parser is likely to predict “car model” by copying from the user query. We search “automobile model” among aliases in domains to find the correct QID. This design allows the model to potentially recover from entity-linking failures.
This paper is available on arxiv under CC 4.0 license.