Table of Links

Abstract and 1. Introduction

Related Work
Proposed Dataset
SymTax Model

4.1 Prefetcher

4.2 Enricher

4.3 Reranker
Experiments and Results
Analysis

6.1 Ablation Study

6.2 Quantitative Analysis and 6.3 Qualitative Analysis
Conclusion
Limitations
Ethics Statement and References

Appendix

A. Appendix

We conduct another quantitative analysis using the section heading as an additional signal in our reranking module.

A.1 Additional Experiment

We concatenate the section heading with query context in reranker and run our two SymTax variants. From Table 6, we can observe that using section heading leads to a significant performance drop in SciBERT_vector for all the metrics. However, for SPECTER_graph, the overall performance remains nearly the same. Both of these patterns clearly indicate that using section heading as a feature acts as a noise, and thus the citation contexts are already rich. Since our proposed dataset contains this additional feature, it is suitable for two additional tasks: context-specific citation generation (Wang et al., 2022), and section heading prediction for a given citation context.

A.2 Implementation Details

A.3 Datasets

ACL-200. This dataset contains papers published at ACL venues. It is a processed version of the ACL-ARC dataset created using ParsCit[12], a string parsing package based on conditional random field.

Table 6: Analysis on the inclusion of section heading as a feature on 10k random samples from ArSyTa data. The results indicate that using section heading as a feature acts as a noise as the citation contexts are already rich.

It contains citation contexts by considering ±200 characters around the citation placeholder.

FullTextPeerRead. It is an expansion of PeerRead dataset that contains the peer reviews of papers submitted to top venues in the Artificial Intelligence domain. So, FullTextPeerRead contains the citation contexts from the papers present in the PeerRead dataset.

RefSeer. This dataset is curated by extracting scientific articles belonging to various engineering domains. A citation excerpt is taken as the text of ±200 characters around the citation marker. It is a large dataset that contains 3.7 million citation contexts.

arXiv (HAtten). It is created using arXiv papers from a large and diverse corpus of scientific articles contained in S2ORC[13]. For every paper having its full text available, a citation excerpt is considered if the cited paper is also present in the arXiv database. Following the similar trend setup by ACL-200 and RefSeer, this dataset is also curated by considering the words in the ±200 character window around the citation marker.

Figure 3: Statistics show the distribution of major category classes of flat-level arXiv taxonomy corresponding to ArSyTa. The highest number of research papers belong to Machine Learning (cs.LG), Computer Vision (cs.CV), and Artificial Intelligence (cs.AI).

Authors:

(1) Karan Goyal, IIIT Delhi, India (karang@iiitd.ac.in);

(2) Mayank Goel, NSUT Delhi, India (mayank.co19@nsut.ac.in);

(3) Vikram Goyal, IIIT Delhi, India (vikram@iiitd.ac.in);

(4) Mukesh Mohania, IIIT Delhi, India (mukesh@iiitd.ac.in).

This paper is available on arxiv under CC by-SA 4.0 Deed (Attribution-Sharealike 4.0 International) license.

[12] https://github.com/knmnyn/ParsCit

[13] https://github.com/allenai/s2orc

Appendix

About Author

Topics

Around The Web

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps