This story draft by @escholar has not been reviewed by an editor, YET.

Appendix

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1. Introduction

  1. Related Work

  2. Proposed Dataset

  3. SymTax Model

    4.1 Prefetcher

    4.2 Enricher

    4.3 Reranker

  4. Experiments and Results

  5. Analysis

    6.1 Ablation Study

    6.2 Quantitative Analysis and 6.3 Qualitative Analysis

  6. Conclusion

  7. Limitations

  8. Ethics Statement and References

Appendix

A. Appendix

We conduct another quantitative analysis using the section heading as an additional signal in our reranking module.

A.1 Additional Experiment

We concatenate the section heading with query context in reranker and run our two SymTax variants. From Table 6, we can observe that using section heading leads to a significant performance drop in SciBERT_vector for all the metrics. However, for SPECTER_graph, the overall performance remains nearly the same. Both of these patterns clearly indicate that using section heading as a feature acts as a noise, and thus the citation contexts are already rich. Since our proposed dataset contains this additional feature, it is suitable for two additional tasks: context-specific citation generation (Wang et al., 2022), and section heading prediction for a given citation context.

A.2 Implementation Details

A.3 Datasets

ACL-200. This dataset contains papers published at ACL venues. It is a processed version of the ACL-ARC dataset created using ParsCit[12], a string parsing package based on conditional random field.


Table 6: Analysis on the inclusion of section heading as a feature on 10k random samples from ArSyTa data. The results indicate that using section heading as a feature acts as a noise as the citation contexts are already rich.


It contains citation contexts by considering ±200 characters around the citation placeholder.


FullTextPeerRead. It is an expansion of PeerRead dataset that contains the peer reviews of papers submitted to top venues in the Artificial Intelligence domain. So, FullTextPeerRead contains the citation contexts from the papers present in the PeerRead dataset.


RefSeer. This dataset is curated by extracting scientific articles belonging to various engineering domains. A citation excerpt is taken as the text of ±200 characters around the citation marker. It is a large dataset that contains 3.7 million citation contexts.


arXiv (HAtten). It is created using arXiv papers from a large and diverse corpus of scientific articles contained in S2ORC[13]. For every paper having its full text available, a citation excerpt is considered if the cited paper is also present in the arXiv database. Following the similar trend setup by ACL-200 and RefSeer, this dataset is also curated by considering the words in the ±200 character window around the citation marker.


Figure 3: Statistics show the distribution of major category classes of flat-level arXiv taxonomy corresponding to ArSyTa. The highest number of research papers belong to Machine Learning (cs.LG), Computer Vision (cs.CV), and Artificial Intelligence (cs.AI).


Authors:

(1) Karan Goyal, IIIT Delhi, India ([email protected]);

(2) Mayank Goel, NSUT Delhi, India ([email protected]);

(3) Vikram Goyal, IIIT Delhi, India ([email protected]);

(4) Mukesh Mohania, IIIT Delhi, India ([email protected]).


This paper is available on arxiv under CC by-SA 4.0 Deed (Attribution-Sharealike 4.0 International) license.

[12] https://github.com/knmnyn/ParsCit


[13] https://github.com/allenai/s2orc

L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks