The file citations.xml contains citation data annotated for coreference among several different entity types: titles, authors, venues, and institutions. The data was harvested from REXA (http://rexa.info), a digital library and search engine covering the computer science research literature and the people who create it.
Most of the xml format should be self-explanatory. The attributes 'id'
and 'clusterid' have been added to some elements, indicating their
mention id and cluster id, respectively. Additionally, fileid refers
to the file from which the reference was found, and refID identifies
which reference in the paper. The concatenation of fileid and refID
uniquely identifies a reference.
Some statistics about the dataset:
15524 entity mentions (author title venue institution)
203 clusters
avg cluster size=19.8669950738916
max cluster size 301
min cluster size 1
2454 citation mentions
243 header mentions
9411 authors
2811 titles
2031 venues
1271 institutions