Contact Record Extraction Data v1.0
10/5/06 The data is available here: contact_data_v1.0.tgz
(md5sum = c14ae66af6c4df0a287e5d14cb6a6d49)
The Contact Record Extraction Data was labeled at the University of Massachusetts to train and evaluate sequence models. This dataset contains text blocks annotated with contact record tags (e.g. phone number, city, state, etc.) The format is a mash-up of SGML and HTML, and should be parsable by a savvy script.
For more details, contact Aron Culotta (aronwc at gmail dot com).
The data was obtained from three sources:
enron: labeled emails from the Enron email corpus
privacyrights: labeled contact blocks from privacyrights.org
webkb: labeled Web pages from the WebKB data.
Some statistic about the data:
numTokens: 22054 numFields: 11188 numBlocks: 646
avgFieldsPerBlock: 17.3188854489164
avgTokensPerBlock: 34.1393188854489
avgTokensPerField: 1.97121916338935
The entire list of classes is:
FirstName
MiddleName
Nickname
Suffix
LastName
Title
JobTitle
CompanyName
Department
AddressLine
City1
City2
State
Country
PostalCode
HomePhoneNumber
FaxNumber
CompanyPhoneNumber
DirectPhoneNumber
MobilePhoneNumber
PagerNumber
WebPageURL
Email
InstantMessagingAddress
VoiceMail