Contact Record Extraction Data v1.0

10/5/06 The data is available here: contact_data_v1.0.tgz (md5sum = c14ae66af6c4df0a287e5d14cb6a6d49)

The Contact Record Extraction Data was labeled at the University of Massachusetts to train and evaluate sequence models. This dataset contains text blocks annotated with contact record tags (e.g. phone number, city, state, etc.) The format is a mash-up of SGML and HTML, and should be parsable by a savvy script.

For more details, contact Aron Culotta (aronwc at gmail dot com).

The data was obtained from three sources:

enron: labeled emails from the Enron email corpus

privacyrights: labeled contact blocks from privacyrights.org

webkb: labeled Web pages from the WebKB data.

Some statistic about the data:

numTokens: 22054 numFields: 11188 numBlocks: 646

avgFieldsPerBlock: 17.3188854489164

avgTokensPerBlock: 34.1393188854489

avgTokensPerField: 1.97121916338935

The entire list of classes is:

FirstName

MiddleName

Nickname

Suffix

LastName

Title

JobTitle

CompanyName

Department

AddressLine

City1

City2

State

Country

PostalCode

HomePhoneNumber

FaxNumber

CompanyPhoneNumber

DirectPhoneNumber

MobilePhoneNumber

PagerNumber

WebPageURL

InstantMessagingAddress

VoiceMail