Contact Record Extraction Data v1.0

10/5/06 The data is available here: contact_data_v1.0.tgz (md5sum = c14ae66af6c4df0a287e5d14cb6a6d49)

The Contact Record Extraction Data was labeled at the University of Massachusetts to train and evaluate sequence models. This dataset contains text blocks annotated with contact record tags (e.g. phone number, city, state, etc.) The format is a mash-up of SGML and HTML, and should be parsable by a savvy script.

For more details, contact Aron Culotta (aronwc at gmail dot com).

The data was obtained from three sources:

  • enron: labeled emails from the Enron email corpus
  • privacyrights: labeled contact blocks from privacyrights.org
  • webkb: labeled Web pages from the WebKB data.

    Some statistic about the data:

  • numTokens: 22054 numFields: 11188 numBlocks: 646
  • avgFieldsPerBlock: 17.3188854489164
  • avgTokensPerBlock: 34.1393188854489
  • avgTokensPerField: 1.97121916338935

    The entire list of classes is:

  • FirstName
  • MiddleName
  • Nickname
  • Suffix
  • LastName
  • Title
  • JobTitle
  • CompanyName
  • Department
  • AddressLine
  • City1
  • City2
  • State
  • Country
  • PostalCode
  • HomePhoneNumber
  • FaxNumber
  • CompanyPhoneNumber
  • DirectPhoneNumber
  • MobilePhoneNumber
  • PagerNumber
  • WebPageURL
  • Email
  • InstantMessagingAddress
  • VoiceMail