Text and Language Technology Group
Text-Tech

An Illustration of Text Encoding

The primary task of text-encoding, specifically text-encoding with XML (Extensible Mark-up Language), is to create a system for tagging important information in a document, which in turn divides that document into parts that have semantic import. Once encoded, a document can be used for a variety of purposes, namely expeditious information retrieval and subsequent processing. An example of a massive text-encoding initiative is the NIH-NCI Tobacco-Documents Project at The University of Georgia, available on the Web at http://www.uga.edu/tobaccodocs. As a result of the class action lawsuits against Big Tobacco, the internal documents of several major tobacco companies were made a matter of public record. The principal undertaking of the Tobacco-Documents Project was to amass a large corpus of these internal documents in order to research deceptive language or linguistic patterns that could be possible indicators of public deception on behalf of the tobacco industry. Each document was encoded with a specific set of XML tags created to facilitate the research objectives of this particular project. The following illustration involves one industry-internal document made public, however, for the purposes of this illustration the text-encoding protocol has been modified.

Format

This letter is written in a typical style. It represents a reply to a general inquiry made by an individual interested in an industry-supported law banning the selling of cigarettes to youths. While the first paragraph of the letter somewhat addresses the original inquiry, a law prohibiting the sale of cigarettes to minors, it largely attempts manipulate the subject by defending an individual’s personal decision to smoke and by calling into question the legitimacy of scientific claims regarding various smoking-related health hazards. The second paragraph of the letter digresses into a calculated defense of the tobacco industry as a whole. And finally, the third paragraph addresses this particular tobacco company’s ad campaign to curtail youth smoking.

October 29, 1985

Ms. Lori Coleman
P.O. Box 42
Apex, NC 27502

Dear Ms Coleman:

Thank you for your letter of September 17. We apologoize for the delay in responding.

Our company does not approve of young people smoking. We believe that adults, however, should be permitted to make their own decision whether or not to smoke. We also believe that until scientific research can establish what really causes the diseases with which smoking has been statistically associated, it would be unfair to advocate any law prohibiting the sales of cigarettes.

We in tobaco regard ours as an honorable trade, bringing to people everywhere the simple pleasure of a produuct that has a long and respectable history behind it. Tobacco was the first business enterprise in America—begun soon after the settlement of Jamestown in 1607—and has contributed materially to the nation's growth and welfare since.

Enclosed are copies of ads we recently rans as part of our public issures campaign on youth smoking. These ads were run in several youth publications such as Teen and Seventeen magazines, and we have received many favorable responses from young people.

We appreciate your taking the time to write.

Sincerely


MGA:kde/Enclosures

P.S. Any quesitons concerning the warning labels should be directed to:

The Tobacco Institute
1875 I Street, Northwest, Suite 800
Washington, DC 20006

Tagged features

  1. The letter’s distinguishing information, such as the date, the name of recipient, the address, etc.
  2. The different structural parts of the letter, including the head, body, and postscript.
  3. The features that may be indicative of deceptive language.

Discussion

There are obvious reasons for tagging information such as the letter’s date and the recipient’s name and address. Tagging this identifying information makes it possible, for example, to search and/or retrieve all letters written in 1985, or in that particular decade for that matter.

The different parts of the letter are highlighted in order to break it into logical parts. For example, all incidences of deception would be mostly likely found in the body of the letter, as opposed to the head or the postscript. Again, the advantage of text-encoding is that it transforms basic documents into semantic components, actually describing both the form and the content of the document in a systematic way.

Finally, the letter is tagged for interesting linguistic features that may indicate some deception on the tobacco company’s part. In tagging the use of subject manipulation and digression, it could be possible to examine these types of formulaic linguistic maneuverings in the context of other internal documents, such as company memos and advertisements. In a larger context, such research could lend itself to discovering an algorithm for deceptive linguistic practices in general.

There are several motives for encoding information in documents, parts of documents, or large collections of documents. Text-encoding transforms documents into a simple data format, written in ASCII text, which means that documents are less susceptible to corruption, which could render valuable files unreadable. Text-encoding allows you to create a task specific vocabulary that makes the important information in the document self-describing. Particularly, the above example demonstrates a vocabulary for encoding letters and the linguistic characteristics therein, which in turn converts a simple file into an information rich document with semantic tags creating identifiable parts. Finally and most importantly, text-encoding allows you to automate data retrieval and processing in large collections of documents.

Services Offered >>


TLTG Home Forensic Doc. Analysis Text Encoding Lexicography Members Services Site Map

700 Oglethorpe Ave. •  Athens, Georgia 30606 •  Phone: 706-549-5519 •  Fax: 706-549-1228 •  mail to