|
||||||||
| | ||||||||
|
An Illustration of Text Encoding The primary task of text-encoding, specifically text-encoding with XML (Extensible Mark-up Language), is to create a system for tagging important information in a document, which in turn divides that document into parts that have semantic import. Once encoded, a document can be used for a variety of purposes, namely expeditious information retrieval and subsequent processing. An example of a massive text-encoding initiative is the NIH-NCI Tobacco-Documents Project at The University of Georgia, available on the Web at http://www.uga.edu/tobaccodocs. As a result of the class action lawsuits against Big Tobacco, the internal documents of several major tobacco companies were made a matter of public record. The principal undertaking of the Tobacco-Documents Project was to amass a large corpus of these internal documents in order to research deceptive language or linguistic patterns that could be possible indicators of public deception on behalf of the tobacco industry. Each document was encoded with a specific set of XML tags created to facilitate the research objectives of this particular project. The following illustration involves one industry-internal document made public, however, for the purposes of this illustration the text-encoding protocol has been modified. Format This letter is written in a typical style. It represents a reply to a general inquiry made by an individual interested in an industry-supported law banning the selling of cigarettes to youths. While the first paragraph of the letter somewhat addresses the original inquiry, a law prohibiting the sale of cigarettes to minors, it largely attempts manipulate the subject by defending an individual’s personal decision to smoke and by calling into question the legitimacy of scientific claims regarding various smoking-related health hazards. The second paragraph of the letter digresses into a calculated defense of the tobacco industry as a whole. And finally, the third paragraph addresses this particular tobacco company’s ad campaign to curtail youth smoking.
Tagged features
Discussion There are obvious reasons for tagging information such as the letter’s date and the recipient’s name and address. Tagging this identifying information makes it possible, for example, to search and/or retrieve all letters written in 1985, or in that particular decade for that matter. The different parts of the letter are highlighted in order to break it into logical parts. For example, all incidences of deception would be mostly likely found in the body of the letter, as opposed to the head or the postscript. Again, the advantage of text-encoding is that it transforms basic documents into semantic components, actually describing both the form and the content of the document in a systematic way. Finally, the letter is tagged for interesting linguistic features that may indicate some deception on the tobacco company’s part. In tagging the use of subject manipulation and digression, it could be possible to examine these types of formulaic linguistic maneuverings in the context of other internal documents, such as company memos and advertisements. In a larger context, such research could lend itself to discovering an algorithm for deceptive linguistic practices in general. There are several motives for encoding information in documents, parts of documents, or large collections of documents. Text-encoding transforms documents into a simple data format, written in ASCII text, which means that documents are less susceptible to corruption, which could render valuable files unreadable. Text-encoding allows you to create a task specific vocabulary that makes the important information in the document self-describing. Particularly, the above example demonstrates a vocabulary for encoding letters and the linguistic characteristics therein, which in turn converts a simple file into an information rich document with semantic tags creating identifiable parts. Finally and most importantly, text-encoding allows you to automate data retrieval and processing in large collections of documents. | ||||||||
| | ||||||||
| ||||||||
| | ||||||||