Text and Language Technology Group
Text-Tech

A Definition of Text Encoding

Text encoding is the process of 1) converting documents to a computer-readable text format and 2) marking significant portions of the text document with codes so that they can be easily identified and manipulated by computer. The goal of text encoding is to examine and process a set of documents only a single time, relying on the tags for later retrieval and analysis. Although this involves a substantial investment of time for the initial encoding, the cost is offset by the ease of subsequent document management and analysis. In most cases, English documents are converted into a plain ASCII text format which is then marked or 'tagged' using XML (Extensible Mark-up Language) codes. This procedure allows text items of particular interest, whatever they are, to be explicitly identified for later electronic access. When correctly planned and implemented, text encoding enables the automatic retrieval of information buried within a document set, which may contain thousands or millions of documents, without the cost of manually reinspecting each document. Text encoding is frequently the first step in preparing a document set for any type of large-scale analysis. Although the concept and implementation of text encoding is straight-forward, the initial design of XML mark-up protocols requires experience and technical expertise. The reason for this is that the ultimate usefulness of the encoded text is determined by the initial mark-up. Not marking a feature that is technically, linguistically, or materially significant means that it may not be retrievable at a reliable rate without reexamining the original documents, which defeats the purpose of text encoding. In designing mark-up protocols, an understanding the logical structure of language data and the technical aspect of processing XML documents is equal in importance to an understanding of the subject matter.

Example >>


TLTG Home Forensic Doc. Analysis Text Encoding Lexicography Members Services Site Map

700 Oglethorpe Ave. •  Athens, Georgia 30606 •  Phone: 706-549-5519 •  Fax: 706-549-1228 •  mail to