Text and Language Technology Group

Definition: Text encoding is the process of 1) converting documents to a computer-readable text format and 2) marking significant portions of the text document with codes so that they can be easily identified and manipulated by computer. The goal of text encoding is to examine and process a set of documents only a single time, relying on the tags for later retrieval and analysis. Although this involves a substantial investment of time for the initial encoding, the cost is offset by the ease of subsequent document management and analysis. In most cases, English documents are converted into a plain ASCII text format which is then marked or 'tagged' using XML (Extensible Mark-up Language) codes. This procedure allows text items of particular interest, whatever they are, to be explicitly identified for later electronic access. When correctly planned and implemented, text encoding enables the automatic retrieval of information buried within a document set, which may contain thousands or millions of documents, without the cost of manually reinspecting each document. Text encoding is frequently the first step in preparing a document set for any type of large-scale analysis. Although the concept and implementation of text encoding is straight-forward, the initial design of XML mark-up protocols requires experience and technical expertise. The reason for this is that the ultimate usefulness of the encoded text is determined by the initial mark-up. Not marking a feature that is technically, linguistically, or materially significant means that it may not be retrievable at a reliable rate without reexamining the original documents, which defeats the purpose of text encoding. In designing mark-up protocols, an understanding the logical structure of language data and the technical aspect of processing XML documents is equal in importance to an understanding of the subject matter.

Examples: For example, a company may have a large collection of documents, anything from purchase orders to memos to company publications, that are on computer or that need to be entered into a computer and processed. Marking these with a text encoding scheme allows the documents to be more easily processed later and for computer programs to extract relevant information from them without assistance.

Services Offered: TLTG is prepared to design and implement text encoding protocols to suit a variety of industrial and academic needs. Our services include the following:


TLTG Home Forensic Doc. Analysis Text Encoding Lexicography Members Services

700 Oglethorpe Ave. •  Athens, Georgia 30606 •  Phone: 706-549-5519 •  Fax: 706-549-1228 •  mail to