|
A Definition of Text Encoding
Text encoding is the process of 1) converting documents to a
computer-readable text format and 2) marking significant portions of the text
document with codes so that they can be easily identified and manipulated by
computer. The goal of text encoding is to examine and process a set of
documents only a single time, relying on the tags for later retrieval and
analysis. Although this involves a substantial investment of time for the
initial encoding, the cost is offset by the ease of subsequent document
management and analysis. In most cases, English documents are converted into a
plain ASCII text format which is then marked or 'tagged' using XML (Extensible
Mark-up Language) codes. This procedure allows text items of particular
interest, whatever they are, to be explicitly identified for later electronic
access. When correctly planned and implemented, text encoding enables the
automatic retrieval of information buried within a document set, which may
contain thousands or millions of documents, without the cost of manually
reinspecting each document. Text encoding is frequently the first step in
preparing a document set for any type of large-scale analysis. Although the
concept and implementation of text encoding is straight-forward, the initial
design of XML mark-up protocols requires experience and technical expertise.
The reason for this is that the ultimate usefulness of the encoded text is
determined by the initial mark-up. Not marking a feature that is technically,
linguistically, or materially significant means that it may not be retrievable
at a reliable rate without reexamining the original documents, which defeats
the purpose of text encoding. In designing mark-up protocols, an understanding
the logical structure of language data and the technical aspect of processing
XML documents is equal in importance to an understanding of the subject
matter.
Example >>
|