head 1.2; access; symbols; locks; strict; comment @# @; 1.2 date 2007.03.06.17.30.00; author TWikiGuest; state Exp; branches; next 1.1; 1.1 date 2005.03.12.19.29.53; author GregorHagedorn; state Exp; branches; next ; desc @none @ 1.2 log @Added topic name via script @ text @---+!! %TOPIC% %META:TOPICINFO{author="GregorHagedorn" date="1110655793" format="1.0" version="1.1"}% %META:TOPICPARENT{name="FormattedText"}% In addition to the inline formatting of text (formatting characters with bold, italic, subscript within a block-level element, see FormattedText), a major problem when marking up legacy text (digitized books) is to handle high level publishing artifacts such as page numbering, header and footer text. Neither is the block-level structure nested withing pages, nor reverse. In a single xml tree the text-syntactical view (paragraphs, inline formatting) and the publishing syntactical view (pages) are therefore difficult to express. The following publishing artifacts are in general most problematic * page breaks (CSS2:"page-break-before:always"; xhtml example: <br style="page-break-before:always" />) * header or footer text, including page numbers. (CSS 2 has no support for changing header or footer information for divisions of a document. If CSS 3 might add this functionality - can someone research this?) * changes between portrait and landscape layout (CSS2: "@@page{size:landscape}" and "portrait") * preservation of line break structure within paragraphs * hyphenation (often is not considered worth preserving; standard hyphen is normal character, no support for optional hyphenation in CSS 2) A simple solution simply inserts appropriate empty xml elements (e.g. xhtml using CSS 2 or proprietary methods) into the text at the position of a page break. This solution has several disadvantages: * It is likely to create mixed context xml (a mixture of text and markup, like in xhtml), which is often difficult to process and creates major problems when interacting with most databases. * It may occur in the middle of a hyphenated word, removing the word from indexes that cannot recognize the situation. Unfortunately, in Unicode the same character is used for a hyphen between parts of a word and other uses, such as abbreviating prefixes ("poly-" indicating multiplicity). The first problem may be addressed by a similar solution to that in FormattedText, i.e. using xml-like markup, treating it as text by escaping (or encoding) </> to entities </>. The text formatting proposal (FormattedText) already contains a method to support line breaks through escaped xhtml <br/> tags. The intended use case in that proposal was not to preserve publishing artifacts, but to increase the "fuzzy semantic expressiveness" where authors believe that a new line is necessary for appropriate separation of statements or arguments. The second problem may be addressed by always placing publishing artifact information behind the word (in front of the next whitespace character) and informing about the relative position of it (see below). -- Main.GregorHagedorn - 12 Mar 2005 ---