wiki-archive/twiki/data/UBIF/FormattedText.txt

%META:TOPICINFO{author="GarryJolleyRogers" date="1259118882" format="1.1" version="1.13"}%
%META:TOPICPARENT{name="BDI.SDD_.ClosedTopicSchemaDiscussionSDD09"}%
---+!! %TOPIC%

<h2>Text formatting (italics, bold, super/subscript) and other publishing artifacts</h2>

<h3>Problem</h3>

First an apology: I (Gregor Hagedorn) was the main proponent of insisting on some minimal text formatting facility in BDI.SDD_. Bob Morris was always very critical of our current solution since basing it on a mimized xhtml inline-formatting structure meant we had mixed content in BDI.SDD_. (A related case of possible mixed content in the natural language markup was carefully avoided, requiring complete markup with Text elements!) I do agree now that "xml mixed content" poses a significant burden. For example, a product like Altova Mapforce that supports graphical xslt creation to map to and from databases and xml instance documents can not handle elements with mixed content. 

Furthermore, using xhtml-style mixed content markup to format label and other text elements, creates an impedance problem between database and xml: The database most likely uses ANSI or Unicode to encode &amp;, &lt;, &gt;, rather than natively storing character entities (&amp;amp;, &amp;lt;, &amp;gt;). It would thus have to distinguish between passing these through unencoded if used in combination of the few recognized markup tags, and encoding them where used to express greater-than, or even xml-tag examples that may be present in the text.

Example: the Database content "A&lt;sub&gt;1&lt;/sub&gt; &gt; A&lt;sub&gt;2&lt;/sub&gt;" would have to be recoded into: "A&lt;sub&gt;1&lt;/sub&gt; &amp;gt; A&lt;sub&gt;2&lt;/sub&gt;" when creating BDI.SDD_ xml content. This is especially problematic, since some validation may have to occur in the conversion process to avoid ill-formed xml or non-valid BDI.SDD_. For example, unbalanced markup like: "A&lt;sub&gt;1&lt;/sup&gt;" should be converted to "A&amp;lt;sub&amp;gt;1&amp;lt;/sup&amp;gt;"!

<h3>New Proposal</h3>

Trevor Patterson proposed to also encode the markup parts. I think this is a workable solution, provided some schema-external conventions are agreed upon which elements should be allowed to have some markup. Basically it means that the task of the database-to-BDI.SDD_ interface is minimized (usually the automatic xml recoding will work) and that the task of recognizing certain text as non-verbatim and having formatting semantics is shifted to those processes that process the data.

In principle the tags should be based on xhtml, and implicit in the proposal is that any xhtml encountered in a database is to be escaped (by encoding &lt;/&gt; to entities &amp;lt;/&amp;gt;). However, for the purpose of interoperability I believe it is necessary to have a standard or recommendation which tags should be recognized and reverted into formatting. The current list is limited to: "em", "strong", "sub", "sup", "i", and "br". All these tags are inline formatting and do not affect any block-level (e.g., paragraph, list, table) layout that may be used in reports. (See also TextPublishingArtifacts for a later addition of recommended tags for page- and other publishing artifacts.)

I have written some xslt to recover a defined set of encoded formatting markup (similar to the previously proposed <nop>StringWithFormatting type, see Appendix below). Please see the following attachment, containing xslt code, example files, expected output and a html document containing the proposal:

   * [[%ATTACHURL%/EncodedFormattingProposal.zip][EncodedFormattingProposal.zip]]

-- [[Main.GregorHagedorn][Gregor Hagedorn]] - 18 August 2004, revised 12. March 2005 
---

I think this makes good sense, my only comment would be that I don't know if it is necessary to define the list of supported tags - this can be up to individual users/applications. Obviously it would be better if everyone at least stuck to xhtml - but is a subset of tags necessary?  I'm not sure any rules are necessary beyond explicitly requiring all invalid xml characters to be converted to entities when creating XML, and converting all entities back when converting from xml to whatever.

BTW - I do think that adding formatting will greatly impair any ability to compare and index strings etc, i.e. make it very difficult for any applications wanting to do this. This might be really difficult when multiple nested tags are used. I guess any application doing this will just have to strip away all the tags - which does suggest you might want to have an unformatted version of any formatted string also stored in the schema.

-- Main.TrevorPaterson - 18 Aug 2004
---

I believe the replace-any-entity-routine will not work, since a) this may create invalid and even ill-formed xml - see the example above which also contains an entity truly meaning greater, and not markup; b) I believe some agreement between content authors and processing applications is necessary as to where (in which elements) and which formatting is supported; this is addressed in the documentation in the zip file); c) allowing complete xhtml would most likely potentially break any report or form generator. It would become extremely difficult to write a report that should be presented as a valid xhtml document, but is able to incorporate text blocks from a datastructure that may include arbitrary xhtml tags. This is the reason behind recommending only a few inline (= character formatting) tags plus the break. It avoids any block-level xhtml. Note that the xslt example code (although not correctly dealing with same-tag nesting), always creates well-formed xml and a valid xhtml fragment. - I encourage reviewers to find an example text that does not perform to this criterion!

However, the UBIF text formatting rules are certainly only recommendations. Any collaboration may wish to extend this, knowing that the additional formatting will only be retrieved among themselves. Other consuming applications would then present the additional tags simply as plain text (like in xhtml source code view), visible to the human consumer and not generating the added functionality. This is still as desirable solution, because it does not break the validity of the 

The comment about problems with indexing is very valid. The example file already contains a case where the feature is "abused" to format an entire field and comments on the difficulty of preventing this. I think this is catch-22: either we want to validate where and how much formatting is possible - this is what the previous BDI.SDD_ versions did and which resulted in mixed content, or we do not validate, you can be agnostic about formatting, but you also have no longer validating control over what happens.

My hope is, given the relative simplicity of the example xslt code, it will be common knowledge to strip exactly the same encoded tags before comparing or indexing. To make this clearer, I have updated the attached zip file: It now includes a second example xslt for such a "StripEncodedFormatting" template.

I believe the comparison/indexing problem cannot be avoided, even if there is no convention on which formatting shall be accepted. I believe having a "public" agreement reduces the urge to create your own private convention (i.e. between your own database and your own processing tool). The public agreement can be dealt with by all consumers, the multiple private subcoding schemes would be very difficult to deal with.

-- [[Main.GregorHagedorn][Gregor Hagedorn]] - 18 August 2004, revised 12. March 2005 
---

<h2>Appendix: Original xml Schema definition of the <nop>StringWithFormatting mixed content type</h2>

This type is now *obsolete* and placed here for reference and comparison. It may be used if post-<nop>RecoverEncodedFormatting shall be validated. 

Note: On August 18, I removed most of the previous discussion that was addressing issues now considered solved with the proposal above (since BDI.SDD_ 1.0 beta 2). Please look at Revision 1.4 of this topic to see the earlier discussion.

<verbatim>
<xs:complexType name="StringWithFormatting" mixed="true">
  <xs:annotation><xs:documentation xml:lang="en-us">Allows basic limited text formatting using a few xhtml elements. The list of formatting options is purposely limited as strongly as possible. No block level formatting (p, ul, hr, table) is supported here). Furthermore, deprecated text formatting (u, b, etc.) is not supported. The list can be extended if arguments are made. Types derived from this could support img and a, or block level formats.</xs:documentation></xs:annotation>
  <xs:choice minOccurs="0" maxOccurs="unbounded">
    <xs:annotation><xs:documentation xml:lang="en-us">(Note that this is a mixed content model, allowing text between elements!)</xs:documentation></xs:annotation>
    <xs:element name="em" type="StringWithFormatting">
       <xs:annotation><xs:documentation xml:lang="en-us">'Emphasis' logical markup (phrase): usually rendered italic.</xs:documentation></xs:annotation>
    </xs:element>
    <xs:element name="strong" type="StringWithFormatting">
      <xs:annotation><xs:documentation xml:lang="en-us">'Strong' logical markup (phrase): usually rendered bold.</xs:documentation></xs:annotation>
    </xs:element>
    <xs:element name="sub" type="StringWithFormatting">
      <xs:annotation><xs:documentation xml:lang="en-us">Logical markup: subscript</xs:documentation></xs:annotation>
    </xs:element>
    <xs:element name="sup" type="StringWithFormatting">
      <xs:annotation><xs:documentation xml:lang="en-us">Logical markup: superscript</xs:documentation></xs:annotation>
    </xs:element>
    <xs:element name="i" type="StringWithFormatting">
      <xs:annotation><xs:documentation xml:lang="en-us">Font style markup: italic markup that could not be interpreted as (preferred) emphasis, esp. for taxon markup.</xs:documentation>
    </xs:annotation>
    </xs:element>
    <xs:element name="br">
      <xs:annotation><xs:documentation xml:lang="en-us">line break (empty element)</xs:documentation></xs:annotation>
    </xs:element>
  </xs:choice>
</xs:complexType>
</verbatim>

And an xml-spy schema diagram for the above:<br />
   <img src="%ATTACHURLPATH%/OldStringWithFormatting.png" alt="OldStringWithFormatting.png"  />

Again, the above illustrated the old BDI.SDD_ solution, which by this discussion has become *obsolete*. It may still be useful to illustrate which xhtml tags are recommended.

-- [[Main.GregorHagedorn][Gregor Hagedorn]] - 18 August 2004, revised 12. March 2005 
---

%META:FILEATTACHMENT{name="OldStringWithFormatting.png" attr="h" comment="" date="1092827030" path="C:\Data\Desktop\DESCR\TDWG-SDD\Schema\1.0\OldStringWithFormatting.png" size="5650" user="GregorHagedorn" version="1.1"}%
%META:FILEATTACHMENT{name="EncodedFormattingProposal.zip" attr="h" comment="" date="1092842270" path="C:\Data\Desktop\DESCR\TDWG-SDD\Schema\1.0\EncodedFormattingProposal.zip" size="16885" user="GregorHagedorn" version="1.2"}%
%META:TOPICMOVED{by="GregorHagedorn" date="1097054110" from="SDD.FormattedText" to="UBIF.FormattedText"}%
Archive entire twiki setup from owl as of today 2016-02-24 10:46:01 +00:00			`%META:TOPICINFO{author="GarryJolleyRogers" date="1259118882" format="1.1" version="1.13"}%`
			`%META:TOPICPARENT{name="BDI.SDD_.ClosedTopicSchemaDiscussionSDD09"}%`
			`---+!! %TOPIC%`

			`<h2>Text formatting (italics, bold, super/subscript) and other publishing artifacts</h2>`

			`<h3>Problem</h3>`

			First an apology: I (Gregor Hagedorn) was the main proponent of insisting on some minimal text formatting facility in BDI.SDD_. Bob Morris was always very critical of our current solution since basing it on a mimized xhtml inline-formatting structure meant we had mixed content in BDI.SDD_. (A related case of possible mixed content in the natural language markup was carefully avoided, requiring complete markup with Text elements!) I do agree now that "xml mixed content" poses a significant burden. For example, a product like Altova Mapforce that supports graphical xslt creation to map to and from databases and xml instance documents can not handle elements with mixed content.

			Furthermore, using xhtml-style mixed content markup to format label and other text elements, creates an impedance problem between database and xml: The database most likely uses ANSI or Unicode to encode &, <, >, rather than natively storing character entities (&amp;, &lt;, &gt;). It would thus have to distinguish between passing these through unencoded if used in combination of the few recognized markup tags, and encoding them where used to express greater-than, or even xml-tag examples that may be present in the text.

			`Example: the Database content "A<sub>1</sub> > A<sub>2</sub>" would have to be recoded into: "A<sub>1</sub> &gt; A<sub>2</sub>" when creating BDI.SDD_ xml content. This is especially problematic, since some validation may have to occur in the conversion process to avoid ill-formed xml or non-valid BDI.SDD_. For example, unbalanced markup like: "A<sub>1</sup>" should be converted to "A&lt;sub&gt;1&lt;/sup&gt;"!`

			`<h3>New Proposal</h3>`

			`Trevor Patterson proposed to also encode the markup parts. I think this is a workable solution, provided some schema-external conventions are agreed upon which elements should be allowed to have some markup. Basically it means that the task of the database-to-BDI.SDD_ interface is minimized (usually the automatic xml recoding will work) and that the task of recognizing certain text as non-verbatim and having formatting semantics is shifted to those processes that process the data.`

			In principle the tags should be based on xhtml, and implicit in the proposal is that any xhtml encountered in a database is to be escaped (by encoding </> to entities &lt;/&gt;). However, for the purpose of interoperability I believe it is necessary to have a standard or recommendation which tags should be recognized and reverted into formatting. The current list is limited to: "em", "strong", "sub", "sup", "i", and "br". All these tags are inline formatting and do not affect any block-level (e.g., paragraph, list, table) layout that may be used in reports. (See also TextPublishingArtifacts for a later addition of recommended tags for page- and other publishing artifacts.)

			`I have written some xslt to recover a defined set of encoded formatting markup (similar to the previously proposed <nop>StringWithFormatting type, see Appendix below). Please see the following attachment, containing xslt code, example files, expected output and a html document containing the proposal:`

			`* [[%ATTACHURL%/EncodedFormattingProposal.zip][EncodedFormattingProposal.zip]]`

			`-- [[Main.GregorHagedorn][Gregor Hagedorn]] - 18 August 2004, revised 12. March 2005`
			`---`

			`I think this makes good sense, my only comment would be that I don't know if it is necessary to define the list of supported tags - this can be up to individual users/applications. Obviously it would be better if everyone at least stuck to xhtml - but is a subset of tags necessary? I'm not sure any rules are necessary beyond explicitly requiring all invalid xml characters to be converted to entities when creating XML, and converting all entities back when converting from xml to whatever.`

			`BTW - I do think that adding formatting will greatly impair any ability to compare and index strings etc, i.e. make it very difficult for any applications wanting to do this. This might be really difficult when multiple nested tags are used. I guess any application doing this will just have to strip away all the tags - which does suggest you might want to have an unformatted version of any formatted string also stored in the schema.`

			`-- Main.TrevorPaterson - 18 Aug 2004`
			`---`

			I believe the replace-any-entity-routine will not work, since a) this may create invalid and even ill-formed xml - see the example above which also contains an entity truly meaning greater, and not markup; b) I believe some agreement between content authors and processing applications is necessary as to where (in which elements) and which formatting is supported; this is addressed in the documentation in the zip file); c) allowing complete xhtml would most likely potentially break any report or form generator. It would become extremely difficult to write a report that should be presented as a valid xhtml document, but is able to incorporate text blocks from a datastructure that may include arbitrary xhtml tags. This is the reason behind recommending only a few inline (= character formatting) tags plus the break. It avoids any block-level xhtml. Note that the xslt example code (although not correctly dealing with same-tag nesting), always creates well-formed xml and a valid xhtml fragment. - I encourage reviewers to find an example text that does not perform to this criterion!

			`However, the UBIF text formatting rules are certainly only recommendations. Any collaboration may wish to extend this, knowing that the additional formatting will only be retrieved among themselves. Other consuming applications would then present the additional tags simply as plain text (like in xhtml source code view), visible to the human consumer and not generating the added functionality. This is still as desirable solution, because it does not break the validity of the`

			The comment about problems with indexing is very valid. The example file already contains a case where the feature is "abused" to format an entire field and comments on the difficulty of preventing this. I think this is catch-22: either we want to validate where and how much formatting is possible - this is what the previous BDI.SDD_ versions did and which resulted in mixed content, or we do not validate, you can be agnostic about formatting, but you also have no longer validating control over what happens.

			`My hope is, given the relative simplicity of the example xslt code, it will be common knowledge to strip exactly the same encoded tags before comparing or indexing. To make this clearer, I have updated the attached zip file: It now includes a second example xslt for such a "StripEncodedFormatting" template.`

			`I believe the comparison/indexing problem cannot be avoided, even if there is no convention on which formatting shall be accepted. I believe having a "public" agreement reduces the urge to create your own private convention (i.e. between your own database and your own processing tool). The public agreement can be dealt with by all consumers, the multiple private subcoding schemes would be very difficult to deal with.`

			`-- [[Main.GregorHagedorn][Gregor Hagedorn]] - 18 August 2004, revised 12. March 2005`
			`---`

			`<h2>Appendix: Original xml Schema definition of the <nop>StringWithFormatting mixed content type</h2>`

			`This type is now obsolete and placed here for reference and comparison. It may be used if post-<nop>RecoverEncodedFormatting shall be validated.`

			`Note: On August 18, I removed most of the previous discussion that was addressing issues now considered solved with the proposal above (since BDI.SDD_ 1.0 beta 2). Please look at Revision 1.4 of this topic to see the earlier discussion.`

			`<verbatim>`
			`<xs:complexType name="StringWithFormatting" mixed="true">`
			`<xs:annotation><xs:documentation xml:lang="en-us">Allows basic limited text formatting using a few xhtml elements. The list of formatting options is purposely limited as strongly as possible. No block level formatting (p, ul, hr, table) is supported here). Furthermore, deprecated text formatting (u, b, etc.) is not supported. The list can be extended if arguments are made. Types derived from this could support img and a, or block level formats.</xs:documentation></xs:annotation>`
			`<xs:choice minOccurs="0" maxOccurs="unbounded">`
			`<xs:annotation><xs:documentation xml:lang="en-us">(Note that this is a mixed content model, allowing text between elements!)</xs:documentation></xs:annotation>`
			`<xs:element name="em" type="StringWithFormatting">`
			`<xs:annotation><xs:documentation xml:lang="en-us">'Emphasis' logical markup (phrase): usually rendered italic.</xs:documentation></xs:annotation>`
			`</xs:element>`
			`<xs:element name="strong" type="StringWithFormatting">`
			`<xs:annotation><xs:documentation xml:lang="en-us">'Strong' logical markup (phrase): usually rendered bold.</xs:documentation></xs:annotation>`
			`</xs:element>`
			`<xs:element name="sub" type="StringWithFormatting">`
			`<xs:annotation><xs:documentation xml:lang="en-us">Logical markup: subscript</xs:documentation></xs:annotation>`
			`</xs:element>`
			`<xs:element name="sup" type="StringWithFormatting">`
			`<xs:annotation><xs:documentation xml:lang="en-us">Logical markup: superscript</xs:documentation></xs:annotation>`
			`</xs:element>`
			`<xs:element name="i" type="StringWithFormatting">`
			`<xs:annotation><xs:documentation xml:lang="en-us">Font style markup: italic markup that could not be interpreted as (preferred) emphasis, esp. for taxon markup.</xs:documentation>`
			`</xs:annotation>`
			`</xs:element>`
			`<xs:element name="br">`
			`<xs:annotation><xs:documentation xml:lang="en-us">line break (empty element)</xs:documentation></xs:annotation>`
			`</xs:element>`
			`</xs:choice>`
			`</xs:complexType>`
			`</verbatim>`

			`And an xml-spy schema diagram for the above:<br />`
			`<img src="%ATTACHURLPATH%/OldStringWithFormatting.png" alt="OldStringWithFormatting.png" />`

			`Again, the above illustrated the old BDI.SDD_ solution, which by this discussion has become obsolete. It may still be useful to illustrate which xhtml tags are recommended.`

			`-- [[Main.GregorHagedorn][Gregor Hagedorn]] - 18 August 2004, revised 12. March 2005`
			`---`

			`%META:FILEATTACHMENT{name="OldStringWithFormatting.png" attr="h" comment="" date="1092827030" path="C:\Data\Desktop\DESCR\TDWG-SDD\Schema\1.0\OldStringWithFormatting.png" size="5650" user="GregorHagedorn" version="1.1"}%`
			`%META:FILEATTACHMENT{name="EncodedFormattingProposal.zip" attr="h" comment="" date="1092842270" path="C:\Data\Desktop\DESCR\TDWG-SDD\Schema\1.0\EncodedFormattingProposal.zip" size="16885" user="GregorHagedorn" version="1.2"}%`
			`%META:TOPICMOVED{by="GregorHagedorn" date="1097054110" from="SDD.FormattedText" to="UBIF.FormattedText"}%`