Abstraction levels in Web document formats

Håkon Wium Lie
Opera Software, Oslo, Norway
<howcome@opera.com>
http://people.opera.com/people/howcome

Abstract

The paper gives an overview of current and emerging document formats on the Web, and discusses the abstraction level of the different formats. The "ladder of abstraction" is introduced as a measuring stick for Web formats, and various levels from presentation (at the bottom of the ladder) to semantics (at the top) is described. The importance of reaching certain abstraction levels in order to support device-independent formats, universal accessibility, and scalable presentations is stressed. The document formats discussed in the paper are: HTML, PDF, GIF, PNG, MathML, and XSL-FO. Also, the effect of style and transformation languages (namely CSS and XSLT) are described.

Introduction

Over the last decade, the Web has established itself as an important medium for publishing documents. The simplicity of HTML [1], which is the major document format on the Web, has been an important reason why the Web in a short time period has achieved this position. Authors without much experience in electronic publishing have quickly been learning HTML by looking at the source code of other documents and using simple text editors to author documents [2]. HTML, in its simplest form, is also easy to implement and a number of HTML browsers appeared in the early days of the Web. These browsers supported a wide range of devices, from text terminals to high-resolution graphics screens as well as aural renderings. Today, millions of browsers around the world understand HTML and the number is quickly increasing.

Document authoring and formatting systems have a longer history than the Web. A survey paper on document formatting systems written in 1982 [3] characterizes formatting as "mapping from abstract objects to concrete objects". The web has not changed this definition, but forces one to rethink what an abstract object is, what a concrete object is, and where the formatting process should take place -- on the client or server side. Also, this article claims that there exists more than those two types of objects and that there is a continuous "ladder of abstraction" between them.

A more recent article on "document species" [4] discusses document formats in the context of the Web. The article observes a preferential adoption of declarative markup over presentational markup and style sheets over inline formatting. This development would, if true, lead to documents formats at higher levels of abstractions. However, recent working drafts issued by W3C shows that the movement is not consistently going upwards on the ladder of abstraction. The section on "Style vs. transformation" discusses this further.

Current Web document formats

HTML lets authors mark the structural role of textual content. For example, headlines, paragraphs and lists can be identified as such, and HTML in its original form does not describe how the various elements are to be presented [5]. The separation of structure and presentation stems from HTML's ancestor SGML [6] and puts HTML on a higher level of abstraction than presentation-oriented formats, e.g., PDF [7]. PDF has no concept of paragraphs, and many users have discovered this when trying to copy content from PDF documents laid out in several columns. When selecting text, the selection will span across both columns and thereby mix text from several parts of the document into the same selection.

Bitmap image formats, e.g. GIF and PNG [9], are at an even lower level of abstraction. Normally, these are normally not considered document formats since they have no notion of text, but on the Web images are often used to convey textual content [10]. Treating text as images gives the author full control of fonts, layout and colors in the presentation. For users, however, images have several disadvantages: they take longer time to download and only allow visual access to the text. Other uses of the text -- e.g., indexing of pages, cut/paste operations, and rendering through speech synthesizers -- become impossible.

The ladder of abstraction

It is the view of the author that a document format's abstraction level is important when assessing its usefulness on the Web. This section introduces the "ladder of abstraction" as a tool for evaluating document formats. The vertical nature of a ladder corresponds to how one describes abstraction levels as "high" or "low".

Typical characteristics of document formats that are high on the ladder of abstraction are:

the information needs processing in order to be presented. For example, in order to render an HTML document visually, the words must be broken into lines, fonts must be selected, and the characters must be rasterized.
the information can be processed and presented in many different ways. Presenting a document visually is only one of several possibilities, others include aural renderings and braille embedding.
the information is represented in a compact manner. Representing a letter with an 8-bit code is more compact than representing an image of the same character.

Conversely, document written in formats that are low on the ladder of abstraction need less processing in order to be presented, they have less flexibility of presentation, and they are less compact.

Another important observation is that it is generally possible to transform documents downwards on the ladder, but much harder to move the other way [11]. For example, graphical Web browsers - in collaboration with the windowing system - rasterize HTML documents into pixels and thereby move information downwards on the ladder of abstraction. Optical Character Recognition (OCR) software attempts to climb the ladder by turning images into text, but OCR systems only work under optimal conditions, e.g. with certain font families and text sizes.

The ladder of abstraction is a simplified, one-dimensional scale. People with a background in mathematics, programming languages or structural linguistics will be familiar with the concept of abstraction. In the context of Web document formats, the author of this paper believes that the following criteria are good measuring sticks for assessing the level of abstraction:

Is the text available? That is, does the format have a notion of characters that later can be mapped into glyphs, or does it represent text as images -- in which case the text is not available.
Is the logical order of text preserved? That is, do documents written in the format have a notion of the logical reading order of the content?
Is the document scalable? For example, can text be laid out with different line lengths? Can the aspect ratio of the pages be changed?
Can the roles of the various text elements be represented? For example, can the author mark part of the text as a headline? As a paragraph? As the name of a variable in a computer program? Being able to do distinguish between these roles is important e.g. when making documents available in braille since some text should be contracted (e.g. headlines), while other text should not (e.g. variable names) [12].
Is the format device-independent? That is, can documents written in the format be rendered into many different devices (e.g. printers, screens, braille printers, and text synthesizers) or are documents intended for a single type of device? [13]
Does the format contain application-specific semantics? HTML is a general document format that does not attempt to describe semantics from more specialized fields, e.g. mathematics, and therefore does not contain application-specific semantics.

Table 1 rates several existing and emerging Web document formats with regard to these criteria. HTML, PDF, GIF and PNG have been discussed above while the emerging XSL-FO and MathML are discussed below. XML, in which several of the emerging formats are written, is also included in the table and refers to documents published using private tag sets.

Table 1. Existing and emerging Web document formats placed on the ladder of abstraction
GIF, PNG PDF XML XSL-FO HTML MathML presentation elements MathML content elements
application-
specific semantics? no no no no no yes yes
device-independent? no no no no yes yes yes
roles known? no no no no yes yes yes
scalable? no no unknown yes yes yes yes
text in logical order? - no unknown yes yes yes yes
text available? no yes yes yes yes yes yes

Table 1. Existing and emerging Web document formats placed on the ladder of abstraction
	GIF, PNG	PDF	XML	XSL-FO	HTML	MathML presentation elements	MathML content elements
application- specific semantics?	no	no	no	no	no	yes	yes
device-independent?	no	no	no	no	yes	yes	yes
roles known?	no	no	no	no	yes	yes	yes
scalable?	no	no	unknown	yes	yes	yes	yes
text in logical order?	-	no	unknown	yes	yes	yes	yes
text available?	no	yes	yes	yes	yes	yes	yes

The rating in table 1 assumes that the document format is used in the best possible way. On the web today, this is often not the case. For example, many HTML documents use tables to enforce a certain layout. Often, due to the nature of table markup, these documents do not have text in logical order. Also, since the tables have been set to have certain widths (e.g. 600 pixels), the document is no longer device-independent.

Style vs. transformation

Style sheets describe how documents are to be presented for users by attaching style (e.g., fonts, spacing, and aural cues) to structured documents. Style sheets are not documents themselves, but serve a supporting role by allowing the presentation of a document to be separated from the content of the document.

Cascading Style Sheets (CSS) [14][15] allow authors to attach presentational properties to semantic elements. Before CSS, the Web experienced a strong push from authors to extend HTML into a presentational language rather than a semantically oriented language. Instead of adding new elements in order to achieve certain presentations, W3C has added new CSS properties to address requests from authors.

XSL (Extensible Stylesheet Language) [16] takes a different approach. Instead of attaching formatting properties to semantic elements, XSL suggests that authors transform the document into a set of formatting objects which can be expressed in XML. The difference between these two approaches is the topic of this section.

The XSL effort within W3C has produced two specifications. The first is a transformation language called XSLT [17], and the second is an XML vocabulary for formatting objects (called "XSL-FO" in this document) [16].

The most common use of XSLT is to transform XML data and documents into HTML on the Web server. Several implementations support XSLT and they allow content providers to use their favorite DTDs internally while serving HTML to the huge installed base of Web browsers. XSLT provides a declarative way of specifying simple transformations.

XSLT can also be used to generate XSL-FO. Formatting objects describe how chunks of information are formatted before presented to a human user. The push for XSL-FO comes from vendors with the goal of improving the quality of printed Web content. Unfortunately, when transforming documents into XSL-FO, the documents move downwards on the ladder of abstraction and only the human presentation is left. Moreover, the resulting documents are tied to a certain output media. Thus, accessibility and device-independence it threatened by the use of XSL-FO.

The transformation from semantic markup to formatting objects can also take place on the client side. Given the XML source and the XSLT transformation sheet, the client can convert the semantic markup into formatting objects. This preserves semantics, and the number of bytes sent over the Web will generally be smaller. In this scenario, however, there is no need for an XML vocabulary to express formatting objects since the formatting objects only exist within the client application. This highlights an important point: it is not formatting objects per se that are harmful (any system that does formatting uses some kind of formatting objects). The harm is done when formatting objects are stored and shipped over the Web.

Code examples

This section will give three examples of how XSLT can be used. The first example transforms from XML to HTML, the second transforms from XML to XSL-FO and the third transforms from XML to HTML/CSS. All examples use this simple XML element as input:

<Heading1>The headline</Heading1>

Example 1: XML to HTML

The first XSLT sheet transforms the XML element into HTML:

<xsl:template match="Heading1">
  <H1>
    <xsl:apply-templates/>
  </H1>
</xsl:template>

The result is:

<H1>The headline</H1>

The resulting HTML is at a high enough level of abstraction that device-independence and accessibility is preserved. What is lacking in information about how to present it.

Example 2: XML to XSL-FO

In this example, the XSLT sheet transforms the XML element into a formatting object:

<xsl:template match="Heading1">
  <fo:block font-size="1.3em" margin-top="1.5em" margin-bottom="0.4em">
    <xsl:apply-templates/>
  </fo:block>
</xsl:template>

The result is:

<fo:block font-size="1.3em" margin-top="1.5em" margin-bottom="0.4em">
  The headline
</fo:block>

The difference between example 1 and example 2 is one of semantics vs. presentation. When transformed into HTML, the semantics of the XML is preserved since the H1 element is globally recognized as being a headline of level 1. When transformed into XSL-FO, semantics is removed and replaced by presentational properties.

Example 3: XML to HTML/CSS

The last example transforms XML into an HTML element with associated CSS stylistic properties:

<xsl:template match="Heading1">
  <H1 STYLE="font-size:1.3em; margin-top:1.5em; margin-bottom:0.4em">
    <xsl:apply-templates/>
  </H1>
</xsl:template>

The result is:

<H1 STYLE="font-size:1.3em; margin-top:1.5em; margin-bottom:0.4em">
   The headline
</H1>

The result preserves the semantics in the form of HTML elements, while presentational information is encoded in the CSS notation.

(When authoring with CSS, one would normally move the stylistic properties into a separate style sheet and not into an attribute as in the above example. Having separate style sheets eases maintenance and makes documents smaller. However, both forms are valid and one can programatically convert between the two.)

MathML

Mathematical Markup Language (MathML) [18] is an emerging format for describing mathematical notation. As expected from an application-specific language, it contains more semantics than general-purpose document formats like HTML

The developers of MathML recognized that different abstraction levels are needed to encode mathematics for the web. From [18]:

A fundamental challenge in defining a mathematics markup language for the Web is reconciling the need to encode both the presentation of a mathematical notation and the content of the mathematical idea or object which it represents.

MathML therefore contains both "presentation elements" and "content elements". This allows authors to encode semantics when available and presentation when semantics is either unavailable or not covered by the content elements.

As an example, consider this mathematical expression:

a-b

Using MathML's presentation elements, the expression can be written:

<mrow>
  <mi>a</mi>
  <mo>-</mo>
  <mi>b</mi>
</mrow>

Using MathML's content elements, the expression can be written:

<apply>
  <minus/>
  <ci>a</ci>
  <ci>b</ci>
</apply>

Although referred to as "presentation markup", it should be noted that the presentation elements are at a higher level of abstraction than e.g. HTML. In table 1, both MathML presentation elements and MathML content elements are given the same rating. This shows that the criteria used are not well suited for assessing the level of abstraction above a certain level. Further criteria should be developed as new formats better describe application-specific semantics.

Conclusions

Documents on the Web should strive to retain information at a high enough level of abstraction to preserve device-independence and accessibility. By performing the down-translation of text in the browser rather than in the server, the user's preferences and needs can be taken into account.

Style sheet languages like CSS can augment structured document formats by attaching presentational information to semantic elements.

Transformation languages like XSLT can move information downwards on the ladder of abstraction, but will not be able to increase the level of abstraction.

The use of XSL-FO as a document format on the Web is a step downwards on the ladder of abstraction and the move threatens device-independence and accessibility.

As a general rule, documents at higher levels of abstraction are better for the Web than documents at lower levels of abstraction. Document formats below HTML on the ladder of abstraction should not be used. The use of application-specific formats should be encouraged, while also keeping in mind that simplicity is a major reason for the success of the Web.

References

[1] Raggett, D; Lam, J; Alexander, I: HTML 3 - Electronic Publishing on the World Wide Web, p. 27, Addison Wesley, 1996

[2] Ibid, p. 62

[3] Furuta, R; Scofield, J; Shaw, A: Document Formatting Systems: Survey, Concepts, and Issues, ACM Computing Surveys, Vol 14 No 3, 1982

[4] Khare, R; Rifkin, A: The origin of the (document) species, Proceedings of WWW7, Brisbane, April 1998

[5] Berners-Lee, T: Weaving the Web, p. 41, Harper San Francisco, 1999

[6] ISO 8879:1986. Information Processing - Text and Office Systems - Standard Generalized Markup Language (SGML), International Organization for Standardization (ISO), Geneva, 1986

[7] Portable Document Format Reference Manual, Adobe SystemsIncorporated, Addison Wesley, 1993

[8] Graphics Interchange Format, Compuserve, 1989

[9] PNG (Portable Network Graphics) Specification, Version 1.0, W3C Recommendation, October 1996

[10] Nielsen, H F; Gettys, J; Baird-Smith, A; Prud'hommeaux, E; Lie, H W; Lilley, C: Network Performance Effects of HTTP/1.1, CSS1, and PNG; Proceedings of SIGCOMM, Cannes, 1997

[11] Lie, H W; Saarela, J: Multipurpose Web Publishing using HTML, XML and CSS; Communications of the ACM, October 1999

[12] Lorimer, P: A critical evaluation of the historical development of the tactile modes of reading and an analysis and evaluation of researches carried out in endeavours to make the braille code easier to read and to write, Ph.D. Thesis, University of Birmingham, December 1996

[13] User Agent Accessibility Guidelines 1.0, W3C Proposed Recommendation, World Wide Web Consortium, March 2000

[14] Lie, H W; Bos, B: Cascading Style Sheets, level 1, W3C Recommendation, World Wide Web Consortium, December 1996

[15] Bos, B; Lie, H W; Lilley, C; Jacobs, I: Cascading Style Sheets, level 2, W3C Recommendation, World Wide Web Consortium, May 1998

[16] Extensible Stylesheet Language (XSL), W3C Working Draft, World Wide Consortium, March 2000

[17] Clark, J: XSL Transformations (XSLT) Version 1.0, W3C Recommendation, World Wide Web Consortium, December 1999

[18] MathML W3C Recommendation, World Wide Web Consortium, July 1999