[ Back to Index of Web Works ] part of George's Web Site

Related Web Works


Electronic Publishing: Examining a New Paradigm

It seems that we, in the field of Library and Information Science, are destined to live in interesting times. The printed word, that sedate building block of the modern world, is currently undergoing a revolution that may have such far reaching effects that the like hasn't been seen since the time of Gutenberg himself. The new world of electronic publishing, producing and distributing materials to be viewed on a personal computer, whether over a network or on a CD-ROM, is taking shape before our very eyes. This paper is an attempt to examine three major areas of this revolution-in-progress:

* The forces driving the move to electronic publishing. o Demographics. o Economics. o Technology.

* Competing document formats for electronic publication. o Content-only formats. o Layout-oriented formats. o Mark-up formats.

* Problems with wide scale use of electronic publishing. o Network Constraints. o Copyright. o Standardization.

* Conclusion.

----------------------------------------------------------------------------

Three Forces Driving the Move to Electronic Publishing

Return to Top

Demographics

The wellspring of electronic publishing is the deceptively simple fact that that's where the people are. The dual trends of the massive growth of the World Wide Web in the last three years, and to a lesser extent the volume of personal computers, most equipped with a CD-ROM drive, that Americans now own, has created a whole new market of people with access to the technology that makes electronic publishing possible.

And its not just the fact that these people have access that is important to potential publishers, and their advertisers, but also who these people are matters a lot. Really accurate number have proven elusive, especially for CD-ROM use, but several solid attempts have been made to collect demographics on web users, including the Graphic, Visualization, & Usability Center's (GVU) 4th WWW User Survey of over 23,000 users, where 76.2% of the respondents were from the US. The figures produced indicate that the average web user is 32.7 years old, with 29.3% of the users female and 70.7% male. The GVU also estimates the average income of web users to be US$63,000.

If these demographics seem familiar, its because this is also the same general group that television advertisers have consistently been willing to pay huge sums of money to reach. The web population has the added bonus of being affluent enough to afford to get on-line, whether from home, college, or work, and curious enough to try something new. This market has the potential to be extremely lucrative, and the appearance of adds on a number of commercial sites, including all the major search engines, indicates that advertisers are definitely paying attention. In fact the web offers advertisers new potential to focus their message only on the group of users that might be most likely to purchase their goods.

---------------------------------------------------------------------------- Return to Top

Economics

Another major advantage electronic publishing has over traditional printing is its relative lack of expense. In fact, compared to traditional print media, electronic publishing is down right cheap. The price of just about everything in the traditional print world has been going up recently. Adobe Magazine reports in their May/June 1996 issue (p. 36) that in the last 8 years postage has climbed almost 66%, while paper costs rose 44% in 1995, and are expected to jump another 20% this year. This is enough to put even the most successful print publications under financial stress.

The lack of these kind of expenses makes publishing electronically all the more attractive. A really first class website might cost US$ 1 million to set up, including hardware and an Internet connection; this is a relatively small figure in the world of publishing giants. Another bonus is that the support staff would also be much smaller: No printers or post-production people, and only a few staff members to maintain the servers.

---------------------------------------------------------------------------- Return to Top

Technology

The last, but certainly not least, trend driving the shift to electronic publishing is the technology itself. The World Wide Web is a great example of this. Three or four years ago there wasn't much but scientists, with high-bandwidth and permanent Internet connections, on the web. Today, millions of people, perhaps as many as 18 million, regularly use the web. And as more things have become possible over the web, and also on CD-ROM, technology makers, both hardware and software, have stepped forward to give content creators and users more control and leverage.

A primary example of this is the concept of repositioning. This is the process of publishers taking material, content as its come to be known, created for print or other media, and migrating it onto the web. This has allowed traditional publishers to test the electronic waters, without making a massive commitment of capital and resources, and this process has been made much easier by traditional software companies creating various modules for their applications that allow them to output HTML or PDF ready documents. In effect, this process is allowing publishers to make a slow and gradual shift into the electronic world, using software with which they are already familiar.

---------------------------------------------------------------------------- Return to Top

Competing Document Formats for Electronic Publication

A key area of concern with anyone involved in electronic publishing is which format to use. There are a number of currently competing formats, both proprietary and open standard, and methods of production that can almost overwhelm someone new to the field. For purposes of analysis they can be grouped into three basic categories: Content-only formats, layout-oriented formats, and mark-up formats.

Two key concepts inform the discussion of differences in format types for electronic publishing: Document layout and document content. Each of these formats can be evaluated in these terms, and their relative abilities in these two areas are key to understanding their usefulness as publishing media. The relationship of these two abilities in each format is represented in figure 1.

Document layout refers to the original design of a document, including typographical attributes, use of white space, etc.. Layout is very important because of its ability to create personality, and personality is the defining attribute that separates one publication from another. Major print publications have spent millions of dollars to achieve "the look" they wish to define them, and understandably they don't want to lose it just by going on-line. Nor do smaller publishers want to be constrained in designing their documents.

Document content, on the other hand, refers to the actual [Graph] information contained in the document. This is largely a function of the ASCII standard, meaning that it is searchable, easy to transmit, and to reproduce. These three abilities represent the major advantages that electronic media hold over print, and thus are key to any discussion for that reason alone.

As for the question of which format is best, it is totally dependent on the use of the documents. For quick and dirty electronic publishing ASCII might be the easiest way to get your message out, especially if the data doesn't refer to, or require, graphics. For publication on the Internet, HTML is the only really viable way to go, simply because its easy to create, quick to download, and widely supported. On the other hand, for CD-ROM distribution, PDF might be more appropriate. The publisher can maintain more control over the appearance of the document, downloading speed is not an issue, and the viewer can be loaded onto the CD itself. If, however, long-term reusability of the document is the most important goal, then SGML may be more appropriate, as it can be re-implemented into a different for many times over the lifetime of the document. It seems that the best overall format for electronic publishing is the one that is most suited to your needs.

---------------------------------------------------------------------------- Return to Top

Content Only Formats

The primary example of a content-only format is ASCII, the American Standard Code for Information Interchange, standardized as ANSI X3.4 and ISO 646. It is solely designed to represent the alphanumeric content of a document without indicating in anyway how the document should be formatted.

Most people would not even consider ASCII a "real" option for electronic publishing, although the minority that favor it can be quite persuasive, but rather as the base coding for something else like HTML, SGML, or PDF. And if the format of a document is really key (i.e. an advertisement or commercial entertainment work) then ASCII probably isn't your best bet, but for other uses it actually has quite a bit going for it.

First and foremost, precisely because ASCII acts as the underpinnings for so many other formats, and has been standardized by ANSI and the ISO, ASCII is universally readable. Every operating system comes equipped with a simple text editor, and every word processor can read ASCII text. Thus, if a publication was intended to reach the maximum amount of readers, no other format would even come close to delivering the number of potential viewers as ASCII.

ASCII also offers users a degree of power not common with some of the other data publishing types, most notably in its flexibility and malleability. A vast array of tools exist for almost every platform for easily parsing and searching single ASCII documents as well as whole groups of texts. Also, because of its level of standardization, ASCII is one of the few formats that we can be relatively certain will still be readable in 25 or 50 years. The same can't be said of HTML or PDF. And to top it all off ASCII is extremely economical in terms of bandwidth usage, because there is no extra code for formatting information.

Of course ASCII's big draw-back is also the source of its strength; the simplicity that makes it so widely and easily used also keeps it from displaying typography formatting and graphics. While this lack of flash hasn't kept anyone from reading a paper novel or a newspaper, on the screen, where people are less likely to be comfortable spending a great deal of time reading, graphics and formatting can help maintain interest and help in the rapid depiction of ideas. Sometimes, as the old saying goes, a picture really is worth a thousand words.

---------------------------------------------------------------------------- Return to Top

Layout-Oriented Formats

The second major type of format is layout-oriented. This format type is designed specifically to maintain the 'look and feel' of a document, as it was created by the author. Just within this category are a number of competing proprietary formats, including Envoy from Novell and Common Ground from Common Ground Software, although by far the most successfull of these is Portable Document Format (PDF) from Adobe, embodied in the Acrobat electronic publishing system. As figure one shows, PDF is almost equally concerned with content and layout. While it can almost perfectly preserve the original layout of a document, it also maintains the contents in a searchable, machine readable, format.

PDF is roughly based on Adobe's older postscript printer language, and is easiest to understand in that context. Much like postscript, PDF is designed to maintain the layout of a document from the format it was created in to its final viewable form. This is a key concern to professional publishers who need to preserve the 'look and feel' of their documents, and helps explain PDF's popularity with this group.

PDF essentially works as a printer device when the Acrobat Writer is installed on a machine. All a user needs to do is select the Adobe module as their printer and any document printed from any application will be turned into a reasonably exact replica of the original, and one that remains machine searchable. This is extremely powerful, and coupled with Adobe's free distribution of the viewer software helps make Adobe a formidable contender in the world of electronic publishing.

The downside of PDF is that only those willing to shell out the money for the software can create content. The complete package of all the Acrobat software modules is now over $3000, although individual parts can be had for significantly less. But even this price remains prohibitive for non-commercial uses. The other downside of PDF is the large file sizes needed to record all the formatting information that can make a document of any size extremely slow to transfer over a modem, and even slightly annoying over a faster LAN connection.

---------------------------------------------------------------------------- Return to Top

Mark-up Formats

The third, and now most common, type of format is the mark-up, the two main examples of which are SGML and HTML. Mark-up formats have the benefit of being entirely created from ASCII, using standard text to create the mark-up tags embedded in the text , and so creates small documents that are easy to send over a network. Another big size advantage for mark-up formats over lay-out formats, is that they rely on each implementation, whether on a web browser or printed page, to supply the physical layout information and so don't need to bother coding that information into the actual file: This can significantly reduce file size. The other big plus for mark-up formats, at least SGML and HTML, is that they are international standards, free for anyone to use that wants to take the time to figure them out.

---------------------------------------------------------------------------- Return to Top

Standardized General Mark-up Language (SGML) is an international standard (ISO-8879) established primarily for the use of large scale governmental printing operations, technical documentation, and commercial ventures. Many of its features are adopted form earlier conventions in those areas.

SGML is actually a 'metalanguage' for describing the content and structure of a document logically, independent of its layout, and as such it is much more concerned with the content of a document rather than its 'look and feel.' (Figure 1.) The method employed by SGML and its subtypes is that of the mark-up tag. For example, instead of the usual typesetters practice of identifying a heading within a document through the use of bold or perhaps underlined text, in SGML the heading text would simply be surrounded by a tag, i.e. Heading Text.

This allows a great deal of flexibility when a document is finally printed or otherwise displayed from SGML, in that the headings could all be displayed in a manner best suited to the audience for which it is displayed. Thus for one printing all the headings might be in 12 point Helvetica Bold, but on another occasion or to a different audience they might be made 14 point Times Roman underlined, and later the same document viewed on-line might be displayed in Ariel or some other CRT friendly font. The point being that the text isn't locked into the format that seems most convenient, for whatever reason, at its initial publication.

A key concept in the area of SGML publishing is the Document Type Definition or DTD. A DTD is the set of rules that an author must follow in the creation of a particular type of document. Each document must therefore conform to a DTD, which acts in effect as a list of allowed mark-up types. Common examples of document types are things like memos, quarterly sales reports, or journal articles.

Conceptually, at least, SGML sounds like a pretty good solution for electronic publishing. Unfortunately, it has one tremendous drawback: It is complicated to use. Unless a potential publisher had a lot of a particular type of document to create, they wouldn't want to go through the months of effort to design and creating a DTD. And every new document type would require the same effort for a new DTD. This is a serious hindrance to the widespread adoption of SGML.

Return to Top

Hypertext Mark-up Language (HTML), the lingua franca of the World Wide Web, is often considered a sub-implementation of SGML, in that it is a particular type of DTD, although for historical reasons it offers the rare example of a body of documents existing before its DTD was created. HTML was, at its conception in 1990, simply an SGML-like language rather than a sub-implementation of SGML. It wasn't until 1994 that a DTD was formulated for what was then HTML+. Of course it was soon outdated by HTML 2.0, and now HTML 3.0.

This situation highlights one of HTML's most serious problems as an electronic publishing method: It lacks the standardization necessary to make it a dependable long-term solution. With Microsoft and Netscape, as well as other Web browser vendors, adding their own proprietary tags, some of which may or may not be accepted into later standardization specifications, no implementation of a document in HTML can be assured to function on every browser or remain viable for any extended period of time into the future, unless it is composed of only the simplest tags. This limits HTML as an electronic publishing medium.

On the positive side, HTML naturally shares many of the same advantages as SGML. It is ASCII based and thus makes small network-friendly files, and it is free for anyone to use, although it diverges from SGML in the fact that it has the current overwhelming advantage of being extremely easy to use. Also, the World Wide Web Consortium has been promising web style sheet tags for quite some time, and if these actually come to pass this could help alleviate some of HTML's worst problems, and help it become an even better format for electronic publishing

----------------------------------------------------------------------------

Problems with Wide Scale Use of Electronic Publishing

While Electronic publishing will no doubt continue to expand as a mass medium, it still has its share of obstacles to overcome, including the following three:

Return to Top

Network Constraints

Network constraints may act to limit the growth of electronic publishing. This is precisely because the most interesting things that electronic publishing can offer, the things that help it to compete with print media, and even TV, are graphics and interactivity. The situation is perpetually one of more people trying to access the web, which continually contains more large graphics, and other high bandwidth files. This is a roundabout way of saying that the network is going to have to become much faster in order to rapidly transfer all of these larger, more interesting, files.

The current situation is that 28.8 kbps seems to be the ceiling for existing modem technologies transfer rates. While ISDN prices are dropping, they still remain expensive in most parts of the US, and also involve various installation hassles. And T1 lines remain wildly outside the range of the typical end user.

As long as these conditions remain in place, perhaps the best place to look for speed increases is in the network itself. The web specifically, like the Internet in general, use network resources in a fairly inefficient manner. This has a lot to do with the Internet's original purpose: To allow communication to continue between geographically separate military installations, in the event that a nuclear attack knocked out portions of the communications system. In short, the Internet, and the web on top of it, were made to be extremely reliable. Survival is key, not speed.

A typical HTTP transaction currently consists of making a separate connection to the server for each separate file involved. This means that every graphic on a page involves a separate connection to the server. After the initial contact is made each file is broken into separate packets and turned loose on the network to reliably but not necessarily quickly make their way to the client that made the initial request, where each file is then recombined and each separate connection is terminated. Each of these connections takes time and network resources. And of course, increasing the number of large graphics, movies, and sound files, moving over the network will only exacerbate this situation.

The IETF, as well as other organizations, has been working to improve these situations, with the development of HTTP 1.1, but with the phenomenal growth of the Internet and the web, it remains to be seen whether or not even these valiant efforts will keep the network viable.

One long term solution would be to install large caches for great portions of the web at strategic locations on the network, in order to shorten the time that it takes to get a file, an reduce the resources of the network used by reducing the distance the files must travel. Even this solution is controversial, however, mostly because certain groups feel that the caches may be used to filter 'un-acceptable' content, and also because the storage of these portions of the web might violate copyright laws. Another good criticism of this idea is simply to try and identify who should or would pay to maintain these caching nodes.

----------------------------------------------------------------------------

Return to Top

Copyright

Another area that has and will continue to trouble the development of electronic publishing in the United States is the current copyright law. Copyright is a concept originally founded in the Constitution, whose framers granted Congress the power, in Article I Section 8, "To promote progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries."

What this means in real terms has been established by statute and case law over the last 206 years and is officially codified in of the Copyright Act of 1976, amended in 1994. The Copyright Act (sec. 102) identifies eight areas of work that may be copyright protected, including (1) literary works, (2) musical works, including any accompanying words, (3) dramatic works, including any accompanying music, (4) pantomimes and choreographic works, (5) pictorial, graphic, and sculptural works, (6) motion picture and other audiovisual works, (7) sound recordings, and (8) architectural works. Categories of work outside of these eight, like software, are not protected by copyright, but may be protected by patent law, trade secret law, or trademark law.

What copyright tries to do (whether it actually does this, or whether it just fills corporate bank accounts is another open question) is foster the creation of socially useful works by offering the incentive to artists, authors, and other creators, that they alone, or to whoever they sell their copyright, will financially benefit from their work for a certain number of years. This is now (sec. 302) set at the lifetime of the creator plus 50 years.

The Copyright Act (sec. 106) also specifically defines the rights granted exclusively to the copyright holder for the duration of its existence. These include (1) the right to reproduce the copyrighted work, (2) to modify or produce derivative works based upon the copyrighted work, (3) to distribute the copyrighted work, (4) where applicable to perform the copyrighted work publicly, and (5) where applicable, to display the copyrighted work publicly.

An astute reader might note little mention of digital technology in the above or anything that might indicate that electronic publishing is any different than any other kind of publishing. But the simple fact is that copyright law was all formulated long before the rise of computers and distributed networks, and is designed to conform to the world as it was in the first half of this (20th) century. At the time, it was generally difficult to make a copy, other than through handwriting, and it could often prove difficult, in the days before e-mail and telephone, to contact an author to legitimately secure his or her permission to use a part of their book. This is certainly not true with electronic publishing, where an entire work can be distributed across the network as a simple e-mail attachment.

The nature of the Internet, the web, and the culture that has grown up surrounding them is far different from the old paper world. It is often assumed that anything put on the Internet, a public utility, is in the public domain, and that the author, by placing it in such an environment, has tacitly given permission to use it in any way. This is partially because there is now simply no feasible way to keep track of who is using the document and how. But this is a dangerous illusion of freedom. Any of the eight types of copyrightable works created since 1978 doesn't even need the copyright symbol on it to be copyrighted. The simple act of committing pen to paper, figuratively, is enough to secure copyright. This means that unless it specifically says so, everything on the Internet, not subject to the usual exemptions, is copyrighted to the person that created it.

But electronic / computer media are so vastly different from paper based media, that it is extremely difficult to get the old laws to fit. A strict interpretation of intellectual property laws would severally reduce many of the advantages of having documents presented on a network in the first place, although this hasn't stopped misguided legislation from trying to do just that. In the end, the form that copyright will take for electronic media will be decided in the same place that it has been for other new media in the last 206 years: The courts. We can all just hope that a rational balance is struck between the capabilities of the technology, the needs of the users, and the rights of the creators.

---------------------------------------------------------------------------- Return to Top

Standards

Standards are a very important concept in the area of electronic publishing, especially when the intended method of distribution is over the Internet. In fact the Internet is the object lesson in the way open standards are supposed to work. It is an extremely heterogeneous system composed of a great number of different operating systems and hardware configurations that can all communicate because the complete specifications for TCP/IP, HTTP, and other internetworking protocols are freely available. In this type of environment, basic HTML written and posted on any machine can be expected to be read by any web browser running on any system. This is extremely powerful in terms of connectivity.

Of course this massive open system was also only made possible by being entirely subsidized by government grants. No free-market software company could afford to be so generous with their code, and so free market systems generally seem to be composed of a number of more-or-less de facto standards, where one proprietary system has achieved such a degree of market saturation that they are an effective standard, with other less successful companies writing their applications to be able to read the dominant applications files. The problem with this is that one years de facto standard may be completely replaced within a few years, forcing documents stored in that format to be migrated, often a time consuming process, into a new format. As an example, files from Visicalc, the first hugely successful spreadsheet, probably wouldn't look so good in Excel 7.0

This offers an important lesson as we contemplate a future world where electronic publishing has taken a place of equal importance beside television and magazines, books, and newspapers, as a major mass media. Imagine millions of publications all put into a an industry-dominating proprietary format and released to the public over the course of the next 10 years. What happens when that industry leader is inevitably overtaken by the next best-thing? Must all the documents be reformatted. Will the world be forced to pay licensing fees for a now inferior technology?

Even with a supposed standard, HTML, this problem remains. It is to the advantage of every web browser producer to make their browser better than their competitors, and one easy way to do this is to invent proprietary mark-ups that only your browser can read. This will encourage people to gain new functionality with your tags, and then requires other users to use your browser to read documents that employ these tags. This is the tactic adopted by both Netscape and Microsoft, and has lead to a certain amount of chaos, but also better tags, on the web of today. Clearly, their are no easy answers on this issue, but it is something that every user and creator of electronic documents should be aware.

---------------------------------------------------------------------------- Return to Top

Conclusion

Electronic publishing will continue to become more common in the future, whether or not it ever reaches the number of watchers as television, or even readers of newspapers. What this paper has revealed is the profound state of flux in which this new media still exists. No matter how these issues are resolved, one thing seems certain: Electronic publication ten years from now, just like everything else related to computers, will probably bear little resemblance to anything that exists today.

---------------------------------------------------------------------------- Written by Alec Miller, May 1996. ----------------------------------------------------------------------------


[ Back to Index of Web Works | Top ]