A Case Study For Publishing With Xm

Indo-European Studies Bulletin: A Case Study for Publishing with XML and Unicode

Deborah Anderson
Visiting Scholar, Department of Linguistics
University of California at Berkeley
Berkeley, California 94720
dwanders@socrates.berkeley.edu

Background to the Problem

Ancient Greek and Latin are two members of the large group of ancient Indo-European languages, whose geographical spread stretched from Europe to Asia in antiquity. Indo-European studies incorporate the study of this group of the ancient related tongues, their linguistic prehistory (and later developments), their cultures, and the archaeological remains that might be associated with Indo-European-speaking peoples.

Because the field of Indo-European spans so many different languages and cultures, doing research necessitates access to a wide variety of primary and secondary materials, including many small European publications. However, the current "crisis in the libraries" has meant that escalating prices for books and journals has restricted the libraries' ability to keep up with serial and monograph purchases. So how can one do adequate research, if the libraries are on a fixed budget and they are not able to purchase all needed materials, other than having one rely on interlibrary loan? The problem is hardly restricted to Indo-European studies: Classicists and those in other fields of the humanities (and even in the sciences) can find this true as well. For those in poorer countries, the problem can be even more dire.

If the Internet can make materials accessible for free or at a reduced cost to a wider audience, then it could be an effective means of scholarly communication. This depends, of course, on users having access to a computer hooked up to the Internet, with the appropriate software and hardware installed.

Outline of Project

As a means of making Indo-European materials more accessible to scholars in the US and abroad, I have devised a project in conjunction with the Electronic Text Unit of the UC Berkeley Library to test Web publication for Indo-European studies. Initial seed-funding was provided by Bryn Mawr Reviews. The project is intended to examine whether electronic journal publication is a viable alternative to print for Indo-European and those smaller fields, especially where publications are being produced by small societies or university departments. The single most challenging aspect of the project is how to display (and print) a wide variety of scripts.

The publication being put online is the Indo-European Studies Bulletin, a small journal affiliated with the UCLA Indo-European Studies Program. The publication now has a small but ever-growing base of subscribers. It is formally affiliated with a support group at UCLA, so any money remaining after the publishing and mailing costs are met goes to support an annual IE conference and to bring speakers to UCLA. Those scholars in Eastern Europe, mainland China, and the former Soviet Union can receive the Bulletin gratis, in the belief that promoting scholarly communication to scholars in these areas is of primary importance and the expense to the group is justified.

The project is intended to look at the following questions:

  • How difficult is it to produce an electronic version? What is the best method?
  • Are there easy-to-use affordable tools for creating an electronic journal in this field?
  • What costs are involved?
  • Can the costs be covered and still offer the publication at a low price (or free)?
  • On the more practical side: Can the electronic version be accessed by those using older equipment?

Other Electronic Journals

Examining how other multilingual electronic journals have been set up can be instructive, since they have also had to deal with the pesky problem of getting the various languages to appear and print correctly. A review of the other electronic journals shows currently a dearth of titles in Indo-European, particularly in Indo-European linguistics. Classics, on the other hand, has a more sizable number of electronic journals. (Classics is defined here the broad sense of classical literature, linguistics, history, and archaeology.) Of the titles found, a growing number are free publications, often put out by a university department or society (e.g., Classics Ireland, affiliated with the Classical Association of Ireland and put out by University College Dublin). Of the free titles listed on various e-journal lists (e.g., https://gort.ucsd.edu/newjour/), some ceased publication after a few issues (e.g., Arachnion).

Perhaps the most rapidly growing sector of electronic journals are electronic versions of established print titles. While a few large publishers are doing the electronic version themselves (such as Johns Hopkins), many others are using "service providers" to produce and distribute them.

Three formats are predominantly used in electronic journal publication.

  1. The most common type of format being used for the electronic version is Adobe's Portable Document Format (PDF). This proprietary format is easy to use and embeds the fonts so that the documents are readable from any computer platform, as long as the free Adobe Acrobat Reader has been downloaded. The online version mirrors the print publication. There are two drawbacks to PDF:
    • you can't search for some of the "odd" (non-ASCII) characters, and
    • since it isn't an international "standard," longevity is not necessarily guaranteed.
  • The other format of choice for online journals is HTML, or Hypertext Markup Language, which has become the standard for the Web. It is appearing more frequently in electronic journal publication, probably because it allows greater capabilities, such as active linking to the bibliography, footnotes, to other articles, etc. The drawback to HTML is that it provides a restrictive set of tags and is limited to a fixed, simpler type of document. HTML does allow the possibility of including multiple scripts and searching for characters in these scripts.
  • A third type of publication should be mentioned, and that is the use of text-only or "plain text" (=ASCII). This format appears in the email versions of the Bryn Mawr Classical Review and Medieval Review. By using the plain text Beta Code to encode Greek, these publications are accessible to all and searchable, although it doesn't render the original Greek script (and one needs to know the transliteration scheme).

    Format Choices for IES Bulletin Project

    A number of important developments have taken place on the Web, specifically the creation of a new standard to replace HTML, XML, Extensible Markup Language. Because XML provides the ability to include markup for content (and is less complex than its parent, SGML), allows the tagset to be extended (hence "extensible"), and is now supported by the latest version of the popular browsers (Internet Explorer 5+, Netscape 6/Mozilla), we decided to use XML for the IES Bulletin. A second motivating factor was to see what kinds of problems would arise when working with Indo-European material in XML. For guidelines on markup (and DTD), we used the Text Encoding Initiative's TEI-Lite. XML also requires Unicode, the universal character encoding standard. In Unicode, every character receives its own unique number, irrespective of computer system, software program, etc. Such a standard is a necessity for working in multiple languages, for in the past the variety of encoding schemes made it difficult to transmit multilingual documents across the web easily without problems. As an example, compare what happens when converting a final sigma from GreekKeys to WinGreek (drawn from the example by Sean Redmond, "Greek Font to Unicode Converter", at https://www.jiffycomp.com/smr/unicode/convert.php3): Final sigma in GreekKeys becomes an omega in WinGreek. This occurs because they were encoded differently. In Unicode, no matter what computer platform or software program, the final sigma should appear as such when transmitted to others (as long as the fonts on the sending and receiving end are Unicode-compliant and both have Unicode-enabled operating systems and browsers).

    Methodology

    The following description provides a synopsis of the methodology being used for creating the print version. First, we convert all articles from authors into Microsoft Word (if not already in this format) and then do the editing. The text is then imported into PageMaker for formatting and final proofing before the final print version is created and sent to the printer. The online version is created by using a plain text version of the PageMaker file. It is inserted into the XML skeleton and edited with Emacs. After the file is parsed against the TEI-Lite Document Type Definition, or DTD, we then turn to a product which Berkeley currently owns, DynaWeb, to "make book," which allows one to create HTML on the fly. It also offers a number of nice features, such as searching. However, at $75,000, it is unrealistic to see DynaWeb as a product that all but a few large institutions can afford.

    Problems

    We have at present converted a number of the Indo-European Studies Bulletin issues online. Some significant problems have turned up. Since the goal of the project was to see what kinds of problems arose - and not necessarily to bypass them in an effort to publish quickly - it was deemed important to examine them.

    1. Some scripts appearing in our issues are not included in the Unicode Standard. For example, one article discusses Sabellian, a language group on the Italic peninsula. The characters in the native script are not yet in the Unicode Standard. Because the length of time from the first proposal to the Unicode Technical Scripts Committee until acceptance is about three to five years and publication deadlines are more immediate, a short-term solution is required. There are a variety of approaches available:
      • One of the simplest means is to use an image, such as a GIF, in its place, though it can't be searched. This method has been used in our publication.
      • A second approach to use a transliteration/transcription scheme (with ASCII), which can be used as long as a note is appended defining the scheme. (This is commonly done on e-mail and in a few Web publications. It would, of course, fail to show the original script.)
      • A third solution is to create a font with the characters (including the non-Unicode characters in the "Private Use Area" set aside by the Unicode Standard), and let users download and install the font before viewing the document. (This is done for the TITUS materials, as well as for another big project, the Leiden Etymological Database.) Perhaps this offers the best solution, since it allows searching. However, good cross-platform fonts that are free have not yet been created.
      • Using PDF is another possibility, but, as mentioned above, it doesn't allow character searching for many non-ASCII characters and is in a proprietary format.
      • The final option is that of "dynamic fonts," whereby a font is embedded into a webpage and delivered to the user with the font in it. (I mention this only in passing, since Netscape reports that it is no longer supporting dynamic fonts.) Microsoft has an equivalent called WEFT (Web Embedding Fonts Tool).

      Although I have outlined the short-term solutions for displaying missing characters, I would like to add that this doesn't solve the long-range problem, namely, that such scripts should be included in Unicode, since Unicode ultimately aims to cover all the dead (and living) languages of the world. Unfortunately, I have discovered that there has been a lack of participation from scholars in reviewing Unicode proposals, which is regrettable, since it is scholars who are most likely to use and benefit from the inclusion of various dead languages in the Unicode Standard. Also, the Technical Director of the Unicode Technical Committee has stated (e-mail communication, 10/28/00), that "[t]here are powerful, countervailing forces within the [computer] industry and among standardization circles that would prefer that the Unicode Standard stop changing and expanding, since adaptation to such change is expensive and unsettling." In other words, the time may soon come when no additional historic languages may be accepted because of feelings amongst Unicode Consortium members. Hence it is important to continue to pursue the inclusion of these scripts now, and to provide feedback on any errors or missing characters in the repertoire. (In this regard, I have been directly involved in getting an Old Italic proposal into Unicode, as well as those for Linear B and cuneiform.) Since ancient scripts are needed ultimately for Indo-European online publication and research, working on Unicode proposals has become a new focus of the project.

      However, in order to get Unicode to work there are a number of requirments you need:

      • the latest browser (Internet Explorer 5+ or Netscape 6/Mozilla), set to "Unicode" (or "UTF-8");
      • a Unicode-compliant font installed on your machine;
      • an operating system that supports Unicode, i.e., Apple Mac OS 9.0, Mac OS X Server, Mac OS X, Microsoft Windows CE, Windows NT, Windows 2000. (For a list of Unicode-enabled products, see https://unicode.org/unicode/onlinedat/products.html.)

      Because our project aims at providing access to a wide audience, requiring these three components of all users seems like a tall order. Since the main browsers (IE, Netscape) are available for free download, this is not too troublesome. The font problem could be resolved, if the needed fonts could be created and downloaded for free (or made available for a nominal amount). Requiring a user to have a new or recent operating system, particularly for those scholars with restricted funding, is more problematic, especially since the new operating systems are currently adding in more and more support for Unicode with each new version of the operating system. Probably a PDF version should be offered in the interim as an option, since this allows easy access and printing capabilities for those with older machines.

    2. In order to mark up the languages in XML, you need a language tag. While the majority of basic languages are covered (such as Ancient Greek, Old English, Middle English), a number are not. I have been in contact with members of the Unicode committee and Peter Constable of SIL, who recommended that those working with dead languages maintain a single list of the language codes to be agreed upon and used amongst scholars. Eventually these can be used to apply for registration to the relevant international body. The question of having the requisite language tags is relevant because, for example, in displaying ancient Greek dialects, some dialects use different letter shapes. The language tag can be used to select the correct font, so the correct shape will appear.
    3. Another problem surfaced in a recent issue: How to handle marking the underdot for unsure reading, a common sign in epigraphy and papyrology. A group at Oxford is working on a proposal for symbols to be included in the Unicode Standard and the TLG at Irvine is also assembling a list. Originally we had used the Unicode "combining underdot" for the unsure reading. However, the Unicode Technical Committee members have pointed out that it would be better to handle it differently, since the underdot has a specific phonetic meaning in many languages (e.g., a vocalic r in Sanskrit), but here indicates something different. This particular problem needs to be discussed more widely amongst those working with old documents and the character encoding standards committees: otherwise there will be inconsistency.
    4. The cost of DynaWeb at $75,000 is prohibitive for all but a few large institutions. One very likely possibility is for us to convert to Cocoon, a open-source program from apache.org. (The Perseus text-hopper might be another option.)

    Other Issues

    A number of issues have yet to be examined:

    • In what way can peer-review be easily managed?
    • Will offering an online version mean a significant loss of revenue? Will the print publication cease to exist, as occurred to BMCR? This brings up the dichotomy between "free access for all" and the need to be able to pay costs, while providing funds for the UCLA support group's efforts. Potential models could include charging for the current issue only, and making only past issues free. Should the entire issue be free? Or should only "highlights" or abstracts be made available at no cost? These issues still need to be discussed and resolved.

    The Future

    Future work will entail converting the remaining back-issues of the IES Bulletin to an online format, noting any further problems that arise with XML and Unicode. Eventually finding an inexpensive alternative to DynaWeb will be necessary. Continued discussion of how to handle character encoding/text markup issues (e.g., the underdot for unsure reading) and the creation of a "Best Practices" guide would be a desideratum, as would a guidebook on how to produce an online publication for small departments and societies in the humanities.

    Conclusion

    Will our efforts in online journal publishing ultimately help in making more IE materials accessible? I hope so. Our work up to this point has really concentrated on making the Internet infrastructure more amenable to handling various ancient languages. The standards and open-source programs do provide a means for smaller departments and societies to publish online, although we can't recommend yet how these should be paid for. Still, we are poised to improve scholarly access and communication greatly with the Internet, and I think this opportunity should be taken up. However, without input from scholars (especially regarding standards), the ability to do research and publish over the Internet could be seriously compromised.


    This file was posted on 15 June 2001.
    Please send your comments to Michael DiMaio, jr.

    Return to the Table of Contents

  • More