Academic Referencing of Internet-based Resources

Gordon Fletcher & Anita Greenhill
This proposal originally appeared in the November 1995 issue of the Australian Library Journal
An early version also appeared in the November 1995 issue of ASLIB Proceedings

The rapid growth of the Internet has outstripped conventions for citing material from that source. Distinguishing material as a [computer file] does not provide sufficient information about the platform necessary for reading it. The URL provides useful information, but augmenting it with other details such as author and date not only provides a meaningful citation, its similarity to conventional bibliographic notation lends a greater degree of legitimacy in academic discourse. The article considers information derivable from the URL, and HTML documents (including non-displayed source text), in order to derive bibliography and inline text citations for various kinds of material. The conventions proposed are applicable to Gopher, FTP, Usenet News, journals distributed by listservers, and email.

Despite the rapid growth of the Internet during 1994 and 1995 no adequate or consistent method of referencing material form this source has developed. Failure to address this issue will result in Internet resources not being awarded full recognition within academic discourse. Unless corrected, the significance of this oversight will be exacerbated as more academic journals become available on-line and more computer-literate students enter tertiary study. Furthermore, the status of researchers who have published in this medium will be affected and universities may deprive themselves of the staff best equipped to meet the challenges of the electronic age.

The methods used to reference material gained from the Internet should echo existing referencing styles. This consistency would improve the readability of references to Internet-based resources and would not distinguish the material solely because of its contemporary distribution material. The work of Li and Crane (1993) currently represents the only consistent system for referencing electronic resources. Its publication prior to the wide adoption of HTML prohibited the development of as concise an approach to Internet resources than might otherwise have been possible.

Although there are a large variety of referencing systems available, the solution being proposed is consistent with the Guide to Referencing (Dow 1995) and the Australian Government Publishing Service's Style Manual (1994).

This article proposes the development of a consistent bibliographical referencing method that emerges from available information in Internet-based file formats including the Hyper-text Mark-up Language (HTML). If adopted, it will avoid the necessity for inclusion of the [computer file] label that has become a de facto and inadequate solution to a complex problem. As an early contribution to the field of Internet citations, some of the proposals offer options for consideration rather than being prescriptive.

The Problem

Distinguishing material as [computer file] has a limited utility in its acknowledgment of the need for additional tools to display the material. This generic label, however, ignores the variety of different computer platforms and file formats that currently exist. Few of these formats are interchangeable with other formats or the specific readers that are used to decode the files. Distinguishing material as a [computer file] is not a sufficiently informative pointer for referencing computer--based material. An effort should be made to provide information that will enable the same material to be retrieved by a different researcher on the basis of the information provided by the author's bibliography. This can only be achieved with a more thorough incorporation of additional details into the body of the bibliographic entry. Among considerations for later retrieval of a file is its physical and symbolic location (these two aspects are roughly equivalent to the city and publisher entries in conventional reference), the file name (the title) and the file format.

Currently the conventions established by the World Wide Web (WWW) - to access a variety of file formats globally - provide the most concise and informative system of incorporating Internet-based resources into referenced works. This system is based upon the Uniform Resource Locator (URL). In graphical WWW browsers activating the open Show Locations reveals the URL of each document being accessed. The generic format of an URL is:

a Hyper-text document (http)
based at the Griffith University WWW server (www.gu)
which is an educational institution (edu)
physically located in Australia (au).
The file itself is nested within two directories (/gwis/cinemedia)
and is identified by the name CineMedia.home.html
the end html also indicates a Hyper-text document format.

The file name is case sensitive, thus cinemedia.HOME.HTML does not point to the same file as CinMedia.home.html. Researchers using this system for their computer-based material must be aware of this when referencing files.

HTML file format

The world Wide Web' Hyper-text Markup Language (HTML) is becoming the file format most commonly used for on-line academic journals. This is a result of the rapid growth of the World Wide Web and the ease-of-use of its graphical user interface. The basis of the the WWW is the Uniform Resource Locator. The uniformity of this file indentifier commends it as the basis for bibliographic referencing of WWW documents. An advantageous consequence of complex machine names and URLs is that the document written in HTML format also contain a simplified 'real world' title field. This information appears in the title bar of the WWW browser's viewing window when the site is accessed. All HTML documents can be assumed to have a file format, a title and a URL by virtue of their existence as a WWW document. The 'voluntary' or added aspects of the document may include the author's name, an institution (in place of publisher details) and a form of date.

Documents with authors present no problem and fortunately, as with most publishing and public expression, the anonymous HTML author is relatively rare. Documents with no personal author will often have an institutional body referred to within the document itself. This may take the form of a link another WWW site or a direct reference to the institution. Sites often contain a Hyper-text link at the base of the files that allows the reader to email the site's author. Using the email address of the author in place of their actual name provides a unique identifier that can be reused meaningfully. Failing these possibilities - and this situation would be relatively rare - the HTML file contains information that is not necessarily displayed directly by the HTML browser. Most browsers allow the user to read the document's source code. The source code is usually available through a series of commands such as View + Source. As HTML is straight text with readable, attached layout tags it is possible to the read the information in the <HEAD> section of the document to obtain an author's name or institution. Unfortunately the <HEAD> tag in HTML is now optional reducing the value of this last option for the future.

Consideration must be given to the fact that there are at least two types of HTML documents on the Internet. The most common type are collections. These sites have no content in the sense of readable or research material, as they are simply a collection of pointers to other sites compiled by the document's author. The value of the better collections is their systematic grouping of sites for perusal. These collections are the most likely to be anonymous but the least likely to be referenced academically. The less common content-provider sites contain electronic journals or on-line data. As a result these is a high probability that an author or an institution will be referred to in the text.

Identifying the publisher of HTML material may become increasingly difficult as a result of the commercialisation of the Internet. This process has allowed research groups and units to maintain a distinct server often with a commcercial (.com) suffix while still maintaining an association with their original institution. The first preference referencing a publisher would be to include the name of the institution most acknowledged for its assistance in the HTML document.

An educational (.edu) site is, however, more likely to be the source of the HTML file. The site referred to by the following URL contains institutional information:

www.gu.edu.au

This is however rather cryptic. There are Internet resources available that allow this site name to be searched for with a real world name being the returned resul. This form of retrieval is dependent on the site being registered with one of these indexes. This additional research requires detailed knowledge of the Internet's available resources - a condition that should, perhaps preface any situation in which Internet resources will be referenced.

In attempting to make references to Internet material retrievable at a later date, the differenc between citing hub.home.html and Humanities HUB (the title of the above example) is significant. The full URL allows a re-connection if it was correctly noted in the bibliography. The title of the resource can be used in a network search using one of the better search engines, such as Lycos, Yahoo, AltaVista or Webcrawler. Attempting this form of retrieval assumes that the site is registered with one of these indexes. A potential advantage in using the document's title rather than its URL location is its accessibility after a file is moved. Although most sites place a pointer to a file's new location when it is moved, these are usually only maintained for a short period. The moved file could, however, be retrieved by a network search using the titles as the keyword(s). The problem of providing onging retrievability can be hedged with the inclusion of both sets of details in a reference.

Alteration of files by the author could be likened to creation of a new edition at the expense of an earlier one. This situation would hopefully be relatively rare for academically orientated material. However referencing on-line daily newspapers cannot avoid this loss of referred material. In these situations the researchers would be advised to maintain a personal archive of Internet material. Students submitting coursework, who choose to reference from the Internet, should maintain an archive of the material they refer to.

Internet resources should be used judiciously. Referencing to an on-line article with its additional complexities should not be attempted if a printed version can be obtained. This helps reduce the complexity that occurs in bibliographies that refer to non-traditional mediums. Encouraging the referencing of paper versions does not devalue the importance of their electronic versions. On-line journals can be considered as working papers that allow the author to rapidly identify articles of interest and relevance. The scarer paper copy could then be obtained and used as a referencing copy, thus allowing the researcher to provide full referencing and page numbering details. However, this will become less viable with strained library resources and an increasing number of journals becoming solely available on-line.

The best methods for referencing HTML files can be derived from the referencing styles used for monographs. The author remains as the basis of a bibliography's order. The year of the original uploading could be included if it is acknowledged in the document itself. The title of the reference equates with the title that appears at the top of the document's window. The publisher details would be the name of the institution where the file was maintained - if it is ascertainable from the document. The place of publication would be entirely replaced by the document's URL. Providing the full URL provides a level of redundancy in the entry that allows the cross-checking of provided references. This method also enables references to be made to those sites with no apparent publisher. Using an article from the C-Theory site as an example of the complete bibliography entry would be:

The Murder Trial: Genre or Event-Scene?

It would be associated with in-line references- that appear as (Brenner, 1995). This example is unusual as it relates to an ejournal that does not currently maintain volume or numbering details. For that reason we suggest treating the work as a separate entity with the title in italics and the 'journal title' as added information. If C-Theory changed this policy the new reference would place the article in quotation marks, italicise the title of the ejournal, and include volume and number details.

A minimal bibliographic entry would appear as:

The Murder Trial: Genre or Event-Scene?

The corresponding in-line references would be (The Murder Trial: Genre or Event-Scene? ~1995) which still provides some information for other researchers. This type of reference would occur when only the title and the URL were available to the researcher.

As a slight deviation from conventional referencing the final full stop in the bibliographic entry should be omitted to avoid URL addressing confusion. To avoid additional confusion the URL should only be split with a space or soft return after the forward slashes and not between words of the location.

This system of referencing does not recognise those documents with an ejournal affiliation. As files can be reflected (duplicated) at different sites with different URLs, recognising the institution as the publisher may not always be useful. However where periodical-style information is evident or available through Hyper-text links it should be acknowledged conventionally. As an example:

Critical lnquiry

The institution/publisher acknowledgment is replaced by an ejournal affiliation. For those ejournals that maintain conventional volume numbering, this information could be included after the journal's name. The URL should still be required regardless of the amount of other bibliographic details available. This insistence upon a URL provides a recognisable label that replaces the [computer file] tag and provides a consistent format for researchers to use.

There are a number of possibilities for dealing with updated sites. We suggest avoiding the no date (n.d.) or circa (c.) tags used for hardcopy items: although there may be uncertainty as to their exact date, hardcopy texts are not subject to change as may be the case with electronic documents. There are a number of mathematical symbols that could be used to indicate the year (or more precise date) the file was accessed. These are the 'less than','less than or equal to', and the 'approximately equal to' symbols which would be used thus: <1995, 1995, or 1995. However we suggest the most appropriate symbol for undated Internet sites and particularly WWW documents is the tilde(~). This symbol is already in common use as part of many URLs to indicate the directory it immediately precedes is a personal directory and could be extended to recognise that, used against a date, the tilde recognises the referenced material may be changeable.

We hope that the requirement for these conventions to deal with undated sites will be shortlived. Authors of academic documents on the Internet should become more aware of the need for researchers to adequately reference their material and should hegin providing full bibliographic data as part of the document's header. The format of referencing applied to monographs allows these welcome additions to be included without the need for a new approach to referencing HTML documents. Similarly, extensions to the basic referencing details of a monograph such as different editions or different publisher (WWW locations) can be readily accommodated.

Page numbers and pages themselves are non-existent in HTML files. Documents contained within a single file are often referred to in Internet jargon as a page. Longer documents are generally brokcn down by their chapters into a series of individual HTML files or pages. This current and established Internet practice allows some level of pointers to be developed with the use of a monograph-derived referencing system. As a fictitious example:

My life as an AOLer - Introduction

My life as an AOLer - Hackerhood

In text this could be referenced as (Collins ~1995) and (Collins ~1995a).

Although this is messy as it replaces page numbers with references to multiple, same-year publications, no simple solution appears to exist. A more sophisticated approach, requiring a greater awareness of how HTML files are constructed, would be to use the fiiename in place of the page number For example (Collins ~1995: aolerl.html) and (Collins ~1995: aoler2.html) would refer to the same bibliographic references as above. There are additional sophistications that could be developed with a greater awareness ofhow HTML files are coded. The <a name> tags in HTML subdivide a single document into many smaller pieces. The purpose of these tags is to allow a user to click on a word or symbol at the top of the document and be immediately moved to where the <a name> tag is situated. A reference to a sentence preceded by the <a name=#modem> tag in the introduction document of the reference above could be referenced in this manner
(Collins ~1995: #modem) or (Collins ~1995: aolerl.html#modem)

This level of detail approximates to individual page references in a conventional book. The major problem with this method is the complexity in ascertaining if these codes exist within the document. Not all HTML documents use the <a name> tag.

A solution to the lack of page numbers for Internet-based resources might be to establish a style sheet for printing HTML documents. A preset top, bottom, left and right margin coupled with a defined font and point size presented as a publicly available style guide would allow researchers to match the printer setup for the printers on their WWW browsers to these parameters - always with the understanding that the source code has not altered in any way. The resultant pages could be referenced as (Collins ~1995:[31]) with the square brackets acknowledging the variable nature of the numbering. Other solutions that could be adopted include numbering paragraphs; counting fixed numbers of lines, say 25, as a single page; or having authors include a number as an integral part of the document between each 'page'.

Another problem in using the Internet for academic referencing is that researchers must be aware of the location of the document they are viewing at any given time. There is a growing tendency for WWW browsers to have a default configuration in which the Show Location

command is turned off. While it is just a matter of clicking it on, a degree of awareness and training is necessary.

There are a number of requirements for consistent referencing of HTML files. The adherence to existing conventions is important. The use of a style guide for printed HTML files provides a means for referencing larger documents. Authors of HTML files should be encouraged to include bibliographic material within their files, with a minimum request that this material is contained within comment tags or the <HEAD>. This information could be coded as an actual bibliographic reference for ease of use and access. As a non-displayed comment the additions would simply be of the form:

A similar line could be included in the displayed document. This would allow the printed document to be easily re-accessed on-line. The inclusion of a document's URL on the printed hardcopy of an HTML file is not something that is currently done automatically by the printing command of WWW browsers, however, this is likely to become an available option. Among the proposed features of HTML+, the next version of the HTML language, is the tag, <PRINTOUT>, which, if it were supported by WWW browsers, would provide exactly this solution.

Gopher

The referencing scheme outlined for HTML files is equally applicable to the other Internet resources that can be accessed via a graphical WWW browser. Gopher services, as the text based predccessors to the WWW, represent substantial investments of time that are not readily transferred to WWW-based HTML files. Fortunately this is not an obstacle to acccssing the large number of resources available through GopherSpace. Many universities still maintain gopher servcrs that use software other than WWW browsers. This software hides the server, directory and filename information from theuser. It can be retrieved but may require a level of skill beyond that necessary for day-to-day use of the Internet. Researchers and students are strongly advised to use a consistent interface to the Internet for both ease of use and regular referencing methods.

The URL used to access gopher servers via a WWW browser is similar to a Hyper-text URL but is prefaced by the gopher:// tag. For example, the Marx and Engels archive can be accessed through the URL: gopher://csf.colorada.edu/11/psn/Marx

These URLs are usually less self-explanatory and longer thnn those used by the WWW but remain as sensitive to misspelling, upper and lower case conflicts, and misplaced punctuation. However the format used to reference HTML files is equally applicable. Essentially the gopher URL should be used instead of the place of publication, with the remainder of the bibliographic entry treated as a reference to a monograph with as much detail being provided as possible.

The relative age of GopherSpace does, however, present problems in accessing full bibliographic data. Gopher sites operate at a more institutional level than the WWW. While WWW pages have readily identifiable individual authors within the overarching framework of the institutional server, the gopher site and the provider institution have a more closely integrated relationship. The WWW could be said to encourage page Authors while GopherSpace harbours anonymous programmers.

A solution could be to ascribe authorship of apparently anonymous gopher sites to the smallest identifiable institutional unit - often a computer science department. Thus a reference for a gopher site may appear as:

Internet User Glossary

The utility of this author ascription is debatable. But with the decline in gopher services, there should be less need to provide references for these materials. The majority of new users on the Internet prefer the more graphical WWW user interface. This filtering effect ensures that some dcgree of experience and skill is developed in referencing and accessing Internet resources before the student or researcher finds it necessary to reference gopher sites. Time may also reduce the need for gopher references as the information contained on gopher servers is transferred to WWW sites.

The example of the World Factbook also raises questions in relation to the distortion between actual academic authorship and the digitising and preparation of material for electronic distribution. Although there are skills involved in both processes, academic works require the acknowledgement of the academic author of the work. Acknowledgment of an individualresponsible for the digitising could be included after the title of the work. This echoes the style for acknowledging editors and translators where the original author remains of paramount importance. There would appear, however, tn be little utility in acknowledging an institutional body in this role when it is recognised as the electronic publisher in the reference and usually implied as such in the URL given for the document.

File Transfer Protocol (FTP)

File Transfer Protacol (FTP) is used to download software or text from a site remote to the user. If the user is accessing FTP through a WWW browser the text is displayed 'raw' with little or no formatting. These files can normally be attributed to individual people with all the appropriate referencing details. FTP is the earliest type of Internet publishing and,when it was (and occasionally still is) used, the material was a digitised version of conventionally published material. If the material cannot be accessed in the printed edition the URL is, once again, recognisable, eg
ftp://nysernet.org/pub/resources/guides/bigdummy.txt

The resultant refcrence is also recognisable:

EFF's Guide to the Internet v2.3

Those who access FTP via software other than a WWW browser can easily convert their reference to a standard URL by adding the tag ftp:// to the front of the server, directory and filename details that are needed to access the material.

FTP documents again reinforce the need for a printout style sheet that specifies a series of standard margins, fonts and point sizes while acknowledging that no formatting changes are conducted on the document. This would allow a square bracketed page number to give a general page guide for in-line referencing.

Usenet News

Usenet News can be accessed in a number of different ways. Currently there seems to be no clear preference for one particular software reader. Some graphical WWW browsers can access news. The hierarchical nature of the news systems, and the distributed nature ofthe material, prohibits a conventional URL system from being used. Nwws is physically held on each subscribed server. This results in totally meaningless URLs for later reference. A URL that points to the Griffith University server is not useable by someone accessing news from another university.

This different distribution method lends itself towards a more periodical-orientated style of referencing. There is usually some form of authorship, acknowledged, although newsgroups do not exclude the possibility of pseudonyms being used.

News items usually have a header that approximates a title. The newsgroup, itself, takes the role of the journal. The full date of the original posting is available and can be used in the way that volume and number are used in conventional journals. This provides a potentially useful reference. As an example:

Graham, Adrian 1995,'Fishing in Mauritius', alt.fishing, 29 July.

Unless research was being conducted specifically on computer-mediated communication, the stranger postings with unusual names and titles - such as the following - would simply not appear as references:

AOhell 1995,'Hi all...', alt.slack, 20 Feb.

Another prohlem with references to Usenet News is the temporary nature of the postings. Not all newsgroup postings are archived, and the references of today become aether tomorrow. Although a number of the major newsgroups are archived, finding~ them and determining specific references may require far more effort than the final piece of information is really worth. These considerations are really only a major concern to a researcher wishing to consult items in an existing bibliography. To access this material it might be easier to contact the author of the article or research directly. This emphasises the importance of students, authors and researchers maintaining a private archive if they have to use these transient materials.

Journals distributed by listservers

Listservers provide the closest the Internet has to 3 hand-delivered journal. Thc listserver of a specific journal posts a full copy of thejournal to each subscriber's email address every time a new issue is completed. Fortunately, the header of an email contains most of tbe referencing information needed for constructing a journal-like bibliographic entry. The order and type of information will vary from listserver to listserver but generally the author of the specific article will bc acknowledged as 3 proper name or sometimes as an email address. The title and journal name is covered by the subject and originator sections of the header. The year will always be the same year as the year of receipt at the email address. The date will be contained within the header - when this does not occur the month of receipt can be used in the reference. Many listservers issue material more than once a month, but used in combination with the author/title will provide a unique identifier. Some listservers and newsgroups are quite closely linked and knowing these combinations will assist the researcher in providing full bibliographic material.

Email

Email is the personal communication of the Internet. Referencing to emai should be undertaken with the same judiciousness that is used with all personal communication. Personal communications are acknowledged only by in-line referenccs and not in bibliographies. In-line references simply acknowledge the interlocutor and the date with an annotation. Email provides similar informatinn. For example:

(Bloodaxe, Eric 1995, email, 24 July)

If the person's name is unclear, the section of the email address in front of the ~ symbol could be used. In that case [email] would become:

(E.Bloodaxe 1995, email, 24 July)

However, some institutions and commercial service providers prefer to use a numeric system when allocating personal email identities. A minimal email communication conducted with someone connected to one of these sites verges upon incomprehensibility, eg:

(S450062 1995, email 24 July)

Conclusion

The methods we have outlined rely heavily on the phenomenon of the World Wide Web and its method of accessing Internet resources. This provides backward compatibility with earlier resources, as well as a standard on which to base future developments. Currently, papers and articles published on the Internet are not recognised as having a legitimate place in academic publishing resumes. Drawing documents on the Internet more closely in line with conventional material - from a referencing point of view - may assist in bringing about a change in policy. This more progressive position will be assisted by an increasingly computer-literate student body desirous of incorporating Internet material into their work. Encouraging this, through the provision of referencing guidelines and tolerance towards the use of these documents, will provide a richer base of material to draw upon than is sometimes available through conventional library resources. When funding for items such as paper periodicals is reduced, the ability to access and reference hypertext versions of journals and other materials becomes important. This academic advantage is increased by the nature of the medium, in that it provides an infinite number of copies in contrast to a single paper copy in a library. Refusal to recognise electronic publishing has the potential to lower the public profile of a university and impede its staff development.

References

Dow, Lesley 1995, Guide to Referencing, Faculty of Humanities, Griffith University, Brisbane.

Li, Xia & Nancy Crane 1993, Electronic Style - a guide to citing electronic information, Meckler, Westport CT.

Style Manual - for authors, editors and printers 1994, 5th edition, Australian Government Publishing Service, Canberra.