Example 4: Books from an On-Line Catalogue

Sampling from an on-line catalogue depends largely on the structure of the records and the search options available in the catalogue system.

The first question to be asked is: Does each record have a unique identifier and is the identifier a searchable field? The easiest case to deal with is an accession number type of identifier. In systems using accession numbers each catalogue record is serially numbered from some starting point such as 000001. New items are assigned the next available accession number. In these systems the only numbers that do not refer to actual item records are those which were assigned to items which have since been withdrawn.

For example: If the accession numbers in a system had started at 00001 and were currently at 11,091, then the sample design could simply be:

	Category 1  Accession #  numbered from 0 to 11091

The resulting list of numbers can then be searched in the online catalogue. On catalogue which allow multiple searches, you can enter several accession numbers at a time.

It may be tempting to use the online catalogue to print the records as they are selected and to use the resulting output for searching the item at the shelves. The disadvantage of this is that you will have a list that is not sorted in shelf order. You will have to decide if working with an unsorted list or cutting it up and manually sorting it is too much trouble. The alternative is manually entering the author, title, and call number into a spreadsheet as shown in the appendix.

Some online catalogues allow you to save the results of searches to disk. In this case it is possible to enter these records directly into a spreadsheet without retyping. Unfortunately, while this direct entry can be done inn principle, doing it may be more work than manual entry.

A variation on the creation of accession numbers are systems that use numbers of the form; "92 -- 317" indicating the three hundred and ninety seventh book purchased in 1992. To design a sample for this case you need to know the maximum number of books ever ordered in a single year. If you must make an estimate of this value be sure your estimate is higher than the actual high value. Otherwise some books acquired late in busy years will not have a chance of being selected. A design for this could be:

	Category 1  Year  numbered from 40 to 92
	Category 2  Book  numbered from 1 to 3800

One serious problem with accession numbers is that only the records of books acquired since the installation of the automated system may have accession numbers. In this case accession numbers are not an appropriate selection key.

In the example above we said, "Otherwise some books acquired late in busy years will not have a chance of being selected." If our sample design had 3,800 as the largest number of books acquired in a year then, in a year that 4,000 books were acquired, the last 200 books would not have a chance to be sampled. In this case we can see a distinct possibility for biasing the sample.

Most acquisition departments keep "wish lists" of books which might augment the collection but which are not critical to the mission of the library. At the end of the year remaining funds are used to purchase from this marginal list. Similarly, and perhaps more the case in hard economic times, libraries keep "must have" lists of things to buy as soon as new funding becomes available. In both cases, books with low annual accession numbers may have quite different selection criteria from those purchased late in the year. Failure to sample from all accessions could bias the sample toward a particular set of selection criteria.

The searching possibilities of an online catalogue offer the chance for other non-random sampling designs. Suppose that a researcher were to use the capability of the online system to browse from a particular starting point. He might use the sampling program to generate a sample in which each entry was six letters chosen randomly. He would then enter the six letters, which in most cases would not be a name or a word and then browse forward to the first author name entry. The problem with this design is that authors with similarly spelt last names would have less of a chance than those with uncommon letter combinations. For example suppose that alphabetically listed the authors in the collection were:


The letter differences between the first two names, Joar and Jonah, are greater than the differences between the middle three names. Similarly, the last entry, Jones, is more distant again. If random letters were selected the third and fourth names would have less chance of being picked. This bias might be important because the spelling of authors names is often indicative of their ethnic heritage.


this page is at http://testbed.cis.drexel.edu/sample/example4.html