Using Web Server Logs to Improve Site Design


M. Carl Drott

Associate Professor

College of Information Science and Technology

Drexel University

32nd and Chestnut St., Philadelphia PA, 19081

(215) 895-2487



Many web page designers may be unaware that web servers record transaction information each time they send a file to a browser. Others may know that a server log exists but they may see it only as a source of general statistical information such as site use distributed over time or counts of the number of times that each page was served. This paper describes how server logs can be used to give designers a much more detailed view of how users are accessing their site. This paper describes the use of server logs to monitor user patterns and employ them to improve the design and functionality of the web site. Web log data has been used to analyze and redesign a wide range of web-based material, including: online tutorials, databases, fact sheets, and reference material.




To begin at the beginning, let us consider the interaction between a web browser and a web server. The user who wants to see your web page gives the page address (the URL) to their browser. The browser software sends a message to your web server program (on what ever machine serves your web access) asking for your page. For the time being we can ignore any of the intermediate stages of going through phone lines to an Internet Service Provider (ISP) or through a local area network to a machine that controls a fire wall. All of the intermediate steps have the possibility of adding to the difficulty of understanding the transaction, but for the moment, consider them invisible.


Your web server program then sends the page back to the user's browser. There is a minor point of nomenclature concerning what the web server does: The web server serves files. Every page that is viewed on a browser consists of one or more files. For example a web page that includes graphics (or other non-text resources,) will consist of one html file and one graphics file for each separate picture. From the user's point of view they are requesting one page. From the server's point of view, it is sending a sequence of files.


After sending each file, the web server program records information about this transaction. At least it can record the information if the person who set up the web server software turned on the logging function. The information that is recorded about the transaction is pretty simple -- in fact you can probably guess most of the information just working from common sense. The server records the time and date of the transaction. It records the name of the file that was sent and how big that file was. It records the internet address to which the file was sent. If the user got to your page by clicking a link on some other page, the server records the address of the page with that link. It also records some details about how the file was sent and any errors that may have occurred as well as information about the browser that the user is using.


This is pretty much all there is to a web log. Of course, if your server is doing more complicated things like running CGI scripts, the log may include information about what script was run and if it worked. But there isn't much more information available in a transaction. In the simplest case, the log information for each file sent is recorded as one line in a very ordinary text file, one that can be read by any word processor or spreadsheet that can open large files.


2.1 Server Log Files

But there is a fly in the ointment. Web logs are pretty much written by computer people for computer people. As a result we have things such as some time and date values being recorded as millionths of a second elapsed since January 1, 1970. This gives us reports of such delightful time-date values as: 793224627.766481. To add to the complexity, every web server records its information in a slightly different format so it is nearly impossible to make any general statements about what a web log looks like. Just in case this isn't enough, the person running the web server can change settings which control what information is actually recorded and what format it is written in. You may have figured out by now that you are going to have to spend a bunch of time with your local web guru if you are going to be able to analyze web logs.


If you do go and ask to analyze log files, you will probably be told that your web people already have a program that does this and they will be happy to give you a copy of the reports. Don't get your hopes up. Most web log report generators are written for people interested in general statistics, rather than for people who want to track the use of intellectual content. There are a small number of programs (some quite expensive) which can run reports that provide information of use in tracking the effects of content, but they are far outnumbered by programs (some quite expensive) that produce elegant graphs showing the average number of bytes sent per minute throughout an average day.


For our analysis purposes we will generally begin with the raw log files rather than the output of statistics generating programs. Unfortunately, for popular sites, the log files can be very large. For example, the web server running on my office Macintosh, presenting only my own material, serves about 20,000 files per month and produces just under 2 megabytes of log file entries. A modestly busy corporate site may do more than ten times that much in a single day. Later in this paper I'll get a chance describe cutting these huge documents to manageable size.


2.2 A Simple Log Entry

So far we have been swimming through the warm oatmeal of generality, lets get specific.

Here is a single entry from my web server log. I have selected a case in which the web server is recording only part of the available information.


03/05/98 <tab> 00:53:39 <tab> OK <tab> <tab> /MinuteSeven.html <tab> <tab> <tab>


Note: I have shown the tab character with <tab>. In the original log, the character is actually a tab. For other web servers the separating character (delimiter) may be a blank space. Blank spaces that are a part of the information will be shown as percent signs (%.) In the original the entire entry is a single line, but it has been word-wrapped to fit in the format of this paper.


Lets take this one part at a time:


(As noted above, sometimes time and date are a single field.)

The date on which this transaction took place


The time of the transaction. Both time and date in this particular file are the local time and date at my web server.


(This field is called "result")

The error status of the transaction. This is not always useful.

(This field is called "host")

The web address of the requesting browser. In this case we see a dummy address set up by America Online.


(This field is called "URL")

This is the name of the file that was sent. Since this is in the main folder for the web server, there is no folder name.

<tab> <tab>

The two tabs in a row show that there is a log file field for which no value was reported.

(This field is called "referrer")

The URL of the page that provided the link to the file sent. In this case it is also from my own pages.


You may be wondering how to tell which data element is which. The answer varies but in general there are three possibilities: 1)Read the web server documentation and ask the person who set up the server what settings were used. 2) Look through the log file. Some servers list the fields being displayed at the beginning. Even better, some servers list the field names as part of every record. 3) Study the log file and figure it out -- this gets easier as you spend more times with log files.


In point 2 above I said that some servers write the list of fields in the log file. For example, my server starts the log file with the line:


If you look back at the table above, you can see that these names match the fields as they are presented.


2.3 A More Complex Log Entry

Lets look at a bit more complex -- and more interesting section of the log. Again, to fit the page, lines have been wrapped. The lines are numbered for clarity.


1) 03/04/98 <tab> 21:09:27 <tab> OK <tab> <tab> /MinuteOne.html <tab> <tab> <tab>

2) 03/04/98 <tab> 21:09:27 <tab> OK <tab> <tab> /nextlessonbutton.GIF <tab> <tab> <tab>

3) 03/04/98 <tab> 21:09:27 <tab> OK <tab> <tab> /8minheader.GIF <tab> <tab> <tab>

4) 03/04/98 <tab> 21:09:40 <tab> OK <tab> <tab> /MinuteTwo.html <tab> <tab> <tab>

5) 03/04/98 <tab> 21:09:41 <tab> OK <tab> <tab> /prevlessonbutton.GIF <tab> <tab> <tab>


What do you see here? As you look, remember that each numbered paragraph represents one file sent to the user.







2.4 Discussion of the More Complex Log Entry

The first file served (see line 1) was MinuteOne.html. This page has two graphics on it and so the server sent nextlessonbutton.GIF (line 2) and 8minheader.GIF (line 3) in addition to the html page. The referrer part of lines 2 and 3 show that it is actually the MinuteOne.html page that was asking for the graphics. Next (line 4) the server sent MinuteTwo.html and (line 5) the graphic prevlessonbutton.GIF which was requested by MinuteTwo.html.

At this point we can use our specific knowledge of how these pages are constructed to make some additional observations. All three of the graphics are referenced in the html pages using the Image command in html (<IMG SRC="8minheader.GIF">). The user's browser automatically sends the request for these files unless the user has prevented it by setting the browser so that it does not automatically load graphics. On the other hand, the command to load MinuteTwo.html is part of an Anchor command (<A HREF = "MinuteTwo.html"><IMG SRC="nextlessonbutton.GIF"Alt="Next Lesson"></A>) and must be given by the user by pushing the button nextlessonbutton.GIF. If we look at the time information, we see that MinuteTwo.html was requested about 13 seconds after MinuteOne.html. Again, based on specific knowledge of the page, we recognize that the user did not have time to even scan the content of MinuteOne, let alone read it. Although not included in the excerpt above, the full log tells us that MinuteThree was requested 12 seconds later and the request for MinuteFour followed in 17 seconds. The full log also shows that in the next hour, no more files were requested by this computer. This suggests that the user was not particularly interested in my pages, but alternate explanations are discussed in the section on Time Analysis below.

I mentioned above that the referrer field in line 1 above, gives us the exact search that was done on Infoseek. A full explanation of how the search is encoded is beyond the scope of this paper, but there is one way that we can easily use this information. Simply copy the entire URL: (,

paste it into the location window near the top of your web browser, and press the return key. In this particular case I repeated the search and found that Infoseek retrieved 5,672,802 hits of which MinuteOne was number three.


2.5 Other Log Fields

The web log can also be set to tell you the name and version of the user's browser. Unfortunately this is not as clear as you might expect as the examples below show. The first entry is for Netscape on the Mac, the second for Netscape on a Sun. The third browser is Microsoft Internet Explorer while the fourth and fifth are unknown to me. Even more confusing is that there are many versions of the major browsers, including at least a dozen versions of Microsoft Internet Explorer for which there appears to be no documentation.

Mozilla/4.03 (Macintosh; I; PPC)

Mozilla/4.03 [en] (X11; I; SunOS 5.6 sun4u; Nav) via NetCache version 3.2X4-

Mozilla/2.0 (compatible; MSIE 3.02; Update a; AK; Windows 95)


ProxyAnon/1.0 (UNIX; 64-bit)

Clearly those who want to make detailed use of browser information will need to do considerable research into the features of each version.


2.6 About the Host Address

The host address is the address to which the server sent the requested file. It is not always easy to tell exactly what this address means. In the ideal case, an address would correspond to a specific machine assigned to a single user. That way, every time the address appeared we would know that we were in contact with the same user. For example the address "" belongs to the Macintosh on the south wall of my office. If you see this address in the host field, you can be very certain that I am the user -- unless it happens to be one of my doctoral students who is using my office for their own work. In the paragraphs below I will describe some of the ways in which a host address may be misleading. As a rule of thumb, I assume that a host address which reappears after twenty minutes or more of inactivity should be counted as a different user.


As I noted above, many Internet Service Providers like AOL assign each user a dummy address. Such a process is useful in maximizing resource use for the ISP, but it prevents us from knowing anything about the user's behavior. If the same user contacts our site the next day, they are almost certain to have a different dummy address. Even on the same day, the ISP may retire the dummy address if has not been used for some period of time. Here are some sample dummy addresses:


It is not always possible to be certain if an address is a dummy. In the examples above, I relied in part on my ability to recognize some of the ISPs involved. Here is an address that I am not so sure about. It may be a dummy, assigned just for a short period of time, or it may be the permanent address of a specific computer.


Even when we get the address of a real machine it is not possible to tell if that machine is used by a single user or if it is in a public spot, available to anyone permitted in the room. Here are two sample addresses:

The first address above is a specific computer available for public access. Most public machines are not so clearly identified. I have no idea about the second address. I would guess that it is an individual's personal machine but that is only a guess


And then there are addresses that give us little information at all. These are the all numeric addresses which we call IP addresses. As you look at IP addresses, remember that they are just backwards from URLs. To clarify that look at the addresses of my web server:

URL: Machine = drott LAN = cis Owner =

IP: Machine = 64 LAN = 28 Owner = 144.118


The most specific part of the address is to the left in the URL but to the right in the IP. The numeric IP addresses usually are assigned to a specific machine, but some ISPs assign them in much the same way as dummies. The log records an IP address when a URL for the location cannot be found. This may be because there is no URL assigned to the IP, the machine that stores the URL for that IP is not available at the moment, or even because your web server has been set so that it saves time by not looking up URLs.



2.7 A Brief Essay on Time

As I said above, the time that a document is sent is recorded in the server log. Some servers even give the start time and the end time for sending files, useful for evaluating server and network performance but little else. In an example above we saw that, since the user was retrieving a new page every ten or fifteen seconds, they could not be reading the page contents at that time. But here our certainty stops.


From your own experience you probably know that browsers store recently visited pages, and that you can get to them by using the forward or back buttons. You also know that you can save a page to your local disk at any time. In general, the use of the Next or Back buttons, or the saving of a document results in no retrieval from the server and thus no entry in the server log file. As a user moves around from site to site, the recently visited pages are kept on the user's computer in memory set aside as a cache. A web browser actually keeps several caches but for our purposes this does not matter. When the user returns to a previously visited site the browser checks the cache to see if the page is still there. If the page is not in a cache then the browser connects to the web server for the page just as it did when the site was first visited. If the page is in the cache, then the browser's action depends on how other browser preferences are set. Some users set their browsers to always get a new copy, most set the browser to replace copies of a certain age, other users set their browsers always to use cache files if they are available. Whether a cached file is available depends on how much space the user has allocated for the cache and the size of the pages (and graphics) loaded. But -- most users don't think much about browser settings, so browser behavior with respect to reloading files from the server tends to be a matter of chance.


We may want to believe that a person who, fifteen minutes after downloading page one of a tutorial, downloads page two of a tutorial has spent that time in careful reading and contemplation. But they could just of well have spent the intervening time visiting half a dozen more exciting sites. The moral: Web browsers never say goodbye to web servers.


For absolute completeness I must note that my undergraduate students have pointed out several ways to generate automatic reload commands from either the browser or the server. This can tell the server if one of the browser windows is still set to your site. This seems to me to be an abuse of bandwidth, but opinions may differ.


2.8 A Hit Without a User

The second of the following log entries appears to be a hit on page 4 of my tutorial, tempting us to wonder why someone started on with the fourth page.

03/04/98 14:38:06 OK /robots.txt

03/04/98 14:38:07 OK /MinuteFour.html

But notice that in the first line the same "user" retrieved a file named "robots.txt." This is our clue that this retrieval is by a web crawler out indexing this page, rather a human user. Any well behaved web crawler is supposed to request the specific file "robots.txt" before it visits any page on the site. The robots.txt file allows the person running the server to define certain files, groups of files, or even the entire site, as out of bounds and not to be indexed. Here it was easy to recognize because the call to robots.txt was just before the request for our tutorial page. Some robots look at robots.txt before every page they access, but others do so only once per visit. You may have to look back in the log to see if a particular "user" is a web crawler.


I. How did the user get here?

1. Ways that show in the log

a. Clicking a link on any other page

b. Using a search engine

2. Ways that do not show in the log

a. Typing the URL

b. Opening a bookmark

II. What did the user do here

1. Things that show in the log

a. Reload the page

2. Things that do not show in the log

a. Save all or part of the page

b. Move from place to place in the page

c. Move to a page they saw before

III Where did the user go?

1. Places that show in the log

a. Click on a link to another page on this server

2. Places that do not show in the log

a. Click on a link to another page on some other


b. Use Next, Back, or a bookmark

c. Quit

Figure 1: What Can be Known




Whenever we create a document we have some sort of expectation of how the user will navigate through it. The most frequent expectation, which is so common that it frequently goes unnoticed, is that of liner progress, either from the beginning or from a starting point chosen from the table of contents or the index. Many authors of web based materials plan for a more complex navigational structure to take advantage of the added power of hypertext. Web log analysis gives us a way of partially testing our navigational design against user behavior. In the following sections I discuss some of the features of user behavior which we can examine.


3.1 Where Did the User Start?

One of the things that surprised me when I began to do web log analysis was that users frequently did not start on the introductory page of a document. (To simplify the discussion I am going to identify a set of web pages about a single topic and designed as a complete unit with the term "document.")


The cause of beginning in the "middle" of a document which surprised me most was the effect of search engines. Most of the people who get to my site through searches are using the simple forms of searching that leave the ordering of the documents up to the search engine. It also appears that people have a strong bias toward selecting retrievals early in the results list. Because the various search services use different and often complex rules for ordering outputs, it is nearly impossible to predict exactly how the different pages in your document will be ordered, or even which of them will be retrieved.


For example, in one document I had a two page sequence. The first page described a concept in calculating sample size for a research study and the second was intended to be an optional page for those who wanted greater depth in the treatment of the topic. As I watched the web server logs it became clear to me that the second of the two pages was much more likely to be highly ranked by a search service than the first. It further appeared that many users were not using the Back or Contents buttons at the bottom of this second page. Thus, even though there were navigational aids that would have allowed the users to locate themselves in the document, it appeared that many users did not make use of them. As a patch I added a button at the top of the second page directing the user to the first page. Later log analysis showed that this button did attract more use than the original buttons had.


In another case, I saw that some users were starting with the first content page rather than with the introductory page. This appeared to come either from a typed in address or (my guess) from a bookmark. This is not a particular problem except that the introductory page includes some options that might be of interest to a frequent user. For example, one of the choices is to download the entire document as an Adobe Acrobat file. I added a note and a link to the bottom of the first content page reminding users of this option. So far, I know that this button is being used, but I am not sure that more users are finding this option.


Some users may start from a link from some other web site. You probably know that you can use some search engines to identify pages with links to your own (e.g. Alta Vista Advanced Search.) but these services are often many months behind in updating their records. On the other hand, web log analysis lets you identify a link from an external site as soon as it is used. In addition, the page owner may have excluded the pages with links to you from indexing so that they will never appear in the search services. One example of this on my own pages was a link from a k-12 school. The link was apparently a part of an assignment, but the directory (folder) containing it was off limits to outside access.


3.2 What Links Did the User Follow?

As I noted above, there is no log record of users who follow links that lead to another server. It is however easy to trace links to other pages on your site. Remember that one of the fields in the log is the "referrer." All you need to do is look for a referrer field that does not have your URL as part of it. For example, I would look for referrers that did not include



While I have not expressed much confidence in commercial log analysis programs, they should not be dismissed altogether. Starting your analysis with some summary statistics is often a good way to begin looking for more detailed questions to ask. However in most cases the available programs give too little specific detail to help us understand documents as communication tools. More detailed analysis requires reducing the file to only records involving those documents of specific interest to you. Don't forget that a spreadsheet, especially if you use macros, can be a powerful analysis tool, and even work by hand to finish the manipulation.


1. Reduce the File Size

a. Eliminate .gif, .jpg and others

b. Select for on a single group of pages or folder

c. Eliminate groups of users (e.g. internal)

2. Focus your question

a. Find links

b. Examine searches

c. Track paths

d. Look at initial contact

Figure 2: Steps in File Analysis

4.1 Reduce the File Size

Some log analysis software can create a complete analysis in a single run, and then let you select the information that is of particular interest after the fact. But many times the analysis software may not separate out the particular part of your site which you want to analyze. You may therefore have to first reduce the log file to just the pages of interest and then run the analysis software, or you may have to forgo the analysis software and use some more generic analysis tool such as a spreadsheet. The program available on my web site is designed to create subsets of web logs so that you can go on with further analysis.


The first step in making the web logs usable is to reduce their massive volume. A good first step is to eliminate information on graphics files. That is, to remove all log entries where the name of the file being sent ends in .gif or .jpg. Don't forget that file names might be inconsistent and so you may also have to eliminate .GIF, .JPG, .JPEG, .jpeg, and so on. You may also remove other files such as sound and video if they are a significant part of your site. For many sites this may immediately eliminate 75% or more of the log entries.


The second step is to limit the remaining files to a specific set of pages which you wish to analyze. It may be that the people who run your server will provide you with a log extract that includes only the folders (directories) over which you have control so that this process has been started for you. In the best of all possible worlds you will have arranged each set of your pages in a separate directory (folder.) For example, all of my pages on random sampling are in a folder called "sampling," so I could simply include all sent pages that included /sampling/ in the URL. On the other hand, my HTML tutorial happens to be at the root level of my web server. Even worse, some of the pages in this document don't have similar names. Thus while some of the pages can be specified with truncation (Minute*), others will have to be enumerated. We'll just call this one more advantage of web log analysis -- it teaches you how best to organize web files.


In many cases we expect internal users, members of our own organization, to use our web site differently than outsiders do. For some pages we may expect them to be experienced navigators, for others, only occasional visitors. In any case it is easy to include or exclude entries based on the requesting URL. In my case, I would use "", since all Drexel University sites include this in their URLs. Some organizations may use several different URLs with no common segments so that each must be included or excluded separately.



The kind of file analysis that I have been talking about is best understood as a sequence of different studies. That is, for each aspect of file use which you want to study you may have to create a new subset of the log and perform a new set of manipulations and analyses. Always remember the unfortunate fact of life that, just because you want to ask a question does not mean that web log analysis can answer it.


4.2.1 Find links

In section 2.2 I introduced the "referrer" link as one of the data elements in a web log. The real value in tracing links is to find out who is linking to you and why. Expect many disappointments. Many links are simply unannotated URLs on a page called "cool stuff" or "my favorites." While being "cool" may be positive, it doesn't help much to decipher your page's appeal to users. Even worse, you are likely to find that your site is listed with others that seem to you to be decidedly less "cool." If we can get beyond the uninformative and the ego deflating, there is a chance of learning about out audiences.


For example, my web document on survey sampling was written for people doing survey research studies, particularly research in libraries. I discovered that a high proportion of the links are from pages introducing basic statistics. Looking at the descriptions of the links, I saw that my material was being used as a good discussion of what would be an ancillary topic in most basic statistics courses. Right now I am working to combine this material with some other general statistics material that I have written so that I can provide an alternate pathway through the material. I think I can make it more useful to the basic statistics audience by focusing more on the definition of terms and calculations and making the process of survey research less obtrusive.

4.2.2 Examine Searches

In section 2.3 above I mentioned that search engines showed you what search was done. The file entry that I showed was:

Each search service codes its searches in a different way, but in general the users search terms a proceeded by a "q" and an equal sign "=." There is also a great deal more encoded information, but you will either need to wait for my forthcoming paper on the subject or figure it out for yourself. In this case the search string was "html+commands." Be careful; the plus here is only a blank space, not a Boolean "AND."

Searches may give you ideas about the audience that your pages might appeal to. Or they might suggest how the words in your pages mislead searchers.


You may not be able to do much to control what web searches will find (or ignore) your site, but here are a few suggestions:


4.2.3 Track Paths

In section 2.2 I briefly mentioned the role of the "referrer" field in letting us track the path of the user through our document. You need to couple this link information with your navigational expectations to know what to look for. A few web log analysis programs will actually draw charts of the average link patterns of a web site. However our interest in the users as individuals often requires that we track the links ourselves. You need an extract of the log which is arranged in rough chronological order, preferably with all hits from the same host grouped together (thus making the file not quite chronological.) One possible way of identifying paths through a document is to start with what you expect to be common paths through the files and then identify departures from the expectation. For example, on a document where I expect a liner path, many individual paths may be characterized as following part of the path and then disappearing or jumping to my home page. If you like graphic information, you might draw the expected paths with a program such as Visio. As you look at user paths remember that many browser actions (for example Back) will not show in the log.


My own experience is that a majority of users are not very adventurous link followers. That is, they do not use the side paths that add extra information. Much of what seems to me to be the power of hypertext goes untapped by my readers. I leave it to you confirm this, and if it is true to explain it. Is it that readers seek only the simplest information route? Or is it just that they have been too often burned by useless links (not at all the sort of things we put in our pages.)



4.2.4 Look at Initial Contact

In the sections above I noted that we can recognize initial contacts that come from links on other pages or from search engines. But this leaves links from typed URLs, bookmarks, and other programs such as email.

There is no way to distinguish between the case in which a user typed your URL from those in which a bookmark or email link was used. But it is the case that typing the URL is more likely to introduce errors. Errors up to the third slash (e.g. in the part of the address) will not show in your log because the request will not be routed to your server. But errors in the directory (folder) path or file name will show marked with "ERR" in the server log. While this seems a promising observation, I know of no reliable way of estimating what percent of typed URLs include errors, so it is hard to derive any quantitative understanding from errors. By the way, I expect that everyone who is at this presentation will type in my URL just as soon as they get home. That way I can refine my error analysis.


We can however use a change in the rate of hits with errors to suggest that the URL to our pages may have been published in some source. This would induce more users to type our address and hence produce more errors. If you suspect that your URL has been published try both a web search and a search in those commercial databases which include full text of current publications. While I noted that the major search engines may be slow in indexing sites, they often have lists of popular sites which are reindexed every day. Thus a search may lead you to published articles which cite your URL.



Web log analysis to improve web page content and design is not an easy task. The information available is often incomplete or subject to multiple interpretations. On the other hand, to authors who are conscious of their intended audience and of their communications goals, small clues may yield significant insights. You can start the process with a few simple steps:

Even as an incomplete source, web logs offer us insights far beyond anything we can obtain from printed documents -- use them!