PEDISTAL

A Personal Document Imaging System

Gary K. Starkweather

Architect - Microsoft Research

Hardware Devices Group

June 2000

Introduction –

The office of the past, and much of the present is still governed by the use of paper for transactions and especially information storage and recovery. Will this ever change in any significant way? I believe it will and currently available technology can change the way business is done. Currently, there are quite a few document image management products available for medium to large businesses. To function properly, these solutions are often expensive and require a fair amount of operator training as well as database experience etc. The purpose of Personal Document Imaging or PDI is to provide a simple cost-effective solution for individuals and perhaps very small businesses consisting of just a few people. This paper describes what such a system does, could do and might look like. The author has built a prototype system called PEDITSAL, that is used constantly and has at least survived the test of actual utility. Most of us have to decide what paper(s) to keep and which documents to throw out for file space reasons. The PDI system renders such decisions increasingly unnecessary due to the low cost nature of the solution

Paper as a Solution -

Paper has been the serious office and personal information solution for over 100 years. We write and receive letters, receive bills and notices, pay bills with paper checks etc. From the invention of the typewriter in the late 1800’s until the invention of the laser printer and other electronic printer technologies, typewritten documents reigned supreme in the office and in business in general. Typist labor costs were low and hence the cost to generate a document, were not in general, prohibitive. Rewrites of course were annoying since often the whole document had to be regenerated by hand if significant formatting changes were required. Of course for many years, duplicates were also painful since one had to use carbon paper which only copied text as impressed by the type mechanism not graphics, pictures, etc. The advent of the copier by Xerox in 1959 changed the world again by making fast, easy and inexpensive copies available. Furthermore, the most important aspect was that copies could be made at the point of need. They did not have to be made at the point of origin of the information. Thus paper got another boost in popularity and its use continued to mushroom.

In 1861, people in the United States used about 60,000 tons of paper. By 1995, 60,000,000 tons of paper were consumed. This represents a growth of 5.3% for the intervening 134 years. Often, prognosticators have indicated that the paperless office was at hand. The development and availability of the personal computer and personal printer has only served to push the envelope harder. Supercalendared paper, which is what office bond paper is, has been growing at about 14% per year. Thus it appears as though management of paper is the key to paper process success not necessarily its complete elimination. We can control how we emit information and most of us at Microsoft utilize the excellent email and Internet services available to greatly reduce our dependence on paper. However, we cannot control how we receive information. Thus, a casual walk past a printer room in almost any building will illustrate the current utility of paper by the busy hum of the printers and copiers for most of the day. The coming of the digital copier/printer/scanner can help manage paper but it will not likely eliminate it anytime soon. Reversal of paper growth will certainly occur but this is not likely for a few years.

Thus, the purpose of the PDI/PEDISTAL system described is to manage what paper the user has created, stored or received. The model would be to process the paper to digital form, store it and then the paper can be dispensed with. This is the first step to driving paper out of our lives as something to be managed, stored etc. Today, even after 25 years of electronic printer technology and millions of laser and inkjet printers, only about 10% of all printed material was electronically printed in 1999. It is estimated that not until about 2005 will approximately 50% of all printed documents be generated on an electronic printer. Therefore, it is felt that the solution described in this paper has value now and for some reasonable time into the future.

Equipment and Its Cost –

We first need to look at the cost disparity between storage and management of conventional paper versus electronic storage of the images of the information on the paper. File cabinets, the old stalwart storage mechanism of the past are low tech and often assumed to be inexpensive. Such expectations are not real, however. Hardware and labor costs are both increasingly prohibitive for the file cabinet model of paper management. It costs, on average, about 10¢ to store a sheet of paper in a file cabinet. This cost does not include any costs for labor which may not be insignificant. While 10¢/sheet may be thought of as miniscule, a simple calculation shows that this is not the case. Assuming burdened labor costs of $6.00 per hour, and 10 seconds of labor to place the paper properly in the file, the cost to store a sheet of paper is about 1.7¢, which adds an additional 17% to the cost of storage. Retrieval is likely to be even more costly. However, for purposes of discussion, let us keep the cost of file storage to 10¢ per sheet of paper. Storage of 10,000 sheets of paper would thus incur a cost of about $1,000. How much would it cost if this were done electronically?

First, we need to decide on what resolution or pixels/inch with which to scan and how well compression will work on the average document. Much testing has been done and it has been determined that 400 pixels/inch is an excellent scanning resolution. Higher than this, and the OCR packages hardly improve their character recognition rate. Separate tests were conducted to find this out and indeed little was gained at 600 dpi over 400 dpi scanning. Furthermore, 400 dpi scans can be conveniently down sampled to 200 dpi for conventional fax transmittal. So far, about 50,000+ documents have been scanned and stored in the author’s experimental system. The average file size is about 200 KB. Obviously, photographs take up more space than typewritten material since compression is not as efficient with little or no whitespace. The scanned image is saved as a CCITT compressed TIFF file and most images are monochrome at the present time. Color images would increase storage requirements about a factor of 2. However, let us assume that a monochrome 8 ½ x 11 document, scanned at 400 dpi and compressed with TIFF, to comprise about 250 KB of data storage. What does it cost for this much storage?

Recent disk prices (late 1999) were such that one could purchase an external SCSI disk drive of 50 GB capacity for about $750 or less. Considering that a 50 GB disk actually has about 46 GB of storage space after formatting and other amenities are taken care of, let us use 45 GB as the available capacity. Using the data above, about 180,000 documents/pages could therefore be stored on such a device. Since the recognized text of such documents consumes only about 2 to 4 KB per file, this is considered as included. Doing the arithmetic shows that the cost to store each document is about 0.42¢. If one realize this low cost for conventional storage, a quality file cabinet would only cost $42 to store 10,000 pages as opposed to the $1,000 it really costs. As disk prices erode further, the electronic storage costs of the document images will certainly improve while the cost of file cabinets is not likely to improve much. Additionally, for a personal file system, 180,000 pages is a considerable amount of material viz. 36 boxes of paper stuffed to the maximum. Most of us will personally never acquire this many documents in our lifetime. In summary, it is clear that the required costs and capabilities for storing the data are clearly available to the average user. The proliferation of personal computers with multi-gigabyte disk drives, means that the required storage for thousands of documents could be readily available to the user.

Scanning –

Scanning has been the subject of debate for some time. How would an average user get their documents conveniently scanned in? Usually, users would go out and buy a commercial scanner for a couple of hundred dollars and scan documents with that device. There are a number of complex issues that make such an arrangement problematic, however. First, the low cost scanners tend to be slow devices. If one wanted to scan a 50 page document, it could take over an hour on such scanners. This would make the scanning problem become a practical issue just from a productivity standpoint. Secondly, such scanners do not usually possess document feeders. Therefore, each page would have to be individually positioned for scanning and thus complicate the productivity issue even further.

The best choice today is to buy a fast scanner with an automatic document feeder. A unit such as the Fujitsu M3097DG is such a unit. I have used such a scanner with great success for some time. It can scan a duplex printed page (imaging on both sides), in one pass and scans single sided pages at 400 dpi at the rate of about 45 pages per minute and duplex pages at about 25 pages per minute. The automatic document feeder holds about 100 typical bond sheets and thus even in simplex mode, one can do something else for about 2 minutes before the scanner needs to be loaded with more images to be scanned. Such a scanner is SCSI based and costs in the neighborhood of $6,000. This is quite high for an average user and is not likely to be the needed mass-market solution. However, this device and its attendant software is an excellent package on which I have scanned over 30,000 documents with virtually no problems.

The best solution, and one that is quickly coming, is the multifunction (MF) copier in the copy room and eventually at service bureaus. Just as one can go to Kinkos or Zebra Copy, e.g. to get copies etc., one would be able to go to such places to get scanning done. The images could either be stored on a network disk accessible on the Internet from home or the images could be placed on a CD for the customer. Currently, most, if not all, MF devices do not let you capture the scanned images to anything but the MF printer or for internal storage and later printing. This is unfortunate because the network aspect of these devices is limited to printing only. Often such restrictions are due to a number of factors, some logical and some not so logical. From the logical perspective, circuit costs in the MF device can often be reduced if the scanner can specially prepare the data for printing only rather than grayscale or TIFF file storage offline. Thus, the cost of the device is lowered and market penetration is conceivably higher. However, the not so logical issues would be one of standards. One manufacturer may feel that if they export their images, then another manufacturer can use their scanner to feed a competitive printer and vice versa. This rather parochial model limits the availability of standards on which universal products can participate and mitigates against the Digital Nervous System. The basic value of current facsimile products is not the outstanding resolution or image quality but the fact that they all work on POTS (Plain Old Telephone System). MF device vendors need to get serious about this issue and progress is taking place. When newer devices become available, one can scan documents and store them on network attached disks etc. The user can then conveniently scan their documents and not have to buy a high speed unit for home use any more than they have to buy a high speed copier for the occasional large copy job. Scanning costs should be very reasonable when batched by the user as their needs require. The scanning problem is going to go away quickly and this vital aspect of PDI is not going to be a problem in a couple of years.

The Prototype System – PEDISTAL

PEDISTAL, or PErsonal Document Imaging System And Library, is a package that the author has put together to provide a research prototype of such a “paper-to-bits” vision. We will review how the relatively simple application works and its current performance metrics on a particular hardware configuration. The PEDISTAL application was written in Visual Basic 6 to enable easier programming and a minimum of coding time. The application is not large and the code is only about 50 kB in size. The compiled executable is about 100 KB. The UI was created for functionality and there are likely better designs but this one works sufficiently well to have great utility at least to the author. Furthermore, with VB, the UI is easily altered or customized and this will be done as the need arises.

The general process flow and operation of the Pedistal system is as follows:

(1) Collect any paper documents to be scanned and remove them from their

binding so that single sheets can be scanned. The scanner will perform duplex scans if necessary although the operator must make the decision.

(2) Scan the documents (usually in black and white) at 400 pixels/inch. As discussed earlier, a resolution of 400 pixels/inch is good for OCR work as well as reading and re-printing the scanned images should that be desired.

(3) The resulting images are renamed and placed in a folder, which then is targeted by the OCR package.

(4) The recognized text is given a simple filename to be associated with the OCR’d document. The method chosen has been to name the image files PageXXXXX.tif and the text files named PageXXXXX.txt. XXXXX is a number from 00001 to 99999. Thus, Page11234.txt is the text recognized from Page11234.tif, etc. This gives the user a direct reference to the text from the scanned image for up to 99998 files. The available size of this number represents about 10 file cabinets worth of data and could be expanded if necessary.

(5) The image files, which are CCITT Group IV compressed, are placed in a folder called plainly enough, OCR Images and the text is placed in a folder called OCR Text. Should there be more than 32,767 files in a folder, it is necessary to have multiple folders such as OCR Images A, OCR Images B etc. Windows 2000 and NT can properly handle more, but apparently VB cannot load more than 32,767 files in a list box and this requires the multiple folder use for numbers of images exceeding this value or multiples thereof.

(6) When searching, the PEDISTAL application then looks into the pre-designated folders for the images and text data.

(7) Typically, the user inserts a word to search for and the PEDISTAL application then searches all the PageXXXXX.txt text files to find those in which this word is present. It is efficient enough to quit looking when it has found one instance in a file. The path for that file and the file name are then placed in a special file that is created. For example, if the word searched for is Lexus, a file called LexusFoundFiles.txt is created. Appendix, I illustrates the contents of such a file. The searches are not case sensitive. All files that contain the word Lexus have their paths and names listed in this file. Thus when going back at a later time one need not re-search the files for the required data unless a new file or set of files have been added that the user might deem to have value in a new search.

(8) An additional file is also created if necessary and its contents expanded as needed which contains words previously searched for. This file, Searched TIFF Items.txt is a list of all previous searches. Words contained in Searched TIFF Items.txt, have files called ****FoundFiles.txt from previous searches. Appendix II illustrates the simple contents of Searched TIFF Items.txt.

(9) When the file list is presented, the user can click on the file of interest and the image is obtained and presented via the Windows Imaging application that ships with every copy of Windows.

This process will now be viewed through the Pedistal UI so that the general process can be seen. Currently, only monochrome documents have been used and to date, about 50,000 are in the repository on a 50 GB disk. The images comprise a storage requirement

FIGURE 1

Pedistal Application UI Presentation

of 10.88 GB. The average filesize is 207 KB and the compression used is CCITT Group IV which is lossless. JPEG would generate more compact files but since the data would not be lossless, this was deemed undesirable for the prototype. The resulting text files comprised 118.3 MB or about 2.25 KB per file. The overall file size then is about 210 KB per file, for both the image and the resulting OCR’d text. The 50 GB disk bought some months ago cost about $750. With about 45 GB of useful storage space after formatting etc. (NTFS), the cost for the storage of all the current 52,000+ pages would calculate to be about $183 or 0.35¢ per page.

As shown in Figure 1, the PEDISTAL application is a reasonably simple one. A detailed UI description is given later in this paper. Upon starting the application, the user selects the path for the text data, in this case K:\. Once the user single clicks on the path or directory box, the application looks in the required folders for all text files starting with “Page” and is not case sensitive. Two columns, if necessary, are filled as shown. The first column contains up to 32K files and the second contains any files up to the next 32K. When 64K are exceeded, a third column will be added but for now that is not required. The user can then do one of two things. They can click on the button “Load Previous Searches” which accesses the file “Searched TIFF Files.txt” and presents the previously searched for words in the listbox below or the user can insert a new word in the textbox adjacent to the #1 box in the center of the display. If the “Load Previous Searches” button is clicked, then the listbox loads the previous words as mentioned above. The user can, if they desire, click on one of those words and the listbox associated with #1 will load with the paths of relevant documents. If a new word is chosen, the user merely types it in and then clicks the “Search for Data” button which will causes the application to look through all the available text files for the required word.

Searching on a 450 MHz PC with 128 MB of memory moves at about the rate of 123 pages per second. Thus a search through all 52,000 pages takes about 7 minutes. This is not stunningly fast but it is not prohibitive either. A commercial text search application was purchased and used to search the files, this application worked about 3X faster than the author’s ad hoc code. Those bent on maximum performance could likely interface with such a package but it was not worth the effort here. As the search for a new word occurs, the number of files remaining to be searched shows in the adjacent box (Text3) and updates every 100 pages. When the files are loaded or the search is complete, the user can click on the button, “Start Windows Imager” and the imaging app will initiate. The user then goes back to the PEDISTAL application, selects one of the found files by clicking on it and the file path is transferred to the textbox below the button “Show Me This Image.” When this button is clicked, the Windows imaging app is pointed at this TIFF file and the image is shown to the user. The Show Previous Image and Show Next Image merely decrement or increment the page number and show the user the resulting image file. This is particularly handy for documents with many pages in which the user wants to move through the document. It is always recommended that the documents be scanned in page order to provide this convenient capability.

If the user wants to search for a number of words and build a set of ***FoundFiles.txt they can enter the word to be searched in the Add Me box and then click the button, Add Items. Should the item be unwanted, misspelled etc. the user can click the Remove Last Item button. Once the words to be searched are entered, the Build New Files button can be clicked and the search will proceed, one word at a time. That way, the user can search overnight or say a spare hour or so etc. Should the user have already searched for a word and wishes to narrow the file list using an additional word or two, the lower boxes can be used. Let us say that the file list has been searched for the word “Microsoft” and the user has gotten 250 files from the search. They may wish to narrow the search and also want the word “color” to also be in any relevant files. Entering the word “color” in the area adjacent the #2 and then clicking the “Look for this also” button causes a second level search to occur. This way, only the 250 files need to be searched and any positive results will have their paths shown in the box below as before. Should a third word be desired to even further narrow the selection, then a similar search can be done via the buttons and fields in the #3 box area, etc. Currently three words are set as the search limit.

Figure 2

Pedistal Application with UI features Numbered

The various UI features are discussed in the next section where their operation is detailed. While the overall UI may be considerably simplified in the future, the current application was intended to give operational experience with the whole paper image storage and retrieval concept.

Summary Explanation of Individual UI Features –

Looking at Figure 2, one can see the UI features are numbered 1 through 38. These features are listed below and given short explanations of their functions. Both Figures 1 and 2 are identical except for the callouts on the various features.

Feature Number Feature Explanation

1 Decrements the file # in item 38 and displays that file’s image.

2 Increments the file # in item 38 and displays that file’s image.

3 Displays the image associated with the file in item 38.

4 Clicking on this box opens the associated folder and loads items 34

and 35 with the image files in those folders.

5 Clicking on item 5 puts the total number of files in item 6.

6 Total number of files in items 34 and 35.

7 Clicking on this button puts the maximum file number in item 8.

8 Contains the maximum file number when item 7 is clicked.

9 Removes a file from the displayed list in item 14.

10 Loads file containing previous searches (Searched TIFF Items.txt)

11 Builds new searched files with the list of words in item 18.

12 A word in this field is searched for when item 13 is clicked.

13 Clicking this button searches for the word in item 12

14 Contains the list of files found from searching for item 12. Clicking on a file in this list loads the file’s path into item 38 for viewing by the Windows Imaging application.

15 Contains the list number of the file clicked on in item 14.

16 Contains the total number of files for which the search succeeded.

17 Contains the list of words previously found (Searched TIFF etc.)

18 Contains the list of new words to search for (batch process).

19 Contains the count of all the words in the previous searches file.

20 Contains a word to be added to the list in item 18.

21 Copies the list in item 17 to item 18 to search all files again. This is often used when a substantial number of files have been added to the repository and the user wants to build a new Searched Tiff etc.

22 Clicking on this item adds the word in item 20 to the list in 18.

23 Removes the last word in item 18 if it is not wanted, was misspelled, etc.

24 Clicking this button alphabetizes the list in item 17 for more convenient location of previously found words.

25 This item lists the number of files with word #2 in them.

26 Clicking this item causes a search of all the files in item 14 to be searched for the word in item 37.

27 Item contains the word to be searched for in the files in item 14.

28 Item 28 contains the number of files found containing word #3.

29 Clicking this button causes a search of the files in item 30 for the word in item 27 (word #3).

30 This item lists the files that contain word #1 (item 12) AND word #2 (item 37).

31 Item 31 list the files containing word #1 (item 12), word #2 (item 37) AND word #3. This is a Boolean AND of all three words.

32 Clicking this button causes the Windows Imaging app to start. Once running the app need not be restarted unless the user intentionally or accidentally closes it.

33 This box also contains the total number of files in the repository.

34 This list contains the text files in folder 1 of the repository.

35 This list contains the text files in folder 2 of the repository.

36 This label changes to a green background and says “Running” if the Windows Imaging app has been started.

37 This item contains the second word to be searched for and is used to search the files found in item 14.

38 This box contains the file that is ready to be viewed if item 3 is clicked. It can be hand edited if the user wants to view a tiff file with a known filename and path.

39 This button or item causes the app to exit. All previous state is lost if the program exits except for those files such as Searched TIFF files.txt etc.

While this is a rather extensive list, it is only the first UI created and tuned for the user’s requirements. Surely the UI can be substantially improved but for the goals of the research project, it contains the functions needed to perform the required tasks. There are often updates to the functionality but getting more documents into the repository is the most critical requirement at this point.

OCR Applications –

The number of available OCR applications is quite large. For this project, four packages were tested, viz. OmniPage Pro by Caere Inc., Xerox TextBridge and Pagis Pro by Xerox Corporation and TypeReader 5.0 by Expervision. Both OmniPage versions 7 and 9 were tested. Some general comments on the various packages are as follows:

(1) OmniPage Pro – In general this is a good OCR package but often it has too many bells and whistles and tends to forget its reason for being. It tries to capture the font and face as well as some page formatting issues. This is often more than is needed or wanted. It is reasonably fast but is limited to handling only 255 pages at a time at least in the versions used. In earlier versions (pre-version 9) the application would lock up if an image with no recognizable text in the image was presented to the application. This may not be as much of a problem in version 9 but the user should be cautious. Speed was quite good and text recognition is fair but peaks at about 400 dpi scanning. Scans at 600 dpi did not improve recognition accuracy and in some cases degraded it.

(2) TextBridge by Xerox – This is another good OCR package but it seems to try to recover the font, the face and the font size as well. This is overkill for most cases. All that was needed for this research was pure character text recovery with no formatting. Furthermore, TextBridge would only handle 255 files at a time, which is too small. Speed was reasonable but nowhere near the fastest in the group. However, accuracy was quite good but the UI is complex and processing a lot of pages at a time was cumbersome.

(3) Pagis Pro – Another good package but with nothing particularly distinguishing about its performance or accuracy. It is clearly a good package to use but there is nothing outstanding to recommend it to anyone.

(4) TypeReader® 5.0 – This package was the best tested. It did not fail on scans with no text and was more accurate at higher scan densities and as good as any at 400 dpi. The only place it got confused was telling the difference between a TimesRoman “1” and a lower case “l” (L). This is not to be unexpected since these are so close in shape. TypeReader®, unlike any of its competitors has three input quality modes. The first is draft, the second is letter quality and the third is best quality. It is recommended that any pages with pictures be submitted to the software as draft quality. The reason is that most pictures that will be scanned are not continuous tone but are halftoned since they have been printed. When one scans magazines or articles at 400 dpi, the halftones are resolved and at the best quality mode, TypeReader® thinks these are periods and will often overflow its buffer with 60,000+ periods. Telling the application to use draft quality, effectively causes it to ignore halftone dots and no overflows ever occurred. Furthermore, one can submit up to 32,000 pages to the app to be recognized via a batch process. The most pages I ever submitted at one time, were 8,000. At about 2 - 4 seconds average per page for OCR generation on a 450 MHz machine, this provided a runtime of about 9 hours continuous. No failures occurred in either of two major trials. TypeReader is just an OCR package with no fancy formatting requirements or provisions and it does its job very well. This is not intended as any recommendation but TypeReader was the best of the OCR applications that were tested.

Summary –

This brief paper is intended to provide some insight into a simple “paper-to-bits” data retrieval system. It does not as yet handle, pure photographs, videos or sound files but there are others working on such systems. The PEDISTAL system will eventually expand to handle photos but will mainly be intended to deal with paper born text information such as technical articles, etc. It is also. Being used to capture paper based information for personal home use.

This system is in use every day and continues to perform well. With further improvements in DVD disk density, soon it would be possible to store the 11 GB of data on only 2 DVDs. A capacity of 11 GB which probably reduces to 9.5 GB when formatted yields a page store of over 45,000 pages. Two DVDs meeting the aforementioned capabilities could contain everything I ever scanned in. These or copies could be readily carried with a portable computer thus providing a portable document store of everything I have ever wanted to keep (some 53,000 pages at this point). In the future, such a data repository would be available via the Internet and not even the transport of the DVD storage disks would be necessary.

As multi-function digital copiers from Ricoh, Savin, Canon, Minolta, Hewlett-Packard and Xerox become available, scanners will be just around the corner or at a local copy shop. Now paper documents do not have to be kept, but the information on them need not be lost either. This system is being transported for use at home so that personal documents can also be included. If the bank no longer sends copies of checks, users can just scan them in and store the images. Any personal papers can be placed in the file store and retrieved whenever it is necessary. Such a system has to be the wave of the future since information will be available where it is needed not just where it is stored. While paper will go away at some point in the future, there appears to be technology, rationale, and profit potential in embracing its strengths while replacing its weaknesses.

Appendix I

Contents of a typical ****FoundFiles.txt file

Filename: LexusFoundFiles.txt

K:\OCR Text Files A\PAGE11434.TXT

K:\OCR Text Files A\PAGE21671.TXT

K:\OCR Text Files A\PAGE42402.TXT

K:\OCR Text Files A\PAGE42404.TXT

K:\OCR Text Files A\PAGE42405.TXT

K:\OCR Text Files A\PAGE42408.TXT

K:\OCR Text Files A\PAGE43239.TXT

K:\OCR Text Files A\PAGE43252.TXT

K:\OCR Text Files A\PAGE43274.TXT

K:\OCR Text Files A\PAGE43283.TXT

K:\OCR Text Files A\PAGE43294.TXT

K:\OCR Text Files A\PAGE43433.TXT

K:\OCR Text Files A\PAGE43439.TXT

K:\OCR Text Files A\PAGE43440.TXT

K:\OCR Text Files A\PAGE43441.TXT

K:\OCR Text Files A\PAGE43442.TXT

K:\OCR Text Files A\PAGE43443.TXT

K:\OCR Text Files A\PAGE43445.TXT

K:\OCR Text Files B\PAGE50001.TXT

K:\OCR Text Files B\PAGE50008.TXT

K:\OCR Text Files B\PAGE50011.TXT

K:\OCR Text Files B\PAGE50013.TXT

K:\OCR Text Files B\PAGE50015.TXT

K:\OCR Text Files B\PAGE50043.TXT

K:\OCR Text Files B\PAGE50072.TXT

K:\OCR Text Files B\PAGE50101.TXT

K:\OCR Text Files B\PAGE50109.TXT

K:\OCR Text Files B\PAGE50142.TXT

K:\OCR Text Files B\PAGE50200.TXT

K:\OCR Text Files B\PAGE50207.TXT

K:\OCR Text Files B\PAGE50209.TXT

K:\OCR Text Files B\PAGE50212.TXT

K:\OCR Text Files B\PAGE50224.TXT

K:\OCR Text Files B\PAGE50225.TXT

K:\OCR Text Files B\PAGE50227.TXT

K:\OCR Text Files B\PAGE50228.TXT

Appendix II

Contents of a typical Searched TIFF Items.txt file

Lexus

Corvette

display

electrostatic

gyricon

electrophoresis

charging

corotron

RX300

Curt