Text Manipulation Programs

The download button is at the bottom of this page

Make Freq


This program makes a word frequency list for a text document.

I have recently found a program which does a very nice job of creating frequency lists on the Macintosh. it is called conc 1.76. It also generates concordances. Look for it in all of the usual Mac depositories.


For input you must specify:
* An input file -- a text only document.
* An output file. (will be of type text)
* A stop word (common words) file; text only, one word per line.

The output will be two lists, one immediately after the other.
The first list gives the words from the document in alphabetical order. On each tab delimited line there will be:
* The word
* The word frequency
* The word stem (for non-stop words only)
Note 1: The words the, and, and a do not appear in their alphabetical order. They are at the end of the alphabetical list.
Note 2: The program attempts to guess which words should always be capitalized -- this feature has not been evaluated.

The second list gives the words in descending frequency order. On each tab delimited line there will be:
* The word
* The word frequency
* An indication if the word is on the stop list.
The output file is designed to be opened by any spreadsheet. It can also be opened with any word processor.

The word stem algorithm uses the Porter method as described by Frakes (1992). The Pascal code for this algorithm was developed by Steve Quirlogico and is available separately. Stemmer Code (pascal)

The program should run on any modern Macintosh (probably system 6.4 or higher.) As sent, the program tries to get 3.5 meg of memory to run. If you have a smaller machine you can reduce this by using the Get Info command in the File menu of the Finder. You will have to use trial and error to figure out how much memory you need. I regularly run the program on files in the 20k to 30k range, but there is nothing in the code that should prevent its use on larger files.

When it runs, the program counts the words it processes in steps of 50. You will see that the program is not very fast and, because it alphabetizes the words as it goes, it runs slower and slower on large files. (It is an O{n*n} sort)

Find Pairs


This program identifies frequently occurring adjacent word pairs.

For input you must specify:
* An input file -- a text only document.
* An output file. (will be of type text)
* A stop word (common words) file; text only, one word per line.

Each output line will contain the following tab-delimited information:
* Pair frequency.
* First word
* First word frequency
* Second Word
* Second word frequency.
As an experimental feature, the program then gives a list of possible acronyms in the text. This feature has yet to be evaluated.
The output file is designed to be opened by any spreadsheet. It can also be opened with any word processor.

The program should run on any modern Macintosh (probably system 6.4 or higher.) As sent, the program tries to get 3.5 meg of memory to run. If you have a smaller machine you can reduce this by using the Get Info command in the File menu of the Finder. I regularly run the program on files in the 20k to 30k range, but there is nothing in the code that should prevent its use on larger files.

When it runs, the program counts the words it processes in steps of 50. You will see that the program is not very fast and, because it alphabetizes the words as it goes, it runs slower and slower. (It is an O{n*n} sort)

Both programs are in a single Stuffit archive:
Text Manipulation Programs (Macintosh) (50 k)
A DOS version will be along real soon now.



[Home] [ Vita] [ Copyright]