D.8 Text Analysis Using Clusters

SheerPower has a lot of support for files. The files can be textual, binary, devices, networks, Internet, etc. Like everything in SheerPower, file processing is very fast.

The example code below shows reading the entire text of the Bible, finding words that exist only once (excluding proper names), and showing those words in context. It uses a text file that contains the text of the bible. Each text line is prefixed with the chapter and verse that it came from, followed by a tab character (ascii 9). This all takes less than 1/100th of a second.

For added fun, you can input in the variable a$ a comma-separated list of first letters to include. In the code, the match() function is used to do this part. Have fun playing with the code. You can always click on Reset to start over.

Note: #1: The Bible text as formatted is from http://berean.bible/
Note #2: We could have used cluster input in this example since this is really a TSV-formatted file. (Tab separated values)

The Code with Explanations

open file in_ch: name '@..\safe\safe_bible.txt' cluster raw_bible: location$, text$ cluster words: location$, word$ // skip copyright headers, etc., there are three of them for idx = 1 to 3 line input #in_ch, eof eof?: temp$ next idx do line input #in_ch, eof eof?: rec$ if eof? then exit do add cluster raw_bible raw_bible->location$ = piece$(rec$, 1, chr$(9)) raw_bible->text$ = piece$(rec$, 2, chr$(9)) loop close #in_ch print 'Bible text lines: '; size(raw_bible) collect cluster raw_bible text$ = raw_bible->text$ // Within each text line, get each word // skip those which are numbers or are proper names for idx = 1 word$ = getword$(text$, idx) if word$ = '' then exit for first$ = word$[1:1] if (first$ >= '0' and first$ <= '9') or (first$ >= 'A' and first$ <= 'Z') then iterate for // subset the list to certain letters (separated by commas) if len(a$) > 0 and match(a$, first$) = 0 then iterate for add cluster words words->location$ = raw_bible->location$ words->word$ = word$ next idx end collect print 'First letters to include: '; a$ print 'Total words: '; size(words) print // Using the unique words, get the ones that occur just once collect cluster words: unique words->word$ include _extracted = 1 // just one occurrence sort by words->word$ end collect print 'Words selected: '; _extracted print for each words row = findrow(raw_bible->location$, words->location$) assert row > 0, 'We should always find the location' text$ = replace$(raw_bible->text$, words->word$+'='+'[['+words->word$+']]') print raw_bible->location$ print change$(wrap$(text$, 5, 70), chr$(13), '') // pretty it up print next words

Program Explanation:

1. Open the File:

The program opens the file safe_bible.txt for reading. The @..\safe\ specifies the path relative to the current directory.

2. Define Clusters:

Two clusters are defined:

raw_bible: Stores the location and text of each verse.
words: Stores individual words extracted from the verses, along with their location.

3. Skip Headers:

The first three lines of the file, which are assumed to be copyright headers, are skipped using a for loop.

4. Read and Store Verses:

The program reads each line of the file and stores the verse location and text in the raw_bible cluster.

5. Process Each Verse:

The program iterates over each verse in the raw_bible cluster. For each verse, it extracts words, skipping numbers and proper names.

6. Filter and Store Words:

Words that meet the criteria (not numbers or proper names) are stored in the words cluster, along with their location.

7. Print Total Words:

The program prints the total number of words stored in the words cluster.

8. Identify Unique Words:

The program identifies words that occur only once and sorts them. It then prints the number of unique words selected.

9. Highlight Unique Words:

For each unique word, the program finds the corresponding verse in the raw_bible cluster, highlights the word in the verse text, and prints the modified verse.

Summary: The program reads a file containing Bible verses, processes the text to extract and filter words, identifies unique words that appear only once, and highlights those words in the original text.

Hide Description

a$ b$ x y

Enter or modify the code below, and then click on RUN

Looking for the full power of Sheerpower?
Check out the Sheerpower website. Free to download. Free to use.