|
D.6 Text Analysis Using Clusters
|
sheerpower has a lot of support for
files.
The files can be textual, binary, devices, networks, Internet, etc.
Like everything in sheerpower, file processing is very fast.
The example code below shows reading the entire text of the Bible, finding
words that exist only once (excluding proper names), and showing those words
in context. It uses a text file that contains the text of the bible. Each
text line is prefixed with the chapter and verse that it came from, followed
by a tab character (ascii 9).
For added fun, you can input in the variable
a$ a comma-separated list
of first letters that you want to include. In the code, the
match()
function is used to do this part. Have fun playing with the code. You can
always click on
Reset to start over.
Note #1: The Bible text as formatted is from http://berean.bible/
Note #2: We could have used cluster input in this example since this is
really a TSV-formatted file. (Tab separated values)
The Code with Explanations
open file in_ch: name '@..\safe\safe_bible.txt'
cluster raw_bible: location$, text$
cluster words: location$, word$
// skip copyright headers, etc., there are three of them
for idx = 1 to 3
line input #in_ch, eof eof?: temp$
next idx
do
line input #in_ch, eof eof?: rec$
if eof? then exit do
add cluster raw_bible
raw_bible->location$ = piece$(rec$, 1, chr$(9))
raw_bible->text$ = piece$(rec$, 2, chr$(9))
loop
close #in_ch
print 'Bible text lines: '; size(raw_bible)
collect cluster raw_bible
text$ = raw_bible->text$
// Within each text line, get each word
// skip those which are numbers or are proper names
for idx = 1
word$ = getword$(text$, idx)
if word$ = '' then exit for
first$ = word$[1:1]
if (first$ >= '0' and first$ <= '9') or
(first$ >= 'A' and first$ <= 'Z') then iterate for
// subset the list to certain letters (separated by commas)
if len(a$) > 0 and match(a$, first$) = 0 then iterate for
add cluster words
words->location$ = raw_bible->location$
words->word$ = word$
next idx
end collect
print 'First letters to include: '; a$
print 'Total words: '; size(words)
print
// Using the unique words, get the ones that occur just once
collect cluster words: unique words->word$
include _extracted = 1 // just one occurrence
sort by words->word$
end collect
print 'Words selected: '; _extracted
print
for each words
row = findrow(raw_bible->location$, words->location$)
assert row > 0, 'We should always find the location'
text$ = replace$(raw_bible->text$, words->word$+'='+'[['+words->word$+']]')
print raw_bible->location$
print change$(wrap$(text$, 5, 70), chr$(13), '') // pretty it up
print
next words
Program Explanation:
1. Open the File:
The program opens the file safe_bible.txt
for reading. The @..\safe\
specifies the path relative to the current directory.
2. Define Clusters:
Two clusters are defined:
- raw_bible: Stores the location and text of each verse.
- words: Stores individual words extracted from the verses, along with their location.
3. Skip Headers:
The first three lines of the file, which are assumed to be copyright headers, are skipped using a for
loop.
4. Read and Store Verses:
The program reads each line of the file and stores the verse location and text in the raw_bible
cluster.
5. Process Each Verse:
The program iterates over each verse in the raw_bible
cluster. For each verse, it extracts words, skipping numbers and proper names.
6. Filter and Store Words:
Words that meet the criteria (not numbers or proper names) are stored in the words
cluster, along with their location.
7. Print Total Words:
The program prints the total number of words stored in the words
cluster.
8. Identify Unique Words:
The program identifies words that occur only once and sorts them. It then prints the number of unique words selected.
9. Highlight Unique Words:
For each unique word, the program finds the corresponding verse in the raw_bible
cluster, highlights the word in the verse text, and prints the modified verse.
10. Summary:
The program reads a file containing Bible verses, processes the text to extract and filter words, identifies unique words that appear only once, and highlights those words in the original text.