|
Often when programming, there is a need to combine multiple variables into a
single named object. Doing so makes it easier to keep track of your variables
and adds clarity to your code. In sheerpower, we call this object a CLUSTER.
This section focuses on
finding data in cluster arrays using the FINDROW()
function.
cluster student: name$ city$ add cluster student student->name$ = "Jason Nordahl" student->city$ = "New York City" add cluster student student->name$ = "Joan Ark" student->city$ = "Helena" add cluster student student->name$ = "Frank Abbott" student->city$ = "San Diego" print findrow(student->name$, "Joan Ark") end
The FINDROW()
function creates an internal hash-table and uses
it to quickly find a given value in a cluster. Hashing is the process of converting data
such as a name into one or more numbers. Instead of finding the name, you can
find the hashed number of the name. The hashed numbers can be organized in ways
that make finding data extremely fast. sheerpower's FINDROW()
function can find information at a rate of over 10 million rows per second.
If the value being searched for does not exist, then the search is at a rate
of over 15 million rows per second.
FINDROW()
uses a method called "hashing" to perform high-speed
lookups on cluster data. Internally, hashing uses what is called a
"hash-table." As new rows are added to the cluster, optimizations are done
to maintain the hash-table. If values within the cluster change, more
optimizations are done to maintain the changed hash-table. Specific
optimizations will be explored elsewhere.
Assuming there are 10 rows in your cluster, sheerpower starts by building a hash-table with 16 slots. This over-allocation of 6 slots increases the efficiency of hash searches at the expense of some memory. The over-allocation also guarantees that there are always unused hash-table slots. The hash-table is accessed in three stages:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
For example, say that you are storing the value of "Boston" into this cluster (city$). Using three different prime numbers, sheerpower creates three hash values (HVALUES) for this cluster variable. Each HVALUE is mathematically computed using the byte values of the city name and one of the prime numbers. In addition, sheerpower keeps track of how many times each city has been hashed and factors that occurrence number into the computation.
sheerpower then takes the first HVALUE and reduces it to a number between 1 and 8. In this example, the reduced number is "4." This tells sheerpower to look in the 4th slot of the hash-table. If that slot is available, sheerpower stores the first HVALUE (4) and the cluster row number for the city name "Boston." We also store the occurrence number for the city.
1 2 3 #Boston 5 6 7 8 9 10 11 12 13 14 15 16
If the slot was in use, we take the second HVALUE and reduce it to a number between 9 and 16. Let's say the reduced number was 15. We look in the 15th slot of the hash-table. If that slot is not in use, we store the second HVALUE and the cluster row number for this city. We also store the occurrence number for the city.
If the slot was in use, we take the third and final HVALUE and reduce it to a number between 1 and 16 – the full range of hash-table slots. Let's say the reduced number was 5. We look in the 5th slot of the hash-table. If that slot is not in use, we store the third HVALUE and the cluster row number for this city. We also store the occurrence number for the city.
If the slot was in use, starting at the next slot (6 in this example), we sequentially search the hash-table for an unused slot. If we get to the end of the hash-table, we start over at the beginning. Because the hash-table is over-allocated, we are guaranteed to find an unused slot – usually within two tries. When we find the unused slot, we store the third HVALUE and the cluster row number for this city. We also store the occurrence number for the city.
We will assume the same 10 row cluster and 16 slot hash-table. Let's find the first occurrence of Boston. Using three sets of prime numbers, we create three hash-values (HVALUE) for Boston. The hash-values are mathematically computed just as we did when building the hash-table.
We start by taking the first HVALUE and reducing it to a number between 1 and 8. Let's say the reduced number was 4. We look in the 4th slot of the hash-table. If the slot is unused, then our search is over. Boston cannot be in our cluster. If the slot is used, we compare our first HVALUE with the HVALUE in the 4th slot. If they match, we then use the row number stored in the slot to look at the cluster’s city. If indeed it is Boston, then our search is over.
1 2 3 #Boston 5 6 7 8 9 10 11 12 13 14 15 16
If we did not find Boston using the first HVALUE, now we use the second HVALUE and repeat the process that we did for the first HVALUE. If still not found, we use the third HVALUE and repeat the process.
If Boston is still not found, we sequentially scan the hash-table. If we find an unused slot, our search is over. Boston cannot be in our cluster. If we find a match for the third number and the row in the cluster contains Boston, our search is over. Otherwise, we keep sequentially scanning. Because we over-allocated the hash-table, the scanning will eventually stop. on average the scan stops after just two attempts.
When inserting a duplicate key, the process begins by locating the hash-table
slot that holds the first occurrence of the key. This slot contains, among
other details, the current count of duplicates for that key
(duplicate_key_count
). We increment this count and then re-hash
the duplicate key, incorporating the updated
duplicate_key_count
. This approach allows us to instantly
determine the number of occurrences of any given key. To retrieve all
occurrences of a duplicate key, we provide the key value and iterate through
the occurrence numbers. When a key's value is updated, we also update the
first occurrence's slot with the new
duplicate_key_count
. If the first occurrence is removed, another
occurrence is promoted to maintain the correct
duplicate_key_count
.
Our studies involving hundreds of millions of keys reveal that the first
hash value (HVALUE
) will locate an unused slot about 85% of the
time. If it doesn’t, the second HVALUE
will find an unused slot
approximately 95% of the time. If this also fails, the third
HVALUE
will find an available slot. In the final step,
sequentially scanning for an unused slot typically succeeds within two
attempts. On average, new keys can be added to a cluster at a rate exceeding
five million rows per second. The FINDROW()
function can locate
over 10 million key values per second, regardless of cluster size. If a key
is not present in the cluster, this can be determined at a rate exceeding 15
million keys per second.
Hide Description
|
|
Enter or modify the code below, and then click on RUN |
Looking for the full power of Sheerpower?
Check out the Sheerpower website. Free to download. Free to use. |