S.3 Advanced Hashing: A Deeper Dive into Cluster Key Lookups

Hashing - A Deeper Dive into FINDROW()

Often when programming, there is a need to combine multiple variables into a single named object. Doing so makes it easier to keep track of your variables and adds clarity to your code. In SheerPower, we call this object a CLUSTER (see CLUSTERS in SheerPower). This section focuses on finding data in cluster arrays using the FINDROW() function (see also FINDROW Function).

Example: FINDROW() Function

cluster student: name$ city$ add cluster student student->name$ = "Jason Nordahl" student->city$ = "New York City" add cluster student student->name$ = "Joan Ark" student->city$ = "Helena" add cluster student student->name$ = "Frank Abbott" student->city$ = "San Diego" print findrow(student->name$, "Joan Ark") end

The FINDROW() function creates an internal hash-table and uses it to quickly find a given value. Hashing is the process of converting data such as a name into one or more numbers. Instead of having to find the name, you can find the hashed number of the name. The hashed numbers can be organized in ways that make finding data fast. SheerPower’s FINDROW() function can find information at a rate of over 10 million rows per second. If the value being searched for does not exist, then the search is at a rate of over 15 million rows per second.

The Internals of FINDROW() Hashing

FINDROW() uses a method called “hashing” to perform high-speed lookups on cluster data. Internally, hashing uses what is called a hash-table. As new rows are added to the cluster, optimizations are done to maintain the hash-table. If values within the cluster change, more optimizations are done to maintain the changed hash-table.

Building the Hash-Table

Assuming there are 10 rows in your cluster, SheerPower starts by building a hash-table with 16 slots. This over-allocation of 6 slots increases the efficiency of hash searches at the expense of some memory. The over-allocation also guarantees that there are always unused hash-table slots. The hash-table is accessed in three stages:

First stage uses slots 1 through 8.
Second stage uses slots 9 through 16.
Third stage uses all 16 slots.

For example, if you are storing the value of “Boston” into this cluster (city$), SheerPower creates three hash values (HVALUES) for this cluster variable...

Searching the Hash-Table

Let’s find the first occurrence of Boston. Using three sets of prime numbers, we create three hash-values for Boston. The process of searching the hash-table is similar to when we built it. If all else fails, we do a linear probe. This final step is rarely needed.

Handling Duplicate Keys

When we want to insert a duplicate key, we find the hash-table slot for the first occurrence of the key. That slot contains, among other things, the current duplicate_key_count for the key. We increment the counter and then re-hash the duplicate key by factoring in the duplicate_key_count. As a side effect, this method also gives us virtually instant access to the number of occurrences of any key value.

Statistical Results for Searching

Our empirical studies using hundreds of millions of keys show the performance of FINDROW():

The first hvalue finds an unused slot over 85% of the time.
The second and third hvalues succeed the rest of the time.

Statistical Results for FINDROW Performance

Hide Description

a$ b$ x y

Enter or modify the code below, and then click on RUN

Looking for the full power of Sheerpower?
Check out the Sheerpower website. Free to download. Free to use.