Sheerpower Logo
Q.8  Advanced Hashing: A Deeper Dive into Cluster Key Lookups

Hashing: A Deeper Dive into Cluster Key Lookups

Often when programming, there is a need to combine multiple variables into a single named object. Doing so makes it easier to keep track of your variables and adds clarity to your code. In sheerpower, we call this object a CLUSTER. This section focuses on finding data in cluster arrays using the FINDROW() function.

Example: FINDROW() Function

    cluster student: name$ city$
    add cluster student
    student->name$ = "Jason Nordahl"
    student->city$ = "New York City"
    add cluster student
    student->name$ = "Joan Ark"
    student->city$ = "Helena"
    add cluster student
    student->name$ = "Frank Abbott"
    student->city$ = "San Diego"
    print findrow(student->name$, "Joan Ark")
    end
  

The FINDROW() function creates an internal hash-table and uses it to quickly find a given value in a cluster. Hashing is the process of converting data such as a name into one or more numbers. Instead of finding the name, you can find the hashed number of the name. The hashed numbers can be organized in ways that make finding data extremely fast. sheerpower's FINDROW() function can find information at a rate of over 10 million rows per second. If the value being searched for does not exist, then the search is at a rate of over 15 million rows per second.

The Internals of FINDROW() Hashing

FINDROW() uses a method called "hashing" to perform high-speed lookups on cluster data. Internally, hashing uses what is called a "hash-table." As new rows are added to the cluster, optimizations are done to maintain the hash-table. If values within the cluster change, more optimizations are done to maintain the changed hash-table. Specific optimizations will be explored elsewhere.

Building the Hash-table

Assuming there are 10 rows in your cluster, sheerpower starts by building a hash-table with 16 slots. This over-allocation of 6 slots increases the efficiency of hash searches at the expense of some memory. The over-allocation also guarantees that there are always unused hash-table slots. The hash-table is accessed in three stages:

  1. The first stage uses slots 1 through 8.
  2. The second stage uses slots 9 through 16.
  3. The third stage uses all of the slots 1 through 16.

Hash-Table with Slots

    1  2  3  4  5  6  7  8
    9  10 11 12 13 14 15 16
  

For example, say that you are storing the value of "Boston" into this cluster (city$). Using three different prime numbers, sheerpower creates three hash values (HVALUES) for this cluster variable. Each HVALUE is mathematically computed using the byte values of the city name and one of the prime numbers. In addition, sheerpower keeps track of how many times each city has been hashed and factors that occurrence number into the computation.

Using the Hash Values

sheerpower then takes the first HVALUE and reduces it to a number between 1 and 8. In this example, the reduced number is "4." This tells sheerpower to look in the 4th slot of the hash-table. If that slot is available, sheerpower stores the first HVALUE (4) and the cluster row number for the city name "Boston." We also store the occurrence number for the city.

Assigning Hash-table Boston to Slot #4

    1  2  3 #Boston 5  6  7  8
    9  10 11 12 13 14 15 16
  

If the slot was in use, we take the second HVALUE and reduce it to a number between 9 and 16. Let's say the reduced number was 15. We look in the 15th slot of the hash-table. If that slot is not in use, we store the second HVALUE and the cluster row number for this city. We also store the occurrence number for the city.

If the slot was in use, we take the third and final HVALUE and reduce it to a number between 1 and 16 – the full range of hash-table slots. Let's say the reduced number was 5. We look in the 5th slot of the hash-table. If that slot is not in use, we store the third HVALUE and the cluster row number for this city. We also store the occurrence number for the city.

If the slot was in use, starting at the next slot (6 in this example), we sequentially search the hash-table for an unused slot. If we get to the end of the hash-table, we start over at the beginning. Because the hash-table is over-allocated, we are guaranteed to find an unused slot – usually within two tries. When we find the unused slot, we store the third HVALUE and the cluster row number for this city. We also store the occurrence number for the city.

Searching the Hash-table

We will assume the same 10 row cluster and 16 slot hash-table. Let's find the first occurrence of Boston. Using three sets of prime numbers, we create three hash-values (HVALUE) for Boston. The hash-values are mathematically computed just as we did when building the hash-table.

We start by taking the first HVALUE and reducing it to a number between 1 and 8. Let's say the reduced number was 4. We look in the 4th slot of the hash-table. If the slot is unused, then our search is over. Boston cannot be in our cluster. If the slot is used, we compare our first HVALUE with the HVALUE in the 4th slot. If they match, we then use the row number stored in the slot to look at the cluster’s city. If indeed it is Boston, then our search is over.

Hash-table

    1  2  3 #Boston 5  6  7  8
    9  10 11 12 13 14 15 16
  

If we did not find Boston using the first HVALUE, now we use the second HVALUE and repeat the process that we did for the first HVALUE. If still not found, we use the third HVALUE and repeat the process.

If Boston is still not found, we sequentially scan the hash-table. If we find an unused slot, our search is over. Boston cannot be in our cluster. If we find a match for the third number and the row in the cluster contains Boston, our search is over. Otherwise, we keep sequentially scanning. Because we over-allocated the hash-table, the scanning will eventually stop. on average the scan stops after just two attempts.

Handling Duplicate Keys

When inserting a duplicate key, the process begins by locating the hash-table slot that holds the first occurrence of the key. This slot contains, among other details, the current count of duplicates for that key (duplicate_key_count). We increment this count and then re-hash the duplicate key, incorporating the updated duplicate_key_count. This approach allows us to instantly determine the number of occurrences of any given key. To retrieve all occurrences of a duplicate key, we provide the key value and iterate through the occurrence numbers. When a key's value is updated, we also update the first occurrence's slot with the new duplicate_key_count. If the first occurrence is removed, another occurrence is promoted to maintain the correct duplicate_key_count.

Statistical Results for Searching

Our studies involving hundreds of millions of keys reveal that the first hash value (HVALUE) will locate an unused slot about 85% of the time. If it doesn’t, the second HVALUE will find an unused slot approximately 95% of the time. If this also fails, the third HVALUE will find an available slot. In the final step, sequentially scanning for an unused slot typically succeeds within two attempts. On average, new keys can be added to a cluster at a rate exceeding five million rows per second. The FINDROW() function can locate over 10 million key values per second, regardless of cluster size. If a key is not present in the cluster, this can be determined at a rate exceeding 15 million keys per second.

Hide Description

    

       


      

Enter or modify the code below, and then click on RUN

Looking for the full power of Sheerpower?
Check out the Sheerpower website. Free to download. Free to use.
Wide screen