Andrew,
It does look like there is room for improvement.
Nevertheless some comments:
1) Note that the primary use of resourceLib.h in base is hash tables that have integer keys. String keys are used with resourceLib.h I think only in limited ways in base (probably only for the PCAS example servers and for event name lookups). It would be good to optimize hash distributions for integer keys also.
2) Note that the resourceLib.h hash table dynamically, and deterministically, expands the table in response to changing load at runtime. This ability to use the same compiled library with both small and large tables while maintaining bounded deterministic lookup times was, at least at the time that I wrote this, a bit different from most of the other commonly used approaches.
When I have looked at this in the past most of the hash libraries are locked to a fixed maximum number of buckets. So if you can know how many the maximum number of entries is ahead of time you can have a faster and more evenly distributed approach. However, having a perfect hash distribution algorithm doesn’t result in a faster lookup in situations where you have 10 entries in every bucket because you are locked to a fixed table size which is too small.
As I recall, the problem was that if you just truncate a large width hash to index a small table you are effectively throwing away part of the higher order bits in the index which can result in unpredictable distributions. My string hash is pretty much identical to Pearson (aka mrk) except that I generate initially a 32 bit arch width hash index using Pearson's approach, treat this result as an integer key, and subsequently process this integer key through an integer hash to generate a final index for the table. For hashing of integer keys, I took a conservative approach of using iterative shifting and exclusive-or to make certain that all bits contribute when a smaller table is indexed. I didn’t spend a lot of time (any time) optimizing this integer hash algorithm so there is no-doubt lots of room for polishing. This post string hash, integer hashing step, is probably what is causing the sub-par distribution that you are observing. If we plug in a better string hash it mi!
ght not help if we don’t deal properly with the truncation issue (aka the hash algorithm for integer keys). Note also again that currently resourceLib.h is being used more often with integer keys so that might be where we get the best gains.
By conservative above I mean that, when I wrote this, I was more concerned about reasonable bounded distribution behavior, and less concerned about perfect distribution behavior when indexing both large and also small tables.
So, in summary, I don’t doubt that there can be useful improvements made, and I am enthusiastic about installing them, as long as we are mindful of the above constraints.
Jeff
> -----Original Message-----
> From: Andrew Johnson [mailto:[email protected]]
> Sent: Monday, March 23, 2009 3:14 PM
> To: Jeff Hill
> Cc: Core-Talk
> Subject: String hash functions and resourceLib.h
>
> Hi Jeff,
>
> I've been looking at string hash functions, and have recently replaced
> Marty's (as used in both gpHash and dbPvd) with one that is faster and
> performs as well on most IOCs and somewhat better on many. The function
> is now available from epicsString.h, and I think we should use it in
> resourceLib.h as well.
> Unfortunately your hash function generates rather a lot of collisions with
> some loads, and as with Marty's function it does require a table look-up
> for every character, which slows it down compared to the pure CPU
> methodologies.
>
> Here are the results of my tests, showing for each algorithm a histogram
> showing how many buckets contain a specific number of items. Of the
> algorithms, "mrk" is Marty's original, and 'joh' is the stringId::hash
> algorithm from resourceLib.h. The algorithm I picked for epicsString.c is
> "ap" by Arash Partow, which along with some of the others is described at
> http://www.partow.net/programming/hashfunctions/
>
>
> tux% bin/linux-x86_64/hash /usr/local/iocapps/iocinfo/pvdata/iocpss05bm 6
> Read 14896 items into 16384 buckets, averaging 0.909 items per bucket
>
> Algorithm: mrk joh ap sdbm djb rs bkdr1 bkdr2 bkdr3 bkdr4
> bkdr5
> 0 6824 8871 6219 6705 7245 6339 6694 6358 7238 6380
> 6447
> 1 5729 4486 6471 5858 5140 6316 5869 6292 5296 6242
> 6157
> 2 2661 1501 2839 2727 2674 2796 2725 2813 2541 2835
> 2837
> 3 905 542 699 846 979 768 850 737 877 756
> 745
> 4 206 274 131 203 273 141 207 156 299 141
> 165
> 5 51 231 24 37 61 24 35 25 107 28
> 28
> 6 5 178 1 7 10 0 4 3 26 2
> 5
> 7 3 133 0 1 2 0 0 0 0 0
> 0
> 8 0 69 0 0 0 0 0 0 0 0
> 0
> 9 0 52 0 0 0 0 0 0 0 0
> 0
> 10 0 26 0 0 0 0 0 0 0 0
> 0
> 11 0 8 0 0 0 0 0 0 0 0
> 0
> 12 0 9 0 0 0 0 0 0 0 0
> 0
> 13 0 2 0 0 0 0 0 0 0 0
> 0
> 14 0 0 0 0 0 0 0 0 0 0
> 0
> 15 0 2 0 0 0 0 0 0 0 0
> 0
>
> Std.Dev. 0.987 1.594 0.896 0.968 1.040 0.911 0.964 0.916 1.064 0.917
> 0.928
> FigOfMerit 0.984 1.547 0.901 0.967 1.032 0.915 0.963 0.920 1.054 0.920
> 0.931
>
> FoM Rank 7 10 0 6 8 1 5 2 9 3
> 4
>
>
> The input data for the above test is the record list from our IOC with the
> largest number of records. I have run this against many other IOC record
> lists, adjusting the table size appropriately, and the "ap" algorithm
> usually appears at or near the top. The source for my test program is
> attached; the second argument sets the table size and should be between 0
> and 8.
>
> I would be happy to commit the necessary changes to resourceLib.h if you
> wish me to.
>
> - Andrew
> --
> The best FOSS code is written to be read by other humans -- Harold Welte
- References:
- String hash functions and resourceLib.h Andrew Johnson
- Navigate by Date:
- Prev:
String hash functions and resourceLib.h Andrew Johnson
- Next:
EPICS @ RTEMS builds: separate BASE and TOOLS locations Ralph Lange
- Index:
2002
2003
2004
2005
2006
2007
2008
<2009>
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- Navigate by Thread:
- Prev:
String hash functions and resourceLib.h Andrew Johnson
- Next:
EPICS @ RTEMS builds: separate BASE and TOOLS locations Ralph Lange
- Index:
2002
2003
2004
2005
2006
2007
2008
<2009>
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
|