Talk:Perfect hash function

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Unnamed section[edit]

Bob Jenkins' "Minimal Perfect Hashing" looks good, but I have not tried it yet.

Regards! Alan

I'm not sure if "perfect" and "ordered" qualifiers shouldn't each go to separate pages, possibly even wiktionary pages. Also I see on definition on web that says ordered means i<=j implies F(i)<=F(j) (note the equal signs) but that sounds wrong to me--too weak. If only there were an expert somewhere to consult with...

JMCorey 18:30, 23 Mar 2005 (UTC)

"distinct elements in S"?[edit]

A set by definition contains distinct elements. So, why the statement "maps distinct elements in S to distinct integers"? Should be "maps each element in S to a distinct integer". —Preceding unsigned comment added by Tbtietc (talkcontribs) 19:53, 3 August 2010 (UTC)[reply]

"Order-preserving minimal perfect hash functions require necessarily Ω(n log n) bits to be represented."[edit]

It is unclear what 'n' is measuring in this statement. Also, a reference would be nice. —Preceding unsigned comment added by 207.81.2.177 (talk) 14:49, 15 April 2011 (UTC)[reply]

A reference where the lowerbound was proved is published in Fox, Chen Daoud, Heath."Order Preserving Minimal Perfect Hashing" SIGIR1990 and subsequent TOIS paper. I included both in Further Reading along with their URLs in the ACM Digital Library.
n stands for the number of keys
— Preceding unsigned comment added by Daoudamjad (talkcontribs) 11:03, 5 March 2013 (UTC)[reply]

Chichelli ref[edit]

I added the reference below because it predates all the ones listed and is, we believe, the first article about perfect hash functions published with useful results.

Richard J. Cichelli. Minimal Perfect Hash Functions Made Simple, Communications of the ACM, Vol. 23, Number 1, January 1980. 19:40, 27 June 2011 (UTC)Mcichelli (talk)

Bits per key[edit]

'It has been proved that any minimal perfect hash function requires at least 1.44 bits/key.'

I've rephrased this to 'minimal perfect hash scheme', since a strict interpretation of the above is actually false. For example if you happen to have the keys 0..99, they can be minimal perfect hashed with the function f(x)=x, which doesn't require 144 bits. —Preceding unsigned comment added by G121 (talkcontribs) 21:15, 15 October 2009 (UTC)[reply]

I have taken the below bits-per-key information out of the page, because it is uncited and unclear. If someone understands what it is trying to say and can cite sources, feel free to add it back. It seems like it is describing the number of storage bits/key to implement any minimal perfect hash for any set of keys, but that can't be right: for example, if the key set is the numbers 0...n-1 the hash function can simply return the number itself and requires no storage. Otus (talk) 12:26, 13 July 2011 (UTC)[reply]

"It has been proved that any minimal perfect hash scheme requires at least 1.44 bits/key.[citation needed] However the smallest currently use around 2.5 bits/key. Some of these space-efficient perfect hash functions have been implemented in cmph library and Sux4J."

The specific case you guys are talking about is not for general use. General use can basically be assumed to be random data. When MPHFs have keys that are strings it is passed through a "random number generator" called a hash function (CRC32, Jenkins, MD5, etc.). 1.44 bits/key (1 / ln(2) bits/key) which is in this paper http://cmph.sourceforge.net/papers/esa09.pdf section 1.1. According to this paper they got 2.07 bits/key and generates at 74,000 keys/sec. A more reasonable 2.27 bits/key generates at 400,000 keys/sec. The computer was a 1.86 GHz Core2. This algorithm, CHD, is in cmph. BPZ, the algorithm that CHD is compared to in that paper, is also in cmph under it's other name BDZ. The 2.5 bits/key is referring to BDZ/BPZ which gets 2.46 bits/key plus a rank table which takes up 32 * 1.23 / 2 ^ b bits/key. Using b = 10 you get about 2.50 bits/key. Granted the rank table only needs ceil(log2(1.23 * n)) * 1.23 / 2 ^ b. — Preceding unsigned comment added by SteveT84 (talkcontribs) 03:02, 12 August 2011

Order preserving and monotone[edit]

It's not clear to me what the difference is between "order preserving" and "monotone". They sound like the same thing. Am I missing something? If so, it might help to clarify the distinction.

Secondly, do "order preserving" and "monotone" only apply to minimal perfect hash functions, or can the terms also be used for perfect hash functions?

Cmcqueen1975 (talk) 01:41, 3 August 2011 (UTC)[reply]

Order Preserving minimal Perfect Hashing can be used with perfect hash functions as well. The term was coined by Fox, Chen, Heath, Daoud in their seminal paper SIGIR 1990 (later published in ACM Transactions on Information Systems; included the reference in Further Reading). I introduced the term in my Ph.D. thesis and it stands for "any priora order" and lexographical key order is a special case. Many examples are given for both. Applications include efficient Hash Join, efficient deduplication, and efficient access for variable length keys as records can maintain their current locations while the OPMPHF provide O(1) access. Used extensively to reference seismic lines and vector data that were stored on online tapes. — Preceding unsigned comment added by Daoudamjad (talkcontribs) 11:03, 5 March 2013 (UTC)[reply]

ACM Digital Library relevant publications[edit]

I added four ACM Digital Library publications relevant to perfect hash functions in section Further Reading. — Preceding unsigned comment added by Daoudamjad (talkcontribs) 11:03, 5 March 2013 (UTC)[reply]

And thankfully that WP:SPAM has been removed. Please do not promote your own work ever again (✉→BWilkins←✎) 01:17, 6 March 2013 (UTC)[reply]

I am not promoting anything. This is not spam. The paper you cite contains "a copy" of the paper that appeared 25 years ago in the Communication of the ACM" the four links are to papers in the ACM digital library. All are copyrighted and citing papers that not "peer reviewed" on Wikipedia go often unnoticed. Certainly editors who are labelling all edit SPAM and reverting without understanding the subject are not acting in the "Wikipedians" spirit. Again I am not promoting my work, these papers collectively have been cited more than 350 times; and have been around for the past 25 years; published so that people can use them. However claiming results that are widely used and having it publicized on wikipedia is not GOOD as it shows lack of knowledge and lack of peoper review. AGAIN THIS IS NOT SPAM and it is wonderful that a record is being maintained of do's and undo's !!

I find it ironic that a "citation needed" for the 1.44 bits and that was proved and derived in the forementioned CACM1992 paper that was deleted about 10 times without being properly investigated and LABELLED SPAM. It reflects lack of desire to improve content!!

( Daoudamjad (talk) 17:35, 7 March 2013 (UTC) )[reply]

First, I think you've been treated a bit roughly for an editor who appears to be newbie here. I don't expect you to know much about how en.Wikipedia operates.
That said, your efforts in Wikipedia's article space appear to have the purpose of promoting yourself rather than serving the encyclopedia. Your edit history shows you created an article about yourself.[1] The references that you added to this article were entirely your works. You are an expert in the field, and you know the publications, so it is possible the actions were good faith additions with a little too much zeal. The problem is editors who add references to their own work have a WP:COI. Editors may add such references, but they need to be very careful. See, for example, MCiura adding material to Shell sort. If COI editors are not careful, then they get into problems such as we have here. Adding half a dozen self-authored references (and no independently-authored references) does not suggest the appropriate level of circumspection.[2] Maybe the contributions are not spam (I didn't characterize them that way in my reverts), but the number of self-references raises a strong conflict of interest issue. For some editors, the mere insertion of a reference to one's own thesis carries an overwhelming suggestion of self-promotion. It is odd that you used initials for the first three authors of the CACM 1992 paper, but spelled out the first name of the fourth author (which happens to be you). Same problem with TOIS 1991. Yet you claim that you are not promoting yourself. I can easily see how seasoned editors would label your insertions spam.
Wikipedia prefers secondary sources to journal articles. A textbook reference offers perspective that most primary sources do not have. Secondary references help show that a result is significant. Primary sources are not forbidden, but they should be used carefully. If a paper was published twenty years ago and cited hundreds of times, then it should be easy to find secondary sources that cover its important findings and put them in perspective.
Further reading sections are not favored. They should contain secondary references that can better explain the topic. Wikipedia does not have a goal of publishing a research bibliography for its topics. Wikipedia is not a scholarly journal or a scholarly index; it is an encyclopedia. WP is not interested in publishing every little fact or citing every important paper. To me, many of the remaining items in the Further reading section do not belong there. However, that inappropriate articles are there is not a reason to add other inappropriate articles. That's a distinction that many new Wikipedia editors do not understand.
Your complaint above is inaccurate. The 1.44 bit claim in the article carries a reference (ref 2, Belazzougui). The 2.5 bit claim still has a citation needed tag. See #Bits per key discussion above. I do not see your article-space edits satisfying the 2.5 bit citation needed tag. You did add a citation for the cn-tagged definition of order preserving MPHF. When I last reverted you, I restored that reference,[3] but then I replaced it with a NIST reference (which happens, by the way, to reference one of your papers). I also used your reference for the Ω(n log n) space bound in the following sentence due to some comments given by you on this talk page. I am, however, skeptical of the importance of that bound; it suggests an order preserving minimal perfect hash function is a minimal perfect hash function composed with a permutation table. Certainly the article does not explain the importance or significance of the result.
I do not see any merit in the claim "deleted about 10 times". Inaccuracy and exaggeration do not serve your purpose.
The suggestion that Bbb23, BWilkins, and Drmies lack Wikipedia's spirit or are uninterested in improving WP's content is absurd. I often come across their good efforts. I think you could have been handled with softer gloves, but you hold a Ph.D. and should have a thick skin.
Please contribute to Wikipedia – just be mindful of what WP is and be careful when there is a conflict of interest.
Glrx (talk) 23:05, 7 March 2013 (UTC)[reply]

Ironically your NIST reference is the wrong reference and Bob Jinkins never wrote that ACM article. Have you taken the time to inverstigate this further, I would have admitted your sincere desire to improve content; but status quo proves otherwise! It is taking a lot more writing to convice you that my additions are not self promoting. I added the reference to my thesis because I contacted authors of [2] and they claimed that the CACM1992 paper only offered an outline; I even posted their answer in your talk page. I am really new to the concept of reverting valid modifications (a reference to a seminal paper) and labelling those modifications SPAM:

Again, the NIST reference you chose is to another seminal paper on perfect hashing and it references our work; but it was not written by Bob Jinkins!!; In fact Bob Jinkins page on perfect hashing that is cited in Further Reading has a link to the ACM resources page on perfect hashing which I have been maintaining for years.

My complaint is paper ref 2 that uses 90% of the CACM1992 Alg II paper and simply "skips" the reference! Our papers took years of work, grants, and many painful revisions berfore they where published. It is amazing how things are twisted and how easy it would be to create a Wiki"stu"pedia instead of the Wikipedia we all love, adore, and fund!! Without further ado, please review the NIST reference; it is not written by Bob Jinkins: cites Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M. Daoud, Practical minimal perfect hash functions for large databases, CACM, 35(1):105-121, January 1992.

Towards better content and higher integrity, we publish and serve and please pardon my "soft" skin. I prefer "Amjad M Daoud" because Daoud is a generic name!! I hope you understand and that is the way it is written on the paper! modifying it to Daoud A. is horrific!! and this is not SELF PROMOTING and I do not deserve being a "nanometer" from being blocked for it!!!

Sincerely, (Daoudamjad (talk) 14:25, 8 March 2013 (UTC))[reply]

Dynamic perfect hashing[edit]

It seems patently obvious that deletions from a set of dynamic keys do not cause a new hash function to be needed. Though the source doesn't explicitly say this, it's implicit in some of the formulations. Can I add this without having to quote or point to some part of the source? Sbalfour (talk) 16:27, 6 October 2019 (UTC)[reply]

Missing info, emphasis?[edit]

When I consider (or have actually attempted) finding a perfect hash function, the immediate consideration is that it is daunting: the necessary conditions are stringent, computational effort is factorially steep, and there is no deterministic procedure for finding one. Most attempts end in failure after a prolonged period of enumerative search, trial and error, and guesswork about what to try next. Sometimes, it cost days of cpu time with no end in sight. Success usually yields a computationally complex function, or a multi-tiered array of hash functions, and/or a substantially sparser table than I'd need with a standard hash function. For ~1000 keys, I found one with 3 modulos(divides) and two multiplies. This is versus the alternative (i.e., a binary search: one may consider the 'hash function' to be ten integer comparisons). And the table for binary search is contiguous, so the ten comparisons are a minimal perfect hash function. How often do we need a perfect hash function for more than 1000 keys?

There is a reasonably deterministic way of guaranteeing a perfect hash function will be found: if the number of slots in the table is at least n2 where n is the number of keys mapped, more than one out of every two random universal hash functions is collisionless. The ONLY reason finding such a function is factorially hard is because we insist that the table size remain O(n). It's a classic time-space tradeoff on a factorial scale.

Here's a table of the trade off for 32 keys and power of 2 table sizes:

table size odds

  • 32 1 / 5.6e12 (minimal perfect hash function)
  • 64 1 / 1.3e4
  • 128 1 / 6.9e1
  • 256 1 / 7.6e0
  • 512 1 / 2.7e0
  • 1024 1 / 1.6e0

where odds is the probability of picking a perfect hash function out of all possible mappings. A standard hash table for 32 items might be about 41 slots, and the odds of picking a perfect hash function for it are roughly 1 in 45 million. If we're singularly concerned about worst case performance, the problem is trivial. Here, a 1024-slot table is negligible - if the data items are huge, we don't put them in the table, but pointers to them instead. But I don't think worst case performance is the driving factor behind selecting a perfect hash function. Expected (statistical) worst case performance (as opposed to theoretical or absolute worst case performance of all keys mapping to the same index) for a universal hash function is already reasonably small.

None of this is mentioned in the article, as if "Aren't PHF's great! Let's use them all the time".Sbalfour (talk) 21:54, 7 October 2019 (UTC)[reply]

logarithm[edit]

Throughout the article (as well as in the reference [1] and its reference [2]) the short form log is used for binary logarithm (logarithm to the base of 2), generally abbreviated either with log2 or lb. Shouldn't this be corrected or at least clarified?

[1] https://cmph.sourceforge.net/papers/esa09.pdf

[2] K. Mehlhorn. Data Structures and Algorithms 1: Sorting and Searching. SpringerVerlag, 1984. 37.138.90.29 (talk) 17:23, 28 February 2024 (UTC)[reply]