Seeding a Pseudo-Random Number Generator (PRNG) from Sources that Aren't Random

D

David T. Ashley 17 years ago

We have an embedded product that needs to assign random radio network addresses near startup.

We plan to use a PRNG with a period of around 2^32 (with a 32-bit seed).

To get the initial seed use for a PRNG, we have some 10-bit A/D converter readings available, as well as a free-running counter with a value 0

Vote

D

Dr Ivan D. Reid 17 years ago

On Wed, 24 Sep 2008 00:56:36 -0400, David T. Ashley wrote in :

One way I've seen in Unix is "ls -lR /proc |md5sum" so you could try generating a CRC or MD5 sum from your available data if you have enough programme space. No idea if that is "mathematically valid", though, especially for so few bits.

Ivan Reid, School of Engineering & Design, _____________ CMS Collaboration, Brunel University. Ivan.Reid@[brunel.ac.uk|cern.ch] Room 40-1-B12, CERN KotPT -- "for stupidity above and beyond the call of duty".

Vote

A

Andrew 17 years ago

lsb from an ADC is usually mostly noise.. How about accumulating the lsb from the ADC over some time? In your case, I'd take the lsb from each of the 8 channels, XOR them together to get one bit that I rotate into a 32 or

64 bit register. After a few laps (a few hundred samples) that should be random enough for most, unless you have a very noise free hardware. I've used that method in a few designs, and as long as I allow time to accumulate some bits, it sure looks random to me...

Vote

B

Bill Davy 17 years ago

Take the bits that are fairly random and push them into a CRC engine (very small in C or assembler) or anything which generates a hash code (say MD-5, for example).

CryptCreateHash(), CryptHashData(), CryptGetHashParam() do the trick in MSVC. You can specify the hash size length you want (it's in bytes, so 4).

Incidentally, be wary of PRNG. One I came across generated alternately odd and even numbers!

Regards,

Bill

Vote

A

ArarghMail809NOSPAM 17 years ago

If there is an otherwise unused ADC channel, you could always stick a reversed biased diode on it as a noise source.

ArarghMail809 at [drop the 'http://www.' from ->] http://www.arargh.com BCET Basic Compiler Page: http://www.arargh.com/basic/index.html To reply by email, remove the extra stuff from the reply address.

Vote

D

David Brown 17 years ago

Plan a way to handle conflicts. It doesn't matter if it is slow (for example, forcibly rebooting both nodes if a conflict is detected) because it will be very rare, but it will take the pressure off trying to generate a perfect RNG. If your networks are not too large, it will also let you use a smaller number of bits while still having very low probabilities of conflicts.

If you have a source of random data (the LSB of an ADC is a good choice

- high resolution timers measuring external events such as button presses are another one), then concatenating the bits is the *best* way to generate a larger random number. Any function that combines these bits (running them through a CRC algorithm, xor'ing them, or any other suggestion given here) will not increase the entropy - it is more likely to decrease it. For an easy way to see this, think about generating a number between 1 and 12 using two dice. If you "concatenate" the rolls by rolling one dice, then adding 0 or 6 depending on the second roll being 1-3 or 4-6, then you get a nice even distribution. If you just add the two rolls directly, you do not.

Hash functions (such as crc's) can be used to spread the random values around and are convenient if the random input varies in size, and are useful if the entropy of the random input is smaller than its size and is spread throughout the input. But that's not the case here - a hash function will be of absolutely no benefit (but not much harm, if it is a good hash function).

Vote

B

Boudewijn Dijkstra 17 years ago

Read this:

formatting link

The code as a whole is total overkill for your application, but it'll get you up to speed on how to stir a pool of entropy and various recipes for leveraging your ingredients.

Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/

Vote

B

Bill Davy 17 years ago

If the distribution of the LSB of your ADC is even slightly skewed (and it is quite likely to be), concatenating 32 to get a 32 bit number will not be as good as taking 1024 and pushing them through CRC/hash code and taking the

32 bit remainder. Imagine that the LSB is 10% 0 and 90% 1 (an extreme case, I grant you), and you can see what happens.

Bill

Vote

D

David Brown 17 years ago

What you are looking at here is the case I described above - the entropy of the random input value is smaller than its size. In your 90/10 case, you get about 0.15 "bits" of randomness (log(0.9)/log(0.5)) per sample from the ADC. So to get 32 bits worth of randomness, you need 210 samples (any more than that are completely redundant for your 90/10 skew). A 32-bit hash function will map these 210 bits onto 32 bits while retaining *most* of the entropy (I don't think a hash function can retain *all* the entropy, but I could be wrong). But the hash function does not make the data any more random.

For a more realistic skewing of say 55%/45%, you get about 0.86 bits entropy per sample. Concatenate 32 of these gives 27.6 bits entropy (this assumes that samples are uncorrelated noise, of course). While this is less than 32 bits, it is still more than the OP would need (if my maths is correct, this corresponds to a 1 in a million chance of a collision if you have about 200 nodes).

Vote

D

Dombo 17 years ago

Good advise. Even with a perfect RNG the system should be able to handle conflicts; two perfect RNG's can still produce the same number.

Vote

P

Peter Webb 17 years ago

I believe you are wrong.

The problem is that the input distribution is Gaussian, and he wants it uniform.

My immediate thought was to find the inverse mapping using the erf(x) function, but the other posters are correct - just sticking it through any reasonable designed hash function should make the distribution uniform, which is what (I think) he requested.

Vote

D

David Brown 17 years ago

Yes, the original poster said he had Gaussian-distributed inputs. Quite frankly, however, if you want uniformly distributed random data, you start with uniformly distributed data - not Gaussian data. Turning Gaussian data back to uniformly distributed data is hard to do well, and it is impossible to get the full randomness back again. The best you can do is, as suggested, concatenate a number of samples and run it through a hash function. I think it would be very difficult to characterise this procedure, so you would want a large number of samples to feel confident of the entropy. All in all, it's very inefficient.

I also see no reason to suppose that sampling the ADC is going to give Gaussian data unless you are adding or averaging multiple samples.

Most of the responders here have suggested other sources of approximately uniformly distributed random data - I mentioned a couple, and have assumed that the OP will use one of these or one of the other suggestions from this group. My posts were therefore about how to turn a number of small approximately uniformly-distributed samples into a large approximately uniformly-distributed sample (concatenate the samples, and avoid any sort of adding or averaging).

Using a hash function as you suggest is perhaps the best answer to the OP's question - but concatenating multiple small uniformly random values is the best answer to the question he should have asked!

mvh.,

David

Vote

M

md 17 years ago

).

I don't know if it's possible, but you could tune your radio to an arbitrary station (or between stations, for some noise) and sample that signal a couple of times.

=3DD

Vote

V

Vladimir Vassilevsky 17 years ago

Many folks already suggested hashing of the different sources. That might work.

However I think there is a conceptual problem here: there is no guarantee that network addresses will be unique. This has to be resolved somehow at the protocol level.

But, if the protocol has to resolve the address collisions, then the same protocol can assign the unique addresses from the very beginning. So there is no need for the black magic with the random numbers.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

Vote

A

Antonio Pasini 17 years ago

Il 24/09/2008 17.12, Vladimir Vassilevsky ha scritto:

I think Vladimir got the point. Make it work by protocol design with an explicit address resolution protocol and drop the whole random-from-adc issue.

Or stick in a 1-bit Dallas unique 64 bit number memory and pass on.

Getting the random trick robust enough would require significant time; I 'd do that just if q.ty > 50K.

Maybe he can just "patch" automatically an unique MAC address in the bitstream of the OTP/FLASH/EEPROM/whatever micro he's programming during production testing, and that's all.

Vote

P

Peter Webb 17 years ago

In reality, the whole question of the distribution is completely moot, because of the nature of his problem - he wants a seed for a Rand function. Once the seed has gone through the RNG once, it will be uniformly distributed. Just don't try and use the seed itself as a random number.

The distribution doesn't matter much because two seeds that differ even only in their LSB will still be mapped to completely different random sequences. A difference in even a single bit will cause the rand function to produce completely uncorrelated output for the two different seeds. The only place that the distribution comes into play is that it very slightly increases the chances of an exact match.

Because of this, all you want to do is minimise/eliminate having two absolutely identical seeds, your simple strategy of concatenating LSBs should work fine. A slightly better/more rigorous approach would be to take every single bit of data from every input, and then use a hashing function to map it into 32 bits. This will use all the available data and generate a uniformly distributed seed. The chance of two identical seeds being created is 1 in 2^32, which is all you can get with a 32 bit key anyway. This won't be materially better than concatenation of the LSBs, but if the code is audited then its easier to prove its optimised than to resort to (quite reasonable) hand-waving arguments.

Vote

D

David Brown 17 years ago

You are wrong regarding the proofs, code auditing and "hand-waving". It is a mathematical fact that concatenating 32 truly random 0-1 bits will give you a 32-bit number with maximal entropy - there is no hand-waving involved. Anything else you do to these bits, including using other bits from inputs, hash functions, RNGs or anything else cannot possibly increase the entropy - and most likely will decrease it. It is also very difficult to calculate the effect of such manipulation on the entropy.

If the assumption that the 1-bit (or larger) samples are truly random (i.e., uniformly distributed) does not hold, then you will not have a full 32 bits of entropy in the concatenating result - but you *still* have more than you could get by manipulating that 32-bit concatenation in any way. If your samples are not uniformly distributed in the first place, then concatenating more than 32 bits worth and passing it through a hash function can get you more entropy than simply taking 32 sample bits - but not as much as the full 32-bit entropy. And any calculations or proof of the entropy would likely involve a great deal of guestimates and hand waving.

Of course, it is unlikely that the OP really needs a full 32-bits of entropy. And the distribution will not make any difference to the chances of a collision (except in how it decreases the entropy) - running the 32-bit value through a hash or a RNG will also not affect the chances of a collision.

The way to handle systems like this is simple. First, make sure you have a method of detecting and resolving conflicts if and when they occur. Then calculate how many bits of random data you need so that the chance of a collision are low enough to be acceptable, based on the cost of conflicts. If you have a reasonable source of uniform random data, sample that number of bits. If your source is not very random, sample more bits based on hand-waving guestimates of "not very" random your samples are. If this means sampling more bits than you'd like in your id string (32-bits in the OPs example), you can either hash the larger sample string down to size, or re-think your design.

Vote

P

Peter Webb 17 years ago

There is an assumption in my analysis which I believe to be true, but don't absolutely know to be so in every case. My assumption is that the hash function is completely uncorrelated with the input data, and doesn't affect the entropy. This is a function of a *well designed* hash function, but not its central function.

I will, for the rest of this post, assume that the outputs of the hash function are uncorrelated with its inputs.

To simplify the argument, we have 8 independent inputs, each of 8 bits, so

64 bits of data we want do map down into a 32 bit seed.

My suggestion to use the full 64 bits and hash it down to 32 bits has one characteristic that simply concantenating the 4 LSBs of each of the 8 inputs doesn't have. You can state that unless all of the data from all of the inputs is identical (in which case you are stuffed anyway) you will only get a collision 1 in 2^32 times, which is the mathematical limit.

By considering only the LSBs, you have chucked away information (the more significant bits) and increased entropy.

In reality, this won't matter, because the 4 LSBs are far more "random" than the more significant bits, and all we are trying to do is avoid a direct collision. (But there, that's a hand-waving argument, isn't it?).

I agree that if we have 8 inputs, 8 bits, and hence 64 bits available, and we want to compress it down to 32 bits, then whether we use concantenation or hashing there are 2^32 possible antecedents for each identical seed.

The problem with concatenating the bottom 4 bits only is that the pattern appears to be regular - 0000|0110 is the same as 1111|0110 is the same as xxxx|0110. Now in reality, these are major fluctuations because they affect the most significant bits (assuming they came off an A/D converter or something), and we aren't throwing away much information when we discard it. Most of the randomness is carried in the LSBs in terms of short term fluctuations, unless its a uniform distribution (and we know its Gaussian). But nevertheless concatenation does have the philosophical problem that lots of inputs differing by only a single bit (not in the LSBs) produce the same seed. With a hashing algorithm, the inputs that produce the same seed look completely different. There are still 2^32 possible antecedents, but they all look completely different.

Vote

D

David Brown 17 years ago

It is certainly possible to make hash functions that don't affect the entropy, at least when the size of the input and the output is the same

- all you need to do is ensure that it is a one-one mapping. But I do not think this is the case for commonly used hash functions designed for mapping larger inputs to a smaller output (such as CRCs or md5). I don't think they will reduce the entropy much, but they might do so a little. And such functions can effectively move (most of) the entropy into a smaller number of output bits. However, what is certainly the case is that a hash function cannot *increase* the entropy, which is what many people think ("we'll put the samples through a CRC function to make it a bit more random").

OK

That is certainly a logically true statement, but it is totally irrelevant, assuming that the 8 inputs are all uniformly random. You simply have a 1 in 2^32 chance of a collision whatever 32 bits you take from your 64 bits of total samples. It makes no difference whether you get these 32 bits from concatenating specific bit numbers in the samples, or from using your ideal hash function.

If our assumptions are not valid, however, things change somewhat. If the hash function is not quite ideal, then you will lose some entropy when using it.

If the samples are not uniformly random, then we have a more complex situation. In this case, the bits of the samples do not have a full bit of entropy each. A good hash function will concentrate much of the total entropy of the samples in its output (though never the full 32 bits).

However, in many real life sample sources, the LSB will have much higher entropy than the MSB. If the LSB's entropy is close to 1 (i.e., it's just noise), then you can get close to the full 32 bits of entropy by simply concatenating 32 LSBs. Taking more samples (either multiple bits, or just more LSBs) and hashing them *may* give you marginally more entropy, but you have great difficulty getting much improvement, and even more difficulty *proving* that you have more entropy. It's simply not worth the effort unless you have a very poor source of samples in the first place. Thus /dev/random on Linux uses hashes because the entropy source is poor (it's based on things like timings on the network). But in an embedded system where you can measure a noisy signal with a high resolution ADC, you have an obvious source of very high entropy samples.

You mean "decreased entropy" - "entropy" is the randomness of the data.

Yes, in throwing out the higher bits, I've thrown away some information

- but I've thrown out the bits with least entropy, and kept those with most entropy. You have to throw out some of the bits to reduce the 64 bit data to a 32-bit number. And if the bits you keep are truly random, you've got perfectly good samples - it does not matter if you've thrown out other bits.

No, it's not hand-waving - it's an English-language explanation of a provable mathematical argument. Theoretical hash functions are hand-waving - concatenating known random bits is not.

Again, I must emphasise that you only get good random data from concatenation if you use good random data in the first place. The 4 LSB's of a single ADC sample will not have as much entropy as the LSB's of 4 separate samples.

So what? Unless your source is such that single-bit differences are common (in which case, it is not uniformly random data), it does not matter. Concatenating the low bits will produce collisions if the low bits are the same and the high bits are different - using a hash will produce collisions for other combinations that are extremely hard to predict. But if the source is truly random, the chances of collisions are identical (assuming the hash function is ideal), so there is absolutely nothing to be gained by using more sample bits and a hash function. Concatenating 32 truly random bits is a *perfect* way to get a 32-bit random number - there is no possible way to beat it (there are other equally perfect ways, but nothing better).

The *only* matter to consider is whether we can get 32 bits that are truly random (or at least random enough for the job). The LSB of an ADC sample of a noisy signal is going to give you that randomness. Take 32 such samples and you've got your random id in a simple, efficient manner and you can easily *prove* it to be random. Problem solved.

If you *don't* have such a source of random data, then you need to use hash functions to compress the entropy of a larger sample into a smaller output.

Vote

P

Peter Webb 17 years ago

Thanks for the discussion; very interesting.

I had a related, very difficult to find bug once. In a graphics program, two objects were supposed to have different predetermined trajectories, but actually had the same. Whenever I breakpointed the program, they took different trajectories. I had instantiated a separate Random class for each; it seeds off the system clock; when the program was running they got the same seed (and hence trajectory), but on a stepthrough the clock was different and so were the trajectories. Duh ...

Vote

Seeding a Pseudo-Random Number Generator (PRNG) from Sources that Aren't Random

Join the Discussion

Didn't find your answer?