I do love that script. (Although I keep wondering, is it safe to take the remainder of the random number? I thought that was risky. Although admittedly, even if it were, I don't expect Arthur Dent is cryptographically consistent.)
I actually did almost replace $((RANDOM % 3)) with $((RANDOM / 10923)). I decided not to, in the end, because the former is more readable, and ultimately the aim of this script is to be read and understood by humans.
However, thinking about it further ... (Warning: extreme pedantry ahead.)
You're certainly right in general that with several types of commonly used low-end RNG it's better to take the most significant part of an output value rather than the least significant part, since state only propagates up the word and never down, and hence (e.g.) taking the k low bits of each output from an n-bit linear-congruential generator will give you a period of at most 2k instead of the 2n you'd get by taking the k high bits. Other types of RNG don't suffer from this at all: a linear feedback shift register, for instance, generates a stream of bits with no bit any worse than any other, so if you had one of those then you could safely reduce its output by whatever means was most convenient for you. However, after a quick source dive it looks as if the RNG behind bash's $RANDOM shell variable is indeed a linear congruential one; so yes, this would impair its quality, at least in principle.
However, that analysis applies to the situation where you're extracting many values from the RNG in succession without re-seeding it. Every fresh instance of bash will re-seed its RNG – and since this script only generates one random number during its entire run, it will generate only one random number per re-seed. So in fact the output values will be the seed values fed through one or other arbitrary transformation, and neither of those transformations looks to me as if it'll be likely to turn a crappy stream of seeds (bash mostly seems to seed off its PID) into particularly random-looking output.
To verify this, I've just tried out both versions of the script on a 1000-long test run. Both, in fact, seem pretty poor but in different ways: the mod-3 version as shown above seems to generate a lot of contiguous sequences of four or more of the same phrase, whereas the division version often seems to generate the same phrase every other run for even longer stretches. Short of using /dev/urandom, I suspect the only way to really do better would be to compute some sort of cryptographic hash (e.g. MD5) of the current high-resolution time and the pid. Which would be feasible in a shell script, admittedly, if even less portable. (I'm already unhappy that I had to make the script bash-specific in order to have it be concise enough to still be funny.)
Incidentally, it's also a mistake to consider this sort of thing to be a cryptographic problem: any such RNG is one you certainly should not have been using for crypto purposes in the first place, and any crypto-quality RNG should be equally (or, at least, sufficiently) strong at both ends of its output.
Oh, and finally: Zaphod's original specification of the Arthur Dent script never mentioned good randomness. All you have to do is to program it to say three things. My script definitely says three things. :-þ
See what I mean? Both the ones based on bash's $RANDOM are utter rubbish, but the MD5 version has actually managed to produce all 27 possible triples at plausibly similar frequencies. I think if anything my original mod-3 version did slightly better than the dividing one: it at least managed to produce 25 of the 27 possible triples to the dividing version's 17, even if its distribution still sucked.
eta: indeed, a chi-squared test confirms that the mod-3 version is the better of the two simple ones. Chi-squared values for the three output distributions are respectively 849, 2728 and 16.5. (With 26 degrees of freedom, you expect a properly random output to be somewhere below 30-50.)
Oh, thank you. You're quite right, I'd hoped someone would do the analysis. The appropriate model should either be a program in a loop, or a program run less than every second (or however long it takes to say "what"). I just had an uneasy feeling that something was wrong.
Zaphod's original specification of the Arthur Dent script never mentioned good randomness.
Yeah, that's what I was thinking, but it didn't manage to prevent me saying anything. Of course, if you put all this in a footnote on the T-shirt, the T-shirt would be really geeky :)
I do love that script. (Although I keep wondering, is it safe to take the remainder of the random number? I thought that was risky. Although admittedly, even if it were, I don't expect Arthur Dent is cryptographically consistent.)
$((RANDOM % 3))with$((RANDOM / 10923)). I decided not to, in the end, because the former is more readable, and ultimately the aim of this script is to be read and understood by humans.However, thinking about it further ... (Warning: extreme pedantry ahead.)
You're certainly right in general that with several types of commonly used low-end RNG it's better to take the most significant part of an output value rather than the least significant part, since state only propagates up the word and never down, and hence (e.g.) taking the k low bits of each output from an n-bit linear-congruential generator will give you a period of at most 2k instead of the 2n you'd get by taking the k high bits. Other types of RNG don't suffer from this at all: a linear feedback shift register, for instance, generates a stream of bits with no bit any worse than any other, so if you had one of those then you could safely reduce its output by whatever means was most convenient for you. However, after a quick source dive it looks as if the RNG behind
bash's$RANDOMshell variable is indeed a linear congruential one; so yes, this would impair its quality, at least in principle.However, that analysis applies to the situation where you're extracting many values from the RNG in succession without re-seeding it. Every fresh instance of
bashwill re-seed its RNG – and since this script only generates one random number during its entire run, it will generate only one random number per re-seed. So in fact the output values will be the seed values fed through one or other arbitrary transformation, and neither of those transformations looks to me as if it'll be likely to turn a crappy stream of seeds (bashmostly seems to seed off its PID) into particularly random-looking output.To verify this, I've just tried out both versions of the script on a 1000-long test run. Both, in fact, seem pretty poor but in different ways: the mod-3 version as shown above seems to generate a lot of contiguous sequences of four or more of the same phrase, whereas the division version often seems to generate the same phrase every other run for even longer stretches. Short of using
/dev/urandom, I suspect the only way to really do better would be to compute some sort of cryptographic hash (e.g. MD5) of the current high-resolution time and the pid. Which would be feasible in a shell script, admittedly, if even less portable. (I'm already unhappy that I had to make the scriptbash-specific in order to have it be concise enough to still be funny.)Incidentally, it's also a mistake to consider this sort of thing to be a cryptographic problem: any such RNG is one you certainly should not have been using for crypto purposes in the first place, and any crypto-quality RNG should be equally (or, at least, sufficiently) strong at both ends of its output.
Oh, and finally: Zaphod's original specification of the Arthur Dent script never mentioned good randomness. All you have to do is to program it to say three things. My script definitely says three things. :-þ
In the red corner, we have my original script:
In the blue corner, the version which uses the most rather than least significant part of each output word:And in the green corner, the MD5-based version:(The last one depends not only onbashbut also on the GNU versions ofdateandhead, plus a workingmd5sum.)We now generate a file of test output from each one by means of
and then analyse each output file by correlating triples of successive output values:And the results are, respectively:See what I mean? Both the ones based onbash's$RANDOMare utter rubbish, but the MD5 version has actually managed to produce all 27 possible triples at plausibly similar frequencies. I think if anything my original mod-3 version did slightly better than the dividing one: it at least managed to produce 25 of the 27 possible triples to the dividing version's 17, even if its distribution still sucked.eta: indeed, a chi-squared test confirms that the mod-3 version is the better of the two simple ones. Chi-squared values for the three output distributions are respectively 849, 2728 and 16.5. (With 26 degrees of freedom, you expect a properly random output to be somewhere below 30-50.)
Zaphod's original specification of the Arthur Dent script never mentioned good randomness.
Yeah, that's what I was thinking, but it didn't manage to prevent me saying anything. Of course, if you put all this in a footnote on the T-shirt, the T-shirt would be really geeky :)