A very small shell script [entries|reading|network|archive]
simont

[ userinfo | dreamwidth userinfo ]
[ archive | journal archive ]

Wed 2009-02-04 10:54
A very small shell script
LinkReply
[personal profile] simontWed 2009-02-04 14:18
I actually did almost replace $((RANDOM % 3)) with $((RANDOM / 10923)). I decided not to, in the end, because the former is more readable, and ultimately the aim of this script is to be read and understood by humans.

However, thinking about it further ... (Warning: extreme pedantry ahead.)

You're certainly right in general that with several types of commonly used low-end RNG it's better to take the most significant part of an output value rather than the least significant part, since state only propagates up the word and never down, and hence (e.g.) taking the k low bits of each output from an n-bit linear-congruential generator will give you a period of at most 2k instead of the 2n you'd get by taking the k high bits. Other types of RNG don't suffer from this at all: a linear feedback shift register, for instance, generates a stream of bits with no bit any worse than any other, so if you had one of those then you could safely reduce its output by whatever means was most convenient for you. However, after a quick source dive it looks as if the RNG behind bash's $RANDOM shell variable is indeed a linear congruential one; so yes, this would impair its quality, at least in principle.

However, that analysis applies to the situation where you're extracting many values from the RNG in succession without re-seeding it. Every fresh instance of bash will re-seed its RNG – and since this script only generates one random number during its entire run, it will generate only one random number per re-seed. So in fact the output values will be the seed values fed through one or other arbitrary transformation, and neither of those transformations looks to me as if it'll be likely to turn a crappy stream of seeds (bash mostly seems to seed off its PID) into particularly random-looking output.

To verify this, I've just tried out both versions of the script on a 1000-long test run. Both, in fact, seem pretty poor but in different ways: the mod-3 version as shown above seems to generate a lot of contiguous sequences of four or more of the same phrase, whereas the division version often seems to generate the same phrase every other run for even longer stretches. Short of using /dev/urandom, I suspect the only way to really do better would be to compute some sort of cryptographic hash (e.g. MD5) of the current high-resolution time and the pid. Which would be feasible in a shell script, admittedly, if even less portable. (I'm already unhappy that I had to make the script bash-specific in order to have it be concise enough to still be funny.)

Incidentally, it's also a mistake to consider this sort of thing to be a cryptographic problem: any such RNG is one you certainly should not have been using for crypto purposes in the first place, and any crypto-quality RNG should be equally (or, at least, sufficiently) strong at both ends of its output.

Oh, and finally: Zaphod's original specification of the Arthur Dent script never mentioned good randomness. All you have to do is to program it to say three things. My script definitely says three things. :-þ
Link Reply to this | Parent | Thread
[personal profile] simontWed 2009-02-04 14:47
Here's some slightly more rigorous analysis.

In the red corner, we have my original script:

#!/bin/bash
case $((RANDOM % 3)) in
  0) echo "What?" ;;
  1) echo "I don't understand." ;;
  2) echo "Where's the tea?" ;;
esac
In the blue corner, the version which uses the most rather than least significant part of each output word:
#!/bin/bash
case $((RANDOM / 10923)) in
  0) echo "What?" ;;
  1) echo "I don't understand." ;;
  2) echo "Where's the tea?" ;;
esac
And in the green corner, the MD5-based version:
#!/bin/bash
x=`date +$$.%F%T%N | md5sum | head -c7`
case $((0x$x % 3)) in
  0) echo "What?" ;;
  1) echo "I don't understand." ;;
  2) echo "Where's the tea?" ;;
esac
(The last one depends not only on bash but also on the GNU versions of date and head, plus a working md5sum.)

We now generate a file of test output from each one by means of

for i in {1..1000}; do ./vsss; done > output
and then analyse each output file by correlating triples of successive output values:
perl -ne 'chomp; $a=$b;$b=$c;$c=$_;
          $x{"$a:$b:$c"}++ if defined $a;
          END{printf "%s\n", join ",",values %x}' output
And the results are, respectively:
1,52,1,2,4,63,3,62,3,62,96,63,3,53,52,51,1,1,63,102,63,1,51,96,49
27,149,28,1,1,2,28,4,1,3,146,148,148,1,1,155,155
39,45,42,36,41,33,41,36,37,39,39,36,22,37,45,38,34,40,38,38,36,28,34,38,33,41,32
See what I mean? Both the ones based on bash's $RANDOM are utter rubbish, but the MD5 version has actually managed to produce all 27 possible triples at plausibly similar frequencies. I think if anything my original mod-3 version did slightly better than the dividing one: it at least managed to produce 25 of the 27 possible triples to the dividing version's 17, even if its distribution still sucked.

eta: indeed, a chi-squared test confirms that the mod-3 version is the better of the two simple ones. Chi-squared values for the three output distributions are respectively 849, 2728 and 16.5. (With 26 degrees of freedom, you expect a properly random output to be somewhere below 30-50.)

Link Reply to this | Parent
[identity profile] cartesiandaemon.livejournal.comThu 2009-02-05 15:51
Oh, thank you. You're quite right, I'd hoped someone would do the analysis. The appropriate model should either be a program in a loop, or a program run less than every second (or however long it takes to say "what"). I just had an uneasy feeling that something was wrong.

Zaphod's original specification of the Arthur Dent script never mentioned good randomness.

Yeah, that's what I was thinking, but it didn't manage to prevent me saying anything. Of course, if you put all this in a footnote on the T-shirt, the T-shirt would be really geeky :)
Link Reply to this | Parent
navigation
[ go | Previous Entry | Next Entry ]
[ add | to Memories ]