reinventing-wheel

🎲 Computers & Randomness

Published on 31st August 2021

10 minutes read

tl;dr

Introduction

I think the best way to start would be an XKCD comic strip on the same topic

Computers were designed to be deterministic in nature. If I ask what's 1 + 1, a computer always returns 2 and nothing else. This is an important and non negotiable prerequisite for us to rely on computers. This sheer predictability of a computer's behavior makes them such a powerful tool that today's generation can't even imagine a world without them.

On the other hand, random sequences are a starting point for a lot of statistical analyses. Even for developing machine learning models, we generally start with random numbers and iteratively build from there. Personally, I have been asking computers to generate random numbers since quite sometime now. And as far as my tiny brain can comprehend, they appear to be doing a decent job of it, because every-time I ask it to give me a random number... well, it gives me a different number and it appears random to me.

The words deterministic and random are pure antonyms of each other, yet somehow deterministic computers are able to generate random numbers. This weird combination makes me curious about the inner workings of this seemingly banal but widespread problem of today's computing world. With this new found curiosity, I started researching a bit more about this problem and turns out that a lot of people have worked very hard over the years so that we can make computers churn out random numbers ( or seemingly random numbers ... will delve into the details shortly )

What exactly is a random number ?

Let's start by understanding the concept of randomness. Randomness loosely means unpredictability. The more certain we are about something, the less random it appears to be. As an example, Let's say someone who I don't know asks me to guess his/her name. I can take a guess but I can never be certain about the answer. I can start with a list of most common names in my head but I can never be sure unless someone / something tells me the answer. It's a situation with large degree of randomness and almost zero certainty. Anything can be the correct answer and I have absolutely no idea about it. On the other hand, if I am in a park and have a sudden urge to find which direction north is, then I'd be much more comfortable. Based on the approximate time, ( if it's morning or evening ) and the knowledge that - sun rises in the east and sets in the west, I can point to a fixed direction and say that's north. There would still be some randomness in it because north may lie a bit to the left ( or to the right ) but in this situation, there is a lot more certainty and significantly less randomness.

Coming to the case of numbers, a number is truly random only when all the numbers have an equal probability of arrival ( i.e. we don't have any idea of what the next number would be exactly ). Also, the concept of random numbers is meaningful only when we have to generate numbers repeatedly. For e.g. if we need to generate only one number, then 4 (or any other number) is a perfectly good answer. It is only when we have to repeat the same task again and again, the concept of randomness comes into picture. If our random number generator is programmed to always give 4 as an answer, then it's not random anymore. I can always predict with fair certainity what the next number would be. My experiments that use this random number generator would end up being biased, because the results will not have details about what'll happen when 6 comes ( or any other number ).
For a random number generator to be perfect :

All numbers should have an equal probability of being the output each and every time we use the generator and
The next number that would be outputted should be independent of previous number that was outputted.

First statement holds true only for a class of RNGs (Random Number Generators) called uniform random number generators. They are of importance because other class of RNGs can be derived from a uniform RNG.

How good are human beings at the job 🤷‍♀️?

Like a lot of things, Internet seems to be divided on the topic. This reddit post proves that humans don't do a very good job of generating random numbers. According to the post, If a human being is asked to generate a random number between 0-10, then there is a very high probability that the answer would be 7 and a very low probability that the answer would be 0. So even though the numbers would be random, they won't be uniformly distributed. While an answer on Quora suggests otherwise. The author explains that if we apply some post processing on top of human generated numbers, then the results are fairly acceptable.

How did the world do it before computers ?

PS: I'll start my clock around the time of WW-2 because that's when computers were invented and we are primarily interested in how computers do the job.

While researching for this article, I came across mention of a book from 1950s called A Million Random Digits with 100,000 Normal Deviates that has 400 pages filled with random numbers. If you are interested to flip through it, then a .pdf version is freely available on the publisher's website(linked with the name). That was a time unlike today where electronic computers were just being created and internet hadn't been invented yet. Acquiring information took considerable amount of effort. If someone had to do statistical experiments, then this book was a lifesaver because then it'd be one less thing off the list for them. They'd already have an acceptable list of random numbers. With availability of the information solved for, speed still remained a bottleneck. I can only imagine how slow those experiments would have been if someone had to do manually copy huge lists of these random numbers before starting an experiment.

An electronic computer appears 👾

A team of scientists who were working on the Hydrogen bomb in late 1940s at Los Alamos National Laboratory, came across an interesting problem. In scientific terms, it was related to neutron diffusion in fissionable materials. In simple terms, they were trying to estimate - Once a neutron enters a fissionable material, how will the subsequent chain reaction look like. A self sustained chain reaction would mean an explosive release of energy in a short span of time, so understanding it was critical to the research. The involved calculation consisted of the following components:

Given a neutron with a certain position and velocity, estimate how far will the neutron travel before colliding with an atomic nucleus.
Once a collision has taken place, how much energy will been released.
Sometimes the collisions are accompanied by a fission ( the nucleus breaks and releases more neutrons ). So for the collision that just happened, determine if a nuclear fission took place or not.
If a fission took place, then how many neutrons were released in the process.
For each of the resulting neutrons, determine the momentum.
Repeat the process again from step 1 using the updated position and velocities of all the neutrons.

Originally the research used massive groups of people doing huge numbers of calculations, but during the course of the research, John Von Neumann and Stanislaw Ulam realized that they could use ENIAC (Electronic Numerical Integrator and Computer) to do these calculations much more quickly using a statistical approach (as compared to actually solving differential equations using human calculations).

PS: The following paragraph is an imagination of how the statistical calculation would have been performed, if you already know what monte carlo simulations are, then feel free to skip ahead.

Researchers experimentally figured out probabilities of various possible outcomes and created a computer model of the whole process. Model is just a fancy term to describe a set of calculations, thats it. In this case, the model performed all the calculations from step 1-6. It took initial position and momentum of a neutron and provided the details of the resulting chain reaction as an output. For e.g. To achieve step 3 above, scientists first figured out the probability of a fission happening on collision using real world experiments. Then they performed multiple computer simulations of the chain reaction with slightly varying initial speed and position of neutrons. Whenever the program said that there was a collision, a random number between 0-1 was used to determine if fission took place or not. Based on all the outcomes of multiple varying inputs, statistical analyses were performed. A statistical analysis could have been as simple as measuring the fraction of outcomes where acceptable number of fissions took place or measuring the average length of the chain reaction or something entirely different. This experiment popularized what we now a days call monte carlo simulations.

Why the name "Monte Carlo" ?

Being a secret, this work required a code name. A colleague of von Neumann and Ulam, Nicholas Metropolis, suggested using the name Monte Carlo, which refers to the Monte Carlo Casino in Monaco where Ulam's uncle would borrow money from relatives to gamble.

It was a nice story and all ... but didn't you skip over the part describing how the random numbers were calculated ?

Ok coming back to the topic at hand. Using a precompiled lists of random numbers was extremely slow. A person would have to manually feed the list to the computer and at that point of time, people used punch cards to feed data, so feeding data was a slow and complicated process, It was then when von Neumann developed a way to calculate pseudorandom numbers ( I'll talk about it in the next section, loosely speaking it's a number which is not random in reality but gives the appearance of randomness ). He called the algorithm middle-square method (again coming below). Though this method has been criticized as crude, von Neumann was aware of this: he justified it as being faster than any other method at his disposal, and also noted that when it went off it did so in an obvious fashion, unlike methods that could be subtly incorrect. Later on, a paper was published describing the algorithm. In that paper, von Neumann wrote the famous words - "Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin".

The above story has been picked from the below mentioned sources. Some phrases have been blatantly copied over while others have been created by my understanding of the story.

Wtf is a pseudorandom number (PRN)?

I think by this time you would have realized that true random numbers cannot be generated by arithmetic algorithms, once we know the algorithm, we know exactly what the results would look like. Using tables of random numbers was effective but slow. Additionally the available tables at that time were of limited size thus becoming a bottleneck in scaling of the experiments, so von Neumann did the next best thing. He created an algorithm which facilitated 2 important points.

It was random enough, i.e. numbers generated were not random in the true sense of randomness but were random enough for him to use them in his experiments without sacrificing the quality of results.
When performing experiments, it is important to have reproducibility, i.e. One should be able to reproduce the results by providing the same set of inputs to the experimental setup. If von Neumann used truly random generators then, he wouldn't have had the ability to reproduce results at all. PRNs provided a deterministic way to regenerate the sequences of random numbers when need arose.

The sequence of random numbers which are generated using such algorithms are called PRN ( Pseudo Random Numbers). von Neumann named his algorithm as middle square method.

So how did this algorithm work ?

The algorithm used a simple mathematical operation to generate n digit random numbers.

In the first step, a number with n digits was provided to the algorithm as an input, the algorithm then returned the middle n digits of the number obtained by squaring the provided input.
In subsequent steps, when the algorithm was asked for the next random number, it used the previous output as the new input and provided next number in the sequence.

This created a chain of random numbers which could be deterministically calculated with only the knowledge of the starting point. This starting point is called "seed" in the terminology of PRNGs. The next challenge is to figure out this seed, but once we have a seed, we have the needed sequence of random numbers.
A common drawback of PRNGs is of getting stuck in cycles. It is entirely possible that the PRNG outputs a number which we have already seen before. And once we feed it back to the algorithm, the cycle keeps on repeating. Much of today's commonly used algorithms have very long (check the number --- 2^20) cycles, but cycles do exist.

How is the seed determined ?

I couldn't figure out what numbers von Neumann used as seeds in his experiments but based on my experience, nowadays, we generally use either a fixed number from our whim ( for reproducibility ) or something fluctuating ( if reproducibility is not necessary ). The most common fluctuating quantity used is time ( more specifically the number of milliseconds elapsed since 1970 ).

Middle Square Method

As mentioned above the algorithm was not too good. It got stuck in cycles too quickly and more than often the sequences decayed to 0. Below is a toy PRNG based on this algorithm for generating random numbers of 4 digits. You can move around the slider to adjust the seed, and you'll see how short the cycles are. For e.g. if you start with the seed as 3792, you'll see the sequence repeating immediately, i.e the chain is of length 1 ( which just the seed as an element), whereas if you select 6239 as your seed, then the cycle starts after 111 numbers.

A 4 digit PRNG. Select a number between 1 - 9999 to be used as a seed

Selected Seed

318

Length of the chain

52

Generated sequence (Click on a number to see details)

3181011221488238166917694197690458120934431036286513738871087181529426553941869878181928724831652729115865153553462517505625640636813548333438825454770752968583211030609370874921300690061002100410081006100

Statistics

Some statistics on how the chain lengths vary by increasing the number of digits.

Sl No.	# digits	Minimum Chain Length	Maximum Chain Length	Average Chain Length
1	2	1 Sample Seed : 10, 60	15 Sample Seed : 42, 69	6
2	4	1 Sample Seed : 2500, 3792	111 Sample Seed : 6239	44
3	6	1 Sample Seed : 495475	943 Sample Seed : 229339	327

What Next ?

Middle square method is now a relic of the past. There are new algorithms that have been developed since then. Wikipedia has a list with some description on each. Interestingly in 2017, a modification was suggested to the middle square algorithm which helps the algorithm get rid of its follies. Here's a link to the paper describing the modification.

tl;dr

Computers cannot generate randomness, however they have access to signals which appear random to us. Using these signals, a pseudo random number generator can produce numbers which look completely random. There have been numerous algorithms over the years to perform this activity. First one was called Middle Square Method which was developed while researching the hydrogen bomb. Coincidently that project also led to popularizing Monte Carlo simulations ( which in general relies on random number generators to function ). The algorithm was not very good because it kept on getting stuck in cycles, i.e it kept on generating same sequence of numbers after a while, but it ultimately paved way for research on computational PRNGs.