Last time: more slicing of numpy arrays, messing around with images (M, N, 3), histograms.
Today: Randomness
Recall one little thing from last time:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
xs = [0,1,2,3]
ys = xs[0:2]
ys[0] = 999
print(xs, ys)
On the other hand, with numpy arrays:
X = np.array([0,1,2,3])
Y = X[0:2]
Y[0] = 999
print(X, Y)
In lists, slicing always makes a fresh copy, whereas for numpy arrays, slicing makes a reference to that part of the array. This makes sense because if you have some huge dataset, you don't want to copy the whole dataset when you want to look at a slice of it.
Randomness is used a lot both in mathematics and the real world.
Generally, a random number comes from a probability distribution.
The distribution might be discrete: i.e. it comes from a set
$$ \{ (x_1, p_1), ..., (x_n, p_n) \},$$where you get outcome $x_i$ with probability $p_i$. It is assumed that $\sum_i p_i = 1$ (if not you normalize the p's so their sum is 1). The function that takes $x_i \mapsto p_i$ is called the probability mass function.
For continuous random numbers, one normally uses a probability density function. E.g. the normal distribution comes from the function.
$$ p(x) = \frac{1}{\sqrt{2\pi}\sigma}\exp({-\frac{1}{2}(\frac{x}{\sigma})^2})$$The probability of a random number from this distribution being in the interval $[a,b]$ is then:
$$\int_a^b p(x)dx$$Most well known distributions are the uniform distribution and the normal distribution.
# let's graph the density function of the normal distribution.
from math import pi, sqrt, e
X = np.linspace(-5,5,300)
Y = (lambda x: 1/sqrt(2*pi)*e**(-0.5*x*x))(X)
plt.plot(X, Y)
The easiest distribution is the uniform distribution (all numbers in a given range are equally likely).
In python it's the function random.random() that will produce a random number in $(0,1)$.
import random
random.seed(42) #optional: the seed will initialize the random number generator
for i in range(15):
r = random.random()
print(r)
How would we turn this into random numbers from a to b?
def rnum(a,b):
return a + (b-a)*random.random()
for i in range(10):
print(rnum(-1,1))
You can do random arrays in numpy
N = 300
X = np.random.uniform(low=0,high=1,size=N)
Y = np.random.uniform(low=0,high=1,size=N)
plt.scatter(X,Y)
X = np.linspace(0,1,100)
Y = 3 * X + 1
# let's add some noise
Z = 3 * X + 1 + np.random.normal(0,1,X.shape[0]) # we'll see much more about randomness later
plt.scatter(X,Z) # we use a scatter plot
plt.plot(X,Y, "r")
Let's add random noise to a linear function.
N = 20000
X = np.zeros(N)
for i in range(N):
X[i] = random.randint(1, 6) # from 1 to 6 inclusive
print(np.mean(X))
N = 50
X = np.zeros(N)
mu = 0.0
sigma = 1.0
for i in range(N):
X[i] = random.gauss(mu, sigma)
plt.scatter(X, np.zeros(N) + 0.5)
# numpy version:
N = 50
mu = 0.0
sigma = 1.0
X = np.random.normal(loc=mu, scale=sigma, size=N)
plt.scatter(X, np.zeros(N) + 0.5)
Looks like it's more concentrated around zero. How can we see it better? Histogram.
N = 500000
mu = 0.0
sigma = 1.0
X = np.random.normal(loc=mu, scale=sigma, size=N)
plt.axis([-6, 6, 0, 0.45])
_,_,_ = plt.hist(X, 50, normed=True)
Sigma is the standard deviation, which measures how wide the normal distribution is. For example:
N = 500000
mu = 0.0
sigma = 2.0 # highers standard dev
X = np.random.normal(loc=mu, scale=sigma, size=N)
plt.axis([-6, 6, 0, 0.45])
_,_,_ = plt.hist(X, 50, normed=True)
# looks the same but look at the numbers above and below
You can compute the mean and standard deviation of any data:
np.mean(X)
np.std(X)
In general, if X is my data-set, then the normal distribution withm mu = np.mean(X)
, and sigma = np.std(X)
will fit the data best.
If a data distribution normal then about 68 percent of the data values are within one standard deviation of the mean.
Histogram of uniform distribution.
N = 50000
X = np.random.uniform(low=0, high=1, size=N)
_, _, _ = plt.hist(X, 50)
The bus from my street to work is scheduled at 8:00 A.M. I measured that the bus' arrival time has a Gaussian (normal) distribution with mean 8:05 and a standard deviation of 4 minutes.
If I miss the bus, the bus after that will come 5 minutes after the first bus (no delays).
When should I get to the bus stop to minimize my total expected waiting time? (I don't care how late I am. I just don't want to wait for the bus at the stop. I'm weird like that.)
Let's simulate.
import random
time_till_next_bus = 50
mu = 0 # 8:05 is time 0
sigma = 4
N = 10000 # number of simulations
arrs = []
for my_arrival_at_stop in range(-20,20):
sum_of_waiting_times = 0.0
for i in range(N):
bus_arrival = random.gauss(mu, sigma)
if my_arrival_at_stop < bus_arrival:
sum_of_waiting_times += bus_arrival - my_arrival_at_stop
elif bus_arrival + time_till_next_bus - my_arrival_at_stop > 0:
sum_of_waiting_times += bus_arrival + time_till_next_bus - my_arrival_at_stop
# if i missed the next bus too, don't count that one.
print("my arrival:", my_arrival_at_stop, "simulated total waiting time", sum_of_waiting_times / N)
arrs.append(sum_of_waiting_times / N)
plt.plot(range(-20,20), arrs)