Lots of research using machine learning algorithms show good results, 99% or more. Some of that research provide source code or binaries so people can run the application on their own computer. The problem is those applications are notorius for yielding different results every time they run. Once you could get 99% accuracy, then you ran it again and the app would show 97% now. So this post will talk about how to ensure your machine learning models show the same result given they are provided with the same data. The examples shown below are in Python and PyTorch, but the concept may be applicable to other programming languages and libraries.
Setting Seeds
In deep learning, weights and biases are often initialised randomly, so is the order of input rows fed to the model. Numpy has got numpy.random module which takes care of all this related stuff. The Random module in Python is also often used to generate random numbers or shuffling lists/arrays. PyTorch, Keras, or Scikit-learn also have some non-deterministic functions which make them yield different output every time they are running.
All those things are based on Pseudorandom Number Generators (PRNG). As the name states, numbers generated by a PRNG are not actually random. They just look arbitrary but they’re based on some calculations with an initial value. As long as we give the PRNG the same initial value, PRNG always outputs the exact same thing. This initial value is called a seed. Therefore, setting seeds in your program is essential to make your code generate reproducible results. And the most important thing is you have to set the seed for every library that uses PRNG since each library usually has its own PRNG mechanism, thus having different seeds.
To set seeds for numpy
and random
modules, put these lines at the beginning of your Python scripts or Jupyter notebooks:
numpy.random.seed(666)
random.seed(666)
You can put any number as the parameter, as long as this number stays the same, numpy
and random
will always generate the same random numbers. However, I’ve read an article on StackOverflow saying that sometimes random
doesn’t generate the same set of numbers even after setting the seed. Unfortunately, I can’t find out the article again. So I’d recommend using numpy.random
instead.
As for PyTorch, they provide some guidelines[1] to ensure your result is reproducible or at least close enough. You just need to add these lines at the beginning of your code:
torch.manual_seed(666) # If you're using CUDNN torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False
Again, the seed can be any number and it doesn’t have to be the same number as you use in numpy
.
Avoid Iterating over Dictionaries
This point is usually forgotten. We usually pay more attention to setting seeds.
Dictionaries are a data structure that maps a key to a value. We often use them to store a list of items and iterate over those dictionaries. The problem is when we iterate over elements in a dictionary, Python interpreter will throw them to us in arbitrary order. Consider running the following code:
items = {"books": 2, "computers": 3, "phones": 5, "bags": 0}
for key, value in items.items():
print(key, value)
The output will be different each time it runs. Sometimes “books” appears first, some other time “phones” appears first. Thus, I would recommend using lists or OrderedDict instead. The preceding code could be modified to something like this:
ix_items = ["books", "computers", "phones", "bags"] items = {"books": 2, "computers": 3, "phones": 5, "bags": 0} for i in range(len(ix_items)): print(ix_items[i], items[ix_items[i]])
or like this:
from collections import OrderedDict items = OrderedDict(sorted({"books": 2, "computers": 3, "phones": 5, "bags": 0})) for key, value in items.items(): print(key, value)
In my opinion, the second solution is more space-efficient and we won’t have to manage two variables. OrderedDict stores its elements based on the order of their entry.
That being said, I hope this information is useful for everyone works on machine learning stuff. If you have any suggestions on making our results reproducible, please do write them on the comment below.
References
Recent Comments