GSoC 2021: end-of-summer report

Posted on Aug 16, 2021
tl;dr: Final report for the Google Summer of Code (GSoC) 2021 program on converting Matlab examples to Jax for the book Probabilistic Machine Learning.

Introduction

In this post I report my work during the Google Summer of Code (GSoC) 2021 program. During the summer I contributed to Volume 2 of the Probabilistic Machine Learning book by Kevin P. Murphy.

The GSoC 2021 TF/Pyprobml program was divided into in two Github repositories: a public repo probml/pyprobml where the final code was pushed and a private repo pyprobml/hermes where discussions and code reviews took place.

Highlights

  • I spent 394.87 hours working on the project on the roughly 83 days of work;
  • 75% of the time was spent writing code, the other 25% was spent on code reviews, meetings, and other non-code work;
  • I spent around 32 hours per week on the project;
  • I tackled 41 issues on the private repo (see The Summary for details);
  • I created a total of 39 pull requests in the probml/pyprobml repo (see The Summary for details);
  • I wrote a total of 6,165 lines of code;
  • I learned about the Jax library for high-performance machine learning.

Main contributions

During GSoC 2021 I mostly worked on dynamical systems and their applications to Machine Learning. One of my initial tasks was to write code that implemented the classical linear dynamical systems’ algorithm to estimate the latent state from an observed state, namely, Kalman Filters. Although a relatively straightforward task, by the end of GSoC I was able to implement a state-space model for the Kalman Filter using the Jax library (See PR 591). I learned that there is (almost always) a faster way to write a loop using Jax.

Furthermore, I learned about the extensions to linear dynamical systems to nonlinear systems using approximations such as the Extended Kalman Filter (EKF), the Unscented Kalman Filter (UKF), Rao-Blackwell Particle Filtering (RBPF), the Bootstrap Particle Filter (BPF), and the Exponential family EKF (EEKF). In the following images, I present examples of these algorithms.

An interesting application to dynamical systems in Machine Learning, and my most challenging issue this summer, was the implementation of the training loop of a Multilayer Perceptron (MLP) using the EKF and UKF algorithms. The goal of this task was to train an MLP on a single pass using the EKF and UKF algorithms. I learned that, from a dynamical system’s point of view, we can view the weights of the MLP as a latent (state) space and the result of the MLP as the observation (measurement) space. If we assume that the weights are static over time, we can use the EKF and UKF algorithms to estimate the weights of the MLP. The following is a video that shows the training loop of an MLP using the EKF algorithm.

Finally, one of my favourite visual examples was the creation of a graph that illustrates the Markov Chain Monte Carlo (MCMC) sampling process using the Metropolis-Hastings algorithm. It is a succinct and elegant way to visualise the MCMC sampling process using different step sizes (see PR 483).

Temporal analysis

To measure my progress throughout the summer, I used the Toggl app to track the time spent on different tasks. The following is a summary of the time spent during the summer.

I worked an average of 6 hours per day on the project. Saturdays and Sundays were spent reading and studying papers and the relevant sections of PML Vol 2 book

My weekly commitment was around 32 hours on average. In the following table I show the total time spent on the project broken down by week. We see that on week 5 I took a small break from the project and on week 6 I started working on the project again.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
     end of week  clocked_hours
week                           
0     2021-05-29      23.183056
1     2021-06-05      43.937778
2     2021-06-12      39.890833
3     2021-06-19      28.637222
4     2021-06-26      34.411111
5     2021-07-03       8.668333
6     2021-07-10      39.891111
7     2021-07-17      26.115556
8     2021-07-24      36.937222
9     2021-07-31      35.834722
10    2021-08-07      42.378611
11    2021-08-14      34.980391

During the summer, I proposed myself to work between 30 and 40 hours per week. On the following graph I show the rolling-7 number of hours worked. I am happy to say that I was (moslty) able to achieve this goal.

During the summer I was able to tackle a total of 41 issues in the private issue tracker. The issues I worked on this summer had different levels of difficulty. At the beginning of the summer, most issues were tagged as Figures, but it was then decided to catalogue them according to an estimated level of difficulty. The following table shows the level of difficulty of the issues I worked on this summer and time time spent on each issue (see The summary section for more details).

Finally, I was able to complete 39 pull requests in the probml/pyprobml repository. I wrote a total of 6,165 lines of code. The following figure shows the number of lines of code per PR (see The summary section for more details).

Conclusion

GSoC 2021 was an incredible experience. I was able to work on a topic I am passionate about, and contribute to the open source community and to a project bigger than myself. I learned a lot from incredible people and I am looking forward to working with them in the future. I am grateful for my GSoC mentors Kevin P. Murphy and Mahmoud Soliman for their support through the whole process.

The Summary

In the following table, I present the pull requests (PRs) I created during the GSoC program. The PRs are ordered by date they were closed.

urltitleadditions
PR 78add gmm and data for old faithful423
PR 81add gmms singularities diagram33
PR 82add parzen-window2 example77
PR 83add geom_ridge diagram58
PR 476add gibbs_gauss_demo.py47
PR 481Modify .gitignore, remove pycache112
PR 482add kalman_tracking_demo.py347
PR 483Add mcmc_gmm_demo.py163
PR 486refactor plot_ellipse, refactor kalman_tracking_demo.py8
PR 487add hmm_casino_demo.py, hmm_lib.py306
PR 488Fixes extra space in mcmc_gmm_demo5
PR 503Parallel Kalman Filtering and refactor301
PR 502feat: create kalman_filter_spiral_demo.py68
PR 501creates Extended Kalman Filter demo226
PR 512Variational Mixture of Gaussians420
PR 521Refactor kf/ukf/ekf and demos690
PR 528Fix linear_dynamical_systems_lib.py4
PR 541Rename dynamical systems libraries6
PR 540create linreg_online_kalman_demo.py140
PR 546Create mix_gauss_ml_vs_map.py191
PR 547Refactor mix_gauss_demo_faithful.py93
PR 548Create gpr_demo_marlik.py144
PR 549Refactor nlds_lib.py65
PR 552create EKF/UKF + MLP153
PR 554Refactor nlds_lib and ekf_vs_ukf_mlp_demo46
PR 553Create hmm_lillypad_demo132
PR 575Fix UKF prior17
PR 577EKF + MLP animation73
PR 579Create condensation algorithm / bootstrap filter demo102
PR 584Create rbpf_maneuver_demo.py351
PR 587Create unigauss_vb_demo.py194
PR 588refactor: linreg_eb_modelsel_vs_n.py71
PR 590refactor lds_lib to use jax.lax.scan229
PR 591Create LinearGaussianStateSpaceModel class382
PR 592Fix lgssm_demo.py8
PR 593fix nlds_lib.py49
PR 596Create replica of 1d-pendulum from Särkkä’s138
PR 597Create adf_logistic_regression_demo.py240
PR 610Refactor: Move 3d-plots functions to pyprobml_utils53

In the following table, I present the issues I tackled in the probml/hermes repository, along with the time I spent on eah issue. The issues are order by creation date.

issue_numberclocked_hourstitle
137.45583Convert unigaussVbDemo to python
189.04167Convert mcmcGmmDemo to python
202.04833Convert gibbsGaussDemo to python
248.22806Convert linregOnlineDemoKalman to python
316.67556Convert gprDemoMarglik to python
377.98583Convert casinoDemo to python
3911.9569Convert kalmanTrackingDemo to python
4027.6747Convert rbpfManeuverDemo to python
5213.0525extended kalman filtering
538.85444unscented kalman filtering
5438.6331EKF/UKF for online learning of an MLP
613.96056Convert mixGaussMLvsMAP to python
760.583889Add plot_ellipse function to pyprobml_utils.py
772.86472fix kalman_tracking_demo
7912.5386make demo of parallel kalman filtering
810.269444Create .gitignore file
1090.705Replace greek symbols in Kalman Filter Demo
1191.30833Make kalman_filter_spiral_demo
1461.33667fig small bug in dynamical_systems_lib.py
1480.378056Make kalman_filter_spiral_demo with limit cycle
1634.64083Organise files and classes of dynamical systems
1720.298333Refactor linear_dynamical_system_lib.py
1832.43583Refactor ExtendedKalmanFilter class (continuous)
1950.573889rename dynamical systems libraries
21510.8703apply particle filtering to a nonlinear 2d tracking problem
2160.881667refactor nlds_lib
2392.21444make HMM lilly pad example
2490.343056refactor GMM demo code
2513.22611debug bayesian model selection for linear regression
2541.34111Refactor nlds_lib and ekf_vs_ukf_mlp_demo (Stage 1)
2550.755Add demo that shows the training animation for UKF/EKF + MLP
2701.48861Refactor UKF in nlds_lib.py
27930.8075ADF for binary logistic regression
29918.1928online inference for the nonlinear 1d pendulum problem
3103.07194refactor lds_lib to use jax.scan
31115.6478refactor lds_lib to be compatible tfp.distributions
3183.80622rbpf_maneuver_demo example using vanilla PF
3261.31528Move 3d-plots functions to pyprobml_utils
3327.86444TFP tutorial to pyprobml-notebooks
3587.59111EEKF for logistic regression
3662.06111End of summer final report