Alexandre Andorra: Thomas Pinder, welcome to Learning Vagin

Thomas: Thank you. Thank you ever so much for having me on.

Alexandre Andorra: now thanks for for taking the time slate for you there where you are and so I definitely appreciate you are being a night owl for the show and for the listeners sure they will appreciate it I was super excited to ⁓ come across your profile actually in linden when you started when you posted actually be there about the GP jacks integration into non pyro so thanks for doing that and and and doing making it public because I don't know when I would have come across your profile. So, know, that's perfect. but definitely when I saw what you were working on, can super excited because I was like, damn, we're working on very similar topics. So it's like, great. I need to talk to this guy, we're going to have a lot in common. And I'm very probably going to learn a lot from you. so let's do that. And you have an interesting background for sure because you you transition from a PhD in statistics at Lancaster in the UK, which is where you're from. And you've had senior scientist roles at mostly if I understood correctly, only US companies right big tech US companies. So Amazon, Uber, Netflix, but you're still based in Europe. So this is interesting in itself. And so yeah, I'm just curious about your your origin story and How did your focus shift toward experimentation and causal inference in such large scale industrial settings?

Thomas: Yeah, so as you say, I did my PhD in statistics. ⁓ I pretty heavily focused on Gaussian processes during my PhD. I found these to be fascinating and the type of thing where for each new thing you learn, you stumble across four or five things which you know nothing about. So it was quite easy to fill a whole PhD learning about these things. I think I sort of got into Bayes and Gaussian processes during my master's, though, actually. ⁓ This was around the time when neural networks were getting pretty popular. We just had attention is all you need come out. yeah, I got very interested in the idea of adversarial detection in computer vision and simultaneously read a wonderful paper by Yarin Garland, Zubin Garamani on Monte Carlo dropout. And I was fascinated as to whether we could combine these two things together and we could detect adversaries using Bayesian deep learning. And that was really my sort of pathway into the world of Bayes. It was a wonderful introduction and I think it was really the thing which made me appreciate Bayes, like this idea of using the uncertainty to make decisions. ⁓ I did work for a few UK based companies. I did spend a little bit of time working for some startups in in Lancaster on NLP and also down in Oxford, working on sort of cost of wear based in optimization. But as you say, I have spent most of my career working in sort of large tech US companies. My first time was actually, I was fortunate enough to work in Amazon supply chain team during my PhD. Well, I got to intern with someone called James Hensman and we, I was a super fun six months. We were basically using Gaussian processes.

Alexandre Andorra: Mm-hmm.

Thomas: to sort of accelerate Horvitz-Thompson estimators, so sort of approximating sums using Gaussian processes and control variance. And that for me was really what made me appreciate that actually you can do really fun, complicated, impactful machine learning and statistics in industry. until then, I think I had had this impression that you had to stay in a university to do this type of complex type of problem solving. But this was really exciting to me. ⁓ However, I finished my PhD and I had spent best part of four years now pretty focused on Gaussian processes and I just fancy to change. And there was an opportunity in Amazon's prime and marketing team to go and work on some experimentation projects. ⁓ So I took that opportunity and I was sort of spent three, almost three years actually at Amazon and working on experimentation and sort of quasi. observational models, mainly working on synthetic controls. I was lucky enough to get to spend most of my time on Amazon working with Alberto Aberde, ⁓ who sort of came up with the idea of synthetic controls. So that was really educational and informative for me. And ⁓ I had a great time just sort of fusing together Bayesian methods with some of these more quasi observational methods. So that's sort of what landed me in the. world I sit in today.

Alexandre Andorra: well okay yeah this is this is really really a fascinating fascinating background and context ⁓ and does sound like a lot of really cool and interesting jobs I obviously I think listeners will know my passion about gush and processes so of course I resonate with that already have a few shows about gush and processes that I'll put into the related shows on the website and by the way folks we have a completely revamped website now where each episode comes with an accompanied blog post that I write and we have like all these key takeaways and related notes that's being a huge effort from the team here at Learn by Stance I've only been you know leading that from the from the background and now it looks really really well so you'll see like when you go there I'll see all the related shows that I'll put in there So we'll have Gaussian processes and ⁓ probably we'll have some of the episodes we've made about, we've done about state space models, because I know, you work a lot also on these kinds of models. So we'll have that in there if you want that as background for the conversation we're going to have today, because we're not going to dive too much into these technical details. ⁓ But yeah, sounds like super fun work, Thomas, and I understand definitely where you're coming from. I'm really happy also, as you were saying that now, these kind of jobs are not only in academia, but also in these kind of industry setting, where you get to also actually have an impact on really how things are done in a massive scale company. And I think it's really like that's really amazing that we can live and actually contribute at such a time personally so so yeah you also lead something that's called GPJAX which is an open source package which is why I know you and for those unfamiliar what is the overarching big picture goal of this library and why was JAX the necessary foundation for it so like also just some info about my audience you can assume that most people will know what a Gaussian process is and what Jax is ⁓ so feel free to build on that and answer that question

Thomas: Yeah, I might start with the second part because Jax was not a necessity for GP Jax. ⁓ It actually came about because during my PhD I actually used to work on a Julia library for Gaussian processes, galsiamprocesses.jl. And that was really where I learned coding actually was in Julia. And then through my PhD supervisor, I contributed to this library and... I really liked this functional way of programming, but then I kind of realized that to get your code out there and used by people, Python is for better or worse, the dominant language in our world. ⁓ And up until Jax, think PyTorch and TensorFlow both encouraged this kind of object oriented way of programming. So when Jax came along, I was naturally very attracted to the framework it provided. And ⁓ it honestly GPGX started because I wanted an excuse. to play around with Jax during my PhD. A nice thing about being in a university is you're allowed a large amount of sort of academic freedom and I took that as an opportunity to just take some time to learn about how Jax worked and at the time I knew about GPs and I thought well there's no library in Jax for GPs so why don't we make one? So it really did just start out as a piece of research code. And I think today it's a little bit more than a piece of research code. It's a little bit more reliable, I think. But I think the overarching principle of GPJAX is to essentially provide the sort of verbs and nouns for people wanting to fit GPs in maybe a non-standard way. So I think there are, think GPITorch and SKLearn have Gaussian process implementations, which allow you to sort of fit a GP out the box and get the estimates out at the end in a reliable way. GPJAX, I wasn't trying to recreate that framework in JAX. What I really wanted was something with very few guardrails and the ability for researchers to really hack around in a GP and to maybe do some of the more unorthodox things, which some of these larger frameworks would prevent you doing. ⁓ And allow people to basically have all of the components of a Gaussian process and they can choose how they plan on composing those components with one another to. maybe develop novel GP methodologies. So that's really, think it's still true to that principle today. I think it still sits in that space. ⁓ But that's really where GPJAX came from actually.

Alexandre Andorra: Hmm, okay. Yes, ⁓ I can definitely resonate with the idea of just being curious about a method or framework and just trying to find a use case for learning it. ⁓ Well done on doing that. think also the package in itself is really helpful and useful for researchers. And I think you did a great job on that. So ⁓ people, if you want to check that out, it's in the show notes. ⁓ definitely ⁓ definitely give GP's GP checks a try if that's something you're curious about in need right now I will also put the links to a GPi torch that you were that you're mentioning for sure something you you also work on frequently is bridging and I very interested in talking to you especially on the

Thomas: Yeah, I think Bayesian methods lend themselves very naturally into the world of causal modelling. I think when you work on these large causal questions inside of industry, I think it's very rarely a binary outcome. I think it's very rarely the case where you run an experiment be it a randomised control or some type of observational method. And you get a very clear picture at the end which says, you should definitely do this or you should definitely not do that. I think these type of business questions are often layered in nuance and complexity. And I think the role of us as scientists when we're trying to build models to address these type of questions is to really provide our stakeholders and partners in business with the context which allows them to make. what they believe to be the right decision. And that should be guided by their own expertise and what the data is saying to them. And I think Bayesian methods really give you a very natural way of reasoning about the uncertainty. Like we no longer have to worry about just getting a treatment effect. We can get a treatment effect and we can get a credible interval around that treatment effect. And we can work with the full posterior distribution if we like. And we can then start to talk about the risk of making a false positive to our stakeholders and In my experience, when you start to frame the outcome of these causal questions through the lens of probabilities and sort of the likelihood of something being positive or the likelihood of the effect being greater than some value versus compressing things down to a p-value being significant or not, I think you provide your colleagues and your stakeholders with far more context and information that should allow them to make Make the best informed decision that they can off the back of the data you've modeled.

Alexandre Andorra: Yeah, I mean, I you're preaching to the choir here and I'm pretty sure everybody in the audience will appreciate what you're saying. ⁓ Something I'm also...

Thomas: Yeah

Alexandre Andorra: obviously very curious about on your work is you've done some work or you're actually very or you're at least very curious about Bayesian synthetic control. So here we're talking about quasi experiments methods we haven't told about that too much, talked about that too much on the show, actually, weirdly, because I mean, think it's scattered across across a lot of episodes, but we have at least that episode with Ben Vincent, the author of causal pie. So I will link to that, to that show to that episode. And of course, putting the show notes causal pie. But can you tell listeners what patient synthetic control is about? And how how it works?

Thomas: Hmm. Yeah, so I think Ben has some excellent stuff on this in causal pilot. I would definitely recommend people go and look at that. ⁓ It's truly excellent. ⁓ Synthetic control, though, I guess, to me, when I first saw synthetic control, I think in its original formulation, it's essentially posed as an optimization problem where you have a collection of control units and then one or many, let's just say one treated unit for now. And your goal is to basically form some linear combination of the control units that they best match the treated unit before the intervention was applied. And in the original paper, ⁓ Alberto does this by sort of optimizing the weights by constraining the weights to live on some probability simplex. ⁓ And it forms it as a constrained optimization problem. However, I think when I first sort of saw this, kind of just. thought of it as linear regression in a way where my response variable is my treated unit and my design matrix is just a collection of control units observed over time. And my Bayesian mind said, well, if I'm performing optimization on a simplex, that's somewhat analogous to just putting a Dirichlet prior down on the coefficients of my linear model. And so I sort of implemented that using NumPyro and PyMC. These models are exceptionally easy and fast to fit nowadays. ⁓ And I think what you get out of this is I think it's a little bit more than just taking a frequentist method and saying let's make it Bayesian because we can. I think by sort of recasting synthetic control from this constrained optimization problem to this sort of Bayesian regression problem You give yourself a series of tools which actually allow you to perform quite a rich mode of inference, optimizing the weights on the simplex is somewhat fragile. Whereas when you set your Dirichlet prior, you can be selective in how you set that concentration parameter on the Dirichlet distribution. And you can then start to say whether you think a few units would be very informative of the counterfactual. And then I'm going to set a very low alert, very low. value for my concentration parameter. Or you might believe actually many units are explanatory of the treated unit and you can put that concentration parameter a little bit higher. Or as I frequently did, you can put something like a gamma hyperprior down on that concentration parameter and really just let the data drive ⁓ this type of sort of balancing that. And I will sort of return to this original question, or this original point I made of like performing inference and giving stakeholders context around the confidence and the sort of what happened on the way to getting to your treatment effect. When you perform a sort of ordinary synthetic control and you do this optimization on the simplex, when you want to get then some uncertainty around your treatment effect in the sort of original synthetic control method, the idea was to do a permutation test. So where you would take each of your control units, sort of mock, pretend that it was a treated unit and then try and estimate the treatment effect and cycle through your pool of control units and then sort of measure the sort of average distance between your treatment effect and then all of these and you would get the uncertainty. That's okay, but all you end up with is a boundary like a permutation test interval around your treatment effect. You don't know what happened between the intervention and maybe the weeks which led up until the end of your treatment effect. Whereas when you have a linear model, abeisin-minemodo, you get this full trajectory from treatment point until final time point and you can see how the treatment evolved over time. And you also have fewer concerns when you have very few units, like this permutation test becomes a little bit unstable. However, when you have the sort of Dirichlet prior approach, I think you end up with much more robust inference and I think in practice this is very meaningful because When your estimates are more reliable, you save yourself ever been in that awful position where you've given some guidance to a stakeholder about how they should interpret some causal effect. And then it turns out that actually your model was very, very fragile and you need to maybe ⁓ change that narrative. I think the Bayesian approach gives you a very robust ⁓ mode of inference here.

Alexandre Andorra: yeah yeah for sure and thanks I think that was a great explanation of what synthetic control is but basically yeah the idea usually is to try and find some combination of units that represent the treatment unit that you have and basically trying to make another treatment unit but just that wasn't treated and to have the counterfactual as well, if that unit had not been treated, it would have looked like that, whereas we have observed what it looks like when it's treated. So the difference is the causal effect. And like if you want an example, it could be something like, let's say, you know, ⁓ some multinational company makes a change to the way they operate in Argentina because of law or something like that and only in Argentina, but they don't change anything in the rest of South America. Then you could use some synthetic control here to try and create a synthetic Argentina based on a combination of weights from the other countries of South America that in a way when they are combined, you a sort of simile Argentina and then the difference is your causal effect. So here the difficulty in this method is actually in computing the weights and making sure that actually your synthetic unit is actually a good proxy of what the treatment unit would have been without the treatment. Is that correct?

Thomas: It's correct as far as I understand it. Yeah, that's a great explanation of what it is.

Alexandre Andorra: Okay, yeah. Me too, me too. It's always weird to like, it's always a bit weird to explain these things because there is a lot of like, this is a unit what what would have been like if it wasn't treated, but sexually treated. So you have to be extremely careful in the in the definitions and the word ⁓ you're using. And yeah, good.

Thomas: Yeah, I agree. I think that counterfactual unit is actually like a super rich object to have in a causal model, right? It's definitely the way you, the path you get to your treatment effect by comparing sort of what happened versus what would have happened. But that counterfactual unit actually gives you so much information about like scenarios that could have happened. And if your control units were a different set of control units, what would that mean for your counterfactual? Like being, having this,

Alexandre Andorra: Yes.

Thomas: Bayesian synthetic control giving you this full counterfactual distribution gives you a super rich ⁓ object for which you can start to reason about different scenarios. yeah, I think your explanation is great. And yeah, that counterfactual is really, really the key part, I think.

Alexandre Andorra: Yeah, yeah, completely, and what's also fascinating to me in... I think all the quasi-experiments methods I know and have used is that actually computing the causal effect is super simple, it's always a difference between the counterfactual unit and the treated unit. What's hard is actually getting to an estimate of the counterfactual unit. Getting there is hard, but then once you get there, it's super easy, you just compute the difference. And if you're in the Bayesian framework, you actually have a full posterior of differences and you can make all the Bensel whistles that you were talking about.

Thomas: Yeah. Yeah, it's really like an iceberg type situation, Like the tip of the iceberg, that treatment effect is really easy to get to everything beneath and all the assumptions you've had to make and the modeling choices you've then had to translate. That's where the work goes in building these models.

Alexandre Andorra: Yeah. Yeah, yeah, exactly. And I like that. I mean, in a way I like that because it's also it's a bit like base, you know, base you have one estimator, the posterior distribution, and you're done, you know, you don't need to know about all these tests and so on. It's just that you have one estimator. And here I like that also because this well, have one estimator, it's just the difference. That's your causal effect. That's what you care about. And you need to get there. But you it's always the same thing. And I think it's ⁓

Thomas: Yep.

Alexandre Andorra: So actually there is a thread I could pick up here on like you talked about, know, permutation tests and the frequentist way of doing that. think there is like. Also a big limitation of that, think when you talk about p-values, that you're also limited to the number of units and you cannot go below that for your significant threshold. And so that's a big limitation. Like if you have 20 units, never, literally mathematically cannot have a p-value that's lower than 5%. And that can be a problem. ⁓ But that's just, yeah, I think it's more of a neat peak or maybe for a debate about the more call statistical conversation and have so many other topics that I want to talk about with you so let's go on before before that though and before we talk about synthetic difference in differences ⁓ can you just remind me quickly why ⁓ using a direct distribution for the weights is actually a more robust because I think I I missed that and and I love ⁓ you to repeat that

Thomas: ⁓ Yeah, maybe let's take a step back and sort of reframe it in the original synthetic control idea of optimizing on the simplex and sort of performing optimization on the simplex and sort of specifying the Dirichlet distribution. They're both essentially supplying the same geometric restriction on the coefficients of your units and you're doing this because if you just treat it as an ordinary least squares type problem.

Alexandre Andorra: Mm-hmm.

Thomas: All of your control units are really highly correlated both with each other and with the treatment unit in the pre treatment window. So ⁓ you have this multicollinearity problem and you really need to regularize these units. And what happens in practice when you are optimizing on the Dirichlet distribution, sorry, when you're optimizing on the simplex is you end up with very sparse weights. Like you end up with. most of the control units been assigned zero mass and then a few control units basically carrying the weight ⁓ of that counterfactual distribution. And that's quite a heavy task and it's a pretty sort of brittle estimator. And for example if you maybe introduce an additional unit into your control set it'll completely shift the allocation of mass across your units and that can then dramatically change your counterfactual distribution. And it's not to say that the Dirichlet distribution is necessarily better, but I think it's a little bit more configurable and you can be a little bit more flexible in how you want that weight to be allocated. As I sort of said earlier, like that concentration parameter as it gets closer to zero will start to sort of give you much sparser units. Whereas when it gets larger, you'll then start to allocate the mass uniformly. And as a practitioner, you're totally free. If you have no idea whether you need very few units or you need quite a lot of uniform mass spread across all of your units, you're totally free to just put a prior down on that value and sort of allow the sampling routine that you pick to sort of guide you to the right value. But there's also Another sort of framing here, like I've worked on some problems in the past where actually you're sort of running like sort of experiments where putting units into either treatment or control is somewhat costly and you maybe want the smallest design possible. Like in these cases, you really like almost want to specify that by definition, I want the smallest number of control units, which will allow me to have a well explained counterfactual. It's quite hard for me to reason about that when I'm optimizing on the simplex, whereas when I'm putting that into my prior distribution, then I can purposefully set that concentration parameter to be quite small. So I don't think it's necessarily better. I just think it enables richer inference and it allows you as the practitioner to have more hooks into the model to impart the information which you think is helpful in getting to your final causal estimate. And it's the classic case of there's no one right way to do this. But having more options in these type of cases is often better and having less of a black box type model.

Alexandre Andorra: So I encourage people to read the your blog post that they put in the show notes about this topic now, let's turn briefly towards synthetic difference in differences. So I think definitely if most people will be ⁓ Will be somewhat familiar with the concept, but can you tell us a bit more about that?

Thomas: Yes.

Alexandre Andorra: what it is, when to use it and how. And I know you've done a bit of work on that lately, mostly as something you're curious about. it's like, it's also great to... pick your thoughts where, know, how you're thinking about that right now and still have your thinking evolve and not at all being, you know, pretty, like not at all prescribing things here, but actually tell us how you're thinking about that topic and how that influences your work. And of course, I will put the blog post you wrote about that in the show notes also.

Thomas: Yeah, so let's just, like, just thinking about diff and diff, I sort of gave an explanation of how one can think about synthetic control as a form of linear regression. The same statement is somewhat true of diff and diff. ⁓ Like, indifference and differences, you have a treatment and control unit and you have this pre versus post intervention window, and you sort of then end up with two indicator variables like was the unit in controlled or treatment and was the unit in the, or was the observation in the pre or the post intervention window. And then the interaction between these two variables ends up the coefficient on that will end up being your treatment effect. so diff and diff in synthetic control can both be framed in the light of linear regression. And there's some really nice papers by Guido Inbens and might even be a talk now on YouTube from Guido Inbens sort of. framing pretty much all of this line of literature as just different variants on linear regression. And it's a really great talk. ⁓ So to me, if we can frame diff and diff through the lens of linear regression, then much like with synthetic control, I have the option to cast this in the light of Bayesian regression, where now my ⁓ prior distribution on that interaction term ⁓ is the sort of prior on my treatment effect. And that's kind of ⁓ Synthetic diff and diff was a paper from maybe five or six years ago now. I think it was 2019 actually. ⁓ But it essentially said, well, what if we can combine together diff and diff with synthetic control? Meaning synthetic control gives me a way of assigning ⁓ a collection of control units to explain the counterfactual. and diff and diff sort of allows me to weight my pre and post intervention windows and have sort of weights on that. Synthetic diff and diff says what if I could have weights on time and on the control units and it goes about estimating them in sort of like a three-step process where I first would learn the weights of the control units using all of the pre-intervention data and then I would go about learning the time weights using only the control units and then As you put it earlier, then we just get the treatment effect out at the end by just doing the difference in between two terms. However, the challenge in trying to cast this into the light of Bayesian inference is now that I sort of have this information restriction where I have all of my data pre and post control and treatment. ⁓ However, I have sort of two quantities I need to estimate the unit weights and the time weights, but the unit weights can only take data from ⁓ pre-intervention and the time weights can only take data from the controls. And so in a Bayesian framework, normally like your likelihood would just factorize out over the data. However, I can't do that here. Otherwise I'll have information about the wrong quantity flowing into the wrong component of my model. ⁓ I actually just stumbled across this idea of sort of the cut posterior or modular posterior is just in passing. And this essentially says like in networks, you may have this problem quite ⁓ commonly where you want information to flow to different components of your posterior and then sort of compose together these modules into one full posterior distribution. And I kind of thought that and I saw that, sorry, and thought maybe we can apply that to synthetic DID and it turns out you can actually relatively simply. ⁓ I have a sort of very short sort of ideas on my blog about how you could do this and it's very recent and could very well be wrong so I would encourage people to read it with a sceptical eye. ⁓ However it turns out that when you sort of frame Bayesian synthetic diff and diff in this cut posterior type way you end up with an estimate which is very comparable to the estimate reported in the synthetic diff and diff paper which essentially said synthetic controls on their own overestimate the magnitude of the treatment effect from the original problem of the California smoking example. So this was a sort of the data set used in the original synthetic control paper which measured the sort of effect of cigarette ban in California where California was the treated unit and then all the other states became the control units and I think the effect in that paper was around 26. ⁓ sales of cigarettes and actually synthetic diff and diff estimates this result to be closer to minus 15 packs per capita based on synthetic control synthetic diff and diff recovers that almost exact same estimate it gets around minus 15.6 so a 0.6 difference but you get the full posterior out as we said multiple times this sort of gives you that richer form of inference and Yeah, I'm still sort of pushing the idea around the sort of open questions around sort of like how do you represent the uncertainty on that treatment unit itself because you have sort of time now factoring in and the sort of different ways you can parameterize the model such that it becomes closer to synthetic control and closer to diff and diff or in certain cases equal to those models and I'm still trying to sort of mull over what that means ⁓ but it's quite a It's quite interesting and it's something I'm quite excited to sort of continue developing and seeing how that idea evolves.

Alexandre Andorra: and yet this is a very very intriguing and I think powerful idea of yet merging since the control and D from the from on the same head I had not heard about that before and preparing preparing the show So, sorry, thanks for putting that in front of me. I will definitely look into that and see, yeah, like study it as you're doing right now and see how we can use that actually in professional setting. actually, can you add the YouTube video of the talk you mentioned to the show notes? Because I think...

Thomas: Yeah.

Alexandre Andorra: it's going to be very helpful, especially if it's exposing this idea of synthetic definitive.

Thomas: Yeah, yeah, let me try and find the talk. And yeah, if I can find it, I'll add it to the show notes.

Alexandre Andorra: Yeah, you can do that after the show for sure. Yeah. Yeah, yeah, exactly. ⁓ Awesome. And so that was for the the quasi experimental causal inference part of the episode for today. ⁓ We can get back to that at some point if you want. But I also want to ask you about GP jacks because well, that's another common passion of ours. So you gave us already the elevator pitch for it at the beginning. But when like more precisely when would you recommend listeners to use it and how

Thomas: I lost the last part of that question and you said, when would you recommend listeners to use it? And then I lost you.

Alexandre Andorra: Yes. And how to use it. Like usually what's your workflow when you're using GP checks.

Thomas: Gotcha, yeah, how? Yeah, so GP.gex is definitely not the Gaussian process you should use all the time. Like GP.gex would let you do some things which are a bad idea. ⁓ And other frameworks will have guardrails in place which will just stop you doing that. I think if you're, for example, if you want to put a Gaussian process into a production system. And it's a very high stakes type scenario. think maybe using GPyTorch is perhaps a, maybe a safer choice. I think it simply has more guardrails put in place. Equally, I think if you really don't care too much about the model's construction and you just want to fit a Gaussian process, think something like SK, and you don't have very big data, then I think something like SKLearn will allow you to fit a, a correct Gaussian process in far few lines than what would require you to use in GP jacks. So I think those are the two cases where you maybe wouldn't want to use GP jacks. However, I think if you're a researcher and you're wanting to fit maybe unusual Gaussian processes, so one example would be if you have a different type of kernel which allows your covariance matrix to have some type of unique structure which you wish to exploit when doing the matrix inverse. common bottleneck in all Gaussian processes. ⁓ GPJAX gives you super easy ways to hook into that. Or if you have a new variation approximation, which you want to test out to accelerate a variation inference type workflow, then again, GPJAX makes that really easy for you to do. So I think in these type of cases using GPJAX ⁓ is a great idea. ⁓ I also think if you really want to fit a model and then retrospectively decompose it, dig into it, take it apart, I think and or compose it with other modules like a linear component, like a linear model or something else, then I think GPJAX enables that very nicely. Like as you mentioned in the introduction, recently we integrated it very tightly with NumPyro, which allows you to build much bigger Bayesian models with Gaussian process components therein. And I think in these type of cases, think GPJAX is somewhat unique in the sense that it really does allow you to build quite complicated models with relatively little code. As to exactly how I use GPJAX, I actually use it less these days. I find myself using Gaussian processes less and less these days and find myself working on GPJAX as a hobby. ⁓ I tend to use GP-JECs in a Marimo notebook and most of the time these days I find myself filling Gaussian processes for case studies and for trying to understand some data. I was never the best methodological or theoretical researcher during my PhD. That was never my strong suit. I think where I was able to sort of be most useful was in building tools and like GPJACs and sort of working on causal problems which are much more closely coupled with applications. So I don't find myself sort of developing too much novel methodology. I prefer to let other people develop clever methodology and then I would work out a way to provide a nice abstraction for that within GPJACs that would allow people to use this nice methodologies within the GPJACs framework.

Alexandre Andorra: Okay, yeah, yeah, that makes sense. everything you've talked about here, also resonate with my experience of Gaussian processes, especially the fact that they're very modular, that you can combine them with other kind of models that you can have that in my care, you can have a Gaussian process component or even several Gaussian processes components in a model that has also linear regression components is just, yeah, that's one of the big ⁓ big powers of Gaussian processes is that one. And that's why also one of the big reason I love them so much. ⁓ One thing I'm curious about though is that, you know, when scaling Gaussian processes, what were the primary software bottlenecks that you encountered in the Python ecosystem that led you to favor the, well, the Jack's stack we talked about at the beginning, it was more of a curiosity, but also now you have the Numpyro stack integrated into GPTACs.

Thomas: Yeah, so I think although I opted to use Jax at the start, I think really I had actually before that implemented a very, very small Gaussian process library using like NumPyro ⁓ and really like Jax solves so many problems where you have to calculate, you don't have to worry about calculating the gradients. Like in the Julia package, we had to calculate a lot of the gradients by scratch.

Alexandre Andorra: Mm-hmm.

Thomas: which means things are super fast, but for each new kernel you want to introduce, you have to calculate the gradient with respect to each of the parameters of that kernel. And it's very difficult. ⁓ So I think software like Jaxx really unlocks the ability to build out new code very quickly because I can take a derivative with any leaf of my pi tree. And that's a super powerful paradigm that I can sort of, through the chain rule, just essentially differentiate through my whole loss function in each of the parameters and get out a gradient at the end. ⁓ And within Jaxx also, and now PyTorch and TensorFlow, I can JIT compile my whole graph, which in, again, something like a Gaussian process where computation can be quite heavy, Jaxx really solves that problem of making a highly computationally challenging problem slightly less challenging through the ability to JIT compile your code. On the NumPyro side, NumPyro really solves the problem of, as you said, like integrating a Gaussian process into a wider framework because NumPyro is a lot more general than GPjax. NumPyro simply provides a way for me to build any type of probabilistic model. There's nothing to say we couldn't support that in GPjax, but we're not a probabilistic programming library. We're a... Gaussian process library. And this has really been the paradigm I have had from the start where GPJAX will fit Gaussian processes. And then when there are good libraries for us to offload of a functionality into, we'll always take that. So for example, we don't implement any optimizers inside GPJAX because there's an excellent library called Optax that can achieve optimization. So we write software within GPJAX, which allows us to hook into Optax and allow Optax to maintain an excellent optimization library. That statement is true for NumPo and the sampling. There would be nothing stopping us integrating a Hamiltonian Monte Carlo sampler into GPjax. In fact, we did that using the Julia library ⁓ before. ⁓ However, you then have to start maintaining a Monte Carlo or an MCMC sampler as well as your Gaussian process code, and that becomes challenging. And to be totally honest, like, I can write a good piece of code to implement the Gaussian process, but I am not necessarily very well skilled in writing the best sampling library. Whereas NumPyro does provide a really efficient sampling library. So I think it's really identifying dependencies and sort of codependent libraries where you can offload challenging functionality into these libraries. design, GPJax has always been designed to kind of allow us to connect with other libraries as easily as possible. and sort of share the strengths of several libraries therein.

Alexandre Andorra: Yeah, that makes a ton of sense. Much better to stand up the shoulders of giants than just re-implementing everything on your own and just playing on our strength, as you were saying, where you're a very good tool builder. so basically, you have that ability to use different tools and blend them and make life easier for people who need that kind of models. I think this is amazing. And that's also... really the spirit of this show because again I resonate a lot with your profiling experience so yeah for sure I completely understand where we're coming from. ⁓ Something you've somewhat talked about already a bit but I'm curious to hear if you have more to say is do you like what are practical ways you use GPs and maybe even GPTACs in your work in relation with causal inference?

Thomas: Yeah, so we use some, so I use Gaussian processes typically if I need to fit some type of regression model and I just kind of want something that's going to work and I don't want to have to think too much about it and I really care about the uncertainty, I'll often reach for a Gaussian process. ⁓ There's an element of personal bias. It's a model I simply know how to fit relatively well. And when linear regression doesn't solve a problem, I know there are other tree-based methods and there are several other different approaches I could reach for. But Gaussian processes are simply the tool I have. So they are sort of my default regression model. I think we've using quite a lot. had a unsuccessful attempt at using Gaussian processes for for Bayesian synthetic diff and diff whilst at Amazon. So there I had this idea of basically saying in synthetic control, your weight belong to this Dirichlet distribution and I want to essentially correlate my weights in time. So kind of tried to place Gaussian process priors down on like the concentration parameter and try and evolve that parameter through time and then or I also tried trying to evolve the weight through time. the inference just became a bit of a nightmare and like the inference was super unstable, our hats were all over the place, ⁓ the model was really not a great model and I never was able to get it to work very well. But it, although the model was unsuccessful, I think it does speak to where Gaussian processes can be useful in sort of causal inference workflows and just generally in industrial workflows when you have this sort of latent property which you wish to propagate over time or over some spatial domain, I think it can be really useful. I worked a little bit on some spatial data before where we essentially needed to sort of model a spatial residual between two satellites. In this type of instance, like the satellites belonged, the satellites were measured on different resolutions, like one of them was on nine by nine kilometer grid cells and one of them was on 500 by 500 meter grid cells. you kind of need to propagate this residual and spatially smooth it in these types of instances. Like a Gaussian process is a really natural type of model, I think. So I think it's like anything. I don't think they are the silver bullet. I don't think a Gaussian process is the model to be used every single time. But I think in these cases where you really care about propagating uncertainty around a model and you really want to sort of carry information through a model. and never compress information down. I Gaussian processes, in my experience, often allow you to achieve that ⁓ relatively easily.

Alexandre Andorra: Yeah, same experience here where these kind of models where you need nonlinear relationships, you don't really know what they look like mathematically, don't really have the time and expertise to do that and want a method that's gonna work. GP's are gonna be a great bet. As we were saying, for spatial data for sure also me I've used them a lot for time series data again recommend them a lot for that ⁓ so yeah like all that all that thing is is extremely like they are extremely versatile and And now, pretty easy to feed, honestly, to sample with the stacks we have. So unless you have a huge amount of data. But now, it's so much easier to sample GPs than it used to be.

Thomas: Yeah. Yeah, I even think actually nowadays, but even unless your data set is truly huge and high dimensional, I actually think high dimensional hurts you more nowadays than high number of observations in a Gaussian process. I think with a lot of these stochastic sort of inducing point type ideas and Hilbert space approximations that we have for GPs, I think we kind of moved away a little bit now from this idea that GPs don't scale. I think they do scale in the number of I think the challenge is still scaling in a high number of covariates. ⁓ And this is still where I would be inclined to reach for some type of tree-based model. ⁓ However, yeah, they are pretty easy to fit nowadays, I agree.

Alexandre Andorra: Yeah, no, exactly. I mean, if people are actually interested in seeing what you can get from caution processes, will put in the show notes the project I have still ongoing with Max Goebel, who was on the show. also link to his episode in the related episodes. on your on your episodes show notes Thomas but yeah like for that project that's called the soccer factor model ⁓ we used a bunch of hierarchical gps on different timescale for soccer strikers to try and predict the number of goals they would score in their next game and so yeah like if you go to this dashboard you'll see the depth of analysis you can get from these kind of models and this is yeah this is all thanks to such powerful models. So yeah, this is an open source project, so that's great because we can show everything. So you can go in there. There is a paper, there is the code that's available, the data, everything in there if you want to learn, if you want to learn on GPs.

Thomas: Yeah, I'm not aware of this work, that sounds super interesting. We actually did something I think not too different. We developed a new hierarchical sparse Gaussian process. We weren't applying it to anything as fun as football players and strikers' goal rates, but we were trying to model different climate models and in the climate science world, there's these projections of surface level temperature out to 2100.

Alexandre Andorra: Mm-hmm.

Thomas: And different models, of course, produce different predictions and under different constraints, you end up with different predictions. And we were actually trying to do something pretty similar to what you described where we were putting hierarchical Gaussian process down on this to kind of understand the latent underlying sort of trajectory which underpinned all of these models in a way to kind of ensemble them, but also allow each model to vary from one another. ⁓ Yeah, maybe we should compare approaches, maybe there's something interesting happening at the intersection.

Alexandre Andorra: Yeah, yeah, for sure. Happy to share notes. ⁓ But yeah, that sounds a lot like what we've done in... Also, I think in your case it's the same in our case that was very clear that the hierarchical structure of the data was helpful because we have players, but they are part of teams and even players are just part of a special population. So that was something that was really important to us to have that kind of higher level population GP and that would trickle down towards each player who would have his own ⁓ GP that would vary from the population level one but would still have the information from the population. What is great with that is that then you can make a better prediction of out of sample players, know, like young players or players who come from a new league from another league and they're completely new, let's say to the Premier League, then you can still make decent prediction whereas before you were more less blind. So this is very important.

Thomas: Yeah, do you also have incomplete data? Because this is something we had to grapple with, like not every satellite provided reliable predictions at every single point in the world. Some of the models were better at different points in the world. Do you have the same thing with your soccer players where I guess not every player will play every game in a season, so you end up with kind of incomplete data? Is that also the case? Yeah.

Alexandre Andorra: Yeah, exactly, So Max Max knows that that is said by heart because he's the one who who like literally made it from nothing. But I am very certain that yes, we have players like that where, you know, they get injured or they just like you had a lot of replacement level player or even average player, which they just get out of the bench. They don't start all the time, then don't enter the game all the time. And that's where also having the hierarchical structure helps a lot because while having predictors also, you know, it's a combination of having predictors that give you information about who the player is and also having a hierarchical structure that then fills in those missing data points. ⁓

Thomas: Yeah, super interesting.

Alexandre Andorra: yeah this is a very fun project that actually we've been doing for almost two years now which is like it keeps on giving you know it's a big project and also fun playground that you can always find something new to ⁓

Thomas: Yeah. ⁓

Alexandre Andorra: So to close up on GP checks and then I want I want to touch on another project of yours, because you do a lot of things Thomas, but so just just curious about the coming roadmap for you know, the coming month for GP checks. Do you have any priorities already that things you want to touch on?

Thomas: Yeah, yeah, it's possible that I follow the literature less closely these days. No, I'm not in an academic setting, but my observation is the sort of literature around GPs is quietened a little bit from where we were several years ago. So I, when I look forward, at least for the remainder of this year, I'm not imagining implementing any new methodologies into GPJACs. We have a pretty wide range of kernels. least the major kernels and our goal is not to implement every single kernel. Our goal is to provide the infrastructure which would allow someone to implement their own kernel very quickly. So we have that and we have a couple of different variational approximations which one could use depending on their use case and we also have a few different inducing point schemes and multi output Gaussian processes implemented. So I think we have a pretty pretty complete set of functionality within GPJAX. So I think when I look forward to the remainder of this year, I think I want to do two things with GPJAX. I would quite like to close that bridge between GPJAX and something like SKLearn or GPytorch, essentially providing a higher level abstraction of GPJAX, which would allow someone to fit a Gaussian process in a few lines of code versus the sort of 20 lines of code we make you write today. ⁓ using something like providing just a YAML file or just a JSON config and then building your model and fitting it. Because whilst the original use case of GPJacks was for researchers, I acknowledge that many people and myself included sometimes just want to fit a GP. I don't want to define all of the boilerplate code every time. And also when I fit a GP, I want things like logging to be implemented under the hood. I want things like to connect to AWS or some cloud-based system. Like all of these things can be handled behind the scene very sort of easily. And I think I would like to provide this abstraction within GPjax. ⁓ maybe in a sort of secondary library. I'm not entirely sure at this point, but essentially giving people a more production ready form of using GPjax. ⁓ The other thing is I would really like to... improve the quality of our documentation. So I think we've done a pretty good job with our documentation. We've made a conscious effort always to explain the underlying maths as well as what the code is doing. However, most of our examples use simulated or synthetic data and I think we could make them far more interesting by just using real data. We had a collaborator And actually we had two collaborators, I should say, ⁓ a few years ago now who implemented a really, really cool ⁓ document, notebook to GPJAX modeling vectors. So rather than predicting a scalar value, you're actually predicting a vector. And they use this to model ocean currents using GPJAX and modeling the spatial vector field. And you can do this using a certain type of kernel. And I think this is a really cool notebook. It uses real data and it's a real problem which people actually work on. So I would like to migrate some of the GPjax examples away from using simulated data and to bring in some new data sets which are interesting and may just catch the interest of people browsing with documentation and they might look at it and think, ⁓ I have a very similar problem to that. First, this is today sort of building a data set by composing a sinusoidal function and a... and a sort of bit of white noise together. ⁓ Going a little bit beyond that. And to be honest, where I am now, like those are the type of documentation, those are the bits of documentation which I enjoy writing as well. Claude code and cursor are so good now, they can implement most of the backend type code far quicker than I ever could with fewer bugs. But I don't think it's very good still at writing documentation and writing case studies. I think there's still a handcrafted narrative which is better than. what an LLM can generate for us. And I still find enjoyment in writing these types of things. So I think this year that's, those are the two sort of angles which I would like to steer GP Jackson.

Alexandre Andorra: Yeah, yeah, yeah. Makes sense and yeah, I... I kind of have the same experience where I feel if I can delegate a lot of the coding to the agent and just basically supervise its work and basically doing what I've done a lot on the Plymouthy side, where you review pull requests. ⁓ Basically here you have your agent writing the code and so on and then ⁓ pinging you for a pull request, you review that, that goes much faster. And then I get to go to the most enjoyable part which is okay how do I use actually that code how do I show people how to use that how do I yeah like basically teach people where and when and why this is useful and writing this kind of pedagogical content where yes I think we as humans ⁓ still have a much more interesting viewpoint because I think we're more aware of the difficulties we have when we learn this. This is actually something we enjoy more and that's mostly what's really, really useful to people because most users are introduced to a software, not through the code on GitHub, but through the example notebooks on the documentation website. So yeah, that makes sense to invest a lot of time in there. And actually, this notebook also, if you have it already ready to share with people, please feel free to add it to the show notes. think this is going to be very helpful.

Thomas: Yeah, yeah. We'll do.

Alexandre Andorra: Awesome. So now I'm going to start to... ⁓ to close us up because it's gonna be too late for you Thomas but I do want to talk about Impulso that's one of your new projects ⁓ and again this like the resemblance ⁓ is uncanny because this is one of our common passion too this is a package to do vector auto-recreation in Python in PIMEC even so yeah I of course looked into it before the show but can you give us the elevator pitch, know, what this is about, ⁓ why did you even do that? And yeah, what is the state of this project?

Thomas: Yeah, so Impulso is a play on the impulse response function, which is a incredibly useful property you get out of structural VAR models. So a VAR model is essentially a way of modeling a time series model with a vector of outputs and acknowledging that there may be correlation between those outputs. Structural VAR is something which, to be totally honest, I'm not an expert in structural VARs. I worked on them with a colleague ⁓ whilst at Amazon and found them fascinating, ⁓ the use cases of them. And since then have just taken an active interest in reading about them and trying to learn more about them. And I feel like I have a moderate understanding of the principles and really it was a case of trying to use them, but there wasn't really a reliable good implementation of them that I could find in Python, which would allow me to fit them in a Bayesian manner. I think there's some excellent R packages, but I'm not the person who can write good R code for you. ⁓ And so I thought, let's try and implement something in Python to solve this problem. It also came about because of actually what I just touched on with the use of LLM. So I use LLMs and Claude code namely within GP jacks nowadays. And it's, pretty good, but I have a huge bias in the sense that I wrote or reviewed every single line of GP jacks. So I know when the LLM is doing something weird or is in the wrong area. I was really curious how we could build software libraries from the ground up using LLMs. So I've I Impulso was a great opportunity to try out that and to try and document my learnings as we go about how we can use LLMs to build software libraries because software, we will always need new software libraries. And I think it's super important that we continue to fill gaps in the ecosystem with new libraries. And so I was curious how we could leverage LLMs to achieve that most effectively. actually, I guess the summary of areas, Impulso is a... entirely selfish project where I've used it to learn more about VAR models and structural VAR models and also how we can use LLMs to build code libraries.

Alexandre Andorra: OK, yeah, that makes sense to me. Again, you know, basically a toy problem that is interesting to you and that you wanted to have more information about. yeah, yeah, I can hear you. Yeah, it's telling me here that your recording has stopped and it's due to media disconnect. maybe try... ⁓ here. Yes, here you are.

Thomas: Yeah, then Riverside just asked, like, ejected me from the call. I think it, ⁓ I'm not sure if the internet just dropped for a few seconds or something from my side. ⁓ I don't know how fixable that is in, in post.

Alexandre Andorra: Haha Yeah. Yeah, this is interesting. So this is the first time that happens. I to say, don't know. I don't know what happened. ⁓ So let me check.

Thomas: Okay, will you be able to stitch the two videos together?

Alexandre Andorra: Yeah, yeah, no worries. mean, I think so. So yeah, I see you twice in the studio, actually. there is first upload is complete. And then so yeah, you just dropped and then there is you here, which is upload uploading. So I think it's all good. We should be all good. Okay, so I'll just ⁓ start again my my answer to your last to last phrase.

Thomas: Okay. Okay, great.

Alexandre Andorra: Yeah, okay, that makes that makes a ton of sense to me. Basically, you were curious about these models and you were looking for toy projects. ⁓ and again this is what I do all the time so I'm like really really impressed by the resemblance here but yeah like and I have to say I looked around the project and also sent it to one of my friends Jesse Kropowski who is actually the person who introduced me the most to to state space models so I owe all the math I know but state space models to Jesse who is always extremely patient with me and my misunderstandings of matrix algebra. ⁓ So Jesse was on the show episode 124 to talk exactly about state space models, especially his brainchild, which is the sub package in PIMC extras to do state space models ⁓ with PIMC. So you can do VAR with that. So vector after regressions, although We do it to be different on time differently on PNC extras because we do it from the state space. equation point of view. with Kalman filters, if I understood correctly, an impulsor, you do that from the linear regression equation point of view. So a good thing of that is that it's going to be much faster to sample your way of doing it because it's a linear regression. So to give listeners an idea, that's going to be a bit more like something like profit. If you're used to these kinds of packages, this is going to be something like more of that. structure, whereas the count line filter thing is what we implemented in PMC extras, which is going to be taking more time to sample, but you can do all the other state space model stuff. you know, always try it off. So always try it off. So people curious about these models, I really encourage you to take a look at Impulso. I think the different documentation is really well done. And, and yeah, it's also like your you're still developing the project so of course it's still a developing project but if you want to also give Thomas a hand feel free to do so open issues open pull request as an open source developer i always appreciate those so i'm sure thomas will and ⁓ also you can check out prime c extras i'll put that in the show notes and related episode the one with Jesse. And finally, Jesse and I taught a tutorial last September in Pindata Berlin, introducing the PindCXTrans state space sub module. So if you want to have an introduction to that, I encourage you to check out the show notes. I'll put that in there. That's called the beginner's guide to state space modeling. On that note, Thomas, anything else you want to about Impulso? Or ⁓ did we summarize it well?

Thomas: No, I think you gave an excellent summary. would maybe add that in, obviously there is the backend difference you mentioned between Impulso and PyMC Extras. I think also PyMC Extras state space library definitely tries to give you more flexibility, much like I described with GP jacks, like allows you to sort of compose things together and gives you a lot of control. That's actually not the problem I'm trying to solve in Impulso.

Alexandre Andorra: Mm-hmm.

Thomas: sort of building software here maybe for economists who just want to sort of fit these models and get access to the inference from the model and worry less about the sort of mechanics of the model itself. It's more of a high level ⁓ abstraction. So yeah, they might do different things on the backend, but I think they also maybe will end up trying to do different things on the front end as well.

Alexandre Andorra: Right, yeah yeah. Yeah, they probably optimize for a different kind of population. I ⁓ think the state space sub module is made to be self-contained in a way if you're using Pint C, but it's still like yourself, probably deep programming language environment so you can go and combine elements together and have structural time series and things like that. ⁓ I think Impulso from what I saw is more opinionated and I think that's what you were going for because that way users have to make less choices and like often some kind of users will prefer that some others what so yeah like that's great to have these these diversity in there ⁓

Thomas: Yes. Yes.

Alexandre Andorra: Can you still have, let's say, five, 10 more minutes so that they can ask you two other questions before asking you the last two questions? Or should I get to the last two questions already?

Thomas: I can hang around.

Alexandre Andorra: Okay, so thanks for taking the time. I was actually curious just, you've worked a lot in these industrial settings. So something I really want to ask you is how do you balance the computational overhead of patient sampling with the latency requirements of production level decision making?

Thomas: Yeah. ⁓ So I guess I... Actually no, I had. often times my work is never, I've never really worked too close to the edge in terms of having to give... give model inferences in the sort of milliseconds. Like often for me, the limiting factor has not been the inference speed of a model, but more of a decision-making speed of an individual human stakeholder and the ability for them to digest information. So oftentimes we would run a campaign and we would get the results and then we would have a few days to analyze them. So there the models are never taking more than a few hours to fit and really then you have a couple of days to digest the outcomes and work out how to report it. So rarely have I been in that case but I think even within that is SLAs or like service level agreements that your model should sit with when it sits in a production system. And I think actually Bayesian models nowadays, I think with PiMC and NumPyro and the different samplers we have, they're often slower than the frequentest counterparts, but honestly, they're not so slow nowadays. We're not always talking about several hours or days to fit. If your model is correctly framed and the priors are reasonable and the likelihood is a good choice, They'll often fit actually quite quickly and when they're fitting very slowly, like sometimes that's actually an artifact of your model and you've got a ill-posed model and it's simply slow to sample because it's difficult to do the sampling routine itself because you've maybe specified a bad likelihood. However, they are always slower and I think in those cases, in the past when I've had real high service level agreements and we've had to give as quick as possible inferences, in those cases I've actually ended up using things like conformal prediction. So taking frequentist models and then trying to get some type of prediction interval on top using conformal prediction type approaches, which are super fast. They're often in their simplest form, they're just a case of taking some predefined quantiles and applying them to your predictions. So those can be an option when you're really pinched for time. It would also say though, sometimes working backwards from the constraint you have is a sensible approach. So when I used to work in the supply chain team of Amazon and we used to have to deliver predictions within 12 hours, we actually framed that as saying like, what... What can we do within 12 hours to guarantee the fastest possible model? So that means you can't evaluate every single data point, maybe in your training set. So actually building secondary models, which select the data points of most information, kind of like in active learning and putting those into your model and fitting your model on the most informative data points. And then you can get onto 12 hours. So sometimes they're just constraints you have to work around. Like in this particular model, we were fitting in the supply chain. We actually really needed a posterior distribution. There was no way around it. ⁓ So there it was a case of sort of working backwards from that and working out what the best you can do within the time you have. ⁓ Yeah, I guess what I'm saying is there's no sort of universal answer, but Bayesian methods are not as slow as they once were and often there are ways to accelerate them when you really need to.

Alexandre Andorra: Yeah, I mean... I couldn't have said it better, thing is I wanted you to say it because otherwise I sound like a preacher and that's less objective. ⁓ But yeah, it's also been my experience where now I am very rarely convinced that you cannot use a patient model if you want one. ⁓ I've actually never seen that really in all the different models.

Thomas: Hahaha.

Alexandre Andorra: I've shipped in industry. It's always been well, no, you know, ⁓ we can make that model better. And as you were saying most of the time, it's well, actually, there is an overparameterization or the priors are too wide or they're too narrow. ⁓ Well, we can use also the latest nutpy sampler and we can use this approximation of a GP here. Let's let's use HSGP instead of a plain vanilla GP, you know, all that stuff. And then you end up fitting you end up sampling huge hierarchical Gaussian processes with a lot of data points on your laptop in like 15 minutes. Well, you know, that's very, very fine, especially if you don't need to run that model every day. And if you needed to, that's 15 minutes. So it's also super fast.

Thomas: Yes. Yes, I agree. totally agree.

Alexandre Andorra: So, yeah, I mean, but this is already like such a great point, right? Because I remember when I was starting this podcast in 2019, that was a huge point of contention, not, not necessarily from academia and practitioners, but like in people who did know about patient status, who would still have these preconceived notion that it's most of the time the best theoretically but impractical to apply. know, let's not even try. And now I feel like this battle has pretty much been won when it comes to education most of the time.

Thomas: I agree. And I think with scaling of GPUs, scaling of the quality of CPUs available to us, think data sets might get bigger, I don't think we, GPs are sort of GPs in their vanilla form are a particularly bad case of cubic scaling and the number of data points, but often giving more compute or throwing more computers, a problem can, can solve a lot of your problems very quickly if, if you have the resources to do so. Yeah. Yeah, that's a good point.

Alexandre Andorra: That's an easy one, but yeah. And actually, you know, ⁓ given your experience and your work on these scalable patient tools, I'm wondering how optimistic you are that these methods will soon become the default choice for causal inference in the industry.

Thomas: it's quite a good... yeah, that's a good question. I'm not sure they'll ever be. I'm not sure that will ever be the default choice. And maybe that's okay. ⁓ There is some, it's my belief that when we report treatment effects, we should be talking about the uncertainty that comes through a posterior distribution. But there are other fine ways of doing this. I actually often don't think they lead to the wrong outcome. I think they're often just more work to get to that same outcome. So. I think there are pretty popular frameworks which can achieve pretty good results. I think the double ML framework in certain instances can be quite an effective tool. think it's often used as the default choice when maybe we could look to other models to achieve what double ML can do for us. But I think there'll always be other tools. as we... As we evolve as an industry to continue using causal inference to really provide data points around high sort of value or high risk decision making in industry, because it's still not the norm, I don't think. I think we are still working towards a world where everyone is doing that. But I think as we continue to do that, I think the need for Bayesian methods becomes pronounced, not because we as Bayesian practitioners believe them to be the right model to use, but actually because the guidance we're getting from our stakeholders. mean the only model that is really reasonable to use as a Bayesian model. example, when I've worked on some sort of geo testing and sort of video marketing projects in the past, like our stakeholders really wanted to know questions like, well, what's the probability, like, what's the risk of us making the wrong decision here? And like, can you put dollar values in that? That's really quite easy to do when you have access to the posterior distribution. But if you're trying to do that through the lens of p-values or confidence in tools, that's actually quite challenging. And so I think it may end up being more commonly used because we end up being more risk minded when we start making these decisions as we as an industry evolve to using these models more for our decision making.

Alexandre Andorra: Yeah, yeah, I agree. And I think that was kind of a provoking question on my end. And I completely agree with you that actually the the assumption of the second of the question was that patient stats would somehow need to become the default. I don't think they do. But I do think that they need to be more basically accepted out of the box as something in the toolbox. Basically, just as you're using scikit-learn or PyTorch or any other framework, which looks classic to ⁓ any member of the team then when it's needed, you're using Bayesian causal inference method because that's what solves the problem best and you don't have to justify overly justify why and like you just justify it as you would using a neural network or an NLM for some other project and I think that's definitely what we're trying to build ourselves to which is where ⁓ yeah here yeah definitely it makes sense let's just use that and and we're good and then we're gonna have all the bells and I'm also very very sure that when people start to see you don't need to know what a p-value is and you can actually interpret the confidence intervals as you actually want them to interpret this will become one of the one of the beloved methods of statistical folks for sure

Thomas: Okay. Yes, I totally agree on that point. also think just picking up on something you just said there, I think there's also an element of we sort of need to own our own shortcomings here, I think. Like, Scikit-learn and other tools have done an amazing job about making frequentist type methods super accessible. to be completely honest, we as the Bayesian community have not done such a good job. A lot of our packages and a lot of our frameworks

Alexandre Andorra: I know.

Thomas: very much geared towards researchers or people wishing to fit new Bayesian models. And I think there are libraries like PyMC and NumPyro and sort of what I'm trying to do with Impulso to try and bridge that gap. But we really haven't designed an ecosystem which makes it particularly friendly for people who don't already understand a lot about Bayesian methods to just go and fit a Bayesian method off the shelf.

Alexandre Andorra: Mm-hmm.

Thomas: So I think there's an element of we also need to continue to build the right tools, which will allow Bayesian methods to become a more default option within a practitioner's toolbox.

Alexandre Andorra: yet yet 100 % degree, awesome well Thomas I could continue these conversation that you're going to have to go to bed at some point so of that will let you be some and that and that many names of the before it is a before letting you go and ask you the last questions ice-caver guest at the end of the show first one he had a limited time and resources which problem would you try to solve

Thomas: You So yeah, mean, there are huge problems in the world and most of them I'm grossly ⁓ incapable of solving. But I think a big problem which I would love to, if I had unlimited time and resources to try and solve would be the supply chain issue that we have in the world. Like the brittleness and the fragility of supply chains feels like a problem which I could imagine working on. But it's incredibly complex getting data around like where weak points in supply chains sit and trying to model them using Bayesian methods would be incredibly meaningful and incredibly impactful, but incredibly costly both in terms of time, money and resources. ⁓ But if you're telling me I have unlimited time and resources, I think that would be the... That would be the problem I would spend a lot of time thinking about. How do we build more resilient supply chains and how do we understand the risk of supply chain problems?

Alexandre Andorra: Yeah, that makes sense for sure. And then second question is if you could have dinner with any great scientific mind dead, alive or fictional, who would be?

Thomas: And... It would be amazing to have dinner with Radford Neill, actually. ⁓ I always find Radford Neill's papers incredibly conversationally written whilst also being highly academically stimulating and informative. I could imagine he'd be an incredibly interesting person to have dinner with. And really for me, when I was first getting into the world of Bayesian statistics, some of his papers on connecting Gaussian processes to infinitely wide neural networks and Hamiltonian Monte Carlo, and a lot of these types of papers, honestly for me were so accessible whilst also being so dense in information that I would keep reading them and coming back to them. It'd be amazing to sit down with Radford, Neil, and try and learn some of the things which he has in his head.

Alexandre Andorra: yeah yeah I love that love that and I think you're the first one to to answer that on the show so that's not the goal the goal is to sample the distribution but ⁓ you're definitely on the tail ⁓ for now also well Thomas let's call it a show that was an absolute pleasure ⁓ to have you on ⁓ thank you for staying up for us I'm sure everybody appreciates it

Thomas: Okay. Okay.

Alexandre Andorra: And well, listeners, if you want to give Thomas a token of appreciation for his work, definitely drop him a word on wherever you go, LinkedIn or GitHub if you're using any of his project, or if you have ideas on how to contribute, or just want to start on open source and think that one of Thomas's project is something you want to contribute to. Then just chime in on the GitHub repo and I'm sure they'll be very happy. On that note, Thomas, thank you so much for taking the time and being on this show.

Thomas: Thank you ever so much for having me. been a really fun conversation. I couldn't have thought of a better way to spend my evening.

Alexandre Andorra: Well, I am very glad to hear that.