10/10. Would hire. (I run the R&D wing of a data science company).
Did you find a difference between home and away games?
I hope it is okay if I play the role of skeptic. It's not that I don't think you did good work - this is a fantastic write-up - but there are a few things that make me hesitant.
* Was your final sample size ~450 with only ~30 runs without timeouts? Given that there are over 5000 games in a given season, that seems small to me. Less than 10% of all games contain a run that might merit a timeout? If it is the right sample size, I am also surprised that a timeout was called in 95% (~30/450) of scenarios where a team went on a run of 6+ points in a short timeframe.
I cannot tell from your code whether you limited the analysis to only those games where the final overall scoring margin during the game was greater than 5 points. You mentioned it in your documentation, but I could not verify whether it was implemented. I agree with your inclination to attempt to control for 'guarantee game' blowout where teams are likely to go on 6+ runs, the opponent is less likely to calla timeout, and the run is likely to continue. However, I don't think the solution is to restrict the dataset to only those games that ended up close. You lose some relevant timeout scenarios (e.g., team that is up large/has opponent go on 6pt run/calls timeout/proceeds to blow them out) as well as potentially bias the control scenarios.
* I must admit that I am not 100% clear on your methodology. I think I understand how you assess the stoppage of runs for settings with a timeout: you start the time-window at the timeout and track performance over the next 10 possessions. I am less clear on how you handle non-timeouts. Do you start the clock once the run becomes official (i.e., 6+ points) or when it hits its peak? If either of these are true, I believe you might be giving an advantage to the non-timeout group. As a quick check, it would be interesting to look at the summary statistics regarding the run size between the timeout and non-timeout groups.
* Have you considered a matching study design? That is, for every scenario where a timeout was called, find a similar scenario where a timeout was not called. You could match on the current scoring differential, the time left in the game, and even on how the run progressed (e.g., made 2, miss, made 3, steal, made 3).
* Have you considered working with expected win probabilities? Given that you scrapped play-by-plays for every game, you could likely create empirical win probability tables for every second of a 40 minute game. I see two key advantages here. I personally think this better quantifies a run (is a game 'slipping' away?) compared to a simple point differential. I am skeptical of 6+ points being classified as a run as a blanket statement. Also, it may help eliminate the 'blowout' scenarios.
Again, I think this was a great investigation. I especially want to thank you for sharing your code. *Edits for formatting*
So I guess what you're saying is, Roy knew?
Good work. Someone should pay you for this.
I think a 5 point average score margin is probably too restrictive in order to determine competitive games. You could increase it to 10 and be comfortable. I think we can all agree that a game that oscillates between a 5 and 15 point lead for one team is still a competitive game.
A few comments.
A couple of preliminaries, first on notation: p(x|y) is not the probability that x AND y occur. It is the probability that x occurs GIVEN that y has already occurred. Bayes rule is all about this: p(x and y) [often written p(x ∩ y)] = p(x|y)*p(y) = p(y|x)*p(x). Secondly, keep in mind that this is an observational study, and so inferring causality is tricky. In particular, p(run ended|timeout) is NOT the "probability that calling a timeout is responsible for ending a run". It is the probability that the run ended given that a timeout was called, and says nothing directly about causal links.
I think your calculations of p(RE|T) are correct, but two things: (a) the denominator in your Bayes equation is actually just p(timeout called), which you could calculate directly and simplify your code. But also (b), do you need to do the Bayes bit at all? Can't you calculate p(RE|T) directly from the data, as you are doing now for p(T|RE)? [i.e. find all situations where a timeout was called, and tabulate the proportion of those where the run ended = p(RE|T)].
Either way, the absolute value of p(RE|T) is not particularly informative (so your statement "as low as a 22% chance that timeouts are responsible for ending runs" is a little misleading). Imagine a situation where the probability of a run ending (without timeout) is 0.1, but it's 0.2 when a timeout is called. That would seem to be pretty good evidence that calling a timeout is associated with an increased probability of the run ending, even though the probability of the run ending is still quite low in absolute terms.
Which brings us back to causality. Ideally you want to find situations where timeouts were called, and comparable situations where timeouts were not called, and look for evidence of a difference in the two sets of outcomes. I realize that you already know this, I'm just writing it down. Having found that evidence, it's down to interpretation and judgement as to whether or not the timeout was the causal mechanism. So the more relevant interpretation of p(RE|T) would be to look at p(RE|T)-p(RE|not T). A positive value would indiate a potential positive effect of timeout. This is what you're doing in the last figure. But your differences are negative (p(RE|not T) > p(RE|T) ?), with timeouts being associated with a slightly lower probability of the run ending. Which might be genuine (perhaps timeouts tend to be called when coaches get desperate and the run is basically unsalvageable, so the fact that the timeout has been called is really an indicator of the calling team being outplayed, rather than any indication of the timeout itself having a negative effect on run ending). But also I wonder if the way you are viewing runs is slightly problematic - runs of up to 10 events seems quite long, and the chances of a timeout being called in that span are quite high (e.g. look at your score ratio=1.0 numbers: you have 456 runs, in 430 of which a timeout was called). So (a) you have very little data about what happens when timeouts are NOT called, which makes it difficult to make inferences about the effect of timeout. And (b) there is no distinction between a timeout called early in the run vs a timeout called late in the run. Maybe there is some value in shortening your 10-event window, and/or comparing runs of given length (e.g. find all runs that extended for at least 3 events. Calculate the proportion of runs that ended after timeout was called AT 3 events, and the proportion that ended at 3 events without timeout. Repeat for different run lengths).
Final comment, it might be worth repeating this in a statistical modelling framework and comparing results. It would be fairly straightforward: fit a binomial generalized linear model where the outcome (run ended) is a function of whether timeout was called (true or false), and look at the coefficient of the timeout term (and test its significance, if you like).
Sorry all of that sounds rather negative, it's not meant to be. There's definitely some interesting results in there, just needs some refining to bring them out.
You might be interested in this study, which looks at a closely-analogous situation of timeouts in volleyball matches: [detail](http://untan.gl/articles/2016/07/16_timeouts-in-the-polish-volleyball-league.html) and [shorter summary](https://markleb1.wordpress.com/2016/07/11/the-truth-about-timeouts-part-two/). Basically - there's not a lot of evidence that timeouts help in volleyball, either.
I'm a bot, *bleep*, *bloop*. Someone has linked to this thread from another place on reddit:
- [/r/bestof] [\/u\/Chu\_BOT posts rigorous statistical analysis of whether timeouts affect scoring runs in basketball. Potentially gets a job offer on the spot.](https://np.reddit.com/r/bestof/comments/6zlsz6/uchu_bot_posts_rigorous_statistical_analysis_of/)
- [/r/theydidthemath] [\[RDTM\] Do timeouts stop scoring runs? (a thrilling statistical analysis) (x-post from r\/collegebasketball)](https://np.reddit.com/r/theydidthemath/comments/6zmqs6/rdtm_do_timeouts_stop_scoring_runs_a_thrilling/)
(#footer)*^(If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.) ^\([Info](/r/TotesMessenger) ^/ ^[Contact](/message/compose?to=/r/TotesMessenger))*