A cautionary tale of reference class forecasting

Feb 15, 2017

In an earlier post, I espoused the virtues of “reference class forecasting”. At TrikeApps, we changed our estimation process from “choose a number” to “choose a story of similar complexity”. The literature suggested this was a superior way to avoid “inside view” bias. In our case it was not superior. I will explain why this technique was not effective for us, and propose a new technique to try in future. If you are considering this estimation method, this should serve as a cautionary tale.

User stories get out-of-date and systems are a moving target. TrikeApps has 15-20 developers working on client systems. For the most part, developers work in a specific area most of the time, developing expertise. Reducing task switching costs in this way has improved our team velocity. We allow developers to switch teams when they feel they are getting stale, but we don’t encourage them to. Even so, trying to look at a six-month-old story and do a complexity comparison is difficult. What state were the systems in at the time? Was this before or after that big refactoring that simplified the whole system? Was the subject matter expert on the client side on holiday when we did that?

When team structures aren’t aligned with sprint objectives, domain knowledge gets fragmented. Our clients are growing fast, as are the systems they need to support that growth. We used to be able to split the teams across client lines. Today, our biggest client has five systems performing very different functions. We persisted with a single team serving that client for some time. During this period, Developer A might develop deep expertise in System A. Developer A then tended to complete any stories that affected System A. Developer B saw System A stories when trying to do a complexity comparison, and either had to conduct a pile of research to understand the context or just take a wild guess. Splitting teams across sprint objectives, rather than systems, has improved the situation.

We could solve this last problem by finding comparison stories from a similar functional area as the story being estimated. This would mean we’d need to organise or tag our stories by functional area. We’ve tried this before. TrikeApps completes something in the range of 45-60 stories per sprint. The administrative overhead of agreeing on canonical tags and tagging stories is enormous.

We’ve retracted the experiment, and gone back to “choose a number”. In the time the experiment ran, the average story1 went from blowing out by 5% to blowing out by 12%. Since retracting, we’ve gotten back down to an average of 8%. Sometimes stories are estimated months ahead of time, so these experiments have a long tail.

Our next experiment will find the dimensions that correlate with estimate values. We will ask our developers a series of questions before they provide an estimate. We’ll track the coefficient of determination for each question and answer. We’ll retire and promote questions and answers by predictive ability. We’ll compare the outcomes of following the model’s predictions with our humans. It’s going to be fascinating.

Are you smashing your estimates, without shouldering a massive administrative burden? Tell us your secrets..

1 actual / estimated, rolling average over four sprints

<< Back to blog post listing