Following on the topic of run-time changeability from the previous post, it’s now the time to talk about new features and the process of deploying them.
Scenario: your developer team has been hacking non-stop for the last few months on a really anticipated feature. Such feature has been developed on a branch all along and never seen the light of production. Until now: the upcoming release of the software has the branch merged in and the shiny new code built into the binary ready to roll to production. You start deploying it and… guess what? Everything works just fine and users are grateful for the new functionality! Except… the new code triggers a serious data-corruption bug that you do not discover until you have rolled out the new release to 80% of your fleet.
Oops. What now? Do you ignore the problem? I’d suggest not: sounds pretty bad. So: do you roll the whole fleet back just because this new feature has a serious bug? Maybe… but how much will that cost you in time, wasted morale and the risk of hitting yet another bug during rollback (which of course never happens)?
Wouldn’t it be nice if you could just reconfigure that 80% of the fleet to explicitly disable the new feature so that the new code path does not get exercised at all? Such a reconfiguration is most likely easier to do, faster to deploy and safer overall.
The answer is of course yes. Unfortunately, such option is not always possible because, as a developer, you would not think of making the shiny new feature optional. “Why would something new that nobody uses yet have to be optional at all? It is not going to regress existing functionality!” is probably the mentality (and I know because I have had it).
So what does this mean? Make sure that any new feature that gets added to the software is protected by a configuration knob that you can turn at run-time. Suggest it to be set to false by default so that the operator deploying the new release must be aware of the new feature and explicitly enable it. With this in place, if the feature ever causes a problem during or soon after the rollout, the operator will easily know how to disable it — because he enabled it in the first place!
“Unfortunately,” this brings us, again, to the problem of knob creep (which I did not really discuss yet but that was brought up by a fellow coworker in the previous post). Do you want to keep the knob for a core feature that was rolled out 2 years ago and that has not shown problems since? Probably not. And for a feature rolled out 1 month ago? Probably yes. Where to draw the line is hard to tell and depends on your environment and the risks you want to take, but you should draw the line somewhere and be proactive on turning new features into part of the core package and removing their knobs. The code will be cleaner and, potentially, faster.