This is a transparent diary of a stressful month in the life of Nordnet’s Head of Core Developments and what goes on behind the screens. IT is a department working 24/7.
Not another framework success story
You’ll find a million blog posts about what frameworks someone managed to use successfully and new tech to fiddle with. So let’s not pour more generic nonsense into that category. Instead it’s probably much more interesting to know what’s really going on (or going wrong) in the real world, aka production.
First of all, no-one spills these sorts of dark secrets without a disclaimer. Here’s mine: every mistake can be sugar coated with all kinds of excuses or promises for the future, but sod it – we messed up, a lot. We’ve sort of stopped doing that, but not completely. The thing is that stuff goes wrong everywhere and the only difference is the amount of makeup and hot air that follows to cover it up. I guess it’s proportional to company size.
You might have new code running for the first time or you might have reconfigured something, but it’s all the same and everybody dealing with critical systems will recognise the feeling of instant maximum angst, cold sweat and panic when you realise things blew up. It’s not pretty and sometimes you see an IT veteran’s eyes glaze over into a 1000 yard stare, ending with a small shudder. That’s them remembering old mistakes.
Hopefully you learn and improve from such mistakes, and if you don’t someone should fire you! In the name of transparency – here’s what we learned some 2000 days ago (thank god those days are over!)
One of the Oracle database nodes starts acting up and complaints are coming in about the site being slow. The node is restarted and all is well. Apparently a backup was running right in the middle of trading hours.
One server fills up the disks and the site goes down. Everybody’s screaming.
The database pools are tuned to handle more load, which blows up the Oracle cache, which in turn messes up the memory. Site down.
Price feeds are delayed and it seems that the German price feed is silent. The full disk killed them four days ago. The feeds are manually restarted and all is well.
Wintrade users are experiencing delays. US stock markets takes a dive in the evening.
Wintrade delays again. It’s optimized in pure panic, but it doesn’t help. The disk gets full again and the site takes a dive.
A lot of servers are reallocated in the server room and one database master node got thrashed. A new one is built.
Trading in Germany is offline for some time due to OS configuration error.
An application server has a faulty clock which just keeps drifting.
In the evening all application servers looses database connections and the site is stuttering. Restarts are needed.
Someone is upgrading BIOS on some Solaris machines and by accident reboots the whole trading system – in the middle of trading. Not brilliant. A few moments later the German trading breaks down. Apparently our connection provider keeled over because we sent too many orders at once.
One Oracle database node blows up but reboots in a couple of minutes. During the night a timestamp conversion results in an overflow and a bunch of trades are not visible to customers.
Just before opening call one database cluster stops accepting new connections and the whole site stalls in seconds. An application server reboot revives the site. The funny thing, in a sad way, was that another cluster also had problems but the monitoring was configured to check the wrong cluster. Luckily both were having problems so the alarms worked anyway. Jesus.
Another database master node goes down and everything stops working. This time because someone accidentally shut it down in the server room. Later a bug in the trading systems halts order handling and a reset is needed.
Wintrade delays again. This time severe. A desperate restart of price feeds is done but that only overloads the trading system and everything needs a restart. Downtime. People work through the night and find big performance hogs that repaired.
My god… All this in one month. Every incident is a gut wrenching moment for the people involved and I can’t believe we got any development done with all that chaos, but we did! And quite a lot of it. You live and you learn!
//Tommi Lahdenperä, Head of Core Development