Nordnetbloggen
Nordnet Tech
Nordnet’s IT and Innovation departments

2000 days ago we messed up – a diary

This is a transparent diary of a stressful month in the life of Nordnet’s Head of Core Developments and what goes on behind the screens. IT is a department working 24/7.

Not another framework success story

You’ll find a million blog posts about what frameworks someone managed to use successfully and new tech to fiddle with. So let’s not pour more generic nonsense into that category. Instead it’s probably much more interesting to know what’s really going on (or going wrong) in the real world, aka production.

Disclaimer

First of all, no-one spills these sorts of dark secrets without a disclaimer. Here’s mine: every mistake can be sugar coated with all kinds of excuses or promises for the future, but sod it – we messed up, a lot. We’ve sort of stopped doing that, but not completely. The thing is that stuff goes wrong everywhere and the only difference is the amount of makeup and hot air that follows to cover it up. I guess it’s proportional to company size.

Learning hell

You might have new code running for the first time or you might have reconfigured something, but it’s all the same and everybody dealing with critical systems will recognise the feeling of instant maximum angst, cold sweat and panic when you realise things blew up. It’s not pretty and sometimes you see an IT veteran’s eyes glaze over into a 1000 yard stare, ending with a small shudder. That’s them remembering old mistakes.

Hopefully you learn and improve from such mistakes, and if you don’t someone should fire you! In the name of transparency – here’s what we learned some 2000 days ago (thank god those days are over!)

The month

Day 1

One of the Oracle database nodes starts acting up and complaints are coming in about the site being slow. The node is restarted and all is well. Apparently a backup was running right in the middle of trading hours.

Day 3

One server fills up the disks and the site goes down. Everybody’s screaming.

Day 6

The database pools are tuned to handle more load, which blows up the Oracle cache, which in turn messes up the memory. Site down.

Day 7

Price feeds are delayed and it seems that the German price feed is silent. The full disk killed them four days ago. The feeds are manually restarted and all is well.

Day 8

Wintrade users are experiencing delays. US stock markets takes a dive in the evening.

Day 9

Wintrade delays again. It’s optimized in pure panic, but it doesn’t help. The disk gets full again and the site takes a dive.

Day 12

A lot of servers are reallocated in the server room and one database master node got thrashed. A new one is built.

Day 13

Trading in Germany is offline for some time due to OS configuration error.

Day 14

An application server has a faulty clock which just keeps drifting.

Day 20

In the evening all application servers looses database connections and the site is stuttering. Restarts are needed.

Day 21

Someone is upgrading BIOS on some Solaris machines and by accident reboots the whole trading system – in the middle of trading. Not brilliant. A few moments later the German trading breaks down. Apparently our connection provider keeled over because we sent too many orders at once.

Day 23

One Oracle database node blows up but reboots in a couple of minutes. During the night a timestamp conversion results in an overflow and a bunch of trades are not visible to customers.

Day 27

Just before opening call one database cluster stops accepting new connections and the whole site stalls in seconds. An application server reboot revives the site. The funny thing, in a sad way, was that another cluster also had problems but the monitoring was configured to check the wrong cluster. Luckily both were having problems so the alarms worked anyway. Jesus.

Day 28

Another database master node goes down and everything stops working. This time because someone accidentally shut it down in the server room. Later a bug in the trading systems halts order handling and a reset is needed.

Day 31

Wintrade delays again. This time severe. A desperate restart of price feeds is done but that only overloads the trading system and everything needs a restart. Downtime. People work through the night and find big performance hogs that repaired.

Tommi Lahdenperä

My god…  All this in one month. Every incident is a gut wrenching moment for the people involved and I can’t believe we got any development done with all that chaos, but we did! And quite a lot of it. You live and you learn!

//Tommi Lahdenperä, Head of Core Development

Finansiella instrument kan både öka och minska i värde. Det finns en risk att du inte får tillbaka de pengar du investerar.

 

Är du inte Nordnetkund? Kom igång med ditt sparande här!

Lämna en kommentar

12 Kommentarer på "2000 days ago we messed up – a diary"

avatar
Sortera:   Nyast | Äldst | Mest gillade
Karl
Gäst
Karl

What was the main policy conclusion (”lesson”) from all the problems that month? Some errors seem avoidable with more automation in deployment and maintenance (human factor), but others are clearly the result of external factors that will probably remain hard to control. Introduce more circuit breakers and decoupling?

drd
Gäst
drd

”Introduce more circuit breakers and decoupling?”
Amen. Decoupling is the word of god.

Tommi
Gäst
Tommi

Very true. A structured workflow including quality control does also help but there is a risk that you overcompensate for previous chaos and make releases way too complicated. Which of course we did.

Automation is key. It can also help reduce human error.

Max
Gäst
Max

Jag se5g stora delar av det pe5 webben.Ne5n je4tterapport orkar jag inte sikvra, men intressant var bl.a. utfre5gningen av Bf6rje Ekholm och Sven Hagstrf6mer, som i vissa delar hade olika syn pe5 investmentbolag.Fick dock intrycket av att dom har stor respekt ff6r varandra.Roligast var den off6rliknelige GW, som tydligen aktiesparat sen han var 15. Han talade varmt ff6r Haldex och Holmen.Avanza skall le4gga ut det pe5 webben, vet inte om dom gjort det e4nnu.Ff6rresten, IKEA? Kan man spara de4r?. Eller du menar kanske Ikano-banken…-)

drd
Gäst
drd

lol. sweet memories.

Bijou
Gäst
Bijou
au rique de choquer, je suis plutf4t d’accord avec Marcaggi sur ce coup. X3 n’avait pas la clssae des deux premier mais en temps que film refait dans l’urgence (sans synger) et devant conclure une trilogie avec un cahier des charges monumnetal, il s’en sort plutf4t bien. les fans ne sont pas me9prise9 comme j’ai pu lire. le combat de phe9nix/xavier existe dans la bd mais n’a pas la meame issue. scott est alle9 rejoindre superman et donc ca a limite9 son role dans celui ci.Enfin on a le premier combat des x-men qui ressemble e0 un combat des x-men.… Läs mer »

Nordnets veckobrev

Få Nordnets veckobrev varje måndagsmorgon med börsens viktigaste händelser.
Anmäl

Anmäl
Nordnets veckobrev
Få Nordnets veckobrev varje måndagsmorgon med börsens viktigaste händelser.
Håll mig uppdaterad
Nordnets veckobrev

Få Nordnets veckobrev varje måndagsmorgon med börsens viktigaste händelser.
Håll mig uppdaterad
close-link