Working at a 24/7/365 data center and supporting critical production databases has side effects. One of them is the requirement to carry a phone with me almost all the time. I do make few exceptions though. On my days off when eating lunch with my wife I leave it in my truck. My wife and I have a rule, while we are spending time together, no phones are allowed, by either one of us. It is a very good rule and one I would suggest everyone do. But that is not what this post is about. Let’s get back on point though.
After some time I checked my phone and have 2 missed calls and a voice mail. I also had about 30 emails referencing the same subject. There were plans to install windows patches and there were some concerns about some agent jobs and the potential for them to fail and not do what they are designed to do. First let me provide a little background, I will do my best to be brief.
The project in question has a program manager with “hands on” experience managing and programming with databases. All of it is in Oracle but I don’t hold that against him. He is also not the problem. The problem is the non-technical build manager and the ripples in the pond she creates and when she uses technical terms she thinks are correct because she heard the term in the past and thinks they apply to the current situation. The other thing that bothers me is others; System Administrators that should know better allow the ripples to continue instead of stopping them.
Um, It’s on a Cluster
Moving on the project database resides on a multi instance node with a dedicated failover node. This is commonly referred to as a cluster. When the build manger created the change order she put in the instruction and later in an email there was a very important replication job (it is not a replication job, but that is what she calls it no matter what I tell her) scheduled to run every Friday evening at 8:00 PM. She went on to say all the work on the cluster needed to be complete before 8 PM so the instance would be on node “A” because that where the job runs from. Here is where the SAs or at least one of the three providing comments on the email thread should have stopped it right then and there. One even commented “maybe we should make sure all of the drives are on node “A”.
I did call that SA back who called me and let him know and confirmed the job would run regardless of what node the instance was on. His reply, “I thought so but wasn’t sure.”
Do not get me wrong, it is always a good idea to confirm anything you are unsure of, especially if it is related to a critical production system you are unfamiliar with the project. But for goodness sake, do not let non technical people drive the conversation or make you question something you already know to be correct.It is an never ending process, but do you should do your best to educate the non technical and technical alike, then maybe once in a while on your day off you will be able to leave your phone in your truck when you eat lunch.