Thou Shalt never IPL, Ever!

Thou shalt not IPL, ever!

While in Dallas, one of my assignments was to monitor the software health of the Sprint, Inc. data center. The company was headquartered in Kansas City, MO, but their main data systems center was in Las Colinas (the hills) section of Irving, TX near Dallas. In a previous story, I told about an assignment that Dick VanMeter and I had (see the story "Oh, I get it…!"). 

For the class on operational stability that Dick was going to teach, the customer had asked that it be given after normal working hours, so he was going to teach from 6:00 PM until midnight for the five days of the class. The customer also asked that I be there as well as a representative from the technical part of the IBM sales team. They had been nervous about using the live system to practice recovery on, but had accepted our assurances that this program would not cause any harm.

On the first night, Dick gave a large overview of what recovery meant an how the operators had many tools at their disposal to keep their systems up no matter what. He explained that he would show them the tools they had and how and when to use them. The nightmare of any running data system is the forced need to IPL (Initial Program Load) a system. IPL is basically giving up on correcting a system that has gone haywire and just restarting it from ground zero. This can result in lost data (though this was rare due to system checkpoints almost all major programs used to re-sync themselves), but the main problem is lost time. Computer minutes on a mainframe are expensive, each one lost can put a customer hopelessly behind for hours, days or even weeks. An IPL can take an hour to complete and have the customer's network lost for that amount of time.

So, he summed up the lecture portion of the first lesson, with the statement he had repeated at least five or six times before during the night. "One thing we do not want to do to fix a problem is IPL. Thou shalt not IPL, ever!"

On the third night, we were doing simulations of console errors. These were a common problem caused by an unattended console or a program that monopolized the message queue of the console. Since computers work very rapidly, thousands of messages could be sent to a console (usually an actual hardware CRT (TV monitor type device). The need to recognize such a condition was essential to keep the system from stalling or to free it from such a stall. While we were doing the exercises, one of the physical consoles was not online and asked the operator to vary it online, it immediately issued a software abnormal end code, that is the software used to connect it to the system had ended. I asked the operator if he had noticed it before and he said that they had been having problems with it all day and that it was varied offline physically and called for IBM hardware support. Dick says this will be a good test since it wasn't possible for this to be a hardware problem and the hardware IBM'er would not be able to repair it. He told them that they could help diagnose the problem for a system engineer (since the error could not be a hardware problem) by gathering data. So, the solution was to vary the machine online and restarting the software controlling it.

It, the console, immediately tried to report it error and the controlling software tried to recover it and abnormally ended (we called it an ABEND), over and over. This particular mainframe computer actually had eight engines, that is, there were eight computations going on at the same time and logically  looked like eight computers instead of one. So, logical computer one went into a wait state due to some unknown tight loop. The instructions to recover it were followed and put the work on Logical computer number two,  when we did. it went into a wait state as well. In fact, soon we had all eight in tight loops and the system gave a wait state error code of a hard wait. These are always severe, but Dick told them, lets look at the message and the diagnostic help for the code being shown to see how to recover the system. He was still confident that there were no problems that we could not recover from.

Reading from the book on what to do in this situation, the operator read; "This is an extremely severe wait state and can not be recovered, the system must be IPL'd!!" Dick was surely red-faced and it meant that I was going to have to come in very early the next day to figure out what happened or else the customer was going to cancel the class. In fact, the customer software group manager informed me that I was gong to have to have an explanation on what happened and what we did wrong. This would be passed on to headquarters in Kansas City for further review. (Not too much pressure!)

The sales team technical guy and I arrived there about 6 AM and between us we discovered that under the covers another program was trying to capture and record the errors but it was ABENDing itself and then trying to capture its own error upon which time it ABEND'd again and  again. On further inspection, we could see that the code was from a third party, that was hooked into our (IBM) code to provide more in-depth detail on system operations. We could also see that the code was branching into an area of non-code causing the error.

The other guy called the company, pretending to be a Sprint representative and asked them if they knew anything about such a problem. They said they did and had sent an alert to Sprint on the danger of a hardware error causing such a problem. The alert was sitting on the desk of the software group manager and had been for several weeks. I wonder what he told headquarter in Kansas City, since obviously I didn't have to! 

I was able to call Dick by 11:00 AM and let him know that the class was still on and the problem had been their fault. (Of course, if we had let the offending console stay offline, it wouldn't have happened, but the course was designed to fix problems not ignore them!) As a final word on this, of course, I never kidded Dick about his ill-advised statements about IPL. Not more that several dozens times that is!

Davdan @ 2008-2018