You're a Software guy, what do you know...

You're a Software guy, What do you know. 

My early career with IBM was as a Program Support Representative (PSR) whose duties were to identify the software failures of IBM mainframe operating systems and subsystems. It also meant in the early days finding work arounds to keep the customer's system up and running while final fixes were developed or found. 

My first major customer was Insurance Company of America, now absorbed into the Traveller's umbrella of companies (pun intended). They were one of the Philadelphia area's largest customer and provided more than half the revenue for our NJ service office. The point is, that they got their way and we were grateful to do it. 

One morning in my first year at the account when I arrived at work, everyone was in a tizzy because the main processing system had gone into a hard wait (the system had stopped because the hardware and/or software could not resolve a problem). I got the documentation and looking at the traces saw that one of the main disk drives had issued a condition code 3 (CC3). Since it contained essential system data and a CC3 means the drive responded essentially with I'm not here or I can't send the data, the system decided to give up. 

There was no doubt that this was a hardware error since software can not issue CC alerts only report them. Software can not put the machine in a wait either, but can recommend on based on hardware conditions. So, I told the hardware team leader that this was a hardware problem. He balked stating that many disk drives were reporting CC3's off and on and that they were intermittent. So, in his mind, it couldn't be hardware, else why was it occurring on multiple disk drives. I asked if his team could check anyway and was completely rebuffed. 

Since the main drive in question was the Paging device, I decided to start with the software manager program that controlled paging.  A bit of explanation here; the paging disk contained the virtual pages that were rolled in and out of storage as needed. This allowed for the illusion that multiple programs were running at the same time. It is easy to see why the system would shut down if this drive were unavailable.  

I called the support system and talked to a guy who specialized in the paging code, his first words were that this had to be a hardware error. I acknowledged it was, but that the hardware reps refused to believe me. He gave me some hints as to what part of the code I might use to prove to them that I was right. I could see in the code that the program would try multiple times to get the page from the drive before it gave up and had the system issue a hard wait. Since the problem was intermittent (sometimes the I/O command would be successful),  I decided to build an outer loop that would not go to the hard wait, but retry the inner loop multiple times. I also built in a routine that would record the drive address of the drive that got a CC3 in a hardware address space (called a control register). I could then look at the register from the console and see what drives were sending the CC3.

For the next several days, I tried to convince the hardware CE's (Customer Engineer) that the problem had to be hardware and their team leader told me "You're a software guy, what do you know about hardware?" I liked Lennie, but he had such a bias about how hardware was the most important thing and because I had never been a CE, I had little credibility with him. My patch was successfully keeping the system up and I could see in the register that the problem was continuing. Since other devices were also sending the CC3 code, other programs were failing sporadically, but most issued retries that were successful.

After a week, I had a three day class in Washington, DC and would not return to the account until the following Thursday. During the time I was gone, the system finally had so many failures that on Tuesday of that week it crashed hard and the customer demanded a resolution. My backup was Harry Barth, he was well respected as a good PSR and had been a CE at one time. He was told that I had failed to solve the problem, so he looked into what was going on and what I had done. He checked out my traces and told Lenny that this was surely a hardware problem and that I had been saving their butts for two weeks. Lenny was skeptical because he kept pointing out that multiple devices were failing the same way which was not logical. Because Harry knew how to follow hardware logic manuals, he decided to look at the hardware logic that controlled the condition codes. Lo and behold, for that mainframe all I/O commands came through a single hardware gate and if it was bad the problem we were having would be the result.

The CE's tested that hardware module and it indeed was faulty. When the customer asked how it had been resolved, Harry made sure they knew that my creative patch had kept them in business for two weeks while we discovered where the hardware error was. We had to present a positive cohesive face to the customer, so we didn't tell them that the problem was a stubborn CE who had had little faith in my ability to solve a problem.

When I got back, Harry told me all that had transpired. Do you think I ever got an apology or a well done from Lenny or his manager? Of course not, but all the customer's software engineers were duly impressed and they were my buddies from then on. I had the account for six years mainly because they didn't want me assigned anywhere else. 

Lennie never learned his lesson, several years later I had another instance where it was obvious that the problem was hardware and he refused to look at it. I had even found the memory location that was faulty and he would not check it. Luckily, this time the customer software manager overheard him and explained that he would indeed fix it since it was evident that it was hardware. Of course, I’m still waiting for an apology on that one too.

Davdan @ 2008-2018