Category Archives: FAIL

Intermittent problems

Intermittent problems are the worst kind (except for total disaster, obviously.)  I’d almost rather see a system not work at all than fail occasionally and randomly.  If it’s completely down, the cause is generally fairly easy to spot, and when you see it start working again, you can be pretty sure it’s working for everybody.  If it consistently fails for one person but works for everybody else, you can at least check that single account on our end, or walk the person through their settings on theirs (assuming you can get in touch with them, but that’s a completely separate topic) and most of the time you’ll find the cause.  Likewise if it only fails at a certain time of day, or from a certain building, or on a specific kind of device, then you can test it then, or there, or using that.

It’s when you can’t find a pattern that things get really annoying.  In order to troubleshoot something, you need to be able to see the trouble as it happens, but if you don’t have any way to do that, you’re going to have a hard time finding the answer.  You can’t know how many people are seeing the problem, or how often. Even if you can get it to fail where you can see it, when it starts working again does that mean you really fixed it?  Or did it just start working again randomly?  And did it fix the system for everybody else?

Even though most problems we deal with are pretty straightforward, it’s those rare intermittent problems that really stick in your mind (even if it’s only because you’ve banged your head against the wall for too long.)  I have to remind myself that they’re rare, because right now I’m working on three of them at once.

  1. When a former student without an account wants to get their transcript online, they can go to Account Lookup and it will tell the system to create a temporary account.  It shows a message explaining what’s happening, and asks the person to try again in ten minutes.  Usually, that second try at Account Lookup tells them the new account name, lets them set a password, and onward they go.
    Except right now, for a few people, it isn’t working. Account Lookup tells them that an account will be created in ten minutes, and then blithely forgets to tell the system to actually go and do that. When they come back in ten minutes, Account Lookup can’t find an account, so it gives them the “Wait ten minutes” message again, and then forgets to inform the system again, and around and around we go.
  2. A few people have reported that they get a “500 Server Error” when they click the Gmail button in the Portal. Until I can get in touch with one of them, I’m stuck, because it works fine for me no matter how I try. For all I know it’s just one of those once-in-a-blue-moon fluke problems that solve themselves. But I can’t afford to ignore it, because on the other hand the people who reported it might just be the tip of the iceberg. Going without email is not just an annoyance anymore.
  3. And finally, about ten people in the online faculty/staff directory are showing up without any contact information; not even email. They weren’t even showing up at all at the start of the week, but a name without contact information is pretty much useless in a contact directory.

So yeah, interesting times. I need to stop talking about these things and start digging into them again.

Still waiting on Google Plus

So, it’s been a couple of weeks since we signed up to get Google+ activated on our Google Apps domain, but it’s still not working. We had to give them a bunch of information to prove we were a real, live university instead of, I guess, some sort of front for an evil conspiracy to give underage people access to Google+. Hopefully they just haven’t gotten to our application yet. I’d hate to think they just dropped it into the bit-bucket without telling us anything. Let’s see how much longer it takes…

Air Conditioning FAIL

On Saturday all three air conditioning units in the server room shut down, and the place rapidly turned into an oven. Our servers put out a lot of heat, and have to be kept cool to prevent Bad Things from happening… and so when the air handlers stopped, Bad Things started to happen.

Luckily, only a couple of servers had actual hardware damage, and those didn’t have anything critical on them. Several more servers shut down ungracefully or started behaving erratically. Luckily our two biggest servers, cougar and sundown, never actually crashed, but since our main network infrastructure server did, nobody could get to cougar or sundown.

Since I live so close to campus, I got called in, but it was Paul Lambert and Dave Diemer who did most of the heavy lifting. Once the major problems were cleared away, then I could do my thing. Dave was still working on three servers until the next morning, and I was up until really late babysitting the webserver, which seemed to go catatonic every few minutes for no apparent reason. We’ll still be cleaning this up for a while.