After a few days of uptime, errors showed up - to all appearances randomly. The same user, performing the same action, on the same data, just a few seconds apart, would fail or succeed without any seeming difference. The pattern of successes and failures looked to eliminate any single point of commonality. Nearly every page served by the application was vulnerable, but none were consistent.
Worst possible scenario
These 'random' issues are the worst possible scenario - repeatable things can be tracked, fixed and tested with a high degree of confidence. Without reliable processes to introduce the behavior, you can never really be sure you've found the root cause of the problem, never mind fixed it.
So I'm holding this black box, and clients and my girlfriend are asking how long it will take me to complete the puzzle inside. I can't see inside it - I don't have any details on what awaits me. It could be those 5-piece puzzles you give infants. It could be a 500-piece skyline of Paris. It could be a 5000-piece surrealist painting without a picture on the box. And that was my dilemma: debugging a complex system behaving in ways I didn't understand and trying to estimate how long it would take to fix.
It separates good developers from great developers
That experience is hardly unique. Every day, developers, engineers and consultants are brought in to solve difficult problems in myriad environments with differing levels of control, access and information. And it is these situations where an often overlooked skill, debugging, separates a good developer from a great one. Most professional developers can solve the puzzles once they know what is broken, how it is failing, and what the intended behavior is. Gauging effort and implementing a solution may be time-consuming, but it is ultimately finite. Getting to the puzzle is the hard (and to me, fun) part of this whole process.
Back to my dinner-delaying dilemma, I did the only thing I could - buy time, cross my fingers and get to work. Thankfully, after an hour or so of stepping through logs from both our production and testing environments, and multiple iterations of write/run/test on my local machines, I managed to construct a scenario that would consistently cause the application to behave erratically. Getting it into that state was the key to unlocking the black box, however complicated my multi-user, multi-step was. Once done, I quickly surveyed the underlying system and scoped out the issue.
Armed with that knowledge, I picked up the phone, changed my reservations and confirmed the new time with my girlfriend. It was still a few hours away, but that was the point; it wasn't an indeterminate amount of time away. It wasn't forever away. It was three hours. I arrived to dinner on time, exhausted but accomplished.