Debugging Windows 8.1

There is bug deep in the Windows 8.1 touch interface code.

Over the past couple of weeks, I spent quite a lot of time trying to debug an issue that was causing crashes in our programs under Windows 8.1 and which, ultimately, turned out to be a bug in the touch interface code of the operating system.  The problem was not with our code; in fact, our tight programming practices actually outed Microsoft’s failure.

To see how we got to that point, read on…

Symptoms

Late last year, we got a single report of a crash in Pretty Good MahJongg under Windows, occurring only when the touch screen was being used; playing the game with mouse and keyboard worked fine.  Crashes in Pretty Good MahJongg are extremely rare, and this had all the earmarks of a device driver problem, so it was given fairly low priority.  Additionally, we did not have the necessary hardware (at that time) to reproduce the error.  Then we got a second report, and we knew something was amiss.

The first two bug reports were nearly identical, and the obvious commonality (and difference from our systems) was that both machines were running Windows 8.1 and, obviously, had touchscreens.  We had tested our products under Windows 8 when it was released, but did not check Windows 8.1 nor using a touch interface (which, in an ideal world, should not crash programs in any event).  Something appeared to be happening with either Windows 8.1 or the touchscreen interface, or both.

Specifically, the error being reported, in all cases, was “Floating point underflow“, though that actual text for the error comes directly from our exception handler, not the system.

Diagnosis

Our products have a very good crash logging system, so when I got the crash dumps, I discovered that the crashes did not appear to happen in program code at all and, in fact, some of them showed only system routines in the stack dump, while others were essentially the same, but with our message loop (but no other program code) in the stack.

My immediate thought was that the problem could be a message collision, knowing that the touch interface added some new Windows messages.  This could mean that either a touch message was triggering program (e.g., animation) code at an unexpected time, perhaps prior to initialization, or vice versa, with a program message causing driver code to be executed improperly.  This could potentially explain both stack conditions, although I would expect to see our program code elsewhere, but that was never the case.

The bigger problem, at first, was that all of the PGMJ message processing code was identical to that in the Goodsol Solitaire Engine, which drives Goodsol Solitaire 101, Most Popular Solitaire, and FreeCell Plus, as well as the code in Action Solitaire, and that most of the message loop is actually contained in a common library shared by all of these products.  After initial confirmation that all reports were for PGMJ, this concern was finally resolved when crash reports began to escalate, and they expanded to include the whole range of products.  At least the new reports fulfilled my expectation.

Of course, to get to the bottom of the problem, I needed to be able to reproduce the error myself, so I ordered a Windows 8.1 tablet (Dell Venue 8 Pro) for testing.  Fortunately, this tablet displayed the error, and I was able to determine a little bit more about the issue.  The crash happened immediately upon the very first touch within the program, whether clicking a button or simply selecting an edit box, though navigating the program with the virtual keyboard worked…  that is, right up until the first (virtual mouse) touch. 🙁

I built a version of PGMJ that moved custom messages elsewhere in the numbering space, but that made no difference at all.  I tried a couple of other brute force experiments, but nothing altered the crash behavior one bit, so I set up remote debugging on the device and began to debug the program properly.  Unfortunately, the debugger saw the stack in exactly the same way as our exception handler, so every crash was deep in system code and if any program routine was in the stack, it was only the message loop.  Still, our program was definitely and consistently crashing, which meant something was different.  The one major advantage of proper debugging, though, is that I got full symbols, so I was able to determine that the actual crash was happening in ‘ninput.dll‘.  But why?

Here you may imagine days of various attempts at debugging the root cause of the crashes, including “handling” certain messages rather than calling DefWindowProc(), doing the opposite and not processing any messages, and setting breakpoints all over the place and, mostly, being disappointed at how few triggered.  I finally narrowed down the issue to happening from a DialogProc() function within the common library when the (new) WM_GESTURENOTIFY message was posted.  That message is the result of the default processing of a WM_GESTURE message, so presumably handling that in some way would prevent the crash.  No dice.  There is a strange documentation conflict when WM_GESTURENOTIFY is sent to a dialog box, since, “This message should always be bubbled up using the DefWindowProc function.”  However, regarding DialogProc, “Although the dialog box procedure is similar to a window procedure, it must not call the DefWindowProc function to process unwanted messages.”  This gave me a bit of a combinatorial problem, too, but nothing seemed to have any effect on the crash.

Finally, frustrated, I regressed to pure shotgunning of the problem.  I knew that not all programs crashed when touched under Windows 8.1, but (all of) ours consistently did, albeit not in our code.  I added an early message box to demonstrate crashing before any of the dialog boxes or other interface features were shown, and then I began removing pieces of the initialization code.  Voila!  The issue revealed itself!

After removing some of the very first initialization code, executed prior to almost anything else being done, and seemingly entirely unrelated to interface code, the crashes disappeared (though, of course, the program no longer worked).  Methodically reducing the amount of code removed, I was able to determine that the crashes were triggered (but not caused) by three simple lines of code in the exception handler initialization.

Problem

Ultimately, the crash problem was a result of the following C++ code:

    unsigned flags = _controlfp ( 0, 0 );
    flags &= ~( _EM_INVALID | _EM_DENORMAL | _EM_ZERODIVIDE | _EM_OVERFLOW | _EM_UNDERFLOW );
    (void)_controlfp ( flags, MCW_EM );

This, very simply, enables floating point exceptions within the program, including the (now problematic) _EM_UNDERFLOW exception.  The purpose was to provide maximum checking for errors in our code, which is usually so clean that it squeaks.  We never imagined that it would catch errors in a released operating system.  For reference, the above code has been shipping for more than 9 years, to many thousands of customers (and potential customers), and never had any problem before Windows 8.1 arrived.

To be perfectly clear, the actual bug is in Windows 8.1, specifically within ‘ninput.dll’.  There is an error in that module, creating a floating point underflow exception, compounded by reliance on a particular floating point state, namely that the hardware underflow exception is (and remains) disabled.  This is a flaw in the operating system, even though the default floating point state and, therefore, most programs do not display symptoms.

Solution

The actual solution, of course, is to remove the above code, which is a workaround to avoid triggering the crashes.  The tradeoff is that our programs will no longer be quite as robust in detecting floating point errors, but as stated above, this checking has been in place for almost a decade without finding any problems in our code, so it should be fairly safe to remove at this point.

Note that removing these three lines of code is actually more than needs to be done to resolve the immediate problem (i.e., the underflow error exception), but enabling the other exceptions still provides additional places for the operating system to fail, perhaps even further along in the same processing path.  The fundamental problem is that Microsoft counted on the default floating point state (and never tested otherwise) for its latest touch interface code, so it is safest for us to simply revert to using the default state as well.

Verification

It is not enough to simply come up with a solution; that solution must be verified.  We approached this issue in two different ways.

First, I built a new beta version of Pretty Good MahJongg with the above solution applied, and that version was provided to as many of the PGMJ customers who reported a problem as feasible.  Every single one (who reported back) confirmed that the crashes were gone.

Second, we bought a brand new Ultrabook laptop with a touchscreen for testing on a different device.  The laptop shipped with Windows 8 (not 8.1), so it was perfect for conducting our verification tests.

I installed the shipping version of PGMJ (Pretty Good MahJongg 2.41) using nothing but the touch interface, and everything worked fine.  We tested several games and had no problems at all (n.b., under Windows 8).  Then, I upgraded the laptop to Windows 8.1 and confirmed that the crash detailed above happened in exactly the same manner and place when using the touchscreen, but the game was perfectly playable with the mouse and keyboard (until one forgot and touched the screen 🙂 ).  Finally, I installed the beta version of PGMJ with the workaround, and everything worked again; in fact, this is a great way to play the game, especially for a title designed without touchscreens in mind.

Given that we verified the solution using two different and separate processes, we are confident that the issue is resolved.  Indeed, Pretty Good MahJongg 2.5 will be released on March 25, so look for it, still the very best tile matching games available for Windows.

For those who know some of my background, the score now stands as follows:
Gregg Seelhoff 3 – Microsoft 0