Windows 8.1 Repaired

Microsoft fixes its touch interface bug reported previously.

Microsoft Windows 8.1About a month ago, I posted about a bug deep in the Windows 8.1 touch interface code, a problem which triggered an exception in our products.  I detailed the debugging process and verification that it was an error in Microsoft’s latest operating system.

Last week, Microsoft indirectly confirmed the bug by issuing an update, KB2919355, which fixes the reported problem.  With that update installed via Windows Update, our test system now works correctly with all of our unrevised games that had previously responded to a hardware (floating point) exception.  Of course, KB2919355 is a “cumulative update” comprised of more than 100 different fixes so, without unwarranted experimentation, I cannot be certain which one addresses our issue, though KB2927066: Ichitaro crashes when you use a touch screen to enter text in Windows 8.1 seems to be the most likely candidate.  (The description is very similar.)

Unfortunately, we still need to update all of our Windows products, since we cannot rely on customers applying the update (and, ironically, it actually fails to install on one of our development systems).  The fix comes too late, after Windows 8.1 was distributed to the public at large and, also, after I spent many hours debugging the problem.  Still, it is better that it was acknowledged and fixed (than denied and ignored).

A series of updates for Goodsol products will commence shortly.

Debugging Windows 8.1

There is bug deep in the Windows 8.1 touch interface code.

Over the past couple of weeks, I spent quite a lot of time trying to debug an issue that was causing crashes in our programs under Windows 8.1 and which, ultimately, turned out to be a bug in the touch interface code of the operating system.  The problem was not with our code; in fact, our tight programming practices actually outed Microsoft’s failure.

To see how we got to that point, read on…

Symptoms

Late last year, we got a single report of a crash in Pretty Good MahJongg under Windows, occurring only when the touch screen was being used; playing the game with mouse and keyboard worked fine.  Crashes in Pretty Good MahJongg are extremely rare, and this had all the earmarks of a device driver problem, so it was given fairly low priority.  Additionally, we did not have the necessary hardware (at that time) to reproduce the error.  Then we got a second report, and we knew something was amiss.

The first two bug reports were nearly identical, and the obvious commonality (and difference from our systems) was that both machines were running Windows 8.1 and, obviously, had touchscreens.  We had tested our products under Windows 8 when it was released, but did not check Windows 8.1 nor using a touch interface (which, in an ideal world, should not crash programs in any event).  Something appeared to be happening with either Windows 8.1 or the touchscreen interface, or both.

Specifically, the error being reported, in all cases, was “Floating point underflow“, though that actual text for the error comes directly from our exception handler, not the system.

Diagnosis

Our products have a very good crash logging system, so when I got the crash dumps, I discovered that the crashes did not appear to happen in program code at all and, in fact, some of them showed only system routines in the stack dump, while others were essentially the same, but with our message loop (but no other program code) in the stack.

My immediate thought was that the problem could be a message collision, knowing that the touch interface added some new Windows messages.  This could mean that either a touch message was triggering program (e.g., animation) code at an unexpected time, perhaps prior to initialization, or vice versa, with a program message causing driver code to be executed improperly.  This could potentially explain both stack conditions, although I would expect to see our program code elsewhere, but that was never the case.

The bigger problem, at first, was that all of the PGMJ message processing code was identical to that in the Goodsol Solitaire Engine, which drives Goodsol Solitaire 101, Most Popular Solitaire, and FreeCell Plus, as well as the code in Action Solitaire, and that most of the message loop is actually contained in a common library shared by all of these products.  After initial confirmation that all reports were for PGMJ, this concern was finally resolved when crash reports began to escalate, and they expanded to include the whole range of products.  At least the new reports fulfilled my expectation.

Of course, to get to the bottom of the problem, I needed to be able to reproduce the error myself, so I ordered a Windows 8.1 tablet (Dell Venue 8 Pro) for testing.  Fortunately, this tablet displayed the error, and I was able to determine a little bit more about the issue.  The crash happened immediately upon the very first touch within the program, whether clicking a button or simply selecting an edit box, though navigating the program with the virtual keyboard worked…  that is, right up until the first (virtual mouse) touch. 🙁

I built a version of PGMJ that moved custom messages elsewhere in the numbering space, but that made no difference at all.  I tried a couple of other brute force experiments, but nothing altered the crash behavior one bit, so I set up remote debugging on the device and began to debug the program properly.  Unfortunately, the debugger saw the stack in exactly the same way as our exception handler, so every crash was deep in system code and if any program routine was in the stack, it was only the message loop.  Still, our program was definitely and consistently crashing, which meant something was different.  The one major advantage of proper debugging, though, is that I got full symbols, so I was able to determine that the actual crash was happening in ‘ninput.dll‘.  But why?

Here you may imagine days of various attempts at debugging the root cause of the crashes, including “handling” certain messages rather than calling DefWindowProc(), doing the opposite and not processing any messages, and setting breakpoints all over the place and, mostly, being disappointed at how few triggered.  I finally narrowed down the issue to happening from a DialogProc() function within the common library when the (new) WM_GESTURENOTIFY message was posted.  That message is the result of the default processing of a WM_GESTURE message, so presumably handling that in some way would prevent the crash.  No dice.  There is a strange documentation conflict when WM_GESTURENOTIFY is sent to a dialog box, since, “This message should always be bubbled up using the DefWindowProc function.”  However, regarding DialogProc, “Although the dialog box procedure is similar to a window procedure, it must not call the DefWindowProc function to process unwanted messages.”  This gave me a bit of a combinatorial problem, too, but nothing seemed to have any effect on the crash.

Finally, frustrated, I regressed to pure shotgunning of the problem.  I knew that not all programs crashed when touched under Windows 8.1, but (all of) ours consistently did, albeit not in our code.  I added an early message box to demonstrate crashing before any of the dialog boxes or other interface features were shown, and then I began removing pieces of the initialization code.  Voila!  The issue revealed itself!

After removing some of the very first initialization code, executed prior to almost anything else being done, and seemingly entirely unrelated to interface code, the crashes disappeared (though, of course, the program no longer worked).  Methodically reducing the amount of code removed, I was able to determine that the crashes were triggered (but not caused) by three simple lines of code in the exception handler initialization.

Problem

Ultimately, the crash problem was a result of the following C++ code:

    unsigned flags = _controlfp ( 0, 0 );
    flags &= ~( _EM_INVALID | _EM_DENORMAL | _EM_ZERODIVIDE | _EM_OVERFLOW | _EM_UNDERFLOW );
    (void)_controlfp ( flags, MCW_EM );

This, very simply, enables floating point exceptions within the program, including the (now problematic) _EM_UNDERFLOW exception.  The purpose was to provide maximum checking for errors in our code, which is usually so clean that it squeaks.  We never imagined that it would catch errors in a released operating system.  For reference, the above code has been shipping for more than 9 years, to many thousands of customers (and potential customers), and never had any problem before Windows 8.1 arrived.

To be perfectly clear, the actual bug is in Windows 8.1, specifically within ‘ninput.dll’.  There is an error in that module, creating a floating point underflow exception, compounded by reliance on a particular floating point state, namely that the hardware underflow exception is (and remains) disabled.  This is a flaw in the operating system, even though the default floating point state and, therefore, most programs do not display symptoms.

Solution

The actual solution, of course, is to remove the above code, which is a workaround to avoid triggering the crashes.  The tradeoff is that our programs will no longer be quite as robust in detecting floating point errors, but as stated above, this checking has been in place for almost a decade without finding any problems in our code, so it should be fairly safe to remove at this point.

Note that removing these three lines of code is actually more than needs to be done to resolve the immediate problem (i.e., the underflow error exception), but enabling the other exceptions still provides additional places for the operating system to fail, perhaps even further along in the same processing path.  The fundamental problem is that Microsoft counted on the default floating point state (and never tested otherwise) for its latest touch interface code, so it is safest for us to simply revert to using the default state as well.

Verification

It is not enough to simply come up with a solution; that solution must be verified.  We approached this issue in two different ways.

First, I built a new beta version of Pretty Good MahJongg with the above solution applied, and that version was provided to as many of the PGMJ customers who reported a problem as feasible.  Every single one (who reported back) confirmed that the crashes were gone.

Second, we bought a brand new Ultrabook laptop with a touchscreen for testing on a different device.  The laptop shipped with Windows 8 (not 8.1), so it was perfect for conducting our verification tests.

I installed the shipping version of PGMJ (Pretty Good MahJongg 2.41) using nothing but the touch interface, and everything worked fine.  We tested several games and had no problems at all (n.b., under Windows 8).  Then, I upgraded the laptop to Windows 8.1 and confirmed that the crash detailed above happened in exactly the same manner and place when using the touchscreen, but the game was perfectly playable with the mouse and keyboard (until one forgot and touched the screen 🙂 ).  Finally, I installed the beta version of PGMJ with the workaround, and everything worked again; in fact, this is a great way to play the game, especially for a title designed without touchscreens in mind.

Given that we verified the solution using two different and separate processes, we are confident that the issue is resolved.  Indeed, Pretty Good MahJongg 2.5 will be released on March 25, so look for it, still the very best tile matching games available for Windows.

For those who know some of my background, the score now stands as follows:
Gregg Seelhoff 3 – Microsoft 0

Seeking a few great Beta Testers

We need people to playtest our arcade/puzzle game.

Demolish! Pairs for iOSToday, Digital Gamecraft is making an open call for iOS beta testers to help us test Demolish! Pairs in preparation for its upcoming release on the Apple App Store.

Anybody with an iPad, iPhone, or iPod touch is eligible to join our team and get early access to this fun game, while helping us make it as good and solid as possible.  All you have to do is play the game (and then tell us about it 🙂 ).

For more information, and to sign up, see our call for iOS beta testers on the Demolish! Pairs site.

ISVCon 2012: Success!

This conference reboot was the best in years.

You shoulda been there!

We have returned safely from ISVCon 2012, which was presented last week in Reno, Nevada [USA] with a mixture of physical exhaustion and mental exhilaration, as is often the case with great conferences.  ISVCon was a relaunch of the old Software Industry Conference, and the consensus was that this was the most beneficial event in several years.  The content was geared towards microISVs (Independent Software Vendors), software companies with just a few people (often, only one person), and the networking/socializing was with others who are facing the same challenges (as well as those who provide services to help).

 The main question: Why were you not there?

 

Before our departure for Reno, I added the Twitter box [edit: formerly] on the right of this blog, and I was “live tweeting” as much as possible throughout the conference, as well as during our journey (and quasi-vacation).  If you follow my personal account at @GreggSeelhoff, you can still see the updates, as well as more going forward.

In the coming days, I will review the highlights of the conference, and I have it on good authority that the Association of Software Professionals (new conference owners) will be making some or all of the session videos publicly available for viewing.

Prior to all that, however, I must give a HUGE shout out to Susan Pichotta of Alta Web Works, who deserves most of the credit for bringing this fantastic 3.0 version of the long-running conference together, and without whom ISVCon would never have happened.  Plans are already in the works for next year, and I really look forward to being there in 2013.

URGENT: ISVCon 2012 is almost here!

Register NOW and save with our discount code.

ISVCon.orgISVCon 2012 takes place July 13-15, which is only a couple weeks (!) away.  ISVCon is the spiritual successor to (or, in entertainment terms, reboot of) SIC, the Software Industry Conference, which I have attended numerous times, and which has always been a great investment.  This conference brings together scores of independent software publishers (or “vendors”, hence ISV) to discuss and learn about the industry  It is a unique opportunity to meet face-to-face with many other people who share similar business challenges; I now call lots of them “friends”.

ISVCon will be taking place in Reno, Nevada (USA) at the Atlantis Casino Resort.

Here is the catch: Time is running out!

Step 1: Register (at a discount)

First, register for ISVCon before the prices go up.  As an incentive, we at Digital Gamecraft can offer you this 10% discount code: “Gamecraft2012“.  Limited time only; prices increase July 1st.

Step 2: Get your hotel room (at a discount)

Next, make your hotel reservations now (using that link) to receive discount pricing and no resort fee.  Offer ends in only a couple of days!

Step 3: Attend ISVCon 2012

Join us in Reno for the conference.  We will be arriving before the Welcome Reception on Thursday evening, during which we will be able to have a drink or two, socialize with friends and colleagues (both long lost and brand new), and switch from travel mode into conference mode.

The conference sessions take place Friday, July 13, through Sunday, July 15, and specifics can be found on this complete conference schedule.  Note that the Friday sessions are Power Sessions, while the Saturday and Sunday sessions provide a couple of options for each timeslot.  There is so much content at ISVCon that we are sending most of the staff (okay, just two of us) to make sure that we can have full coverage of the relevant topics.  Additionally, the networking value and information exchange between (and sometimes during) sessions is possibly even more valuable than the speakers.

That said, let me draw your attention particularly to Paradise Room A on Saturday from 1:45pm to 2:45pm, for my presentation, Quality Assurance for Small Software Publishers, and on Sunday from 9:00am to 10:00am, where I will serve on a panel of game developers for the session, How Games are Different.  The answer to your question is: I will be there and awake at 9am because, with the time difference, that will be noon back home.  (Also, I never work the B room.)

We will there at the conference through the After Hours MeetUp on Sunday evening, before beginning our (more) lengthy journey back to the office.  From experience, this will involve an odd mixture of being physically spent, but mentally energized, full of plans and ideas.  Honestly, attending ISVCon 2012 is probably one of the best ways to spend a few days improving your business; I strongly recommend it for any ISV.

Follow me on Twitter @GreggSeelhoff for live conference updates.  See you there!