No Begin Menu for You
I are likely to launch most packages on my Home windows 10 laptop computer by typing the <Win> key, then a couple of letters of this system identify, after which hitting enter. On my highly effective laptop computer (SSD and 32 GB of RAM) this course of often takes so long as it takes me to sort these characters, only a fraction of a second.
Often.
Generally, nevertheless, it takes longer. Quite a bit longer. As in, tens of seconds. The slowdowns are unpredictable however not too long ago I used to be in a position to file an Occasion Tracing for Home windows (ETW) hint of one among these delays. With a little bit of assist from people on twitter I used to be in a position to analyze the hint and perceive why it took a few minute to launch notepad.
Earlier than I get in to the evaluation I’ve two warnings/disclaimers: 1) I’ve a very good understanding of the issue, however I should not have an answer and a couple of) if you’re seeing similar signs that doesn’t imply that your root trigger is identical as mine, however I’ll give some hints on the way to see whether it is.
My evaluation of the hint (trace is here, installer for the analysis tools is here, analysis tutorials are here, be at liberty to comply with alongside) began with my wanting on the enter occasions and Window in Focus graph in Home windows Efficiency Analyzer (WPA), each proven beneath (zoomed in a bit for optimum element):
The primary diamond within the Multi-Enter row exhibits once I pressed the Home windows key, with subsequent key presses (together with urgent enter) clumped collectively shortly afterwards. The items on the x-axis are seconds so we are able to see that all the typing took about 3/4 of a second.
The enter occasions are injected into the hint by my UIforETW trace-recording device. These enter occasions are one of many causes I desire UIforETW over Microsoft’s trace-recording instruments. I’ve hidden the table-view however it lists which keys had been pressed, whereas anonymizing letters and numbers to stop UIforETW from being a key logger. Enter occasions could be a vital device in serving to know the place/when to look in traces to know what is going on.
The enter occasions assist set up context, however the Window in Focus occasions actually inform the story. We will see that the SearchApp (begin menu?) positive aspects focus as quickly as I press the <Win> key however then nothing else occurs for greater than 13 seconds. That’s the issue, visualized.
However why?
The following step is to see what’s inflicting the delay. A fast look on the CPU Utilization (Exact) and Disk Utilization graphs confirmed that the CPU and disk had been nearly 100% idle, so the beginning menu should be ready on one thing else:
When a course of is idle for some time whenever you want that it was doing work then the problem is to determine what it was ready on. I appeared on the context-switch occasions in CPU Utilization (Exact). A number of the SearchApp threads had been named (yay!) however not all of them had been and I couldn’t discover the principle thread to see what it was ready on, so I needed to poke round and hope one thing turned apparent. I zoomed in on the burst of CPU exercise simply earlier than notepad launched and I seen that WerFault.exe and wermgr.exe each began getting busy. Correlation is not causation, however it positive is suspicious.
Be aware that WER stands for Home windows Error Reporting – the system that sends crash dumps again to Microsoft for evaluation in order that software program reliability will be improved
Trying on the Processes desk confirmed me that the command line for WerFault.exe was “C:WINDOWSsystem32WerFault.exe -u -p 17804 -s 2124”. That implies that Home windows Error Reporting was being requested to file info for crashed course of 17804, and once I appeared within the Processes desk for that Course of ID (PID) I discovered “RuntimeBroker.exe <Microsoft.Home windows.Search> (17804)”. Properly now. Doesn’t that identify look related?
A have a look at all the “Transient” processes (those who began or ended in the course of the hint timeline) was fairly revealing:
WerFault.exe and RuntimeBroker.exe (17804) (the highest of the 2 RuntimeBroker.exe processes) had been each operating once I began recording the hint and each ended at about the identical time, and WerFault.exe was dealing with a crash in RuntimeBroker.exe. Discover additionally {that a} new copy of RuntimeBroker.exe begins operating when the outdated copy goes away. Now we’re beginning to have an evidence:
- RuntimeBroker.exe crashes
- WerFault.exe offers with the crash, protecting the RuntimeBroker.exe course of open
- Then a brand new RuntimeBroker.exe launches and offers no matter it’s that SearchApp.exe wanted
Now we now have a brand new query: why is WerFault.exe sitting idle for therefore lengthy?
I appeared on the CPU Utilization (Exact) information and noticed that WerFault.exe has no less than 13 threads, none of them named (come on Microsoft – thread names are actually useful!) however the principle thread was simply identifiable because the one utilizing probably the most CPU time. I then sorted by Time Since Final and seen that at one level the principle thread had been ready to run for 15.572 s. Actually it was most likely ready even longer, however the begin of its wait was earlier than the beginning of the hint and due to this fact unknowable. You will discover extra particulars on how to do idle-analysis here.
The stack the place the principle WerFault.exe thread was ready for 15.572 s is proven beneath:
The abstract could be that it was ready in UploadReport.
So now we perceive the issue. RuntimeBroker.exe crashed (resulting from heap corruption, in keeping with the decision stack within the RuntimeBroker.exe crash dump, proven to the suitable) and it took greater than 15 seconds to add the crash dump, presumably resulting from my flaky resort WiFi. Throughout this time my begin menu was inoperable.
This deserves reiterating. My begin menu was hung because of the mixture of heap corruption and WerFault.exe deciding that it wanted to add the crash dump earlier than releasing the outdated course of so {that a} new one might be began.
It took two bugs (the heap corruption and the upload-before-restarting) to make this cling occur, however occur it did.
We will even go deeper. The UploadReport operate was blocked for 15.567 s and the Readying Course of/Readying Thread Id exhibits us who finally unblocked the operate. That turned out to be one other WerFault.exe thread which was blocked in some CHttpRequest features, as present above. That doesn’t add considerably to the understanding of the issue, however does reveal properly how one can hint a cling backwards by a number of processes and threads.
Looking ahead to this downside
Normally if you wish to perceive why your laptop is performing badly you could file and analyze a hint. Nevertheless if you wish to see if you’re hitting this specific downside then there are simpler steps that you would be able to comply with.
Step one is to configure the local recording of crash dumps. It is a good concept generally as a result of it permits you to monitor the soundness of your laptop over your time.
Then, with crash dumps being recorded should you see a Begin Menu cling you’ll be able to simply look in %localappdatapercentcrashdumps and see if there’s a current RuntimeBroker.exe crash. In that case then you’re presumably seeing this bug.
Ready on add
Raymond Chen provides a number of causes for why Windows Error Reporting doesn’t restart crashed processes before uploading the report (circa 2012) however I don’t discover these causes completely compelling, particularly within the start-menu case. So long as you kill the outdated course of earlier than beginning the brand new one – and reply the DLL-version questions from a crash dump – many of the issues he factors out are avoidable. Exponential backoff on course of restarts can handle the remaining. And, the implications of ready can, as we now have seen, be arbitrarily lengthy start-menu hangs, with no indication {that a} crash is the issue. There may be additionally some confusion about the behavior – possibly the design has modified within the final ten years.
Fixing the crash
Heap corruption bugs will be extraordinarily tough to seek out and repair, however this one looks as if it is likely to be simple. I turned on pageheap on RuntimeBroker.exe, killed the related model to get it to restart and apply the pageheap settings and it began crashing each time I opened the beginning menu. I configured WER to avoid wasting full crash dumps and shortly had a half-dozen crash dumps with full particulars of what was taking place.
The crashes usually occur on this name stack:
With pageheap enabled the crash occurs on a really comparable name stack, however barely earlier. The crash occurs earlier (and extra reliably) as a result of with web page heap whenever you free reminiscence it’s unmapped, so dereferencing it reliably causes a crash, as a substitute of studying from the freed reminiscence:
The crash occurs when dereferencing [rcx] so I ran its worth by the the !heap command (see my pageheap blogpost for particulars) and acquired this name stack:
The one complication is that this doesn’t occur on all Home windows 10 machines. There appears to be some required state that makes it occur or not and I don’t know what’s. I’ll say that I’m blissful to share the crash dumps with anybody at Microsoft who needs to analyze.
I don’t know the code and don’t perceive what is going on however I’ve handled sufficient use-after-free bugs to say that that is most likely easy to analyze and repair utilizing the crash dumps. Though, I acquired a few completely different crash name stacks so there is likely to be a number of bugs.
Conclusions
I ended my preliminary twitter thread by saying that these hangs were making me cranky and had me questioning if Home windows 10 was abandonware. Since then I’ve been informed that individuals appear to be seeing start-menu hangs on Home windows 11, however that’s not essentially the identical downside.
To be clear, Microsoft has the expertise to file traces on begin menu hangs on buyer machines. These traces would present roughly the identical factor as my hint. Additionally they obtain crash dumps from buyer machines. They could also have a approach of correlating them (if not then they need to hook that up). They usually created pageheap which makes use-after-free crashes simple to analyze.
So, why hasn’t this been addressed? On my laptop computer I see that RuntimeBroker.exe has crashed, on common, each second day this yr. That’s too many start-menu hangs for my tastes. I don’t know the way lengthy it has been taking place so maybe a fix is on the way – in that case that will be nice to listen to. If not then I’ll proceed to be cranky and I hope that this serves as a very good reminder of the significance of utilizing all that fancy telemetry to handle points like this.
Or, possibly I’m simply unfortunate and I’m one of many few folks (not the only person) who’s hitting this crash.
In brief, I’m actually happy with the instruments that Microsoft has created and launched to let me analyze efficiency points resembling these. Nevertheless I want that I didn’t have to make use of them so usually on Home windows itself.