Uptime associated server crashes – Barry on WordPress
This can be a visitor put up by Iliya Polihronov. Iliya is the latest member of the worldwide infrastructure, methods, and safety crew at Automattic and the primary ever visitor blogger right here on barry.wordpress.com.
Hey, my title is Iliya and as a Techniques Wrangler at Automattic, I’m one of many folks dealing with the server-side points throughout the 2000 servers operating WordPress.com and different Automattic services.
Final week, inside two hours of one another, two of our MogileFS storage servers locked up with the next hint:
The following day, a couple of extra servers crashed with comparable traces.
We began trying to find a standard sample. All hosts had been operating Debian kernels starting from 2.6.32-21 to 2.6.32-24, a few of them had been in several knowledge facilities and had totally different functions in our community.
One factor we seen was that all the servers crashed after having an uptime of a bit greater than 200 days. After some analysis and investigation, we discovered that the perpetrator seems to be a fairly attention-grabbing kernel bug.
As a part of the scheduler load balancing algorithm, the kernel searches for the busiest group inside a given scheduling area. With a purpose to do this it has to have in mind the typical load for all teams. It’s calculated within the perform find_busiest_group() with:
sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr;
sds.total_load is the sum of the load on all CPUs within the scheduling area, based mostly on the run queue duties and their precedence.
SCHED_LOAD_SCALE is a continuing used to extend decision.
sds.total_pwr is the sum of the ability of all CPUs within the scheduling area. This sum finally ends up being zero and that’s what inflicting the crash – division by zero.
The “CPU energy” is used to have in mind how a lot calculating capabilities a CPU has in comparison with the opposite CPUs and the primary components for calculating it are:
1. Whether or not the CPU is shared, for instance through the use of multithreading.
2. What number of real-time duties the CPU is processing.
3. In newer kernels, how a lot time the CPU had spent processing IRQs.
The present suggested fix for this bug is counting on the speculation that whereas bearing in mind the real-time duties (#2 above), scale_rt_power() might return destructive worth, and thus the sum of all CPU powers might find yourself being zero.
This was merged into the 2.6.32.29 vanilla kernel, along with the IRQ accounting into the cpu_power (#3 above). It is usually merged into the Debian 2.6.32-31 kernel.
Alternatively, the scheduling load balancing may be turned off, which might successfully skip the associated code. This may be executed utilizing management teams, nonetheless it ought to be used with warning as it might trigger efficiency points:
mount -t cgroup -o cpuset cpuset /cgroups
echo 0 > /cgroups/cpuset.sched_load_balance
As it’s but not completely clear if the recommended repair actually fixes the issue, we’ll attempt to put up updates on any new developments as we observe them.