Skylake bug causes Intel chips to freeze under 'complex workloads'

Skylake bug causes Intel chips to freeze under 'circuitous workloads'

This site may earn affiliate commissions from the links on this folio. Terms of use.

Skylake1

Intel has disclosed that its sixth-generation Core products (known every bit Skylake) endure from a CPU bug that tin can cause a system to hang. The company has only publicly identified one application family that causes information technology, Prime95.

The Prime95 thread on Skylake instability dates back to early Dec, when testers noted that running the 768K examination on the latest Intel processors would cause the application to fail — sometimes within minutes, sometimes only after hours. The forum users collectively worked through the usual suspects and double-checked RAM, motherboard vendors, voltage levels, clock speeds, Prime95 software versions, and whether the CPU was overclocked or not.

Disabling Hyper-Threading plain fixes the problem (based on user reports), merely none of the other variables had a measurable impact on the outcome. If y'all run Prime95 on a Skylake CPU with the maximum number of threads bachelor on the processor with the "CpuSupportsFMA3=0" (which forces the use of AVX) at the 768 FFT size, the organisation will eventually crash.

Unfortunately, Intel's current disclosure is vague at best. The complete argument reads:

Hello All,
Intel has identified an outcome that potentially affects the sixth Gen Intel® Cadre™ family of products. This result simply occurs under sure complex workload conditions, like those that may be encountered when running applications like Prime95. In those cases, the processor may hang or cause unpredictable system behavior. Intel has identified and released a gear up and is working with external business organization partners to become the set deployed through BIOS.

It'south non clear yet what the fix will be, or if it will require cease users to avoid certain code paths or features when testing processors. Niche cases like this can take enormous impacts on companies — in the early 1990s, Intel's Pentium processors suffered what became known as the FDIV bug. The fleck'southward worked perfectly in the vast bulk of cases, simply would render an incorrect value in specific floating-point cases. Specifically, the returned values were incorrect by roughly 0.000061.

Nonetheless, the bug caused serious headaches for Intel. The visitor took a hammering in the press and a charge of $475 1000000 against earnings to resolve the problem. Since then, we've seen a number of high-profile errors — AMD has its TLB bug with the original Phenom, Intel's get-go iteration of TSX (Transactional Synchronization Extensions) were disabled via microcode update. At that place's a bug in Intel'due south VM implementation that can permit a guest VM to error in a way that traps the CPU in an infinite loop.

Intel turned some of the flawed Pentium chips into keychains.

We recollect of processors equally essentially flawless devices that "just work," but reality tells a different story. Bank check out Intel'due south list of errata in Haswell — there's a 5-page list of flaws and issues, well-nigh all of which are labeled as "No fix." The solution, in the bulk of cases, is "Don't do information technology like that." AMD chips aren't immune from these kinds of issues by any ways, but there's been less hammering on AMD chips since they don't have the enterprise marketplace share they used to command.

Sometimes bugs are disclosed, sometimes they aren't — Piledriver has a meaning problem with 256-bit AVX instructions, for example, that injects an 18-20 cycle delay into executing multiple consecutive instructions. Every original Intel Atom (before Bay Trail) had a floating point flaw that could insert a NOP (no performance) into every other cycle, finer doubling FPU compute time. No i bought an Atom for its FPU performance, so the bug didn't get talked about.

We'll take to await and come across what Intel's solution for this problem is. The simplest style to fix it might be to tell the CPU to avoid using AVX in specific instances, but the FDIV bug demonstrated that users often demand 100% compatible CPUs — fifty-fifty if they aren't using the functions that actually trigger a problems. The problem is, as CPUs add more features and capabilities, it takes longer and longer to adequately test those functions.