x86/microcode/intel: Division by zero panic in 4.9.79 and 4.4.114

* x86/microcode/intel: Division by zero panic in 4.9.79 and 4.4.114
@ 2018-02-06 14:09 Rolf Neugebauer
  2018-02-06 14:24 ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Rolf Neugebauer @ 2018-02-06 14:09 UTC (permalink / raw)
  To: Borislav Petkov, Jia Zhang, Thomas Gleixner, Tony Luck,
	Ingo Molnar, H. Peter Anvin, x86, LKML, Greg KH, Rolf Neugebauer

The backport of 7e702d17ed1 ("x86/microcode/intel: Extend BDW
late-loading further with LLC size check") to 4.9.79 and 4.4.14 causes
a division by zero panic on some single vCPU machine types on Google
Cloud (e.g. g1-small and n1-standard-1):

[    1.591435] divide error: 0000 [#1] SMP
[    1.591961] Modules linked in:
[    1.592546] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.79-linuxkit #1
[    1.593461] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[    1.595135] task: ffffa01b2b63c040 task.stack: ffffb24f40010000
[    1.596683] RIP: 0010:[<ffffffffb7d88ce3>]  [<ffffffffb7d88ce3>]
init_intel_microcode+0x49/0x5a
[    1.598018] RSP: 0000:ffffb24f40013e48  EFLAGS: 00010206
[    1.599016] RAX: 0000000001400000 RBX: 00000000ffffffea RCX: 0000000000000000
[    1.600093] RDX: 0000000000000000 RSI: ffffa01b2fffa9a8 RDI: ffffffffb7d8888b
[    1.601222] RBP: 00000000ffffffff R08: ffffa01b2fffa9a1 R09: 000000000000005f
[    1.602330] R10: 0000000000000067 R11: ffffa01b22164177 R12: 0000000000000000
[    1.603291] R13: ffffffffb7d787b6 R14: 0000000000000000 R15: 0000000000000000
[    1.604406] FS:  0000000000000000(0000) GS:ffffa01b2fc00000(0000)
knlGS:0000000000000000
[    1.606154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.607493] CR2: 0000000000000000 CR3: 000000002ac0a000 CR4: 0000000000040630
[    1.609310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    1.611122] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    1.613008] Stack:
[    1.613445]  ffffffffb7d888c1 0000000000000000 ffffffffb7d1ba40
0000000000000000
[    1.615423]  0000000000000000 ffffffffb7d787b6 0000000000000000
0000000000000000
[    1.617049]  ffffffffb74fed89 00000000fffffff4 00000000ffffffff
4eae7bc133547c47
[    1.617904] Call Trace:
[    1.618198]  [<ffffffffb7d888c1>] ? microcode_init+0x36/0x1de
[    1.619264]  [<ffffffffb7d787b6>] ? set_debug_rodata+0xc/0xc
[    1.620143]  [<ffffffffb74fed89>] ? driver_register+0xaf/0xb4
[    1.621841]  [<ffffffffb7d8888b>] ? save_microcode_in_initrd+0x3c/0x3c
[    1.623424]  [<ffffffffb70021b7>] ? do_one_initcall+0x98/0x130
[    1.624659]  [<ffffffffb7d787b6>] ? set_debug_rodata+0xc/0xc
[    1.626303]  [<ffffffffb7d79091>] ? kernel_init_freeable+0x166/0x1e7
[    1.627674]  [<ffffffffb77ce3d7>] ? rest_init+0x6e/0x6e
[    1.628942]  [<ffffffffb77ce3e1>] ? kernel_init+0xa/0xe6
[    1.630243]  [<ffffffffb77d8587>] ? ret_from_fork+0x57/0x70
[    1.631592] Code: 16 0f b6 35 a0 ac fb ff 48 c7 c7 c4 ea 99 b7 e8
82 3e 41 ff 31 c0 c3 8b 05 3b ad fb ff 0f b7 0d 54 ad fb ff 31 d2 c1
e0 0a 48 98 <48> f7 f1 89 05 f4 6a 14 00 48 c7 c0 00 d8 c2 b7 c3 4c 8d
54 24
[    1.638875] RIP  [<ffffffffb7d88ce3>] init_intel_microcode+0x49/0x5a
[    1.640382]  RSP <ffffb24f40013e48>
[    1.641320] ---[ end trace e88ef332e19594b9 ]---
[    1.642418] Kernel panic - not syncing: Fatal exception
[    1.644581] Kernel Offset: 0x36000000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    1.647312] ---[ end Kernel panic - not syncing: Fatal exception

On these machines x86_max_cores is zero:

[    0.227977] ftrace: allocating 35563 entries in 139 pages
[    0.273941] smpboot: x86_max_cores == zero !?!?
[    0.274645] smpboot: Max logical packages: 1

4.9.78 (and earlier) as well as 4.4.13 (and earlier) worked fine so
this seems like a regression in stable.

4.14.16 and 4.15 (to which the above patch was backported as well)
work fine, but I noticed significant code changes for these kernels
around where x86_max_cores is set.

The obvious quick fix/hack is to check for x86_max_cores in
calc_llc_size_per_core() and return 0. I'm happy to submit a patch for
this, but a more correct fix might be back porting the relevant
changes around where  x86_max_cores is set.

Thanks
Rolf

^ permalink raw reply	[flat|nested] 14+ messages in thread