linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Linux 3.2: FPU Issue in execve with Intel E5-2620v3 and E7-4880v2
@ 2017-03-22  1:58 Cai, Jason
  2017-03-22  6:10 ` Greg KH
  0 siblings, 1 reply; 3+ messages in thread
From: Cai, Jason @ 2017-03-22  1:58 UTC (permalink / raw)
  To: stable, kernelnewbies, linux-kernel

Dear Kernel Hackers,

I'm Jason Cai, a kernel developer from Dell EMC. I hit the same issue as the
one Lennart Sorensen sent at Dec 19, 2016.

I narrow down the issue now. It seems that an unexpected DNA 
(Device not Available) may be triggered in the `execve` code path.
Specifically, it exists between `setup_new_exec()` and `start_thread()` in
file `load_elf_binary()`.

I've added a BUG_ON() just before `start_thread` in `load_elf_binary ` to 
assert the fpu status of the current process descriptor should be clean
when performing an exec. It gets triggered and the stack is as the following:

-----------------------------------------------------------------------------
(E3)[      1517.089157] current is bad: ffff8812227387c0 (abuse)
(E3)[      1517.089176] prev: fpu=ffff8811d846c100, fpu_src=ffff8817fbab7500, fpu_fork=ffff880bf5513740, fpu_exec=          (null)
(E3)[      1517.089190] has_fpu=1, fpu_counter=1, flags=402000, CR0=80050033
(E0)[      1517.089223] ------------[ cut here ]------------
(E2)[      1517.095250] kernel BUG at linux-3.2/fs/binfmt_elf.c:1064!
(U0)(MSG-KERN-00005):[      1517.106894] invalid opcode: 0000 [#1] SMP
(E4)[      1517.114030] CPU 23
(E4)[      1517.117055] Modules linked in: ...
(E4)[      1517.192079]
(E4)[      1517.194621] Pid: 29746, comm: abuse Tainted: P           O 3.2.33
(E4)[      1517.207783] RIP: 0010:[<ffffffff81129670>]  [<ffffffff81129670>] load_elf_binary+0x1858/0x1983
(E4)[      1517.218284] RSP: 0018:ffff8817fa15fd08  EFLAGS: 00010292
(E4)[      1517.225087] RAX: 0000000000000053 RBX: ffff8812227387c0 RCX: 0000000081000000
(E4)[      1517.233924] RDX: 0000000081000000 RSI: 0000000000000046 RDI: ffffffff81721140
(E4)[      1517.242761] RBP: ffff8817fa15fe18 R08: 0000000000000000 R09: 000000020fc00000
(E4)[      1517.251597] R10: ffff88187a15fc17 R11: 0000000000000000 R12: ffff880622e3ef80
(E4)[      1517.260432] R13: ffff8811c4333400 R14: ffff8812227387c0 R15: ffff8817fa15ff58
(E4)[      1517.269269] FS:  0000000000000000(0000) GS:ffff88183fd60000(0000) knlGS:0000000000000000
(E4)[      1517.279169] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
(E4)[      1517.286455] CR2: 00007fbca10dcba8 CR3: 00000011dd8a7000 CR4: 00000000001407e0
(E4)[      1517.295290] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
(E4)[      1517.304125] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
(E4)[      1517.312960] Process abuse (pid: 29746, threadinfo ffff8817fa15e000, task ffff8812227387c0)
(E0)[      1517.323055] Stack:
(E4)[      1517.326178]  0000000000000001 00007fffd47a98e8 00007fffd47a9988 ffff881200000008
(E4)[      1517.335384]  ffff880627961680 ffff8812227387c0 ffff8817fa15e000 ffff8817fa15e000
(E4)[      1517.344586]  ffff8817fa15e000 ffff8812227387c0 0000000000500988 0000000000500778
(E0)[      1517.353798] Call Trace:
(E4)[      1517.357416]  [<ffffffff810ec3a0>] search_binary_handler+0xd6/0x273
(E4)[      1517.365196]  [<ffffffff810edeed>] do_execve_common.clone.28+0x1e1/0x2e8
(E4)[      1517.373458]  [<ffffffff810ee00f>] do_execve+0x1b/0x1d
(E4)[      1517.379975]  [<ffffffff810092b1>] sys_execve+0x49/0xe1
(E4)[      1517.386589]  [<ffffffff813a4b4c>] stub_execve+0x6c/0xc0
(E0)[      1517.393293] Code: 81 31 c0 e8 c3 27 f1 ff 41 0f 20 c0 48 c7 c7 f0 49 51 81 8b 4b 14 0f b6 93 b8 01 00 00 48 8b b3 d8 04 00 00 31 c0 e8 a0 27 f1 ff <0f> 0b 49 8
b 95 98 00 00 00 48 8b 75 b8 4c 89 ff e8 ba 7d ed ff
(U1)(MSG-KERN-00005):[      1517.416621] RIP  [<ffffffff81129670>] load_elf_binary+0x1858/0x1983
(E4)[      1517.426164]  RSP <ffff8817fa15fd08>
(E4)[      1517.430961] ---[ end trace 5dcaec314d0b0edb ]---
(U0)(MSG-KERN-00018):[      1517.436994] Kernel panic - not syncing: Fatal exception
(E4)[      1517.445346] Pid: 29746, comm: abuse Tainted: P      D    O 3.2.33
(E4)[      1517.454276] Call Trace:
(E4)[      1517.457893]  [<ffffffff8139af77>] panic+0xb2/0x1d2
(E4)[      1517.464122]  [<ffffffff8103c75a>] ? kmsg_dump+0x5d/0xdf
(E4)[      1517.470825]  [<ffffffff8139eb8a>] oops_end+0xae/0xbe
(E4)[      1517.477246]  [<ffffffff81004b81>] die+0x5a/0x65
(E4)[      1517.483185]  [<ffffffff8139e6b8>] do_trap+0x121/0x130
(E4)[      1517.489703]  [<ffffffff81002a27>] do_invalid_op+0x96/0x9f
(E4)[      1517.496601]  [<ffffffff81129670>] ? load_elf_binary+0x1858/0x1983
(E4)[      1517.504280]  [<ffffffff813a63f5>] invalid_op+0x15/0x20
(E4)[      1517.510893]  [<ffffffff81129670>] ? load_elf_binary+0x1858/0x1983
(E4)[      1517.518575]  [<ffffffff81129670>] ? load_elf_binary+0x1858/0x1983
(E4)[      1517.526257]  [<ffffffff810ec3a0>] search_binary_handler+0xd6/0x273
(E4)[      1517.534035]  [<ffffffff810edeed>] do_execve_common.clone.28+0x1e1/0x2e8
(E4)[      1517.542289]  [<ffffffff810ee00f>] do_execve+0x1b/0x1d
(E4)[      1517.548810]  [<ffffffff810092b1>] sys_execve+0x49/0xe1
(E4)[      1517.555427]  [<ffffffff813a4b4c>] stub_execve+0x6c/0xc0
--------------------------------------------------------------------------------------------------

The kernel codes I'm testing are the same as the stable branch linux-3.2.y
AFAIK, there is no FPU instructions between `setup_new_exec()` and 
`start_thread() ` in `load_elf_binary()`.

The BUG_ON() codes are as the following:
--------------------------------------------------------------------------------------------------
if ((current->thread.has_fpu) || current->fpu_counter || tsk_used_math(current)) {
     // printk some status related to FPU ...
    BUG_ON(1);
}
--------------------------------------------------------------------------------------------------

Maybe the quick fix is that simply doesn't free the FPU state in `start_thread_common`.

Last but not least, by now, this issues can only be seen on the systems armed
with Intel E5-2620v3 and E7-4880v2.

Thus, I'm still wondering whether it's possible a CPU issue or something else? 
How can I verify it?

I would greatly appreciate if you kindly give me some feedback.

Best regards,
Jason Cai

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Linux 3.2: FPU Issue in execve with Intel E5-2620v3 and E7-4880v2
  2017-03-22  1:58 Linux 3.2: FPU Issue in execve with Intel E5-2620v3 and E7-4880v2 Cai, Jason
@ 2017-03-22  6:10 ` Greg KH
  2017-03-22  6:19   ` Cai, Jason
  0 siblings, 1 reply; 3+ messages in thread
From: Greg KH @ 2017-03-22  6:10 UTC (permalink / raw)
  To: Cai, Jason; +Cc: stable, kernelnewbies, linux-kernel

On Wed, Mar 22, 2017 at 01:58:58AM +0000, Cai, Jason wrote:
> Dear Kernel Hackers,
> 
> I'm Jason Cai, a kernel developer from Dell EMC. I hit the same issue as the
> one Lennart Sorensen sent at Dec 19, 2016.
> 
> I narrow down the issue now. It seems that an unexpected DNA 
> (Device not Available) may be triggered in the `execve` code path.
> Specifically, it exists between `setup_new_exec()` and `start_thread()` in
> file `load_elf_binary()`.
> 
> I've added a BUG_ON() just before `start_thread` in `load_elf_binary ` to 
> assert the fpu status of the current process descriptor should be clean
> when performing an exec. It gets triggered and the stack is as the following:

As you have a closed kernel module loaded, it's impossible for us to
actually tell what you are doing, or support you at all, sorry.  Please
work with the group that gave you that code, as they are the only ones
that can do so.

Also, does this happen with 4.10?  3.2 is _really_ old you know.

thansk,

greg k-h

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: Linux 3.2: FPU Issue in execve with Intel E5-2620v3 and E7-4880v2
  2017-03-22  6:10 ` Greg KH
@ 2017-03-22  6:19   ` Cai, Jason
  0 siblings, 0 replies; 3+ messages in thread
From: Cai, Jason @ 2017-03-22  6:19 UTC (permalink / raw)
  To: Greg KH; +Cc: stable, kernelnewbies, linux-kernel

Hi Greg K.H.,

Thanks for reply. Yes, you're right. 

Finally, I found the root-cause of my FPU issue. It's a bug in one of our driver 
which registered a timer and uses FPU in its timer function. That triggered a 
DNA (Device Not Available), and unexpectedly changed the fpu state of the 
current process. At the end of `load_elf_binary()`, the fpu state is freed by 
calling `start_thread_common()`, but the incorrect fpu state remains after 
`load_elf_binary()`.

Anyway, thanks for your kindness and your information.

Best regards,
Jason Cai


-----Original Message-----
From: Greg KH [mailto:gregkh@linuxfoundation.org] 
Sent: 2017年3月22日 14:10
To: Cai, Jason <Jason.Cai@emc.com>
Cc: stable@vger.kernel.org; kernelnewbies@nl.linux.org; linux-kernel@vger.kernel.org
Subject: Re: Linux 3.2: FPU Issue in execve with Intel E5-2620v3 and E7-4880v2

On Wed, Mar 22, 2017 at 01:58:58AM +0000, Cai, Jason wrote:
> Dear Kernel Hackers,
> 
> I'm Jason Cai, a kernel developer from Dell EMC. I hit the same issue as the
> one Lennart Sorensen sent at Dec 19, 2016.
> 
> I narrow down the issue now. It seems that an unexpected DNA 
> (Device not Available) may be triggered in the `execve` code path.
> Specifically, it exists between `setup_new_exec()` and `start_thread()` in
> file `load_elf_binary()`.
> 
> I've added a BUG_ON() just before `start_thread` in `load_elf_binary ` to 
> assert the fpu status of the current process descriptor should be clean
> when performing an exec. It gets triggered and the stack is as the following:

As you have a closed kernel module loaded, it's impossible for us to
actually tell what you are doing, or support you at all, sorry.  Please
work with the group that gave you that code, as they are the only ones
that can do so.

Also, does this happen with 4.10?  3.2 is _really_ old you know.

thansk,

greg k-h

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-03-22  6:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-22  1:58 Linux 3.2: FPU Issue in execve with Intel E5-2620v3 and E7-4880v2 Cai, Jason
2017-03-22  6:10 ` Greg KH
2017-03-22  6:19   ` Cai, Jason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).