Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?

From: Andy Lutomirski <luto@amacapital.net>
To: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Kees Cook <keescook@chromium.org>,
	David Drysdale <drysdale@google.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Will Drewry <wad@chromium.org>, Ingo Molnar <mingo@kernel.org>,
	Alok Kataria <akataria@vmware.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Borislav Petkov <bp@alien8.de>,
	Alexei Starovoitov <ast@plumgrid.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Oleg Nesterov <oleg@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>, X86 ML <x86@kernel.org>
Subject: Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
Date: Thu, 13 Aug 2015 14:47:34 -0700	[thread overview]
Message-ID: <CALCETrWj6X+zCX47iHip0Qjp88p7f3p0uWvskL357XCyvPrXVA@mail.gmail.com> (raw)
In-Reply-To: <55CD0DAC.9080809@redhat.com>

On Thu, Aug 13, 2015 at 2:35 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 08/13/2015 08:47 PM, Kees Cook wrote:
>> On Thu, Aug 13, 2015 at 10:39 AM, David Drysdale <drysdale@google.com> wrote:
>>> On Thu, Aug 13, 2015 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Thu, Aug 13, 2015 at 9:28 AM, David Drysdale <drysdale@google.com> wrote:
>>>>> On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>>>>> On 08/13/2015 10:30 AM, David Drysdale wrote:
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>>>>>>> else could reproduce it.
>>>>>>>
>>>>>>> The problem occurs with a seccomp-bpf filter program that's set up to return
>>>>>>> an errno value -- an errno of 1 is always returned instead of what's in the
>>>>>>> filter, plus other oddities (selftest output below).
>>>>>>>
>>>>>>> The problem seems to need a combination of circumstances to occur:
>>>>>>>
>>>>>>>  - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>>>>>>    64-bit kernel -- I'm testing with seccomp_bpf from
>>>>>>>    tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>>>>>>
>>>>>> Does it work correctly when built as 64-bit program?
>>>>>
>>>>> Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>>>>>
>>>>>>>
>>>>>>>  - The kernel needs to be running as a VM guest -- it occurs inside my
>>>>>>>    VMware Fusion host, but not if I run on bare metal.  Kees tells me he
>>>>>>>    cannot repro with a kvm guest though.
>>>>>>>
>>>>>>> Bisecting indicates that the commit that induces the problem is
>>>>>>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>>>>>>> 64-bit logic", included in all the v4.2-rc* candidates.
>>>>>>>
>>>>>>> Apologies if I've just got something odd with my local setup, but the
>>>>>>> bisection was unequivocal enough that I thought it worth reporting...
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>
>>>>>>> seccomp_bpf failure outputs:
>>>>>
>>>>> [snip]
>>>>>
>>>>>> End result should be:
>>>>>> pt_regs->ax = -E2BIG (via syscall_set_return_value())
>>>>>> pt_regs->orig_ax = -1 ("skip syscall")
>>>>>> and syscall_trace_enter_phase1() usually returns with 0,
>>>>>> meaning "re-execute syscall at once, no phase2 needed".
>>>>>>
>>>>>> This, in turn, is called from .S files, and when it returns there,
>>>>>> execution loops back to syscall dispatch.
>>>>>>
>>>>>> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
>>>>>> So -E2BIG should survive and be returned...
>>>>>
>>>>> So I was just about to send:
>>>>>
>>>>>  That makes sense, and given that exactly the same 32-bit binary
>>>>>  runs fine on a different machine, there's presumably something up
>>>>>  with my local setup.  The failing machine is a VMware guest, but
>>>>>  maybe that's not the relevant interaction -- particularly if no-one
>>>>>  else can repro.
>>>>>
>>>>> But then I noticed some odd audit entries in the main log:
>>>>>
>>>>> Aug 13 16:52:56 ubuntu kernel: [   20.687249] audit: type=1326
>>>>> audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
>>>>> ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
>>>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
>>>>> compat=1 ip=0xf773cc90 code=0x0
>>>>> Aug 13 16:52:56 ubuntu kernel: [   20.691157] audit: type=1326
>>>>> audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
>>>>> ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
>>>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
>>>>> compat=1 ip=0xf773cc90 code=0x10000000
>>>>> ...
>>>>>
>>>>> I didn't think I had any audit stuff turned on, and indeed:
>>>>>   # auditctl -l
>>>>>   No rules
>>>>>
>>>>> But as soon as I'd run that auditctl command, the 32-bit
>>>>> seccomp_bpf binary started running fine!
>>>>>
>>>>> So now I'm confused, and I can no longer reproduce the
>>>>> problem.  Which probably means this was a false alarm, in
>>>>> which case, my apologies.
>>>>
>>>> You might have triggered TIF_AUDIT or whatever it's called, which
>>>> causes a whole different path through the asm tangle, so you might
>>>> really have a problem.
>>>>
>>>> Try auditctl -a task,never.  If that doesn't change anything, try
>>>> rebooting the guest.
>>>
>>> Aha, that seems to re-instate the problem -- with that auditctl setup
>>> I get the 32-bit seccomp failures on two different machines (one VM,
>>> one bare).  So can anyone else repro?
>>>
>>> I guess the relevant steps are thus:
>>>   - sudo auditctl -a task,never
>>>   - cd tools/testing/selftests/seccomp
>>>   - CFLAGS=-m32 make clean run_tests
>>
>> That was it! I can reproduce this now on kvm (after adding the auditctl rule).
>
> I suspect this change:
>
>         .macro auditsys_entry_common
> ...
>         movl %ebx,%esi                  /* 2nd arg: 1st syscall arg */
>         movl %eax,%edi                  /* 1st arg: syscall number */
>         call __audit_syscall_entry
> -       movl RAX(%rsp),%eax     /* reload syscall number */
> -       cmpq $(IA32_NR_syscalls-1),%rax
> -       ja ia32_badsys
> +       movl ORIG_RAX(%rsp),%eax        /* reload syscall number */
>         movl %ebx,%edi                  /* reload 1st syscall arg */
>         movl RCX(%rsp),%esi     /* reload 2nd syscall arg */
>         movl RDX(%rsp),%edx     /* reload 3rd syscall arg */
>
> We were reloading syscall# from pt_regs->ax.

I am so glad that this code is gone in -tip.  Good riddance!

>
> After the patch, pt_regs->ax isn't equal to syscall# on entry,
> instead it contains -ENOSYS. Therefore the change shown above
> was made, to reload it from pt_regs->orig_ax.
>
> Well. This still should work... in fact it is "more correct"
> than it was before...
>
> 64-bit code has no call to __audit_syscall_entry, it uses
> syscall_trace_enter_phase1/phase2 mechanism instead of
> "only audit" shortcut. If the bug is here (though I don't see it),
> it explains why 64-bit binary works.
>
>
> Now, how do we reach this bit of code?
>
> ia32_sysenter_target:
> ...
>         testl   $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jnz  sysenter_tracesys
> ...
> sysenter_tracesys:
>         testl   $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jz      sysenter_auditsys
> ...
> sysenter_auditsys:
>         auditsys_entry_common     <== OUR MACRO
>         movl %ebp,%r9d                  /* reload 6th syscall arg */
>         jmp sysenter_dispatch
>
>
> ia32_cstar_target:
> ...
>         testl   $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jnz   cstar_tracesys
> ...
> cstar_tracesys:
>         testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jz cstar_auditsys
> ...
> cstar_auditsys:
>         movl %r9d,R9(%rsp)      /* register to be clobbered by call */
>         auditsys_entry_common  <== OUR MACRO
>         movl R9(%rsp),%r9d      /* reload 6th syscall arg */
>         jmp cstar_dispatch
>

TIF_SECCOMP had better be set, so that code should be unreachable.

syscall_trace_enter_phase1 returns 0 if we hit SECCOMP_RET_ERRNO (i.e.
SECCOMP_PHASE1_SKIP).  syscall_trace_enter sees that and returns
regs->orig_ax, which is -1.

It seems to me that the bug is that sysexit_from_sys_call isn't
reloading RAX from regs->ax.

--Andy