[Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
@ 2015-08-13  8:30 David Drysdale
  2015-08-13 15:17 ` Denys Vlasenko
  0 siblings, 1 reply; 22+ messages in thread
From: David Drysdale @ 2015-08-13  8:30 UTC (permalink / raw)
  To: Kees Cook, Denys Vlasenko, Andy Lutomirski, linux-kernel,
	Will Drewry, Ingo Molnar
  Cc: Alok Kataria, Linus Torvalds, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

Hi folks,

I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
else could reproduce it.

The problem occurs with a seccomp-bpf filter program that's set up to return
an errno value -- an errno of 1 is always returned instead of what's in the
filter, plus other oddities (selftest output below).

The problem seems to need a combination of circumstances to occur:

 - The seccomp-bpf userspace program needs to be 32-bit, running against a
   64-bit kernel -- I'm testing with seccomp_bpf from
   tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.

 - The kernel needs to be running as a VM guest -- it occurs inside my
   VMware Fusion host, but not if I run on bare metal.  Kees tells me he
   cannot repro with a kvm guest though.

Bisecting indicates that the commit that induces the problem is
3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
64-bit logic", included in all the v4.2-rc* candidates.

Apologies if I've just got something odd with my local setup, but the
bisection was unequivocal enough that I thought it worth reporting...

Thanks,
David

seccomp_bpf failure outputs:

seccomp_bpf.c:533:global.ERRNO_valid:Expected 7 (7) ==
(*__errno_location ()) (1)
seccomp_bpf.c:560:global.ERRNO_zero:Expected 0 (0) == read(0, ((void
*)0), 0) (4294967295)
seccomp_bpf.c:587:global.ERRNO_capped:Expected 4095 (4095) ==
(*__errno_location ()) (1)
seccomp_bpf.c:905:precedence.errno_is_third:Expected 0 (0) ==
syscall(20) (4294967295)
seccomp_bpf.c:925:precedence.errno_is_third_in_any_order:Expected 0
(0) == syscall(20) (4294967295)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13  8:30 [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM? David Drysdale
@ 2015-08-13 15:17 ` Denys Vlasenko
  2015-08-13 16:28   ` David Drysdale
  0 siblings, 1 reply; 22+ messages in thread
From: Denys Vlasenko @ 2015-08-13 15:17 UTC (permalink / raw)
  To: David Drysdale, Kees Cook, Andy Lutomirski, linux-kernel,
	Will Drewry, Ingo Molnar
  Cc: Alok Kataria, Linus Torvalds, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On 08/13/2015 10:30 AM, David Drysdale wrote:
> Hi folks,
> 
> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
> else could reproduce it.
> 
> The problem occurs with a seccomp-bpf filter program that's set up to return
> an errno value -- an errno of 1 is always returned instead of what's in the
> filter, plus other oddities (selftest output below).
> 
> The problem seems to need a combination of circumstances to occur:
> 
>  - The seccomp-bpf userspace program needs to be 32-bit, running against a
>    64-bit kernel -- I'm testing with seccomp_bpf from
>    tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.

Does it work correctly when built as 64-bit program?

> 
>  - The kernel needs to be running as a VM guest -- it occurs inside my
>    VMware Fusion host, but not if I run on bare metal.  Kees tells me he
>    cannot repro with a kvm guest though.
> 
> Bisecting indicates that the commit that induces the problem is
> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
> 64-bit logic", included in all the v4.2-rc* candidates.
> 
> Apologies if I've just got something odd with my local setup, but the
> bisection was unequivocal enough that I thought it worth reporting...
> 
> Thanks,
> David
> 
> 
> seccomp_bpf failure outputs:
> 
> seccomp_bpf.c:533:global.ERRNO_valid:Expected 7 (7) ==
> (*__errno_location ()) (1)

Test source code:

TEST(ERRNO_valid)
{
        struct sock_filter filter[] = {
                BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
                        offsetof(struct seccomp_data, nr)),
                BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_read, 0, 1),
                BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | E2BIG),
                BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
        };
        struct sock_fprog prog = {
                .len = (unsigned short)ARRAY_SIZE(filter),
                .filter = filter,
        };
        long ret;
        pid_t parent = getppid();

        ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
        ASSERT_EQ(0, ret);

        ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
        ASSERT_EQ(0, ret);

        EXPECT_EQ(parent, syscall(__NR_getppid));
        EXPECT_EQ(-1, read(0, NULL, 0));
        EXPECT_EQ(E2BIG, errno);
}

The last EXPECT expects 7 (E2BIG) but sees 1.


I'm trying to see how that happens.
SECCOMP_RET_ERRNO action is processed as follows:


static u32 __seccomp_phase1_filter(int this_syscall, struct seccomp_data *sd)
{
...
        case SECCOMP_RET_ERRNO:
                /* Set low-order bits as an errno, capped at MAX_ERRNO. */
                if (data > MAX_ERRNO)
                        data = MAX_ERRNO;
                syscall_set_return_value(current, task_pt_regs(current),
                                         -data, 0);
                goto skip;
...
skip:
        audit_seccomp(this_syscall, 0, action);
        return SECCOMP_PHASE1_SKIP;  // "the syscall should not be invoked"
}

The above is called from:

unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
{
...
        if (work & _TIF_SECCOMP) {
...                ret = seccomp_phase1(&sd);
                if (ret == SECCOMP_PHASE1_SKIP) {
                        regs->orig_ax = -1;
                        ret = 0;
                }
		...
        }
        /* Do our best to finish without phase 2. */
        if (work == 0)
                return ret;  /* seccomp and/or nohz only (ret == 0 here) */
#ifdef CONFIG_AUDITSYSCALL
        if (work == _TIF_SYSCALL_AUDIT) {
                /*
                 * If there is no more work to be done except auditing,
                 * then audit in phase 1.  Phase 2 always audits, so, if
                 * we audit here, then we can't go on to phase 2.
                 */
                do_audit_syscall_entry(regs, arch);
                return 0;
        }
#endif
        return 1;  /* Something is enabled that we can't handle in phase 1 */
}
...
long syscall_trace_enter(struct pt_regs *regs)
{
        u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
        unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);

        if (phase1_result == 0)
                return regs->orig_ax;
        else
                return syscall_trace_enter_phase2(regs, arch, phase1_result);
}


End result should be:
pt_regs->ax = -E2BIG (via syscall_set_return_value())
pt_regs->orig_ax = -1 ("skip syscall")
and syscall_trace_enter_phase1() usually returns with 0,
meaning "re-execute syscall at once, no phase2 needed".

This, in turn, is called from .S files, and when it returns there,
execution loops back to syscall dispatch.

Because of orig_ax = -1, syscall dispatch should skip calling syscall.
So -E2BIG should survive and be returned...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 15:17 ` Denys Vlasenko
@ 2015-08-13 16:28   ` David Drysdale
  2015-08-13 17:15     ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: David Drysdale @ 2015-08-13 16:28 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Kees Cook, Andy Lutomirski, linux-kernel, Will Drewry,
	Ingo Molnar, Alok Kataria, Linus Torvalds, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 08/13/2015 10:30 AM, David Drysdale wrote:
>> Hi folks,
>>
>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>> else could reproduce it.
>>
>> The problem occurs with a seccomp-bpf filter program that's set up to return
>> an errno value -- an errno of 1 is always returned instead of what's in the
>> filter, plus other oddities (selftest output below).
>>
>> The problem seems to need a combination of circumstances to occur:
>>
>>  - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>    64-bit kernel -- I'm testing with seccomp_bpf from
>>    tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>
> Does it work correctly when built as 64-bit program?

Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).

>>
>>  - The kernel needs to be running as a VM guest -- it occurs inside my
>>    VMware Fusion host, but not if I run on bare metal.  Kees tells me he
>>    cannot repro with a kvm guest though.
>>
>> Bisecting indicates that the commit that induces the problem is
>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>> 64-bit logic", included in all the v4.2-rc* candidates.
>>
>> Apologies if I've just got something odd with my local setup, but the
>> bisection was unequivocal enough that I thought it worth reporting...
>>
>> Thanks,
>> David
>>
>>
>> seccomp_bpf failure outputs:

[snip]

> End result should be:
> pt_regs->ax = -E2BIG (via syscall_set_return_value())
> pt_regs->orig_ax = -1 ("skip syscall")
> and syscall_trace_enter_phase1() usually returns with 0,
> meaning "re-execute syscall at once, no phase2 needed".
>
> This, in turn, is called from .S files, and when it returns there,
> execution loops back to syscall dispatch.
>
> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
> So -E2BIG should survive and be returned...

So I was just about to send:

 That makes sense, and given that exactly the same 32-bit binary
 runs fine on a different machine, there's presumably something up
 with my local setup.  The failing machine is a VMware guest, but
 maybe that's not the relevant interaction -- particularly if no-one
 else can repro.

But then I noticed some odd audit entries in the main log:

Aug 13 16:52:56 ubuntu kernel: [   20.687249] audit: type=1326
audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
compat=1 ip=0xf773cc90 code=0x0
Aug 13 16:52:56 ubuntu kernel: [   20.691157] audit: type=1326
audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
compat=1 ip=0xf773cc90 code=0x10000000
...

I didn't think I had any audit stuff turned on, and indeed:
  # auditctl -l
  No rules

But as soon as I'd run that auditctl command, the 32-bit
seccomp_bpf binary started running fine!

So now I'm confused, and I can no longer reproduce the
problem.  Which probably means this was a false alarm, in
which case, my apologies.

D.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 16:28   ` David Drysdale
@ 2015-08-13 17:15     ` Andy Lutomirski
  2015-08-13 17:39       ` David Drysdale
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2015-08-13 17:15 UTC (permalink / raw)
  To: David Drysdale
  Cc: Denys Vlasenko, Kees Cook, linux-kernel, Will Drewry,
	Ingo Molnar, Alok Kataria, Linus Torvalds, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 9:28 AM, David Drysdale <drysdale@google.com> wrote:
> On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> On 08/13/2015 10:30 AM, David Drysdale wrote:
>>> Hi folks,
>>>
>>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>>> else could reproduce it.
>>>
>>> The problem occurs with a seccomp-bpf filter program that's set up to return
>>> an errno value -- an errno of 1 is always returned instead of what's in the
>>> filter, plus other oddities (selftest output below).
>>>
>>> The problem seems to need a combination of circumstances to occur:
>>>
>>>  - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>>    64-bit kernel -- I'm testing with seccomp_bpf from
>>>    tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>>
>> Does it work correctly when built as 64-bit program?
>
> Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>
>>>
>>>  - The kernel needs to be running as a VM guest -- it occurs inside my
>>>    VMware Fusion host, but not if I run on bare metal.  Kees tells me he
>>>    cannot repro with a kvm guest though.
>>>
>>> Bisecting indicates that the commit that induces the problem is
>>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>>> 64-bit logic", included in all the v4.2-rc* candidates.
>>>
>>> Apologies if I've just got something odd with my local setup, but the
>>> bisection was unequivocal enough that I thought it worth reporting...
>>>
>>> Thanks,
>>> David
>>>
>>>
>>> seccomp_bpf failure outputs:
>
> [snip]
>
>> End result should be:
>> pt_regs->ax = -E2BIG (via syscall_set_return_value())
>> pt_regs->orig_ax = -1 ("skip syscall")
>> and syscall_trace_enter_phase1() usually returns with 0,
>> meaning "re-execute syscall at once, no phase2 needed".
>>
>> This, in turn, is called from .S files, and when it returns there,
>> execution loops back to syscall dispatch.
>>
>> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
>> So -E2BIG should survive and be returned...
>
> So I was just about to send:
>
>  That makes sense, and given that exactly the same 32-bit binary
>  runs fine on a different machine, there's presumably something up
>  with my local setup.  The failing machine is a VMware guest, but
>  maybe that's not the relevant interaction -- particularly if no-one
>  else can repro.
>
> But then I noticed some odd audit entries in the main log:
>
> Aug 13 16:52:56 ubuntu kernel: [   20.687249] audit: type=1326
> audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
> ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
> exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
> compat=1 ip=0xf773cc90 code=0x0
> Aug 13 16:52:56 ubuntu kernel: [   20.691157] audit: type=1326
> audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
> ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
> exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
> compat=1 ip=0xf773cc90 code=0x10000000
> ...
>
> I didn't think I had any audit stuff turned on, and indeed:
>   # auditctl -l
>   No rules
>
> But as soon as I'd run that auditctl command, the 32-bit
> seccomp_bpf binary started running fine!
>
> So now I'm confused, and I can no longer reproduce the
> problem.  Which probably means this was a false alarm, in
> which case, my apologies.

You might have triggered TIF_AUDIT or whatever it's called, which
causes a whole different path through the asm tangle, so you might
really have a problem.

Try auditctl -a task,never.  If that doesn't change anything, try
rebooting the guest.

--Andy

>
> D.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 17:15     ` Andy Lutomirski
@ 2015-08-13 17:39       ` David Drysdale
  2015-08-13 18:47         ` Kees Cook
  0 siblings, 1 reply; 22+ messages in thread
From: David Drysdale @ 2015-08-13 17:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Kees Cook, linux-kernel, Will Drewry,
	Ingo Molnar, Alok Kataria, Linus Torvalds, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Aug 13, 2015 at 9:28 AM, David Drysdale <drysdale@google.com> wrote:
>> On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>> On 08/13/2015 10:30 AM, David Drysdale wrote:
>>>> Hi folks,
>>>>
>>>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>>>> else could reproduce it.
>>>>
>>>> The problem occurs with a seccomp-bpf filter program that's set up to return
>>>> an errno value -- an errno of 1 is always returned instead of what's in the
>>>> filter, plus other oddities (selftest output below).
>>>>
>>>> The problem seems to need a combination of circumstances to occur:
>>>>
>>>>  - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>>>    64-bit kernel -- I'm testing with seccomp_bpf from
>>>>    tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>>>
>>> Does it work correctly when built as 64-bit program?
>>
>> Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>>
>>>>
>>>>  - The kernel needs to be running as a VM guest -- it occurs inside my
>>>>    VMware Fusion host, but not if I run on bare metal.  Kees tells me he
>>>>    cannot repro with a kvm guest though.
>>>>
>>>> Bisecting indicates that the commit that induces the problem is
>>>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>>>> 64-bit logic", included in all the v4.2-rc* candidates.
>>>>
>>>> Apologies if I've just got something odd with my local setup, but the
>>>> bisection was unequivocal enough that I thought it worth reporting...
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>
>>>> seccomp_bpf failure outputs:
>>
>> [snip]
>>
>>> End result should be:
>>> pt_regs->ax = -E2BIG (via syscall_set_return_value())
>>> pt_regs->orig_ax = -1 ("skip syscall")
>>> and syscall_trace_enter_phase1() usually returns with 0,
>>> meaning "re-execute syscall at once, no phase2 needed".
>>>
>>> This, in turn, is called from .S files, and when it returns there,
>>> execution loops back to syscall dispatch.
>>>
>>> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
>>> So -E2BIG should survive and be returned...
>>
>> So I was just about to send:
>>
>>  That makes sense, and given that exactly the same 32-bit binary
>>  runs fine on a different machine, there's presumably something up
>>  with my local setup.  The failing machine is a VMware guest, but
>>  maybe that's not the relevant interaction -- particularly if no-one
>>  else can repro.
>>
>> But then I noticed some odd audit entries in the main log:
>>
>> Aug 13 16:52:56 ubuntu kernel: [   20.687249] audit: type=1326
>> audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
>> ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
>> compat=1 ip=0xf773cc90 code=0x0
>> Aug 13 16:52:56 ubuntu kernel: [   20.691157] audit: type=1326
>> audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
>> ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
>> compat=1 ip=0xf773cc90 code=0x10000000
>> ...
>>
>> I didn't think I had any audit stuff turned on, and indeed:
>>   # auditctl -l
>>   No rules
>>
>> But as soon as I'd run that auditctl command, the 32-bit
>> seccomp_bpf binary started running fine!
>>
>> So now I'm confused, and I can no longer reproduce the
>> problem.  Which probably means this was a false alarm, in
>> which case, my apologies.
>
> You might have triggered TIF_AUDIT or whatever it's called, which
> causes a whole different path through the asm tangle, so you might
> really have a problem.
>
> Try auditctl -a task,never.  If that doesn't change anything, try
> rebooting the guest.

Aha, that seems to re-instate the problem -- with that auditctl setup
I get the 32-bit seccomp failures on two different machines (one VM,
one bare).  So can anyone else repro?

I guess the relevant steps are thus:
  - sudo auditctl -a task,never
  - cd tools/testing/selftests/seccomp
  - CFLAGS=-m32 make clean run_tests

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 17:39       ` David Drysdale
@ 2015-08-13 18:47         ` Kees Cook
  2015-08-13 21:35           ` Denys Vlasenko
  0 siblings, 1 reply; 22+ messages in thread
From: Kees Cook @ 2015-08-13 18:47 UTC (permalink / raw)
  To: David Drysdale
  Cc: Andy Lutomirski, Denys Vlasenko, linux-kernel, Will Drewry,
	Ingo Molnar, Alok Kataria, Linus Torvalds, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 10:39 AM, David Drysdale <drysdale@google.com> wrote:
> On Thu, Aug 13, 2015 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, Aug 13, 2015 at 9:28 AM, David Drysdale <drysdale@google.com> wrote:
>>> On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>>> On 08/13/2015 10:30 AM, David Drysdale wrote:
>>>>> Hi folks,
>>>>>
>>>>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>>>>> else could reproduce it.
>>>>>
>>>>> The problem occurs with a seccomp-bpf filter program that's set up to return
>>>>> an errno value -- an errno of 1 is always returned instead of what's in the
>>>>> filter, plus other oddities (selftest output below).
>>>>>
>>>>> The problem seems to need a combination of circumstances to occur:
>>>>>
>>>>>  - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>>>>    64-bit kernel -- I'm testing with seccomp_bpf from
>>>>>    tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>>>>
>>>> Does it work correctly when built as 64-bit program?
>>>
>>> Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>>>
>>>>>
>>>>>  - The kernel needs to be running as a VM guest -- it occurs inside my
>>>>>    VMware Fusion host, but not if I run on bare metal.  Kees tells me he
>>>>>    cannot repro with a kvm guest though.
>>>>>
>>>>> Bisecting indicates that the commit that induces the problem is
>>>>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>>>>> 64-bit logic", included in all the v4.2-rc* candidates.
>>>>>
>>>>> Apologies if I've just got something odd with my local setup, but the
>>>>> bisection was unequivocal enough that I thought it worth reporting...
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>
>>>>> seccomp_bpf failure outputs:
>>>
>>> [snip]
>>>
>>>> End result should be:
>>>> pt_regs->ax = -E2BIG (via syscall_set_return_value())
>>>> pt_regs->orig_ax = -1 ("skip syscall")
>>>> and syscall_trace_enter_phase1() usually returns with 0,
>>>> meaning "re-execute syscall at once, no phase2 needed".
>>>>
>>>> This, in turn, is called from .S files, and when it returns there,
>>>> execution loops back to syscall dispatch.
>>>>
>>>> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
>>>> So -E2BIG should survive and be returned...
>>>
>>> So I was just about to send:
>>>
>>>  That makes sense, and given that exactly the same 32-bit binary
>>>  runs fine on a different machine, there's presumably something up
>>>  with my local setup.  The failing machine is a VMware guest, but
>>>  maybe that's not the relevant interaction -- particularly if no-one
>>>  else can repro.
>>>
>>> But then I noticed some odd audit entries in the main log:
>>>
>>> Aug 13 16:52:56 ubuntu kernel: [   20.687249] audit: type=1326
>>> audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
>>> ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
>>> compat=1 ip=0xf773cc90 code=0x0
>>> Aug 13 16:52:56 ubuntu kernel: [   20.691157] audit: type=1326
>>> audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
>>> ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
>>> compat=1 ip=0xf773cc90 code=0x10000000
>>> ...
>>>
>>> I didn't think I had any audit stuff turned on, and indeed:
>>>   # auditctl -l
>>>   No rules
>>>
>>> But as soon as I'd run that auditctl command, the 32-bit
>>> seccomp_bpf binary started running fine!
>>>
>>> So now I'm confused, and I can no longer reproduce the
>>> problem.  Which probably means this was a false alarm, in
>>> which case, my apologies.
>>
>> You might have triggered TIF_AUDIT or whatever it's called, which
>> causes a whole different path through the asm tangle, so you might
>> really have a problem.
>>
>> Try auditctl -a task,never.  If that doesn't change anything, try
>> rebooting the guest.
>
> Aha, that seems to re-instate the problem -- with that auditctl setup
> I get the 32-bit seccomp failures on two different machines (one VM,
> one bare).  So can anyone else repro?
>
> I guess the relevant steps are thus:
>   - sudo auditctl -a task,never
>   - cd tools/testing/selftests/seccomp
>   - CFLAGS=-m32 make clean run_tests

That was it! I can reproduce this now on kvm (after adding the auditctl rule).

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 18:47         ` Kees Cook
@ 2015-08-13 21:35           ` Denys Vlasenko
  2015-08-13 21:47             ` Andy Lutomirski
  2015-08-13 22:27             ` Linus Torvalds
  0 siblings, 2 replies; 22+ messages in thread
From: Denys Vlasenko @ 2015-08-13 21:35 UTC (permalink / raw)
  To: Kees Cook, David Drysdale
  Cc: Andy Lutomirski, linux-kernel, Will Drewry, Ingo Molnar,
	Alok Kataria, Linus Torvalds, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On 08/13/2015 08:47 PM, Kees Cook wrote:
> On Thu, Aug 13, 2015 at 10:39 AM, David Drysdale <drysdale@google.com> wrote:
>> On Thu, Aug 13, 2015 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Thu, Aug 13, 2015 at 9:28 AM, David Drysdale <drysdale@google.com> wrote:
>>>> On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>>>> On 08/13/2015 10:30 AM, David Drysdale wrote:
>>>>>> Hi folks,
>>>>>>
>>>>>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>>>>>> else could reproduce it.
>>>>>>
>>>>>> The problem occurs with a seccomp-bpf filter program that's set up to return
>>>>>> an errno value -- an errno of 1 is always returned instead of what's in the
>>>>>> filter, plus other oddities (selftest output below).
>>>>>>
>>>>>> The problem seems to need a combination of circumstances to occur:
>>>>>>
>>>>>>  - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>>>>>    64-bit kernel -- I'm testing with seccomp_bpf from
>>>>>>    tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>>>>>
>>>>> Does it work correctly when built as 64-bit program?
>>>>
>>>> Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>>>>
>>>>>>
>>>>>>  - The kernel needs to be running as a VM guest -- it occurs inside my
>>>>>>    VMware Fusion host, but not if I run on bare metal.  Kees tells me he
>>>>>>    cannot repro with a kvm guest though.
>>>>>>
>>>>>> Bisecting indicates that the commit that induces the problem is
>>>>>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>>>>>> 64-bit logic", included in all the v4.2-rc* candidates.
>>>>>>
>>>>>> Apologies if I've just got something odd with my local setup, but the
>>>>>> bisection was unequivocal enough that I thought it worth reporting...
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>>
>>>>>> seccomp_bpf failure outputs:
>>>>
>>>> [snip]
>>>>
>>>>> End result should be:
>>>>> pt_regs->ax = -E2BIG (via syscall_set_return_value())
>>>>> pt_regs->orig_ax = -1 ("skip syscall")
>>>>> and syscall_trace_enter_phase1() usually returns with 0,
>>>>> meaning "re-execute syscall at once, no phase2 needed".
>>>>>
>>>>> This, in turn, is called from .S files, and when it returns there,
>>>>> execution loops back to syscall dispatch.
>>>>>
>>>>> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
>>>>> So -E2BIG should survive and be returned...
>>>>
>>>> So I was just about to send:
>>>>
>>>>  That makes sense, and given that exactly the same 32-bit binary
>>>>  runs fine on a different machine, there's presumably something up
>>>>  with my local setup.  The failing machine is a VMware guest, but
>>>>  maybe that's not the relevant interaction -- particularly if no-one
>>>>  else can repro.
>>>>
>>>> But then I noticed some odd audit entries in the main log:
>>>>
>>>> Aug 13 16:52:56 ubuntu kernel: [   20.687249] audit: type=1326
>>>> audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
>>>> ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
>>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
>>>> compat=1 ip=0xf773cc90 code=0x0
>>>> Aug 13 16:52:56 ubuntu kernel: [   20.691157] audit: type=1326
>>>> audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
>>>> ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
>>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
>>>> compat=1 ip=0xf773cc90 code=0x10000000
>>>> ...
>>>>
>>>> I didn't think I had any audit stuff turned on, and indeed:
>>>>   # auditctl -l
>>>>   No rules
>>>>
>>>> But as soon as I'd run that auditctl command, the 32-bit
>>>> seccomp_bpf binary started running fine!
>>>>
>>>> So now I'm confused, and I can no longer reproduce the
>>>> problem.  Which probably means this was a false alarm, in
>>>> which case, my apologies.
>>>
>>> You might have triggered TIF_AUDIT or whatever it's called, which
>>> causes a whole different path through the asm tangle, so you might
>>> really have a problem.
>>>
>>> Try auditctl -a task,never.  If that doesn't change anything, try
>>> rebooting the guest.
>>
>> Aha, that seems to re-instate the problem -- with that auditctl setup
>> I get the 32-bit seccomp failures on two different machines (one VM,
>> one bare).  So can anyone else repro?
>>
>> I guess the relevant steps are thus:
>>   - sudo auditctl -a task,never
>>   - cd tools/testing/selftests/seccomp
>>   - CFLAGS=-m32 make clean run_tests
> 
> That was it! I can reproduce this now on kvm (after adding the auditctl rule).

I suspect this change:

        .macro auditsys_entry_common
...
        movl %ebx,%esi                  /* 2nd arg: 1st syscall arg */
        movl %eax,%edi                  /* 1st arg: syscall number */
        call __audit_syscall_entry
-       movl RAX(%rsp),%eax     /* reload syscall number */
-       cmpq $(IA32_NR_syscalls-1),%rax
-       ja ia32_badsys
+       movl ORIG_RAX(%rsp),%eax        /* reload syscall number */
        movl %ebx,%edi                  /* reload 1st syscall arg */
        movl RCX(%rsp),%esi     /* reload 2nd syscall arg */
        movl RDX(%rsp),%edx     /* reload 3rd syscall arg */

We were reloading syscall# from pt_regs->ax.

After the patch, pt_regs->ax isn't equal to syscall# on entry,
instead it contains -ENOSYS. Therefore the change shown above
was made, to reload it from pt_regs->orig_ax.

Well. This still should work... in fact it is "more correct"
than it was before...

64-bit code has no call to __audit_syscall_entry, it uses
syscall_trace_enter_phase1/phase2 mechanism instead of
"only audit" shortcut. If the bug is here (though I don't see it),
it explains why 64-bit binary works.


Now, how do we reach this bit of code?

ia32_sysenter_target:
...
        testl   $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
        jnz  sysenter_tracesys
...
sysenter_tracesys:
        testl   $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
        jz      sysenter_auditsys
...
sysenter_auditsys:
        auditsys_entry_common     <== OUR MACRO
        movl %ebp,%r9d                  /* reload 6th syscall arg */
        jmp sysenter_dispatch


ia32_cstar_target:
...
        testl   $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
        jnz   cstar_tracesys
...
cstar_tracesys:
        testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
        jz cstar_auditsys
...
cstar_auditsys:
        movl %r9d,R9(%rsp)      /* register to be clobbered by call */
        auditsys_entry_common  <== OUR MACRO
        movl R9(%rsp),%r9d      /* reload 6th syscall arg */
        jmp cstar_dispatch


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 21:35           ` Denys Vlasenko
@ 2015-08-13 21:47             ` Andy Lutomirski
  2015-08-13 22:49               ` Linus Torvalds
  2015-08-13 22:27             ` Linus Torvalds
  1 sibling, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2015-08-13 21:47 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Kees Cook, David Drysdale, linux-kernel, Will Drewry,
	Ingo Molnar, Alok Kataria, Linus Torvalds, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 2:35 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 08/13/2015 08:47 PM, Kees Cook wrote:
>> On Thu, Aug 13, 2015 at 10:39 AM, David Drysdale <drysdale@google.com> wrote:
>>> On Thu, Aug 13, 2015 at 6:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Thu, Aug 13, 2015 at 9:28 AM, David Drysdale <drysdale@google.com> wrote:
>>>>> On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>>>>> On 08/13/2015 10:30 AM, David Drysdale wrote:
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>>>>>>> else could reproduce it.
>>>>>>>
>>>>>>> The problem occurs with a seccomp-bpf filter program that's set up to return
>>>>>>> an errno value -- an errno of 1 is always returned instead of what's in the
>>>>>>> filter, plus other oddities (selftest output below).
>>>>>>>
>>>>>>> The problem seems to need a combination of circumstances to occur:
>>>>>>>
>>>>>>>  - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>>>>>>    64-bit kernel -- I'm testing with seccomp_bpf from
>>>>>>>    tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>>>>>>
>>>>>> Does it work correctly when built as 64-bit program?
>>>>>
>>>>> Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>>>>>
>>>>>>>
>>>>>>>  - The kernel needs to be running as a VM guest -- it occurs inside my
>>>>>>>    VMware Fusion host, but not if I run on bare metal.  Kees tells me he
>>>>>>>    cannot repro with a kvm guest though.
>>>>>>>
>>>>>>> Bisecting indicates that the commit that induces the problem is
>>>>>>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>>>>>>> 64-bit logic", included in all the v4.2-rc* candidates.
>>>>>>>
>>>>>>> Apologies if I've just got something odd with my local setup, but the
>>>>>>> bisection was unequivocal enough that I thought it worth reporting...
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>
>>>>>>> seccomp_bpf failure outputs:
>>>>>
>>>>> [snip]
>>>>>
>>>>>> End result should be:
>>>>>> pt_regs->ax = -E2BIG (via syscall_set_return_value())
>>>>>> pt_regs->orig_ax = -1 ("skip syscall")
>>>>>> and syscall_trace_enter_phase1() usually returns with 0,
>>>>>> meaning "re-execute syscall at once, no phase2 needed".
>>>>>>
>>>>>> This, in turn, is called from .S files, and when it returns there,
>>>>>> execution loops back to syscall dispatch.
>>>>>>
>>>>>> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
>>>>>> So -E2BIG should survive and be returned...
>>>>>
>>>>> So I was just about to send:
>>>>>
>>>>>  That makes sense, and given that exactly the same 32-bit binary
>>>>>  runs fine on a different machine, there's presumably something up
>>>>>  with my local setup.  The failing machine is a VMware guest, but
>>>>>  maybe that's not the relevant interaction -- particularly if no-one
>>>>>  else can repro.
>>>>>
>>>>> But then I noticed some odd audit entries in the main log:
>>>>>
>>>>> Aug 13 16:52:56 ubuntu kernel: [   20.687249] audit: type=1326
>>>>> audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
>>>>> ses=4294967295 pid=2621 comm="secccomp_bpf.ke"
>>>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
>>>>> compat=1 ip=0xf773cc90 code=0x0
>>>>> Aug 13 16:52:56 ubuntu kernel: [   20.691157] audit: type=1326
>>>>> audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
>>>>> ses=4294967295 pid=2631 comm="secccomp_bpf.ke"
>>>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
>>>>> compat=1 ip=0xf773cc90 code=0x10000000
>>>>> ...
>>>>>
>>>>> I didn't think I had any audit stuff turned on, and indeed:
>>>>>   # auditctl -l
>>>>>   No rules
>>>>>
>>>>> But as soon as I'd run that auditctl command, the 32-bit
>>>>> seccomp_bpf binary started running fine!
>>>>>
>>>>> So now I'm confused, and I can no longer reproduce the
>>>>> problem.  Which probably means this was a false alarm, in
>>>>> which case, my apologies.
>>>>
>>>> You might have triggered TIF_AUDIT or whatever it's called, which
>>>> causes a whole different path through the asm tangle, so you might
>>>> really have a problem.
>>>>
>>>> Try auditctl -a task,never.  If that doesn't change anything, try
>>>> rebooting the guest.
>>>
>>> Aha, that seems to re-instate the problem -- with that auditctl setup
>>> I get the 32-bit seccomp failures on two different machines (one VM,
>>> one bare).  So can anyone else repro?
>>>
>>> I guess the relevant steps are thus:
>>>   - sudo auditctl -a task,never
>>>   - cd tools/testing/selftests/seccomp
>>>   - CFLAGS=-m32 make clean run_tests
>>
>> That was it! I can reproduce this now on kvm (after adding the auditctl rule).
>
> I suspect this change:
>
>         .macro auditsys_entry_common
> ...
>         movl %ebx,%esi                  /* 2nd arg: 1st syscall arg */
>         movl %eax,%edi                  /* 1st arg: syscall number */
>         call __audit_syscall_entry
> -       movl RAX(%rsp),%eax     /* reload syscall number */
> -       cmpq $(IA32_NR_syscalls-1),%rax
> -       ja ia32_badsys
> +       movl ORIG_RAX(%rsp),%eax        /* reload syscall number */
>         movl %ebx,%edi                  /* reload 1st syscall arg */
>         movl RCX(%rsp),%esi     /* reload 2nd syscall arg */
>         movl RDX(%rsp),%edx     /* reload 3rd syscall arg */
>
> We were reloading syscall# from pt_regs->ax.

I am so glad that this code is gone in -tip.  Good riddance!

>
> After the patch, pt_regs->ax isn't equal to syscall# on entry,
> instead it contains -ENOSYS. Therefore the change shown above
> was made, to reload it from pt_regs->orig_ax.
>
> Well. This still should work... in fact it is "more correct"
> than it was before...
>
> 64-bit code has no call to __audit_syscall_entry, it uses
> syscall_trace_enter_phase1/phase2 mechanism instead of
> "only audit" shortcut. If the bug is here (though I don't see it),
> it explains why 64-bit binary works.
>
>
> Now, how do we reach this bit of code?
>
> ia32_sysenter_target:
> ...
>         testl   $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jnz  sysenter_tracesys
> ...
> sysenter_tracesys:
>         testl   $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jz      sysenter_auditsys
> ...
> sysenter_auditsys:
>         auditsys_entry_common     <== OUR MACRO
>         movl %ebp,%r9d                  /* reload 6th syscall arg */
>         jmp sysenter_dispatch
>
>
> ia32_cstar_target:
> ...
>         testl   $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jnz   cstar_tracesys
> ...
> cstar_tracesys:
>         testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT), ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jz cstar_auditsys
> ...
> cstar_auditsys:
>         movl %r9d,R9(%rsp)      /* register to be clobbered by call */
>         auditsys_entry_common  <== OUR MACRO
>         movl R9(%rsp),%r9d      /* reload 6th syscall arg */
>         jmp cstar_dispatch
>

TIF_SECCOMP had better be set, so that code should be unreachable.

syscall_trace_enter_phase1 returns 0 if we hit SECCOMP_RET_ERRNO (i.e.
SECCOMP_PHASE1_SKIP).  syscall_trace_enter sees that and returns
regs->orig_ax, which is -1.

It seems to me that the bug is that sysexit_from_sys_call isn't
reloading RAX from regs->ax.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 21:47             ` Andy Lutomirski
@ 2015-08-13 22:49               ` Linus Torvalds
  2015-08-13 22:54                 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2015-08-13 22:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Kees Cook, David Drysdale, linux-kernel,
	Will Drewry, Ingo Molnar, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

[-- Attachment #1: Type: text/plain, Size: 1787 bytes --]

On Thu, Aug 13, 2015 at 2:47 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> It seems to me that the bug is that sysexit_from_sys_call isn't
> reloading RAX from regs->ax.

Ugh. That code is confusing, and _most_ cases seem to have %rax already loaded.

There seems to be three cases:

 - fallthrough from cstar_dispatch after a successful call to the system call

   This has %rax as the correct return value (which also got saved to
RAX on stack)

 - the 'auditsys_exit' macro 'exit' case.

   This seems to have %rax reloaded inside the macro

 - the out-of-range system call case for cstar_dispatch

   This does *not* seem to load %rax with $ENOSYS, but keeps the bad
system call number in it.

so yeah, there seems to be a bug there, but if I read it right, that
bug seems to happen just for the out-of-range system call case, which
afaik isn't the case reported here.

I guess adding a re-load of %rax is ok, even though in the common
cases it is already loaded.

Oh, and sysexit_from_sys_call seems to have the exact same situation.
The "system call dispatch with %rax out of range" fallthrough case
doesn't set %rax to ENOSYS.

So I guess we could remove the reloading of system call return value
from auditsys_exit, and just do it unconditionally in the common path.

Which is sad, since the *really* common case already has the right
value, but whatever.

Does the attached patch make sense and work? Totally untested, just
looking at the code. But maybe it's right, because it's exactly that
ENOSYS case that the bad patch in question changed.

Btw, the old ENOSYS code also cleared ORIG_EAX. I'm not sure why, but
we used to have

 ia32_badsys:
        movq $0,ORIG_RAX(%rsp)
        movq $-ENOSYS,%rax
        jmp ia32_sysret

for that case.

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1084 bytes --]

 arch/x86/entry/entry_64_compat.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 5a1844765a7a..a7e257d9cb90 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -140,6 +140,7 @@ sysexit_from_sys_call:
 	 */
 	andl	$~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
 	movl	RIP(%rsp), %ecx		/* User %eip */
+	movq    RAX(%rsp), %rax
 	RESTORE_RSI_RDI
 	xorl	%edx, %edx		/* Do not leak kernel information */
 	xorq	%r8, %r8
@@ -219,7 +220,6 @@ sysexit_from_sys_call:
 1:	setbe	%al			/* 1 if error, 0 if not */
 	movzbl	%al, %edi		/* zero-extend that into %edi */
 	call	__audit_syscall_exit
-	movq	RAX(%rsp), %rax		/* reload syscall return value */
 	movl	$(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT), %edi
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
@@ -368,6 +368,7 @@ sysretl_from_sys_call:
 	RESTORE_RSI_RDI_RDX
 	movl	RIP(%rsp), %ecx
 	movl	EFLAGS(%rsp), %r11d
+	movq    RAX(%rsp), %rax
 	xorq	%r10, %r10
 	xorq	%r9, %r9
 	xorq	%r8, %r8

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 22:49               ` Linus Torvalds
@ 2015-08-13 22:54                 ` Linus Torvalds
  2015-08-13 22:56                   ` Kees Cook
  2015-08-13 22:58                   ` Andy Lutomirski
  0 siblings, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2015-08-13 22:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Kees Cook, David Drysdale, linux-kernel,
	Will Drewry, Ingo Molnar, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Does the attached patch make sense and work?

Btw, I'm not all that happy with it anyway.

I still think Denys' patch also potentially changed what audit and
strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
is a good change.

But maybe that three-liner patch fixes the immediate problem that
David sees. David?

                  Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 22:54                 ` Linus Torvalds
@ 2015-08-13 22:56                   ` Kees Cook
  2015-08-13 22:59                     ` Andy Lutomirski
  2015-08-14  7:33                     ` David Drysdale
  2015-08-13 22:58                   ` Andy Lutomirski
  1 sibling, 2 replies; 22+ messages in thread
From: Kees Cook @ 2015-08-13 22:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Denys Vlasenko, David Drysdale, linux-kernel,
	Will Drewry, Ingo Molnar, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Does the attached patch make sense and work?
>
> Btw, I'm not all that happy with it anyway.
>
> I still think Denys' patch also potentially changed what audit and
> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
> is a good change.
>
> But maybe that three-liner patch fixes the immediate problem that
> David sees. David?

Your patch fixes it for me. The seccomp compat selftests pass again
with audit enabled.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 22:56                   ` Kees Cook
@ 2015-08-13 22:59                     ` Andy Lutomirski
  2015-08-13 23:14                       ` Kees Cook
                                         ` (2 more replies)
  2015-08-14  7:33                     ` David Drysdale
  1 sibling, 3 replies; 22+ messages in thread
From: Andy Lutomirski @ 2015-08-13 22:59 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, Denys Vlasenko, David Drysdale, linux-kernel,
	Will Drewry, Ingo Molnar, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 3:56 PM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> Does the attached patch make sense and work?
>>
>> Btw, I'm not all that happy with it anyway.
>>
>> I still think Denys' patch also potentially changed what audit and
>> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
>> is a good change.
>>
>> But maybe that three-liner patch fixes the immediate problem that
>> David sees. David?
>
> Your patch fixes it for me. The seccomp compat selftests pass again
> with audit enabled.

Kees, would it be straightforward to rig up the seccomp tests to
automatically test compat?  The x86 selftests automatically test both
native and compat, and that might be usable as a model.  I did that
because it's extremely easy to regress one and not the other.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 22:59                     ` Andy Lutomirski
@ 2015-08-13 23:14                       ` Kees Cook
  2015-08-13 23:30                       ` Linus Torvalds
  2015-08-14 11:58                       ` Denys Vlasenko
  2 siblings, 0 replies; 22+ messages in thread
From: Kees Cook @ 2015-08-13 23:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Denys Vlasenko, David Drysdale, linux-kernel,
	Will Drewry, Ingo Molnar, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 3:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Aug 13, 2015 at 3:56 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>>
>>>> Does the attached patch make sense and work?
>>>
>>> Btw, I'm not all that happy with it anyway.
>>>
>>> I still think Denys' patch also potentially changed what audit and
>>> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
>>> is a good change.
>>>
>>> But maybe that three-liner patch fixes the immediate problem that
>>> David sees. David?
>>
>> Your patch fixes it for me. The seccomp compat selftests pass again
>> with audit enabled.
>
> Kees, would it be straightforward to rig up the seccomp tests to
> automatically test compat?  The x86 selftests automatically test both
> native and compat, and that might be usable as a model.  I did that
> because it's extremely easy to regress one and not the other.

Yeah, I'll figure out how to get this working sanely. There are some
ugly behaviors on arm64 doing compat that seccomp found too, so I'll
need those targets for more than just x86.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 22:59                     ` Andy Lutomirski
  2015-08-13 23:14                       ` Kees Cook
@ 2015-08-13 23:30                       ` Linus Torvalds
  2015-08-14 11:58                       ` Denys Vlasenko
  2 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2015-08-13 23:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Denys Vlasenko, David Drysdale, linux-kernel,
	Will Drewry, Ingo Molnar, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 3:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Kees, would it be straightforward to rig up the seccomp tests to
> automatically test compat?  The x86 selftests automatically test both
> native and compat, and that might be usable as a model.  I did that
> because it's extremely easy to regress one and not the other.

Note that in this case, the bug was actually _hidden_ by audit (since
the audit path would end up reloading %rax, and is why doing "auditctl
-a task,never" actually enabled people to see it), so it would also be
good to try to make sure that the tests would try both with and
without audit involved too.

I'm very tired of these bugs, but I guess and hope that your patches
to move as much as possible of this to C will actually end up helping
in the long run. So while I'm not really looking forward to even
_more_ patches to the low-level asm, at least the C rewrite seems more
worthwhile than some of the noise that made this all so painful has
felt.

                 Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 22:59                     ` Andy Lutomirski
  2015-08-13 23:14                       ` Kees Cook
  2015-08-13 23:30                       ` Linus Torvalds
@ 2015-08-14 11:58                       ` Denys Vlasenko
  2015-08-14 14:27                         ` Andy Lutomirski
  2 siblings, 1 reply; 22+ messages in thread
From: Denys Vlasenko @ 2015-08-14 11:58 UTC (permalink / raw)
  To: Andy Lutomirski, Kees Cook
  Cc: Linus Torvalds, David Drysdale, linux-kernel, Will Drewry,
	Ingo Molnar, Alok Kataria, Borislav Petkov, Alexei Starovoitov,
	Frederic Weisbecker, H. Peter Anvin, Oleg Nesterov,
	Steven Rostedt, X86 ML

On 08/14/2015 12:59 AM, Andy Lutomirski wrote:
> On Thu, Aug 13, 2015 at 3:56 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>>
>>>> Does the attached patch make sense and work?
>>>
>>> Btw, I'm not all that happy with it anyway.
>>>
>>> I still think Denys' patch also potentially changed what audit and
>>> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
>>> is a good change.
>>>
>>> But maybe that three-liner patch fixes the immediate problem that
>>> David sees. David?
>>
>> Your patch fixes it for me. The seccomp compat selftests pass again
>> with audit enabled.
> 
> Kees, would it be straightforward to rig up the seccomp tests to
> automatically test compat?  The x86 selftests automatically test both
> native and compat, and that might be usable as a model.  I did that
> because it's extremely easy to regress one and not the other.

BTW, why 64-bt code doesn't need this RAX read-back?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-14 11:58                       ` Denys Vlasenko
@ 2015-08-14 14:27                         ` Andy Lutomirski
  0 siblings, 0 replies; 22+ messages in thread
From: Andy Lutomirski @ 2015-08-14 14:27 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Kees Cook, Linus Torvalds, David Drysdale, linux-kernel,
	Will Drewry, Ingo Molnar, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Fri, Aug 14, 2015 at 4:58 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 08/14/2015 12:59 AM, Andy Lutomirski wrote:
>> On Thu, Aug 13, 2015 at 3:56 PM, Kees Cook <keescook@chromium.org> wrote:
>>> On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
>>>> <torvalds@linux-foundation.org> wrote:
>>>>>
>>>>> Does the attached patch make sense and work?
>>>>
>>>> Btw, I'm not all that happy with it anyway.
>>>>
>>>> I still think Denys' patch also potentially changed what audit and
>>>> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
>>>> is a good change.
>>>>
>>>> But maybe that three-liner patch fixes the immediate problem that
>>>> David sees. David?
>>>
>>> Your patch fixes it for me. The seccomp compat selftests pass again
>>> with audit enabled.
>>
>> Kees, would it be straightforward to rig up the seccomp tests to
>> automatically test compat?  The x86 selftests automatically test both
>> native and compat, and that might be usable as a model.  I did that
>> because it's extremely easy to regress one and not the other.
>
> BTW, why 64-bt code doesn't need this RAX read-back?
>

It's hiding inside of RESTORE_C_REGS_EXCEPT_RCX_R11.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 22:56                   ` Kees Cook
  2015-08-13 22:59                     ` Andy Lutomirski
@ 2015-08-14  7:33                     ` David Drysdale
  1 sibling, 0 replies; 22+ messages in thread
From: David Drysdale @ 2015-08-14  7:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, Andy Lutomirski, Denys Vlasenko, linux-kernel,
	Will Drewry, Ingo Molnar, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 11:56 PM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> Does the attached patch make sense and work?
>>
>> Btw, I'm not all that happy with it anyway.
>>
>> I still think Denys' patch also potentially changed what audit and
>> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
>> is a good change.
>>
>> But maybe that three-liner patch fixes the immediate problem that
>> David sees. David?
>
> Your patch fixes it for me. The seccomp compat selftests pass again
> with audit enabled.
>
> -Kees

Yep, that patch fixes it for me too.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 22:54                 ` Linus Torvalds
  2015-08-13 22:56                   ` Kees Cook
@ 2015-08-13 22:58                   ` Andy Lutomirski
  2015-08-13 23:25                     ` Linus Torvalds
  1 sibling, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2015-08-13 22:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Denys Vlasenko, Kees Cook, David Drysdale, linux-kernel,
	Will Drewry, Ingo Molnar, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 3:54 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Aug 13, 2015 at 3:49 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Does the attached patch make sense and work?
>
> Btw, I'm not all that happy with it anyway.
>
> I still think Denys' patch also potentially changed what audit and
> strace see for %rax in the pt_regs to -ENOSYS, which I'm not convinced
> is a good change.

For better for for worse, the native 64-bit path changed several
versions agi, and nothing broke that I'm aware of.  The change was:

commit 54eea9957f5763dd1a2555d7e4cb53b4dd389cc6
Author: Andy Lutomirski <luto@amacapital.net>
Date:   Fri Sep 5 15:13:55 2014 -0700

    x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscalls

AFAIK, ptrace has always seen ax == -ENOSYS on syscall entry for
native 64-bit syscalls.  My change just simplified the fast path
(which is invisible by ptrace for obvious reasons, unless someone
traces fork or something along those lines *without*) and made it less
different from the slow path.  (IIRC it also simplified some stuff
down the road.)

Looking at 3.19's ia32entry.S, it has:

sysenter_tracesys:
#ifdef CONFIG_AUDITSYSCALL
        testl   $(_TIF_WORK_SYSCALL_ENTRY &
~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
        jz      sysenter_auditsys
#endif
        SAVE_REST
        CLEAR_RREGS
        movq    $-ENOSYS,RAX(%rsp)/* ptrace can change this for a bad syscall */

So I think it's always been the intent and practice that ptracers
would see ax == -ENOSYS on syscall entry.

IOW, whether this is good or bad, I don't think it's really a change.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 22:58                   ` Andy Lutomirski
@ 2015-08-13 23:25                     ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2015-08-13 23:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Kees Cook, David Drysdale, linux-kernel,
	Will Drewry, Ingo Molnar, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 3:58 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> For better for for worse, the native 64-bit path changed several
> versions agi, and nothing broke that I'm aware of.

Ok, I'll take that as an ACK for the patch, and apply it, since it
seems to fix this regression.

               Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 21:35           ` Denys Vlasenko
  2015-08-13 21:47             ` Andy Lutomirski
@ 2015-08-13 22:27             ` Linus Torvalds
  2015-08-14 11:20               ` Denys Vlasenko
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2015-08-13 22:27 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Kees Cook, David Drysdale, Andy Lutomirski, linux-kernel,
	Will Drewry, Ingo Molnar, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On Thu, Aug 13, 2015 at 2:35 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
> I suspect this change:
>
>         .macro auditsys_entry_common
> ...
>         movl %ebx,%esi                  /* 2nd arg: 1st syscall arg */
>         movl %eax,%edi                  /* 1st arg: syscall number */
>         call __audit_syscall_entry
> -       movl RAX(%rsp),%eax     /* reload syscall number */
> -       cmpq $(IA32_NR_syscalls-1),%rax
> -       ja ia32_badsys
> +       movl ORIG_RAX(%rsp),%eax        /* reload syscall number */
>
> We were reloading syscall# from pt_regs->ax.
>
> After the patch, pt_regs->ax isn't equal to syscall# on entry,
> instead it contains -ENOSYS. Therefore the change shown above
> was made, to reload it from pt_regs->orig_ax.
>
> Well. This still should work... in fact it is "more correct"
> than it was before...

Well, since it doesn't work, that's clearly not the case.

Also, you do realize that ORIG_RAX can get changed by signal handling
and ptrace?

In fact, I think that whole "save -ENOSYS to pt_regs->ax" is BS.
Exactly because we use pt_regs->ax for ptrace etc, and you've changed
the register state we expose.

So, considering that we're in -rc6, and this has been bisected, I
would normally just revert the commit with extreme prejudice. However,
in this case it's unnecessarily painful, just because there's been a
ton of pointless churn in this area (cfi annotations, renaming, no end
of crap.

I'm also going to be a *lot* less inclined to take all these idiotic
low-level x86 changes from now on. It's been a total pain, for very
little gain. These "cleanups" have been buggy as hell, and test
coverage for the compat case is clearly lacking.

Denys, this was not the first "obvious cleanup" of yours that broke
things subtly, and where your reaction to the problem report was "that
cannot happen".

                        Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-13 22:27             ` Linus Torvalds
@ 2015-08-14 11:20               ` Denys Vlasenko
  2015-08-22 10:03                 ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: Denys Vlasenko @ 2015-08-14 11:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, David Drysdale, Andy Lutomirski, linux-kernel,
	Will Drewry, Ingo Molnar, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

On 08/14/2015 12:27 AM, Linus Torvalds wrote:
> On Thu, Aug 13, 2015 at 2:35 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>
>> I suspect this change:
>>
>>         .macro auditsys_entry_common
>> ...
>>         movl %ebx,%esi                  /* 2nd arg: 1st syscall arg */
>>         movl %eax,%edi                  /* 1st arg: syscall number */
>>         call __audit_syscall_entry
>> -       movl RAX(%rsp),%eax     /* reload syscall number */
>> -       cmpq $(IA32_NR_syscalls-1),%rax
>> -       ja ia32_badsys
>> +       movl ORIG_RAX(%rsp),%eax        /* reload syscall number */
>>
>> We were reloading syscall# from pt_regs->ax.
>>
>> After the patch, pt_regs->ax isn't equal to syscall# on entry,
>> instead it contains -ENOSYS. Therefore the change shown above
>> was made, to reload it from pt_regs->orig_ax.
>>
>> Well. This still should work... in fact it is "more correct"
>> than it was before...
> 
> Well, since it doesn't work, that's clearly not the case.
> 
> Also, you do realize that ORIG_RAX can get changed by signal handling
> and ptrace?

I am very aware of that, yes. If it changes, we should use *it*.
That's why new code in this part is "more correct" than old one.

> In fact, I think that whole "save -ENOSYS to pt_regs->ax" is BS.
> Exactly because we use pt_regs->ax for ptrace etc, and you've changed
> the register state we expose.

ptrace always sees pt_regs->ax = -ENOSYS on syscall entry.
That's part of the ABI. Syscall# is in pt_regs->orig_ax.

We used to do that _only_ on ptrace code path, the fast path
didn't store -ENOSYS in pt_regs->ax. This optimization ended up being
more pain than gain, and it was changed for 64-bit code by this commit:

commit 54eea9957f5763dd1a2555d7e4cb53b4dd389cc6
Author: Andy Lutomirski <luto@amacapital.net>
Date:   Fri Sep 5 15:13:55 2014 -0700

    x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscalls


I changed 32-bit compat code to do the same thing.


> I'm also going to be a *lot* less inclined to take all these idiotic
> low-level x86 changes from now on. It's been a total pain, for very
> little gain. These "cleanups" have been buggy as hell, and test
> coverage for the compat case is clearly lacking.

The code in question was an unholy mess, with about a half dozen
"clever optimizations" tangled together. Now it is much, much
less ugly.

It was nearly inevitable that something would break during untangling.

I understand the frustration of having things breaking
"because of the stupid cleanup".


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?
  2015-08-14 11:20               ` Denys Vlasenko
@ 2015-08-22 10:03                 ` Ingo Molnar
  0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2015-08-22 10:03 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Kees Cook, David Drysdale, Andy Lutomirski,
	linux-kernel, Will Drewry, Alok Kataria, Borislav Petkov,
	Alexei Starovoitov, Frederic Weisbecker, H. Peter Anvin,
	Oleg Nesterov, Steven Rostedt, X86 ML

* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> It was nearly inevitable that something would break during untangling.

1)

So the 'chronic lack of compat, audit/noaudit and Wine testing' was certainly 
avoidable.

The problem wasn't the fact that something was bound to break, but the latency of 
finding these bugs. If we cannot reduce the latency so that bugs are caught early 
enough (before they reach mainline) then we shouldn't be doing such changes.

We are slowly adding tests for that in the x86 self-tests, but IMHO we should be 
more proactive than that.

2)

Another structural problem I saw occasionally was the attempt to characterise away 
regressions.

That's a 100% no-no: if a change breaks any user-space program, it does not matter 
how 'correct' a change is, how weird the user-space dependence and how rare the 
user-space program: the regression needs to be fixed either by going forward with 
a fix or by going backwards via reverting the change.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-08-22 10:03 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-13  8:30 [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM? David Drysdale
2015-08-13 15:17 ` Denys Vlasenko
2015-08-13 16:28   ` David Drysdale
2015-08-13 17:15     ` Andy Lutomirski
2015-08-13 17:39       ` David Drysdale
2015-08-13 18:47         ` Kees Cook
2015-08-13 21:35           ` Denys Vlasenko
2015-08-13 21:47             ` Andy Lutomirski
2015-08-13 22:49               ` Linus Torvalds
2015-08-13 22:54                 ` Linus Torvalds
2015-08-13 22:56                   ` Kees Cook
2015-08-13 22:59                     ` Andy Lutomirski
2015-08-13 23:14                       ` Kees Cook
2015-08-13 23:30                       ` Linus Torvalds
2015-08-14 11:58                       ` Denys Vlasenko
2015-08-14 14:27                         ` Andy Lutomirski
2015-08-14  7:33                     ` David Drysdale
2015-08-13 22:58                   ` Andy Lutomirski
2015-08-13 23:25                     ` Linus Torvalds
2015-08-13 22:27             ` Linus Torvalds
2015-08-14 11:20               ` Denys Vlasenko
2015-08-22 10:03                 ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.