From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kselftest-owner@vger.kernel.org>
X-Google-Smtp-Source: AH8x224Ca4eOVUpppvu2bGlVSgUEeQwltwardfrCnBVAMvB6xsUZpuikUIJRA6lDLE2UrFXTl6Vs
ARC-Seal: i=1; a=rsa-sha256; t=1517167330; cv=none;
        d=google.com; s=arc-20160816;
        b=ytGrRzDz87kHQaJ6aotvRWW+/xKjmflFmsICJuO9t7F0m2/Cj4iAKO+InrKjACJQj3
         coL8/ImsTKRYn/imHPFrsBNbbClw1XKg+Olo3mx+z3NkjUS3YAQxxzvjoODz0dkqw8VY
         73VYd4GI08HX9ozd5w0GIdCJbcA9pD24vuBXRlF5J/DCZaP9kRYpbIb0ScfEVlXHbA6F
         P+brJ5vrh3l3xRSPDKcwXc5CKcnXWEIPaG4J8xTUk3oZWyaOy2tHXNn8biH/YHSOVL8T
         ACp0ZIjF3NL32KfN16aXEhjbpIyfLeqSmOiAyGk6pWsUIpeYfW1x4ebxMrG8oyWHb4uK
         kapg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
        h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject
         :message-id:date:from:references:in-reply-to:mime-version
         :dmarc-filter:arc-authentication-results;
        bh=dpa5ZLVr9Bzz5K8U2Fc4iotNXp9cmvz00CjgyEDclqc=;
        b=h7ty7I/DnaAHldu6wmvrNU7dMs6ohf01uuNNNFGayXIAuEptnx3XgDN2qymdo6kNAC
         lxaem+2PacFlU4xjoMEcCQ1Uwq8LraJNq8bQ3qhD2P5PwVhvrHxyrehTFWW29SnjOc7T
         4xlYvbFM8wv6xDnx1lq8AAxTJT5Do1QZ+FJbPplM5T2mHQs+jgwUjuXsEwxLJPLm1tUi
         FRexLLO/vSah/ZAhLy4EvR1kFLznitl2tWqLcCbNwHiak+CBJ+Xr84Wp1ZyI9i1VtyZq
         bjisgoDrYhJ8rmmI0I+Vt/Ku/IlLRKuw4jJwU3TX1S7RnEGmyRLCwJ5ZHyW5B9EHBS8o
         MZVg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: best guess record for domain of linux-kselftest-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kselftest-owner@vger.kernel.org
Authentication-Results: mx.google.com;
       spf=pass (google.com: best guess record for domain of linux-kselftest-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kselftest-owner@vger.kernel.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752422AbeA1TV6 convert rfc822-to-8bit (ORCPT
        <rfc822;gregkh@linuxfoundation.org>); Sun, 28 Jan 2018 14:21:58 -0500
Received: from mail.kernel.org ([198.145.29.99]:47124 "EHLO mail.kernel.org"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752459AbeA1TV5 (ORCPT <rfc822;linux-kselftest@vger.kernel.org>);
        Sun, 28 Jan 2018 14:21:57 -0500
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C05A4217A7
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org
MIME-Version: 1.0
In-Reply-To: <CALCETrWq04AngUHkSv-bqDYXDSVH8N0bhXS_Nh63SpMmndexSA@mail.gmail.com>
References: <20180126153631.ha7yc33fj5uhitjo@xps> <CALCETrXVFC=6wPNpL0Dc7pkSv4JSaoEZZcu3Rtw0sg7KJsE5Pg@mail.gmail.com>
 <CALCETrUi7Ub2TbFy3Cvj+j4VXZeYULPY+mgL7OX7bz9L8GO9ew@mail.gmail.com>
 <CALCETrWvkd68wPCxGwmzqgsLTr_59+=L9u8obwt+f+oUQwDY=w@mail.gmail.com>
 <CALCETrW5=K5wJqLRZACmJJ+LF8Nb+Yf3OJ1tTNwqiOCxHu3w1Q@mail.gmail.com> <CALCETrWq04AngUHkSv-bqDYXDSVH8N0bhXS_Nh63SpMmndexSA@mail.gmail.com>
From: Andy Lutomirski <luto@kernel.org>
Date: Sun, 28 Jan 2018 11:21:35 -0800
X-Gmail-Original-Message-ID: <CALCETrUcJ2dVkSXGhtRvQeYQZZhPca95VhjQB8j6dFN6hzceSQ@mail.gmail.com>
Message-ID: <CALCETrUcJ2dVkSXGhtRvQeYQZZhPca95VhjQB8j6dFN6hzceSQ@mail.gmail.com>
Subject: Re: selftests/x86/fsgsbase_64 test problem
To: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>, "H. Peter Anvin" <hpa@zytor.com>,
        Dan Rue <dan.rue@linaro.org>, Shuah Khan <shuah@kernel.org>,
        Ingo Molnar <mingo@kernel.org>,
        Dmitry Safonov <dsafonov@virtuozzo.com>,
        "open list:KERNEL SELFTEST FRAMEWORK"
        <linux-kselftest@vger.kernel.org>,
        LKML <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8BIT
Sender: linux-kselftest-owner@vger.kernel.org
X-Mailing-List: linux-kselftest@vger.kernel.org
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: =?utf-8?q?1590669876861238825?=
X-GMAIL-MSGID: =?utf-8?q?1590865250456508973?=
X-Mailing-List: linux-kernel@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>

On Fri, Jan 26, 2018 at 2:42 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Jan 26, 2018 at 2:38 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>> On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>>> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote:
>>>>>>
>>>>>> We've noticed that fsgsbase_64 can fail intermittently with the
>>>>>> following error:
>>>>>>
>>>>>>         [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>>>>>>                 Before schedule, set selector to 0x1
>>>>>>                 other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>>>>>>         [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
>>>>>>
>>>>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>>>>>>
>>>>>>     for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>>>>>>
>>>>>> This problem isn't new - I've reproduced it on latest mainline and every
>>>>>> release going back to v4.12 (I did not try earlier). This was tested on
>>>>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
>>>>>> i3-5010U.
>>>>>>
>>>>>
>>>>> Hmm, I can reproduce it, too.  I'll look in a bit.
>>>>
>>>> I'm triggering a different error, and I think what's going on is that
>>>> the kernel doesn't currently re-save GSBASE when a task switches out
>>>> and that task has save gsbase != 0 and in-register GS == 0.  This is
>>>> arguably a bug, but it's not an infoleak, and fixing it could be a wee
>>>> bit expensive.  I'm not sure what, if anything, to do about this.  I
>>>> suppose I could add some gross perf hackery to the test to detect this
>>>> case and suppress the error.
>>>>
>>>> I can also trigger the problem you're seeing, and I don't know what's
>>>> up.  It may be related to and old problem I've seen that causes signal
>>>> delivery to sometimes corrupt %gs.  It's deterministic, but it depends
>>>> in some odd way on register state.  I can currently reproduce that
>>>> issue 100% of the time, and I'm trying to see if I can figure out
>>>> what's happening.
>>>
>>> I think it's a CPU bug, and I'm a bit mystified.  I can trigger the
>>> following, plausibly related issue:
>>>
>>> Write a program that writes %gs = 1.
>>> Run that program under gdb
>>> break in which %gs == 1
>>> display/x $gs
>>> si
>>>
>>> Under QEMU TCG, gs stays equal to 1.  On native or KVM, on Skylake, it
>>> changes to 0.
>>>
>>> On KVM or native, I do not observe do_debug getting called with %gs ==
>>> 1.  On TCG, I do.  I don't think that's precisely the problem that's
>>> causing the test to fail, since the test doesn't use TF or ptrace, but
>>> I wouldn't be shocked if it's related.
>>>
>>> hpa, any insight?
>>>
>>> (NB: if you want to play with this as I've described it, you may need
>>> to make invalid_selector() in ptrace.c always return false.  The
>>> current implementation is too strict and causes problems.)
>>
>> Much simpler test.  Run the attached program (gs1).  It more or less
>> just sets %gs to 1 and spins until it stops being 1.  Do it on a
>> kernel with the attached patch applied.  I see stuff like this:
>>
>> # ./gs1
>> PID = 129
>> [   15.703015] pid 129 saved gs = 1
>> [   15.703517] pid 129 loaded gs = 1
>> [   15.703973] pid 129 prepare_exit_to_usermode: gs = 1
>> ax = 0, cx = 0, dx = 0
>>
>> So we're interrupting the program, switching out, switching back in,
>> setting %gs to 1, observing that %gs is *still* 1 in
>> prepare_exit_to_usermode(), returning to usermode, and observing %gs
>> == 0.
>>
>> Presumably what's happening is that the IRET microcode matches the
>> SDM's pseudocode, which says:
>>
>> RETURN-TO-OUTER-PRIVILEGE-LEVEL:
>> ...
>> FOR each SegReg in (ES, FS, GS, and DS)
>>   DO
>>     tempDesc ← descriptor cache for SegReg (* hidden part of segment register *)
>>     IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code
>>     THEN (* Segment register invalid *)
>>       SegReg ← NULL;
>>     FI;
>>   OD;
>>
>> But this is very odd.  The actual permission checks (in the docs for MOV) are:
>>
>> IF DS, ES, FS, or GS is loaded with non-NULL selector
>> THEN
>>   IF segment selector index is outside descriptor table limits
>>   or segment is not a data or readable code segment
>>   or ((segment is a data or nonconforming code segment)
>>   or ((RPL > DPL) and (CPL > DPL))
>>     THEN #GP(selector); FI;
>>
>> ^^^^
>> This makes no sense.  This says that the data segments cannot be
>> loaded with MOV.  Empirically, it seems like MOV works if CPL <= DPL
>> and RPL <= DPL, but I haven't checked that hard.
>
> Surely Intel meant:
>
> ... or ((segment is a data segment or nonconforming code segment) and
> ((RPL > DPL) or (CPL > DPL))
>
> This would be consistent with the AMD APM #GP condition of "The DS,
> ES, FS, or GS register was loaded and the segment pointed to was a
> data or non-conforming code segment, but the RPL or CPL was greater
> than the DPL."
>
>>
>>   IF segment not marked present
>>     THEN #NP(selector);
>>   ELSE
>>     SegmentRegister ← segment selector;
>>     SegmentRegister ← segment descriptor; FI;
>>   FI;
>>
>>   IF DS, ES, FS, or GS is loaded with NULL selector
>>   THEN
>>     SegmentRegister ← segment selector;
>>     SegmentRegister ← segment descriptor;
>>     ^^^^
>>     wtf?  There is no "segment descriptor".  Presumably what actually
>> gets written to segment.DPL is nonsense.
>>   FI;
>
> I think the bug is here.  I think that, when writing a NULL selector
> to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas
> they should set DPL to 3.

As an experiment, I did this:

 DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
+       [0] = { .dpl = 3, },
+

This had no apparent effect.  I was hoping that maybe loading NULL
into a selector would copy DPL from from gdt[0], but it seems like it
doesn't.