All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@kernel.org>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>,
	Konstantin Khlebnikov <khlebnikov@yandex-team.ru>,
	X86 ML <x86@kernel.org>, Borislav Petkov <bp@alien8.de>,
	Neil Berrington <neil.berrington@datacore.com>,
	LKML <linux-kernel@vger.kernel.org>,
	stable <stable@vger.kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems
Date: Thu, 25 Jan 2018 14:00:22 -0800	[thread overview]
Message-ID: <CALCETrVwJCL=QTRT70b8u3p8xOXUiC7_Mkz45Bi3M9-vYgXWtg@mail.gmail.com> (raw)
In-Reply-To: <bd7c53c9-cec6-2db2-6ee6-5cc03ca6dd39@intel.com>

On Thu, Jan 25, 2018 at 1:49 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 01/25/2018 01:12 PM, Andy Lutomirski wrote:
>> Neil Berrington reported a double-fault on a VM with 768GB of RAM that
>> uses large amounts of vmalloc space with PTI enabled.
>>
>> The cause is that load_new_mm_cr3() was never fixed to take the
>> 5-level pgd folding code into account, so, on a 4-level kernel, the
>> pgd synchronization logic compiles away to exactly nothing.
>
> You don't mention it, but we can normally handle vmalloc() faults in the
> kernel that are due to unsynchronized page tables.  The thing that kills
> us here is that we have an unmapped stack and we try to use that stack
> when entering the page fault handler, which double faults.  The double
> fault handler gets a new stack and saves us enough to get an oops out.
>
> Right?

Exactly.

There are two special code paths that can't use vmalloc_fault(): this
one and switch_to().  The latter avoids explicit page table fiddling
and just touches the new stack before loading it into rsp.

>
>> +static void sync_current_stack_to_mm(struct mm_struct *mm)
>> +{
>> +     unsigned long sp = current_stack_pointer;
>> +     pgd_t *pgd = pgd_offset(mm, sp);
>> +
>> +     if (CONFIG_PGTABLE_LEVELS > 4) {
>> +             if (unlikely(pgd_none(*pgd))) {
>> +                     pgd_t *pgd_ref = pgd_offset_k(sp);
>> +
>> +                     set_pgd(pgd, *pgd_ref);
>> +             }
>> +     } else {
>> +             /*
>> +              * "pgd" is faked.  The top level entries are "p4d"s, so sync
>> +              * the p4d.  This compiles to approximately the same code as
>> +              * the 5-level case.
>> +              */
>> +             p4d_t *p4d = p4d_offset(pgd, sp);
>> +
>> +             if (unlikely(p4d_none(*p4d))) {
>> +                     pgd_t *pgd_ref = pgd_offset_k(sp);
>> +                     p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
>> +
>> +                     set_p4d(p4d, *p4d_ref);
>> +             }
>> +     }
>> +}
>
> We keep having to add these.  It seems like a real deficiency in the
> mechanism that we're using for pgd folding.  Can't we get a warning or
> something when we try to do a set_pgd() that's (silently) not doing
> anything?  This exact same pattern bit me more than once with the
> KPTI/KAISER patches.

Hmm, maybe.

What I'd really like to see is an entirely different API.  Maybe:

typedef struct {
  opaque, but probably includes:
  int depth;  /* 0 is root */
  void *table;
} ptbl_ptr;

ptbl_ptr root_table = mm_root_ptbl(mm);

set_ptbl_entry(root_table, pa, prot);

/* walk tables */
ptbl_ptr pt = ...;
ptentry_ptr entry;
while (ptbl_has_children(pt)) {
  pt = pt_next(pt, addr);
}
entry = pt_entry_at(pt, addr);
/* do something with entry */

etc.

Now someone can add a sixth level without changing every code path in
the kernel that touches page tables.

--Andy

  reply	other threads:[~2018-01-25 22:00 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-25 21:12 [PATCH v2 0/2] x86/mm/64: vmalloc pgd synchronization cleanups/fixes Andy Lutomirski
2018-01-25 21:12 ` [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems Andy Lutomirski
2018-01-25 21:49   ` Dave Hansen
2018-01-25 22:00     ` Andy Lutomirski [this message]
2018-01-26  9:30       ` Ingo Molnar
2018-01-26 18:54       ` Kirill A. Shutemov
2018-01-26 15:06   ` [tip:x86/urgent] " tip-bot for Andy Lutomirski
2018-01-26 18:51   ` [PATCH v2 1/2] " Kirill A. Shutemov
2018-01-26 19:02     ` Andy Lutomirski
2018-01-26 20:50       ` Kirill A. Shutemov
2018-01-25 21:12 ` [PATCH v2 2/2] x86/mm/64: Tighten up vmalloc_fault() sanity checks on 5-level kernels Andy Lutomirski
2018-01-26 15:07   ` [tip:x86/urgent] " tip-bot for Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALCETrVwJCL=QTRT70b8u3p8xOXUiC7_Mkz45Bi3M9-vYgXWtg@mail.gmail.com' \
    --to=luto@kernel.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=khlebnikov@yandex-team.ru \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neil.berrington@datacore.com \
    --cc=stable@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.