From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Cyrus-Session-Id: sloti22d1t05-3489203-1516992928-2-3873735377504495952 X-Sieve: CMU Sieve 3.0 X-Spam-known-sender: no X-Spam-score: 0.0 X-Spam-hits: BAYES_00 -1.9, HEADER_FROM_DIFFERENT_DOMAINS 0.001, RCVD_IN_DNSWL_HI -5, T_RP_MATCHES_RCVD -0.01, LANGUAGES en, BAYES_USED global, SA_VERSION 3.4.0 X-Spam-source: IP='209.132.180.67', Host='vger.kernel.org', Country='US', FromHeader='name', MailFrom='org' X-Spam-charsets: plain='us-ascii' X-Resolved-to: greg@kroah.com X-Delivered-to: greg@kroah.com X-Mail-from: stable-owner@vger.kernel.org ARC-Seal: i=1; a=rsa-sha256; cv=none; d=messagingengine.com; s=arctest; t=1516992927; b=l+ojjeOU7x0THETAhXECI2Ju4BQlSqM7jr+9FUOOYRQogLG bkdgWswUiXWN6GsjTsMrQ6VQYzfygXlt2ILRKZp2RmvsTfki15ZxY3KKOsBOANBV qSKP7QIejPWJbmzF7eRol9fkmJdSKSrc9LEASuxa27uvSq4cx80J6GCA7gRfycY3 nk7qVP+2Uug/s1ayYwoKRX5v+IpaIt7VQJ993mn4DlWcNnHaG0vFwyntBzQE9sG4 /TZLFCIN8ZB6TZygux/JPGlNGCnyQWSXIKzkk0jxn0npxlr/IuumkdhB+MrHesjW Xtm0DHDc9ce+fqIysm5aHfvYsGEV9e8nCC5VAZQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=date:from:to:cc:subject:message-id :references:mime-version:content-type:in-reply-to:sender :list-id; s=arctest; t=1516992927; bh=qx7wN6W/7vmxC5fERBZW4V4/Zw OqZ+PvOfql22dgQ8c=; b=DQkCJ7dLMYYMMdjqIh5BitR+OU5FqaU5h8kef4wMk8 pX7ZPzWVQvLKXvD/Suf/QZ6GzX3oW1q7vLjwL9hhTogxwxvyVHOkvv0WUQm1fsfO NzuJVPnfkYSTRtqgQv5EEHHk4P+bey4zgnrYJksTYovIrpA+4YfJJIt+PSCmSfUb fP3YakKtuwq0UR1/2ys/Krfhp45pJOws+jeportB/kFJTd6OMNkmbaYU2s99pjyz btTeX43La2zPafVghhk6GUaLsVTB7a0UxeGuwpUvyGn2mIm1QLj3NTMr/rfsyXxm BDA5CkVewr3n/x7VFIFlJ9E71IrNoDKDCK8sRur+Z2ig== ARC-Authentication-Results: i=1; mx5.messagingengine.com; arc=none (no signatures found); dkim=pass (2048-bit rsa key sha256) header.d=shutemov-name.20150623.gappssmtp.com header.i=@shutemov-name.20150623.gappssmtp.com header.b=qpRV/Gyj x-bits=2048 x-keytype=rsa x-algorithm=sha256 x-selector=20150623; dmarc=none (p=none,has-list-id=yes,d=none) header.from=shutemov.name; iprev=pass policy.iprev=209.132.180.67 (vger.kernel.org); spf=none smtp.mailfrom=stable-owner@vger.kernel.org smtp.helo=vger.kernel.org; x-aligned-from=fail; x-google-dkim=pass (2048-bit rsa key) header.d=1e100.net header.i=@1e100.net header.b=AOJgqlJe; x-ptr=pass x-ptr-helo=vger.kernel.org x-ptr-lookup=vger.kernel.org; x-return-mx=pass smtp.domain=vger.kernel.org smtp.result=pass smtp_org.domain=kernel.org smtp_org.result=pass smtp_is_org_domain=no header.domain=shutemov.name header.result=pass header_is_org_domain=yes Authentication-Results: mx5.messagingengine.com; arc=none (no signatures found); dkim=pass (2048-bit rsa key sha256) header.d=shutemov-name.20150623.gappssmtp.com header.i=@shutemov-name.20150623.gappssmtp.com header.b=qpRV/Gyj x-bits=2048 x-keytype=rsa x-algorithm=sha256 x-selector=20150623; dmarc=none (p=none,has-list-id=yes,d=none) header.from=shutemov.name; iprev=pass policy.iprev=209.132.180.67 (vger.kernel.org); spf=none smtp.mailfrom=stable-owner@vger.kernel.org smtp.helo=vger.kernel.org; x-aligned-from=fail; x-google-dkim=pass (2048-bit rsa key) header.d=1e100.net header.i=@1e100.net header.b=AOJgqlJe; x-ptr=pass x-ptr-helo=vger.kernel.org x-ptr-lookup=vger.kernel.org; x-return-mx=pass smtp.domain=vger.kernel.org smtp.result=pass smtp_org.domain=kernel.org smtp_org.result=pass smtp_is_org_domain=no header.domain=shutemov.name header.result=pass header_is_org_domain=yes Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752558AbeAZSy4 (ORCPT ); Fri, 26 Jan 2018 13:54:56 -0500 Received: from mail-wm0-f67.google.com ([74.125.82.67]:53901 "EHLO mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752529AbeAZSyz (ORCPT ); Fri, 26 Jan 2018 13:54:55 -0500 X-Google-Smtp-Source: AH8x226dQyjrtYRRacZcOX6y7yqSO/dpfz+++lkKu73vhS2uOMPEE9WosNCOe5Xe2ZQ5AVs5LKOnvA== Date: Fri, 26 Jan 2018 21:54:51 +0300 From: "Kirill A. Shutemov" To: Andy Lutomirski Cc: Dave Hansen , Konstantin Khlebnikov , X86 ML , Borislav Petkov , Neil Berrington , LKML , stable , "Kirill A. Shutemov" Subject: Re: [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems Message-ID: <20180126185451.jdjgekr7awwhukml@node.shutemov.name> References: <346541c56caed61abbe693d7d2742b4a380c5001.1516914529.git.luto@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20171215 Sender: stable-owner@vger.kernel.org X-Mailing-List: stable@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-Mailing-List: linux-kernel@vger.kernel.org List-ID: On Thu, Jan 25, 2018 at 02:00:22PM -0800, Andy Lutomirski wrote: > On Thu, Jan 25, 2018 at 1:49 PM, Dave Hansen wrote: > > On 01/25/2018 01:12 PM, Andy Lutomirski wrote: > >> Neil Berrington reported a double-fault on a VM with 768GB of RAM that > >> uses large amounts of vmalloc space with PTI enabled. > >> > >> The cause is that load_new_mm_cr3() was never fixed to take the > >> 5-level pgd folding code into account, so, on a 4-level kernel, the > >> pgd synchronization logic compiles away to exactly nothing. > > > > You don't mention it, but we can normally handle vmalloc() faults in the > > kernel that are due to unsynchronized page tables. The thing that kills > > us here is that we have an unmapped stack and we try to use that stack > > when entering the page fault handler, which double faults. The double > > fault handler gets a new stack and saves us enough to get an oops out. > > > > Right? > > Exactly. > > There are two special code paths that can't use vmalloc_fault(): this > one and switch_to(). The latter avoids explicit page table fiddling > and just touches the new stack before loading it into rsp. > > > > >> +static void sync_current_stack_to_mm(struct mm_struct *mm) > >> +{ > >> + unsigned long sp = current_stack_pointer; > >> + pgd_t *pgd = pgd_offset(mm, sp); > >> + > >> + if (CONFIG_PGTABLE_LEVELS > 4) { > >> + if (unlikely(pgd_none(*pgd))) { > >> + pgd_t *pgd_ref = pgd_offset_k(sp); > >> + > >> + set_pgd(pgd, *pgd_ref); > >> + } > >> + } else { > >> + /* > >> + * "pgd" is faked. The top level entries are "p4d"s, so sync > >> + * the p4d. This compiles to approximately the same code as > >> + * the 5-level case. > >> + */ > >> + p4d_t *p4d = p4d_offset(pgd, sp); > >> + > >> + if (unlikely(p4d_none(*p4d))) { > >> + pgd_t *pgd_ref = pgd_offset_k(sp); > >> + p4d_t *p4d_ref = p4d_offset(pgd_ref, sp); > >> + > >> + set_p4d(p4d, *p4d_ref); > >> + } > >> + } > >> +} > > > > We keep having to add these. It seems like a real deficiency in the > > mechanism that we're using for pgd folding. Can't we get a warning or > > something when we try to do a set_pgd() that's (silently) not doing > > anything? This exact same pattern bit me more than once with the > > KPTI/KAISER patches. > > Hmm, maybe. > > What I'd really like to see is an entirely different API. Maybe: > > typedef struct { > opaque, but probably includes: > int depth; /* 0 is root */ > void *table; > } ptbl_ptr; > > ptbl_ptr root_table = mm_root_ptbl(mm); > > set_ptbl_entry(root_table, pa, prot); > > /* walk tables */ > ptbl_ptr pt = ...; > ptentry_ptr entry; > while (ptbl_has_children(pt)) { > pt = pt_next(pt, addr); > } > entry = pt_entry_at(pt, addr); > /* do something with entry */ > > etc. I thought about very similar design, but never got time to try it really. It's not one-week-end type of project :/ -- Kirill A. Shutemov