From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933991AbbIVSAh (ORCPT ); Tue, 22 Sep 2015 14:00:37 -0400 Received: from mail-oi0-f49.google.com ([209.85.218.49]:36113 "EHLO mail-oi0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933974AbbIVSAd (ORCPT ); Tue, 22 Sep 2015 14:00:33 -0400 MIME-Version: 1.0 In-Reply-To: References: <1442903021-3893-1-git-send-email-mingo@kernel.org> <1442903021-3893-6-git-send-email-mingo@kernel.org> From: Andy Lutomirski Date: Tue, 22 Sep 2015 11:00:12 -0700 Message-ID: Subject: Re: [PATCH 05/11] mm: Introduce arch_pgd_init_late() To: Linus Torvalds Cc: Ingo Molnar , Linux Kernel Mailing List , linux-mm , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Oleg Nesterov , Waiman Long , Thomas Gleixner Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 22, 2015 at 10:55 AM, Linus Torvalds wrote: > On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar wrote: >> Add a late PGD init callback to places that allocate a new MM >> with a new PGD: copy_process() and exec(). >> >> The purpose of this callback is to allow architectures to implement >> lockless initialization of task PGDs, to remove the scalability >> limit of pgd_list/pgd_lock. > > Do we really need this? > > Can't we just initialize the pgd when we allocate it, knowing that > it's not in sync, but just depend on the vmalloc fault to add in any > kernel entries that we might have missed? I really really hate the vmalloc fault thing. It seems to work, rather to my surprise. It doesn't *deserve* to work, because of things like the percpu TSS accesses in the entry code that happen without a valid stack. For all I know, there's a long history of this hitting on monster non-SMAP systems that are all buggy and rootable but no one notices because it's rare. On SMAP with non-malicious userspace, it's an instant double fault. With malicious userspace, it's rootable regardless of SMAP, but it's much harder with SMAP. If we start every mm with a fully zeroed pgd (which is what I think you're suggesting), then this starts affecting small systems as in addition to monster systems. I'd really rather go in the other directoin and completely eliminate vmalloc faults. We could do that by eagerly initializing all pgd, or we could do it by tracking, per-pgd, how up-to-date it is and fixing it up in switch_mm. The latter is a bit nasty on SMP. --Andy