From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=NXjw=DO=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-17.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,
	URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 09EE4C4727E
	for <linux-mm@archiver.kernel.org>; Wed,  7 Oct 2020 08:28:52 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 4007E214DB
	for <linux-mm@archiver.kernel.org>; Wed,  7 Oct 2020 08:28:51 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ja7q3lbf"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4007E214DB
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 77E906B005C; Wed,  7 Oct 2020 04:28:50 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 708AF6B0062; Wed,  7 Oct 2020 04:28:50 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5CE9F6B0068; Wed,  7 Oct 2020 04:28:50 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0119.hostedemail.com [216.40.44.119])
	by kanga.kvack.org (Postfix) with ESMTP id 2799B6B005C
	for <linux-mm@kvack.org>; Wed,  7 Oct 2020 04:28:50 -0400 (EDT)
Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id B87FF8249980
	for <linux-mm@kvack.org>; Wed,  7 Oct 2020 08:28:49 +0000 (UTC)
X-FDA: 77344453578.08.ship36_2303b13271cd
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin08.hostedemail.com (Postfix) with ESMTP id 959731819E798
	for <linux-mm@kvack.org>; Wed,  7 Oct 2020 08:28:49 +0000 (UTC)
X-HE-Tag: ship36_2303b13271cd
X-Filterd-Recvd-Size: 8879
Received: from mail-ed1-f67.google.com (mail-ed1-f67.google.com [209.85.208.67])
	by imf03.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed,  7 Oct 2020 08:28:49 +0000 (UTC)
Received: by mail-ed1-f67.google.com with SMTP id dn5so1241616edb.10
        for <linux-mm@kvack.org>; Wed, 07 Oct 2020 01:28:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=i6+tPxYX3+sDTyDY0Lu++GhBFNvS005wERPIS86FMh4=;
        b=ja7q3lbfllR6aSWdYofwIDbrkxqM+Z69rZzUb312Sd6WDhGb9uLm3FfT91bfNKk+gs
         k3n55GuJFsqzXVy7cwSxswCU3yt4qda0fVNH8PTrBaGx2aUnveORSXjzxRl2ucz+b9ej
         To9IAQUYfB0T10B+L/MpwQHvmboinwho2H3FSF4nSyfPYPFMk2d2pa7DaWXnDEA4nkOG
         ne1kheQv1S3Jaba1AZ2Lv1mwRBkPwIgcMC9Jf+LZzPV+WTmTydNC6hk2tRjyesmhLIbG
         Ac+fBS5/r4sJwSUQiJqqgmIGAFXH9eUTHqoBmK3Jucq/i+4aSBuWUg5RZdjlTBEAEfLJ
         CCmQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=i6+tPxYX3+sDTyDY0Lu++GhBFNvS005wERPIS86FMh4=;
        b=Vf1Gw5dNNKGpbgIJW5kA4DRQjvH4RDTscGYvOImKVP4W4hHe2K2MnTk574cAOGgxW1
         BxKhUiEjtY35YovVyHVgf+eHtXAnsndrpZr8DIMuU5EiZIQ+jlKHLJ6iRf5y86ZsE/HO
         m9oi9xdpNvMrzU1KIvPuS0FH9if3z5hHZybAvivKoKSaIQc14wxmjT5JW5hnjLvF0AHk
         gRcF5ShCs0iyWSEfSdchgxlASyCMSXbl+3VLjMSaOLFN9feE7H//NmJ9Il5ygz0ZjNkc
         DQy/sVglulNI17vEJDy+cYFO4hJpRsixHfM9PImjE52p+Tok1cKeUwOlEsM8P7MYXx51
         7gCw==
X-Gm-Message-State: AOAM531OjOW3zWjlu9i1D79FTutpiZtLNINEDGBLr3i6Tq3V1gpFkUjW
	SYUd1p1iHnOhEdlnMpZ0EXR/UxljKKwa/Y4uN6bT6g==
X-Google-Smtp-Source: ABdhPJzZ9GVXoX0ko71WUGB2y/5W+4ES3/lGQ3d/h6PpksQuIKGGI2iLk+YvOWrctIBoW5ApCe2RAL1XvVEem5Lc6pY=
X-Received: by 2002:a05:6402:151a:: with SMTP id f26mr2406407edw.386.1602059327389;
 Wed, 07 Oct 2020 01:28:47 -0700 (PDT)
MIME-Version: 1.0
References: <20201006225450.751742-1-jannh@google.com> <20201006225450.751742-2-jannh@google.com>
 <115d17aa221b73a479e26ffee52899ddb18b1f53.camel@sipsolutions.net>
In-Reply-To: <115d17aa221b73a479e26ffee52899ddb18b1f53.camel@sipsolutions.net>
From: Jann Horn <jannh@google.com>
Date: Wed, 7 Oct 2020 10:28:21 +0200
Message-ID: <CAG48ez2cW0mrSPihrtX6Kus2AYc0hKX8izpzvOMYrnk0eLOAoA@mail.gmail.com>
Subject: Re: [PATCH v2 1/2] mmap locking API: Order lock of nascent mm outside
 lock of live mm
To: Johannes Berg <johannes@sipsolutions.net>
Cc: Andrew Morton <akpm@linux-foundation.org>, Linux-MM <linux-mm@kvack.org>, 
	Michel Lespinasse <walken@google.com>, Jason Gunthorpe <jgg@nvidia.com>, Richard Weinberger <richard@nod.at>, 
	Jeff Dike <jdike@addtoit.com>, linux-um@lists.infradead.org, 
	kernel list <linux-kernel@vger.kernel.org>, "Eric W . Biederman" <ebiederm@xmission.com>, 
	Sakari Ailus <sakari.ailus@linux.intel.com>, John Hubbard <jhubbard@nvidia.com>, 
	Mauro Carvalho Chehab <mchehab@kernel.org>, Anton Ivanov <anton.ivanov@cambridgegreys.com>
Content-Type: text/plain; charset="UTF-8"
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Oct 7, 2020 at 9:42 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> On Wed, 2020-10-07 at 00:54 +0200, Jann Horn wrote:
> > Until now, the mmap lock of the nascent mm was ordered inside the mmap lock
> > of the old mm (in dup_mmap() and in UML's activate_mm()).
> > A following patch will change the exec path to very broadly lock the
> > nascent mm, but fine-grained locking should still work at the same time for
> > the old mm.
> >
> > In particular, mmap locking calls are hidden behind the copy_from_user()
> > calls and such that are reached through functions like copy_strings() -
> > when a page fault occurs on a userspace memory access, the mmap lock
> > will be taken.
> >
> > To do this in a way that lockdep is happy about, let's turn around the lock
> > ordering in both places that currently nest the locks.
> > Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer,
> > make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that
> > instead.
> >
> > The added locking calls in exec_mmap() are temporary; the following patch
> > will move the locking out of exec_mmap().
> >
> > Signed-off-by: Jann Horn <jannh@google.com>
> > ---
> >  arch/um/include/asm/mmu_context.h |  3 +--
> >  fs/exec.c                         |  4 ++++
> >  include/linux/mmap_lock.h         | 23 +++++++++++++++++++++--
> >  kernel/fork.c                     |  7 ++-----
> >  4 files changed, 28 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h
> > index 17ddd4edf875..c13bc5150607 100644
> > --- a/arch/um/include/asm/mmu_context.h
> > +++ b/arch/um/include/asm/mmu_context.h
> > @@ -48,9 +48,8 @@ static inline void activate_mm(struct mm_struct *old, struct mm_struct *new)
> >        * when the new ->mm is used for the first time.
> >        */
> >       __switch_mm(&new->context.id);
> > -     mmap_write_lock_nested(new, SINGLE_DEPTH_NESTING);
> > +     mmap_assert_write_locked(new);
> >       uml_setup_stubs(new);
> > -     mmap_write_unlock(new);
> >  }
>
> FWIW, this was I believe causing lockdep issues.
>
> I think I had previously determined that this was pointless, since it's
> still nascent and cannot be used yet?

Well.. the thing is that with patch 2/2, I'm not just protecting the
mm while it hasn't been installed yet, but also after it's been
installed, until setup_arg_pages() is done (which still uses a VMA
pointer that we obtained really early in the nascent phase). With the
recent rework Eric Biederman has done to clean up the locking around
execve, operations like process_vm_writev() and (currently only in the
MM tree, not mainline yet) process_madvise() can remotely occur on our
new mm after setup_new_exec(), before we've reached setup_arg_pages().
While AFAIK all those operations *currently* only read the VMA tree,
that would change as soon as someone e.g. changes the list of allowed
operations for process_madvise() to include something like
MADV_MERGEABLE. In that case, we'd get a UAF if the madvise code
merges away our VMA while we still hold and use a dangling pointer to
it.

So in summary, I think the code currently is not (visibly) buggy in
the sense that you can make it do something bad, but it's extremely
fragile and probably only safe by chance. This patchset is partly my
attempt to make this a bit more future-proof before someone comes
along and turns it into an actual memory corruption bug with some
innocuous little change. (Because I've had a situation before where I
thought "oh, this looks really fragile and only works by chance, but
eh, it's not really worth changing that code" and then the next time I
looked, it had turned into a security bug that had already made its
way into kernel releases people were using.)

> But I didn't really know for sure,
> and this patch was never applied:
>
> https://patchwork.ozlabs.org/project/linux-um/patch/20200604133752.397dedea0758.I7a24aaa26794eb3fa432003c1bf55cbb816489e2@changeid/

Eeeh... with all the kernel debugging infrastructure *disabled*,
down_write_nested() is defined as:

# define down_write_nested(sem, subclass) down_write(sem)

and then down_write() is:

void __sched down_write(struct rw_semaphore *sem)
{
  might_sleep();
  rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
  LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
}

and that might_sleep() there is not just used for atomic sleep
debugging, but actually also creates an explicit preemption point
(independent of CONFIG_DEBUG_ATOMIC_SLEEP; here's the version with
atomic sleep debugging *disabled*):

# define might_sleep() do { might_resched(); } while (0)

where might_resched() is:

#ifdef CONFIG_PREEMPT_VOLUNTARY
extern int _cond_resched(void);
# define might_resched() _cond_resched()
#else
# define might_resched() do { } while (0)
#endif

_cond_resched() has a check for preempt_count before triggering the
scheduler, but on PREEMPT_VOLUNTARY without debugging, taking a
spinlock currently won't increment that, I think. And even if
preempt_count was active for PREEMPT_VOLUNTARY (which I think the x86
folks were discussing?), you'd still hit a call into the RCU core,
which probably shouldn't be happening under a spinlock either.

Now, arch/um/ sets ARCH_NO_PREEMPT, so we can't actually be configured
with PREEMPT_VOLUNTARY, so this can't actually happen. But it feels
like we're on pretty thin ice here.

> I guess your patches will also fix the lockdep complaints in UML in this
> area, I hope I'll be able to test it soon.

That would be a nice side effect. :)