From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=uEqb=GS=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.5 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 844A2C433E0
	for <linux-mm@archiver.kernel.org>; Fri, 15 Jan 2021 08:59:42 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 25C1020705
	for <linux-mm@archiver.kernel.org>; Fri, 15 Jan 2021 08:59:42 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 25C1020705
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 9925D8D0142; Fri, 15 Jan 2021 03:59:41 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9442E8D0023; Fri, 15 Jan 2021 03:59:41 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 859A68D0142; Fri, 15 Jan 2021 03:59:41 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0137.hostedemail.com [216.40.44.137])
	by kanga.kvack.org (Postfix) with ESMTP id 6F9E48D0023
	for <linux-mm@kvack.org>; Fri, 15 Jan 2021 03:59:41 -0500 (EST)
Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 3B9E1180AD81D
	for <linux-mm@kvack.org>; Fri, 15 Jan 2021 08:59:41 +0000 (UTC)
X-FDA: 77707411362.11.burst99_2d009902752e
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin11.hostedemail.com (Postfix) with ESMTP id 10159180F8B86
	for <linux-mm@kvack.org>; Fri, 15 Jan 2021 08:59:41 +0000 (UTC)
X-HE-Tag: burst99_2d009902752e
X-Filterd-Recvd-Size: 8057
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124])
	by imf26.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri, 15 Jan 2021 08:59:40 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1610701180;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=l2L9yzM5LJmwEO/nUBz4NvmNEiJLnioU/dsXzw7Xd5A=;
	b=ZEsW5SSGm7JaVNuILaIsfICYCtNhpODS5S0qfRCp1baJ7YzFWXRdO73rXendLA46RLzUNR
	3yT69UUZySlQBiS87s8Y/5nrEBM/moOzcpnLC0ODHwOODVdb/YzNArWHU9z7+HLlh9vmB7
	zxwjYU32ihgSgJ9z1oWN7GjUBqvEzZg=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-128-MYXvcpmQPYq3qO-QYdWOdw-1; Fri, 15 Jan 2021 03:59:38 -0500
X-MC-Unique: MYXvcpmQPYq3qO-QYdWOdw-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 980ED1572D;
	Fri, 15 Jan 2021 08:59:35 +0000 (UTC)
Received: from [10.36.112.11] (ovpn-112-11.ams2.redhat.com [10.36.112.11])
	by smtp.corp.redhat.com (Postfix) with ESMTP id 428FA77718;
	Fri, 15 Jan 2021 08:59:24 +0000 (UTC)
Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
To: Andrea Arcangeli <aarcange@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, Yu Zhao <yuzhao@google.com>,
 Andy Lutomirski <luto@kernel.org>, Peter Xu <peterx@redhat.com>,
 Pavel Emelyanov <xemul@openvz.org>, Mike Kravetz <mike.kravetz@oracle.com>,
 Mike Rapoport <rppt@linux.vnet.ibm.com>, Minchan Kim <minchan@kernel.org>,
 Will Deacon <will@kernel.org>, Peter Zijlstra <peterz@infradead.org>,
 Linus Torvalds <torvalds@linux-foundation.org>,
 Hugh Dickins <hughd@google.com>, "Kirill A. Shutemov"
 <kirill@shutemov.name>, Matthew Wilcox <willy@infradead.org>,
 Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
 Kees Cook <keescook@chromium.org>, John Hubbard <jhubbard@nvidia.com>,
 Leon Romanovsky <leonro@nvidia.com>, Jason Gunthorpe <jgg@ziepe.ca>,
 Jan Kara <jack@suse.cz>, Kirill Tkhai <ktkhai@virtuozzo.com>,
 Nadav Amit <nadav.amit@gmail.com>, Jens Axboe <axboe@kernel.dk>
References: <20210110004435.26382-1-aarcange@redhat.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat GmbH
Message-ID: <bb071419-bf40-c5ed-4b2d-d5eb03031b0a@redhat.com>
Date: Fri, 15 Jan 2021 09:59:23 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.5.0
MIME-Version: 1.0
In-Reply-To: <20210110004435.26382-1-aarcange@redhat.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 10.01.21 01:44, Andrea Arcangeli wrote:
> Hello Andrew and everyone,
> 
> Once we agree that COW page reuse requires full accuracy, the next
> step is to re-apply 17839856fd588f4ab6b789f482ed3ffd7c403e1f and to
> return going in that direction.


After stumbling over the heated discussion around this, I wanted to
understand the details and the different opinions. I tried to summarize
in my simple words (bear with me) what happened and how I think we can
proceed from here. Maybe that helps.

====

What happened:

1) We simplified handling of faults on write-protected pages (page table
entries): we changed the logic when we can reuse a page ("simply
unprotecting it"), and when we have to copy it instead (COW). The
essence of the simplification is, that we only reuse a page if we are
the only single user of the page, meaning page_count(page) == 1, and the
page is mapped into a single process (page_mapcount(page) == 1);
otherwise we copy it. Simple.

2) The old code was complicated and there are GUP (e.g., RDMA, VFIO)
cases that were broken in various ways in the old code already: most
prominently fork(). As one example, it would have been possible for
mprotect(READ) memory to still get modified by GUP users like RDMA.
Write protection (AFAIU via any mechanism) after GUP pinned a page was
not effective; the page was not copied.

3) Speculative pagecache reference can temporarily bump up the
page_count(page), resulting in false positives. We could see
page_count(page) > 1, although we're the single instance that actually
uses a page. In the simplified code, we might copy a page although not
necessary (I cannot tell how often that actually happens).

4) clear_refs(4) ("measure approximately how much memory a process is
using"), uffd-wp (let's call it "lightweight write-protection, handling
the actual fault in user space"), and mprotect(READ) all write-protect
page table entries to generate faults on next write access. With the
simplified code, we will COW whenever we find the page_count(page) > 1.

The simplification seemed to regress clear_refs and uffdio-wp code
(AFAIU in case of uffd-wp, it results in memory corruption). But looks
like we can mostly fix it by adding more extensive locking.

5) Mechanisms like GUP (AFAIU including Direct I/O) also takes
references on pages, increasing page_count(). With the simplification,
we might now end up copying a page, although there is "somewhat" only a
single user/"process" involved.

One example is RDMA: if we read memory using RDMA and mprotect(READ)
such memory, we might end up copying the underlying page on the next
write: suddenly, RDMA is disconnected and will no longer read what is
getting written. Not to mention, we consume more memory. AFAIU, other
examples include direct I/O (e.g., write() with O_DIRECT).

AFAIU, a more extreme case is probably VFIO: A VM with VFIO (e.g.,
passthrough of a PCI device) can essentially be corrupted by "echo 4 >
/proc/[pid]/clear_refs".

6) While some people think it is okay to break GUP further, as it is
already broken in various other ways, other people think this is
changing something that used to work (AFAIU a user-visible change) with
little benefit.

7) There is no easy way to detect if a page really was pinned: we might
have false positives. Further, there is no way to distinguish if it was
pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking
we most probably would need more counters, which we cannot fit into
struct page. (AFAIU, for huge pages it's easier).

However, AFAIU, even being able to detect if (and how) a page was pinned
would not completely help to solve the puzzle.

8) We have a vmsplice security issue that has to be fixed by touching
the code in question. A forked child process can read memory content of
its parent, which was modified by the parent after fork. AFAIU, the fix
will further lock us in into the direction of the code we are heading.

9) The simplification is part of v5.10, which is a LTS release. AFAIU,
that one needs fixing, too.


I see the following possible directions we can head

A) Keep the simplification. Try fixing the fallout. Keep the GUP cases
broken or make mprotect() fail when detecting such a scenario;
AFAIU, both are user-visible changes.

B) Keep the simplification. Try fixing the fallout. Fix GUP cases that
used to work; AFAIU fixing this is the hard/impossible part, and is
undesired by some people..

C) Revert the simplification for now. Go back to the drawing board and
use what we learned to come up with a simplification that (all?  )
people are happy with.

D) Revert the simplification: turns out the code could not get
simplified to this extend. We learned a lot, though.

======

Please let me know in case I messed up anything and/or missed important
points.

-- 
Thanks,

David / dhildenb