From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.5 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70B39C433DB for ; Sat, 16 Jan 2021 11:42:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E5B6E23119 for ; Sat, 16 Jan 2021 11:42:36 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E5B6E23119 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0DC2C8D0208; Sat, 16 Jan 2021 06:42:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 08D528D0200; Sat, 16 Jan 2021 06:42:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EE53C8D0208; Sat, 16 Jan 2021 06:42:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0175.hostedemail.com [216.40.44.175]) by kanga.kvack.org (Postfix) with ESMTP id D618D8D0200 for ; Sat, 16 Jan 2021 06:42:35 -0500 (EST) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 9F81E1730865 for ; Sat, 16 Jan 2021 11:42:35 +0000 (UTC) X-FDA: 77711450670.15.skin92_3608be427537 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin15.hostedemail.com (Postfix) with ESMTP id 80ACE1814B0D6 for ; Sat, 16 Jan 2021 11:42:35 +0000 (UTC) X-HE-Tag: skin92_3608be427537 X-Filterd-Recvd-Size: 7826 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf33.hostedemail.com (Postfix) with ESMTP for ; Sat, 16 Jan 2021 11:42:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1610797354; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lYWPEyiRYQkPrhW6cczH/pyf7YFbufnMuIaorWSwGGE=; b=GVCFQOaUTf9PrpPAsIRzGzC1zs2AgmjT4Afv76AEpaUYAysW+GSE6jdECjV2jYkh5LdNwZ Lxs0cgj2tzy76x/ZfxRiizXhuK4FEa3g7bd3sPxEiUywzb55HsJ7VbMhfuKPzcy2+3+Cja U+hSahg7vAkKyOfZUPK+ESwsJFGug4w= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-377-q1sYkXJwPEW5-YQUP0SaQg-1; Sat, 16 Jan 2021 06:42:30 -0500 X-MC-Unique: q1sYkXJwPEW5-YQUP0SaQg-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E55E51005504; Sat, 16 Jan 2021 11:42:26 +0000 (UTC) Received: from [10.36.112.59] (ovpn-112-59.ams2.redhat.com [10.36.112.59]) by smtp.corp.redhat.com (Postfix) with ESMTP id E35015C1C2; Sat, 16 Jan 2021 11:42:11 +0000 (UTC) Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse To: John Hubbard , Jason Gunthorpe Cc: Andrea Arcangeli , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yu Zhao , Andy Lutomirski , Peter Xu , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , Minchan Kim , Will Deacon , Peter Zijlstra , Linus Torvalds , Hugh Dickins , "Kirill A. Shutemov" , Matthew Wilcox , Oleg Nesterov , Jann Horn , Kees Cook , Leon Romanovsky , Jan Kara , Kirill Tkhai , Nadav Amit , Jens Axboe References: <20210110004435.26382-1-aarcange@redhat.com> <20210115183721.GG4605@ziepe.ca> From: David Hildenbrand Organization: Red Hat GmbH Message-ID: Date: Sat, 16 Jan 2021 12:42:10 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 16.01.21 04:40, John Hubbard wrote: > On 1/15/21 11:46 AM, David Hildenbrand wrote: >>>> 7) There is no easy way to detect if a page really was pinned: we might >>>> have false positives. Further, there is no way to distinguish if it was >>>> pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking >>>> we most probably would need more counters, which we cannot fit into >>>> struct page. (AFAIU, for huge pages it's easier). >>> >>> I think this is the real issue. We can only store so much information, >>> so we have to decide which things work and which things are broken. So >>> far someone hasn't presented a way to record everything at least.. >> >> I do wonder how many (especially long-term) GUP readers/writers we have >> to expect, and especially, support for a single base page. Do we have a >> rough estimate? >> >> With RDMA, I would assume we only need a single one (e.g., once RDMA >> device; I'm pretty sure I'm wrong, sounds too easy). >> With VFIO I guess we need one for each VFIO container (~ in the worst >> case one for each passthrough device). >> With direct I/O, vmsplice and other GUP users ?? No idea. >> >> If we could somehow put a limit on the #GUP we support, and fail further >> GUP (e.g., -EAGAIN?) once a limit is reached, we could partition the >> refcount into something like (assume max #15 GUP READ and #15 GUP R/W, >> which is most probably a horribly bad choice) >> >> [ GUP READ ][ GUP R/W ] [ ordinary ] >> 31 ... 28 27 ... 24 23 .... 0 >> >> But due to saturate handling in "ordinary", we would lose further 2 bits >> (AFAIU), leaving us "only" 22 bits for "ordinary". Now, I have no idea >> how many bits we actually need in practice. >> >> Maybe we need less for GUP READ, because most users want GUP R/W? No idea. >> >> Just wild ideas. Most probably that has already been discussed, and most >> probably people figured that it's impossible :) >> > > I proposed this exact idea a few days ago [1]. It's remarkable that we both > picked nearly identical values for the layout! :) Heh! Somehow I missed that. But well, there were *a lot* of mails :) > > But as the responses show, security problems prevent pursuing that approach. It still feels kind of wrong to waste valuable space in the memmap. In an ideal world (well, one that still only allows for a 64 byte memmap :) ), we would: 1) Partition the refcount into separate fields that cannot overflow into each other, similar to my example above, but maybe add even more fields. 2) Reject attempts that would result in an overflow to everything except the "ordinary" field (e.g., GUP fields in my example above). 3) Put an upper limit on the "ordinary" field that we ever expect for sane workloads (E.g., 10 bits). In addition, reserve some bits (like the saturate logic) that we handle as a "red zone". 4) For the "ordinary" field, as soon as we enter the red zone, we know we have an attack going on. We continue on paths that we cannot fail (e.g., get_page()) but eventually try stopping the attacker(s). AFAIU, we know the attacker(s) are something (e.g., one ore multiple processes) that has direct access to the page in their address space. Of course, the more paths we can reject, the better. Now, we would: a) Have to know what sane upper limits on the "ordinary" field are. I have no idea which values we can expect. Attacker vs. sane workload. b) Need a way to identify the attacker(s). In the simplest case, this is a single process. In the hard case, this involves many processes. c) Need a way to stop the attacker(s). Doing that out of random context is problematic. Last resort is doing this asynchronously from another thread, which leaves more time for the attacker to do harm. Of course, problem gets more involved as soon as we might have a malicious child process that uses a page from a well-behaving parent process for the attack. Imagine we kill relevant processes, we might end up killing someone who's not responsible. And even if we don't kill, but instead reject try_get_page(), we might degrade the well-behaving parent process AFAIKS. Alternatives to killing the process might be unmapping the problematic page from the address space. Reminds me a little about handling memory errors for a page, eventually killing all users of that page. mm/memory-failure.c:kill_procs(). Complicated problem :) -- Thanks, David / dhildenb