From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED,USER_AGENT_MUTT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D034C43387 for ; Tue, 15 Jan 2019 20:42:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 374E820656 for ; Tue, 15 Jan 2019 20:42:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732628AbfAOUma (ORCPT ); Tue, 15 Jan 2019 15:42:30 -0500 Received: from ipmail02.adl2.internode.on.net ([150.101.137.139]:44910 "EHLO ipmail02.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730179AbfAOUma (ORCPT ); Tue, 15 Jan 2019 15:42:30 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail02.adl2.internode.on.net with ESMTP; 16 Jan 2019 07:12:24 +1030 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1gjVXT-0001Zn-2h; Wed, 16 Jan 2019 07:42:23 +1100 Date: Wed, 16 Jan 2019 07:42:23 +1100 From: Dave Chinner To: Pankaj Gupta Cc: Dan Williams , Matthew Wilcox , Linux Kernel Mailing List , KVM list , Qemu Developers , linux-nvdimm , linux-fsdevel , virtualization@lists.linux-foundation.org, Linux ACPI , linux-ext4 , linux-xfs , Jan Kara , Stefan Hajnoczi , Rik van Riel , Nitesh Narayan Lal , Kevin Wolf , Paolo Bonzini , Ross Zwisler , vishal l verma , dave jiang , David Hildenbrand , jmoyer , xiaoguangrong eric , Christoph Hellwig , "Michael S. Tsirkin" , Jason Wang , lcapitulino@redhat.com, Igor Mammedov , Eric Blake , Theodore Ts'o , adilger kernel , darrick wong , "Rafael J. Wysocki" Subject: Re: [PATCH v3 0/5] kvm "virtio pmem" device Message-ID: <20190115204222.GK4205@dastard> References: <20190109144736.17452-1-pagupta@redhat.com> <1326478078.61913951.1547192704870.JavaMail.zimbra@redhat.com> <20190113232902.GD4205@dastard> <20190113233820.GX6310@bombadil.infradead.org> <942065073.64011540.1547450140670.JavaMail.zimbra@redhat.com> <20190114212501.GG4205@dastard> <20190114222132.GH4205@dastard> <1684638419.64320214.1547530506805.JavaMail.zimbra@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1684638419.64320214.1547530506805.JavaMail.zimbra@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190115204223.g3KpMQ1pGT4XpgI_0Te4FqAD3n0Eh3a101FnEO1CLac@z> On Tue, Jan 15, 2019 at 12:35:06AM -0500, Pankaj Gupta wrote: > > > > > On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote: > > > > > > > > > > > > Until you have images (and hence host page cache) shared between > > > > > > > multiple guests. People will want to do this, because it means they > > > > > > > only need a single set of pages in host memory for executable > > > > > > > binaries rather than a set of pages per guest. Then you have > > > > > > > multiple guests being able to detect residency of the same set of > > > > > > > pages. If the guests can then, in any way, control eviction of the > > > > > > > pages from the host cache, then we have a guest-to-guest > > > > > > > information > > > > > > > leak channel. > > > > > > > > > > > > I don't think we should ever be considering something that would > > > > > > allow a > > > > > > guest to evict page's from the host's pagecache [1]. The guest > > > > > > should > > > > > > be able to kick its own references to the host's pagecache out of its > > > > > > own pagecache, but not be able to influence whether the host or > > > > > > another > > > > > > guest has a read-only mapping cached. > > > > > > > > > > > > [1] Unless the guest is allowed to modify the host's file; obviously > > > > > > truncation, holepunching, etc are going to evict pages from the > > > > > > host's > > > > > > page cache. > > > > > > > > > > This is so correct. Guest does not not evict host page cache pages > > > > > directly. > > > > > > > > They don't right now. > > > > > > > > But someone is going to end up asking for discard to work so that > > > > the guest can free unused space in the underlying spares image (i.e. > > > > make use of fstrim or mount -o discard) because they have workloads > > > > that have bursts of space usage and they need to trim the image > > > > files afterwards to keep their overall space usage under control. > > > > > > > > And then.... > > > > > > ...we reject / push back on that patch citing the above concern. > > > > So at what point do we draw the line? > > > > We're allowing writable DAX mappings, but as I've pointed out that > > means we are going to be allowing a potential information leak via > > files with shared extents to be directly mapped and written to. > > > > But we won't allow useful admin operations that allow better > > management of host side storage space similar to how normal image > > files are used by guests because it's an information leak vector? > > First of all Thank you for all the useful discussions. > I am summarizing here: > > - We have to live with the limitation to not support fstrim and > mount -o discard options with virtio-pmem as they will evict > host page cache pages. We cannot allow this for virtio-pmem > for security reasons. These filesystem commands will just zero out > unused pages currently. Not sure I follow you here - what pages are going to be zeroed and when will they be zeroed? If discard is not allowed, filesystems just don't issue such commands and the underlying device will never seen them. > - If alot of space is unused and not freed guest can request host > Administrator for truncating the host backing image. You can't use truncate to free space in a disk image file. The only way to do it safely in a generic, filesystem agnositic way is to mount the disk image (e.g. on loopback) and run fstrim on it. The loopback device will punches holes in the file where all the free space is reported by the filesystem via discard requests. Which is kinda my point - this could only be done if the guest is shut down, which makes it very difficult for admins to manage. > We are also planning to support qcow2 sparse image format at > host side with virtio-pmem. So you're going to be remapping a huge number of disjoint regions into a linear pmem mapping? ISTR discussions about similar things for virtio+fuse+dax that came up against "large numbers of mapped regions don't scale" and so it wasn't a practical solution compared to a just using raw sparse files.... > - There is no existing solution for Qemu persistent memory > emulation with write support currently. This solution provides > us the paravartualized way of emulating persistent memory. Sure, but the question is why do you need to create an emulation that doesn't actually perform like pmem? The whole point of pmem is performance, and emulating pmem by mmap() of a file on spinning disks is going to be horrible for performance. Even on SSDs it's going to be orders of magnitudes slower than real pmem. So exactly what problem are you trying to solve with this driver? Cheers, Dave. -- Dave Chinner david@fromorbit.com