From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752829AbcLMXqO (ORCPT ); Tue, 13 Dec 2016 18:46:14 -0500 Received: from mx1.redhat.com ([209.132.183.28]:44314 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751905AbcLMXqM (ORCPT ); Tue, 13 Dec 2016 18:46:12 -0500 From: Jeff Moyer To: Dan Williams Cc: linux-nvdimm , "linux-kernel\@vger.kernel.org" Subject: Re: [PATCH 0/8] device-dax: sub-division support References: <148143770485.10950.13227732273892953675.stgit@dwillia2-desk3.amr.corp.intel.com> X-PGP-KeyID: 1F78E1B4 X-PGP-CertKey: F6FE 280D 8293 F72C 65FD 5A58 1FF8 A7CA 1F78 E1B4 X-PCLoadLetter: What the f**k does that mean? Date: Tue, 13 Dec 2016 18:46:09 -0500 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Tue, 13 Dec 2016 23:46:11 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Dan, In general, I have a couple of concerns with this patchset: 1) You're making a case that subdivision shouldn't be persistent, which means that all of the code we already have for subdividing devices (partitions, libnvdimm) has to be re-invented in userspace, and existing tools can't be used to manage nvdimms. 2) You're pushing file system features into a character device. I think that using device dax for both volatile and non-volatile memories is a mistake. For persistent memory, I think users would want any subdivision to be persistent. I also think that using a familiar storage model, like block devices and partitions, would make a heck of a lot more sense than this proposal. For volatile use cases, I don't have a problem with what you've proposed. But then, I don't really think too much about those use cases, either, so maybe I'm not the best person to ask. So, in my opinion, you should make device dax all about the volatile use case and we can go back to pushing dax for block devices to support use cases like big databases and passing NVDIMMs into VMs. Yes, I'm signing up to help. More detailed responses are inline below. Dan Williams writes: > On Mon, Dec 12, 2016 at 9:15 AM, Jeff Moyer wrote: >> Hi, Dan, >> >> Dan Williams writes: >> >>>>>From [PATCH 6/8] dax: sub-division support: >>> >>> Device-DAX is a mechanism to establish mappings of performance / feature >>> differentiated memory with strict fault behavior guarantees. With >>> sub-division support a platform owner can provision sub-allocations of a >>> dax-region into separate devices. The provisioning mechanism follows the >>> same scheme as the libnvdimm sub-system in that a 'seed' device is >>> created at initialization time that can be resized from zero to become >>> enabled. >>> >>> Unlike the nvdimm sub-system there is no on media labelling scheme >>> associated with this partitioning. Provisioning decisions are ephemeral >>> / not automatically restored after reboot. While the initial use case of >>> device-dax is persistent memory other uses case may be volatile, so the >>> device-dax core is unable to assume the underlying memory is pmem. The >>> task of recalling a partitioning scheme or permissions on the device(s) >>> is left to userspace. >> >> Can you explain this reasoning in a bit more detail, please? If you >> have specific use cases in mind, that would be helpful. > > A few use cases are top of mind: > > * userspace persistence support: filesystem-DAX as implemented in XFS > and EXT4 requires filesystem coordination for persistence, device-dax > does not. An application may not need a full namespace worth of > persistent memory, or may want to dynamically resize the amount of > persistent memory it is consuming. This enabling allows online resize > of device-dax file/instance. OK, so you've now implemented file extending and truncation (and block mapping, I guess). Where does this end? How many more file-system features will you add to this character device? > * allocation + access mechanism for performance differentiated memory: > Persistent memory is one example of a reserved memory pool with > different performance characteristics than typical DRAM in a system, > and there are examples of other performance differentiated memory > pools (high bandwidth or low latency) showing up on commonly available > platforms. This mechanism gives purpose built applications (high > performance computing, databases, etc...) a way to establish mappings > with predictable fault-granularities and performance, but also allow > for different permissions per allocation. So, how would an application that wishes to use a device-dax subdivision of performance differentiated memory get access to it? 1) administrator subdivides space and assigns it to a user 2) application gets to use it Something like that? Or do you expect applications to sub-divide the device-dax instance programmatically? Why wouldn't you want the mapping to live beyond a single boot? > * carving up a PCI-E device memory bar for managing peer-to-peer > transactions: In the thread about enablling P2P DMA one of the > concerns that was raised was security separation of different users of > a device: http://marc.info/?l=linux-kernel&m=148106083913173&w=2 OK, but I wasn't sure that there was consensus in that thread. It seemed more likely that the block device ioctl path would be pursued. If this is the preferred method, I think you should document their requirements and show how the implementation meets them, instead of leaving that up to reviewers. Or, at the very least, CC the interested parties? >>> For persistent allocations, naming, and permissions automatically >>> recalled by the kernel, use filesystem-DAX. For a userspace helper >> >> I'd agree with that guidance if it wasn't for the fact that device dax >> was born out of the need to be able to flush dirty data in a safe manner >> from userspace. At best, we're giving mixed guidance to application >> developers. > > Yes, but at the same time device-DAX is sufficiently painful (no > read(2)/write(2) support, no builtin metadata support) that it may > spur application developers to lobby for a filesystem that offers > userspace dirty-data flushing. Until then we have this vehicle to test > the difference and dax-support for memory types beyond persistent > memory. Let's just work on the PMEM_IMMUTABLE flag that Dave suggested[1] and make device dax just for volatile memories. -Jeff [1] http://lkml.iu.edu/hypermail/linux/kernel/1609.1/05372.html