nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Amir Goldstein <amir73il@gmail.com>
To: John Groves <John@groves.net>, Miklos Szeredi <miklos@szeredi.hu>,
	 Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: lsf-pc@lists.linux-foundation.org,
	Jonathan Corbet <corbet@lwn.net>,
	 Dan Williams <dan.j.williams@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	 Dave Jiang <dave.jiang@intel.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	 Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Matthew Wilcox <willy@infradead.org>,
	 linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	 nvdimm@lists.linux.dev, Randy Dunlap <rdunlap@infradead.org>,
	 Jon Grimm <jon.grimm@amd.com>,
	Dave Chinner <david@fromorbit.com>,
	john@jagalactic.com,  Bharata B Rao <bharata@amd.com>,
	Jerome Glisse <jglisse@google.com>,
	gregory.price@memverge.com,  Ajay Joshi <ajayjoshi@micron.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>,
	 Alistair Popple <apopple@nvidia.com>,
	Christoph Hellwig <hch@infradead.org>, Zi Yan <ziy@nvidia.com>,
	 David Rientjes <rientjes@google.com>,
	Ravi Shankar <venkataravis@micron.com>,
	 dave.hansen@linux.intel.com, John Hubbard <jhubbard@nvidia.com>,
	mykolal@meta.com,  Brian Morris <bsmorris@google.com>,
	Eishan Mirakhur <emirakhur@micron.com>,
	Wei Xu <weixugc@google.com>,  "Theodore Ts'o" <tytso@mit.edu>,
	Srinivasulu Thanneeru <sthanneeru@micron.com>,
	John Groves <jgroves@micron.com>,
	 Christoph Lameter <cl@gentwo.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Aravind Ramesh <arramesh@micron.com>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND]
Date: Tue, 23 Apr 2024 16:30:59 +0300	[thread overview]
Message-ID: <CAOQ4uxi83HUUmMmNs9NeeOOfVVXhpWAdeAEDq8r31p0tK1sA2A@mail.gmail.com> (raw)
In-Reply-To: <20240229002020.85535-1-john@groves.net>

On Thu, Feb 29, 2024 at 2:20 AM John Groves <John@groves.net> wrote:
>
> John Groves, Micron
>
> Micron recently released the first RFC for famfs [1]. Although famfs is not
> CXL-specific in any way, it aims to enable hosts to share data sets in shared
> memory (such as CXL) by providing a memory-mappable fs-dax file system
> interface to the memory.
>
> Sharable disaggregated memory already exists in the lab, and will be possible
> in the wild soon. Famfs aims to do the following:
>
> * Provide an access method that provides isolation between files, and does not
>   tempt developers to mmap all the memory writable on every host.
> * Provide an an access method that can be used by unmodified apps.
>
> Without something like famfs, enabling the use of sharable memory will involve
> the temptation to do things that may destabilize systems, such as
> mapping large shared, writable global memory ranges and hooking allocators to
> use it (potentially sacrificing isolation), and forcing the same virtual
> address ranges in every host/process (compromising security).
>
> The most obvious candidate app categories are data analytics and data lakes.
> Both make heavy use of "zero-copy" data frames - column oriented data that
> is laid out for efficient use via (MAP_SHARED) mmap. Moreover, these use case
> categories are generally driven by python code that wrangles data into
> appropriate data frames - making it straightforward to put the data frames
> into famfs. Furthermore, these use cases usually involve the shared data being
> read-only during computation or query jobs - meaning they are often free of
> cache coherency concerns.
>
> Workloads such as these often deal with data sets that are too large to fit
> in a single server's memory, so the data gets sharded - requiring movement via
> a network. Sharded apps also sometimes have to do expensive reshuffling -
> moving data to nodes with available compute resources. Avoiding the sharding
> overheads by accessing such data sets in disaggregated shared memory looks
> promising to make make better use of memory and compute resources, and by
> effectively de-duplicating data sets in memory.
>
> About sharable memory
>
> * Shared memory is pmem-like, in that hosts will connect in order to access
>   pre-existing contents
> * Onlining sharable memory as system-ram is nonsense; system-ram gets zeroed...
> * CXL 3 provides for optionally-supported hardware-managed cache coherency
> * But "multiple-readers, no writers" use cases don't need hardware support
>   for coherency
> * CXL 3.1 dynamic capacity devices (DCDs) should be thought of as devices with
>   an allocator built in.
> * When sharable capacity is allocated, each host that has access will see a
>   /dev/dax device that can be found by the "tag" of the allocation. The tag is
>   just a uuid.
> * CXL 3.1 also allows the capacity associated with any allocated tag to be
>   provided to each host (or host group) as either writable or read-only.
>
> About famfs
>
> Famfs is an append-only log-structured file system that places many limits
> on what can be done. This allows famfs to tolerate clients with a stale copy
> of metadata. All memory allocation and log maintenance is performed from user
> space, but file extent lists are cached in the kernel for fast fault
> resolution. The current limitations are fairly extreme, but many can be relaxed
> by writing more code, managing Byzantine generals, etc. ;)
>
> A famfs-enabled kernel can be cloned at [3], and the user space repo can be
> cloned at [4]. Even with major functional limitations in its current form
> (e.g. famfs does not currently support deleting files), it is sufficient to
> use in data analytics workloads - in which you 1) create a famfs file system,
> 2) dump data sets into it, 3) run clustered jobs that consume the shared data
> sets, and 4) dismount and deallocate the memory containing the file system.
>
> Famfs Open Issues
>
> * Volatile CXL memory is exposed as character dax devices; the famfs patch
>   set adds the iomap API, which is required for fs-dax but until now missing
>   from character dax.
> * (/dev/pmem devices are block, and support the iomap api for fs-dax file
>   systems)
> * /dev/pmem devices can be converted to /dev/dax mode, but native /dev/dax
>   devices cannot be converted to pmem mode.
> * /dev/dax devices lack the iomap api that fs-dax uses with pmem, so the famfs
>   patch set adds that.
> * VFS layer hooks for a file system on a character device may be needed.
> * Famfs has uncovered some previously latent bugs in the /dev/dax mmap
>   machinery that probably require attention.
> * Famfs currently works with either pmem or devdax devices, but our
>   inclination is to drop pmem support to, reduce the complexity of supporting
>   two different underlying device types - particularly since famfs is not
>   intended for actual pmem.
>
>
> Required :-
> Dan Williams
> Christian Brauner
> Jonathan Cameron
> Dave Hansen
>
> [LSF/MM + BPF ATTEND]
>
> I am the author of the famfs file system. Famfs was first introduced at LPC
> 2023 [2]. I'm also Micron's voting member on the Software and Systems Working
> Group (SSWG) of the CXL Consortium, and a co-author of the CXL 3.1
> specification.
>
>
> References
>
> [1] https://lore.kernel.org/linux-fsdevel/cover.1708709155.git.john@groves.net/#t
> [2] https://lpc.events/event/17/contributions/1455/
> [3] https://www.computeexpresslink.org/download-the-specification
> [4] https://github.com/cxl-micron-reskit/famfs-linux
>

Hi John,

Following our correspondence on your patch set [1], I am not sure that the
details of famfs file system itself are an interesting topic for the
LSFMM crowd??
What I would like to do is schedule a session on:
"Famfs: new userspace filesystem driver vs. improving FUSE/DAX"

I am hoping that Miklos and Bernd will be able to participate in this
session remotely.

You see the last time that someone tried to introduce a specialized
faster FUSE replacement [2], the comments from the community were
that FUSE protocol can and should be improved instead of introducing
another "filesystem in userspace" protocol.

Since 2019, FUSE has gained virtiofs/dax support, it recently gained
FUSE passthrough support and Bernd is working on FUSE uring [3].

My hope is that you will be able to list the needed improvements
to /dev/dax iomap and FUSE so that you could use the existing
kernel infrastructure and FUSE libraries to implement famfs.

How does that sound for a discussion?

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/3jwluwrqj6rwsxdsksfvdeo5uccgmnkh7rgefaeyxf2gu75344@ybhwncywkftx/
[2] https://lore.kernel.org/linux-fsdevel/8d119597-4543-c6a4-917f-14f4f4a6a855@netapp.com/
[3] https://lore.kernel.org/linux-fsdevel/20230321011047.3425786-1-bschubert@ddn.com/

  reply	other threads:[~2024-04-23 13:31 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-29  0:20 [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND] John Groves
2024-04-23 13:30 ` Amir Goldstein [this message]
2024-04-24 12:22   ` [Lsf-pc] " John Groves

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOQ4uxi83HUUmMmNs9NeeOOfVVXhpWAdeAEDq8r31p0tK1sA2A@mail.gmail.com \
    --to=amir73il@gmail.com \
    --cc=John@groves.net \
    --cc=ajayjoshi@micron.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=arramesh@micron.com \
    --cc=bernd.schubert@fastmail.fm \
    --cc=bharata@amd.com \
    --cc=brauner@kernel.org \
    --cc=bsmorris@google.com \
    --cc=cl@gentwo.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@fromorbit.com \
    --cc=emirakhur@micron.com \
    --cc=gregory.price@memverge.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jglisse@google.com \
    --cc=jgroves@micron.com \
    --cc=jhubbard@nvidia.com \
    --cc=john@jagalactic.com \
    --cc=jon.grimm@amd.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=miklos@szeredi.hu \
    --cc=mykolal@meta.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=sthanneeru@micron.com \
    --cc=tytso@mit.edu \
    --cc=venkataravis@micron.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vishal.l.verma@intel.com \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).