nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND]
@ 2024-02-29  0:20 John Groves
  2024-04-23 13:30 ` [Lsf-pc] " Amir Goldstein
  0 siblings, 1 reply; 3+ messages in thread
From: John Groves @ 2024-02-29  0:20 UTC (permalink / raw)
  To: lsf-pc, Jonathan Corbet, Dan Williams, Vishal Verma, Dave Jiang,
	Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
	linux-cxl, linux-fsdevel, nvdimm
  Cc: John Groves, John Groves, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price, Randy Dunlap, Jerome Glisse,
	David Rientjes, Johannes Weiner, John Hubbard, Zi Yan,
	Bharata B Rao, Aneesh Kumar K . V, Alistair Popple,
	Christoph Lameter, Andrew Morton, Jon Grimm, Brian Morris,
	Wei Xu, Theodore Ts'o, mykolal, Aravind Ramesh, Ajay Joshi,
	Eishan Mirakhur, Ravi Shankar, Srinivasulu Thanneeru

John Groves, Micron

Micron recently released the first RFC for famfs [1]. Although famfs is not
CXL-specific in any way, it aims to enable hosts to share data sets in shared
memory (such as CXL) by providing a memory-mappable fs-dax file system
interface to the memory.

Sharable disaggregated memory already exists in the lab, and will be possible
in the wild soon. Famfs aims to do the following:

* Provide an access method that provides isolation between files, and does not
  tempt developers to mmap all the memory writable on every host.
* Provide an an access method that can be used by unmodified apps.

Without something like famfs, enabling the use of sharable memory will involve
the temptation to do things that may destabilize systems, such as
mapping large shared, writable global memory ranges and hooking allocators to
use it (potentially sacrificing isolation), and forcing the same virtual
address ranges in every host/process (compromising security).

The most obvious candidate app categories are data analytics and data lakes.
Both make heavy use of "zero-copy" data frames - column oriented data that
is laid out for efficient use via (MAP_SHARED) mmap. Moreover, these use case
categories are generally driven by python code that wrangles data into
appropriate data frames - making it straightforward to put the data frames
into famfs. Furthermore, these use cases usually involve the shared data being
read-only during computation or query jobs - meaning they are often free of
cache coherency concerns.

Workloads such as these often deal with data sets that are too large to fit
in a single server's memory, so the data gets sharded - requiring movement via
a network. Sharded apps also sometimes have to do expensive reshuffling -
moving data to nodes with available compute resources. Avoiding the sharding
overheads by accessing such data sets in disaggregated shared memory looks
promising to make make better use of memory and compute resources, and by
effectively de-duplicating data sets in memory.

About sharable memory

* Shared memory is pmem-like, in that hosts will connect in order to access
  pre-existing contents
* Onlining sharable memory as system-ram is nonsense; system-ram gets zeroed...
* CXL 3 provides for optionally-supported hardware-managed cache coherency
* But "multiple-readers, no writers" use cases don't need hardware support
  for coherency
* CXL 3.1 dynamic capacity devices (DCDs) should be thought of as devices with
  an allocator built in.
* When sharable capacity is allocated, each host that has access will see a
  /dev/dax device that can be found by the "tag" of the allocation. The tag is
  just a uuid.
* CXL 3.1 also allows the capacity associated with any allocated tag to be
  provided to each host (or host group) as either writable or read-only.

About famfs

Famfs is an append-only log-structured file system that places many limits
on what can be done. This allows famfs to tolerate clients with a stale copy
of metadata. All memory allocation and log maintenance is performed from user
space, but file extent lists are cached in the kernel for fast fault
resolution. The current limitations are fairly extreme, but many can be relaxed
by writing more code, managing Byzantine generals, etc. ;)

A famfs-enabled kernel can be cloned at [3], and the user space repo can be
cloned at [4]. Even with major functional limitations in its current form
(e.g. famfs does not currently support deleting files), it is sufficient to
use in data analytics workloads - in which you 1) create a famfs file system,
2) dump data sets into it, 3) run clustered jobs that consume the shared data
sets, and 4) dismount and deallocate the memory containing the file system.

Famfs Open Issues

* Volatile CXL memory is exposed as character dax devices; the famfs patch
  set adds the iomap API, which is required for fs-dax but until now missing
  from character dax.
* (/dev/pmem devices are block, and support the iomap api for fs-dax file
  systems)
* /dev/pmem devices can be converted to /dev/dax mode, but native /dev/dax
  devices cannot be converted to pmem mode.
* /dev/dax devices lack the iomap api that fs-dax uses with pmem, so the famfs
  patch set adds that.
* VFS layer hooks for a file system on a character device may be needed.
* Famfs has uncovered some previously latent bugs in the /dev/dax mmap
  machinery that probably require attention.
* Famfs currently works with either pmem or devdax devices, but our
  inclination is to drop pmem support to, reduce the complexity of supporting
  two different underlying device types - particularly since famfs is not
  intended for actual pmem.


Required :-
Dan Williams
Christian Brauner
Jonathan Cameron
Dave Hansen

[LSF/MM + BPF ATTEND]

I am the author of the famfs file system. Famfs was first introduced at LPC
2023 [2]. I'm also Micron's voting member on the Software and Systems Working
Group (SSWG) of the CXL Consortium, and a co-author of the CXL 3.1
specification.


References

[1] https://lore.kernel.org/linux-fsdevel/cover.1708709155.git.john@groves.net/#t
[2] https://lpc.events/event/17/contributions/1455/
[3] https://www.computeexpresslink.org/download-the-specification
[4] https://github.com/cxl-micron-reskit/famfs-linux

Best regards,
John Groves
Micron

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND]
  2024-02-29  0:20 [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND] John Groves
@ 2024-04-23 13:30 ` Amir Goldstein
  2024-04-24 12:22   ` John Groves
  0 siblings, 1 reply; 3+ messages in thread
From: Amir Goldstein @ 2024-04-23 13:30 UTC (permalink / raw)
  To: John Groves, Miklos Szeredi, Bernd Schubert
  Cc: lsf-pc, Jonathan Corbet, Dan Williams, Vishal Verma, Dave Jiang,
	Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
	linux-cxl, linux-fsdevel, nvdimm, Randy Dunlap, Jon Grimm,
	Dave Chinner, john, Bharata B Rao, Jerome Glisse, gregory.price,
	Ajay Joshi, Aneesh Kumar K . V, Alistair Popple,
	Christoph Hellwig, Zi Yan, David Rientjes, Ravi Shankar,
	dave.hansen, John Hubbard, mykolal, Brian Morris,
	Eishan Mirakhur, Wei Xu, Theodore Ts'o,
	Srinivasulu Thanneeru, John Groves, Christoph Lameter,
	Johannes Weiner, Andrew Morton, Aravind Ramesh

On Thu, Feb 29, 2024 at 2:20 AM John Groves <John@groves.net> wrote:
>
> John Groves, Micron
>
> Micron recently released the first RFC for famfs [1]. Although famfs is not
> CXL-specific in any way, it aims to enable hosts to share data sets in shared
> memory (such as CXL) by providing a memory-mappable fs-dax file system
> interface to the memory.
>
> Sharable disaggregated memory already exists in the lab, and will be possible
> in the wild soon. Famfs aims to do the following:
>
> * Provide an access method that provides isolation between files, and does not
>   tempt developers to mmap all the memory writable on every host.
> * Provide an an access method that can be used by unmodified apps.
>
> Without something like famfs, enabling the use of sharable memory will involve
> the temptation to do things that may destabilize systems, such as
> mapping large shared, writable global memory ranges and hooking allocators to
> use it (potentially sacrificing isolation), and forcing the same virtual
> address ranges in every host/process (compromising security).
>
> The most obvious candidate app categories are data analytics and data lakes.
> Both make heavy use of "zero-copy" data frames - column oriented data that
> is laid out for efficient use via (MAP_SHARED) mmap. Moreover, these use case
> categories are generally driven by python code that wrangles data into
> appropriate data frames - making it straightforward to put the data frames
> into famfs. Furthermore, these use cases usually involve the shared data being
> read-only during computation or query jobs - meaning they are often free of
> cache coherency concerns.
>
> Workloads such as these often deal with data sets that are too large to fit
> in a single server's memory, so the data gets sharded - requiring movement via
> a network. Sharded apps also sometimes have to do expensive reshuffling -
> moving data to nodes with available compute resources. Avoiding the sharding
> overheads by accessing such data sets in disaggregated shared memory looks
> promising to make make better use of memory and compute resources, and by
> effectively de-duplicating data sets in memory.
>
> About sharable memory
>
> * Shared memory is pmem-like, in that hosts will connect in order to access
>   pre-existing contents
> * Onlining sharable memory as system-ram is nonsense; system-ram gets zeroed...
> * CXL 3 provides for optionally-supported hardware-managed cache coherency
> * But "multiple-readers, no writers" use cases don't need hardware support
>   for coherency
> * CXL 3.1 dynamic capacity devices (DCDs) should be thought of as devices with
>   an allocator built in.
> * When sharable capacity is allocated, each host that has access will see a
>   /dev/dax device that can be found by the "tag" of the allocation. The tag is
>   just a uuid.
> * CXL 3.1 also allows the capacity associated with any allocated tag to be
>   provided to each host (or host group) as either writable or read-only.
>
> About famfs
>
> Famfs is an append-only log-structured file system that places many limits
> on what can be done. This allows famfs to tolerate clients with a stale copy
> of metadata. All memory allocation and log maintenance is performed from user
> space, but file extent lists are cached in the kernel for fast fault
> resolution. The current limitations are fairly extreme, but many can be relaxed
> by writing more code, managing Byzantine generals, etc. ;)
>
> A famfs-enabled kernel can be cloned at [3], and the user space repo can be
> cloned at [4]. Even with major functional limitations in its current form
> (e.g. famfs does not currently support deleting files), it is sufficient to
> use in data analytics workloads - in which you 1) create a famfs file system,
> 2) dump data sets into it, 3) run clustered jobs that consume the shared data
> sets, and 4) dismount and deallocate the memory containing the file system.
>
> Famfs Open Issues
>
> * Volatile CXL memory is exposed as character dax devices; the famfs patch
>   set adds the iomap API, which is required for fs-dax but until now missing
>   from character dax.
> * (/dev/pmem devices are block, and support the iomap api for fs-dax file
>   systems)
> * /dev/pmem devices can be converted to /dev/dax mode, but native /dev/dax
>   devices cannot be converted to pmem mode.
> * /dev/dax devices lack the iomap api that fs-dax uses with pmem, so the famfs
>   patch set adds that.
> * VFS layer hooks for a file system on a character device may be needed.
> * Famfs has uncovered some previously latent bugs in the /dev/dax mmap
>   machinery that probably require attention.
> * Famfs currently works with either pmem or devdax devices, but our
>   inclination is to drop pmem support to, reduce the complexity of supporting
>   two different underlying device types - particularly since famfs is not
>   intended for actual pmem.
>
>
> Required :-
> Dan Williams
> Christian Brauner
> Jonathan Cameron
> Dave Hansen
>
> [LSF/MM + BPF ATTEND]
>
> I am the author of the famfs file system. Famfs was first introduced at LPC
> 2023 [2]. I'm also Micron's voting member on the Software and Systems Working
> Group (SSWG) of the CXL Consortium, and a co-author of the CXL 3.1
> specification.
>
>
> References
>
> [1] https://lore.kernel.org/linux-fsdevel/cover.1708709155.git.john@groves.net/#t
> [2] https://lpc.events/event/17/contributions/1455/
> [3] https://www.computeexpresslink.org/download-the-specification
> [4] https://github.com/cxl-micron-reskit/famfs-linux
>

Hi John,

Following our correspondence on your patch set [1], I am not sure that the
details of famfs file system itself are an interesting topic for the
LSFMM crowd??
What I would like to do is schedule a session on:
"Famfs: new userspace filesystem driver vs. improving FUSE/DAX"

I am hoping that Miklos and Bernd will be able to participate in this
session remotely.

You see the last time that someone tried to introduce a specialized
faster FUSE replacement [2], the comments from the community were
that FUSE protocol can and should be improved instead of introducing
another "filesystem in userspace" protocol.

Since 2019, FUSE has gained virtiofs/dax support, it recently gained
FUSE passthrough support and Bernd is working on FUSE uring [3].

My hope is that you will be able to list the needed improvements
to /dev/dax iomap and FUSE so that you could use the existing
kernel infrastructure and FUSE libraries to implement famfs.

How does that sound for a discussion?

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/3jwluwrqj6rwsxdsksfvdeo5uccgmnkh7rgefaeyxf2gu75344@ybhwncywkftx/
[2] https://lore.kernel.org/linux-fsdevel/8d119597-4543-c6a4-917f-14f4f4a6a855@netapp.com/
[3] https://lore.kernel.org/linux-fsdevel/20230321011047.3425786-1-bschubert@ddn.com/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND]
  2024-04-23 13:30 ` [Lsf-pc] " Amir Goldstein
@ 2024-04-24 12:22   ` John Groves
  0 siblings, 0 replies; 3+ messages in thread
From: John Groves @ 2024-04-24 12:22 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Bernd Schubert, lsf-pc, Jonathan Corbet,
	Dan Williams, Vishal Verma, Dave Jiang, Alexander Viro,
	Christian Brauner, Jan Kara, Matthew Wilcox, linux-cxl,
	linux-fsdevel, nvdimm, Randy Dunlap, Jon Grimm, Dave Chinner,
	john, Bharata B Rao, Jerome Glisse, gregory.price, Ajay Joshi,
	Aneesh Kumar K . V, Alistair Popple, Christoph Hellwig, Zi Yan,
	David Rientjes, Ravi Shankar, dave.hansen, John Hubbard, mykolal,
	Brian Morris, Eishan Mirakhur, Wei Xu, Theodore Ts'o,
	Srinivasulu Thanneeru, John Groves, Christoph Lameter,
	Johannes Weiner, Andrew Morton, Aravind Ramesh

On 24/04/23 04:30PM, Amir Goldstein wrote:
> On Thu, Feb 29, 2024 at 2:20 AM John Groves <John@groves.net> wrote:
> >
> > John Groves, Micron
> >
> > Micron recently released the first RFC for famfs [1]. Although famfs is not
> > CXL-specific in any way, it aims to enable hosts to share data sets in shared
> > memory (such as CXL) by providing a memory-mappable fs-dax file system
> > interface to the memory.
> >
> > Sharable disaggregated memory already exists in the lab, and will be possible
> > in the wild soon. Famfs aims to do the following:
> >
> > * Provide an access method that provides isolation between files, and does not
> >   tempt developers to mmap all the memory writable on every host.
> > * Provide an an access method that can be used by unmodified apps.
> >
> > Without something like famfs, enabling the use of sharable memory will involve
> > the temptation to do things that may destabilize systems, such as
> > mapping large shared, writable global memory ranges and hooking allocators to
> > use it (potentially sacrificing isolation), and forcing the same virtual
> > address ranges in every host/process (compromising security).
> >
> > The most obvious candidate app categories are data analytics and data lakes.
> > Both make heavy use of "zero-copy" data frames - column oriented data that
> > is laid out for efficient use via (MAP_SHARED) mmap. Moreover, these use case
> > categories are generally driven by python code that wrangles data into
> > appropriate data frames - making it straightforward to put the data frames
> > into famfs. Furthermore, these use cases usually involve the shared data being
> > read-only during computation or query jobs - meaning they are often free of
> > cache coherency concerns.
> >
> > Workloads such as these often deal with data sets that are too large to fit
> > in a single server's memory, so the data gets sharded - requiring movement via
> > a network. Sharded apps also sometimes have to do expensive reshuffling -
> > moving data to nodes with available compute resources. Avoiding the sharding
> > overheads by accessing such data sets in disaggregated shared memory looks
> > promising to make make better use of memory and compute resources, and by
> > effectively de-duplicating data sets in memory.
> >
> > About sharable memory
> >
> > * Shared memory is pmem-like, in that hosts will connect in order to access
> >   pre-existing contents
> > * Onlining sharable memory as system-ram is nonsense; system-ram gets zeroed...
> > * CXL 3 provides for optionally-supported hardware-managed cache coherency
> > * But "multiple-readers, no writers" use cases don't need hardware support
> >   for coherency
> > * CXL 3.1 dynamic capacity devices (DCDs) should be thought of as devices with
> >   an allocator built in.
> > * When sharable capacity is allocated, each host that has access will see a
> >   /dev/dax device that can be found by the "tag" of the allocation. The tag is
> >   just a uuid.
> > * CXL 3.1 also allows the capacity associated with any allocated tag to be
> >   provided to each host (or host group) as either writable or read-only.
> >
> > About famfs
> >
> > Famfs is an append-only log-structured file system that places many limits
> > on what can be done. This allows famfs to tolerate clients with a stale copy
> > of metadata. All memory allocation and log maintenance is performed from user
> > space, but file extent lists are cached in the kernel for fast fault
> > resolution. The current limitations are fairly extreme, but many can be relaxed
> > by writing more code, managing Byzantine generals, etc. ;)
> >
> > A famfs-enabled kernel can be cloned at [3], and the user space repo can be
> > cloned at [4]. Even with major functional limitations in its current form
> > (e.g. famfs does not currently support deleting files), it is sufficient to
> > use in data analytics workloads - in which you 1) create a famfs file system,
> > 2) dump data sets into it, 3) run clustered jobs that consume the shared data
> > sets, and 4) dismount and deallocate the memory containing the file system.
> >
> > Famfs Open Issues
> >
> > * Volatile CXL memory is exposed as character dax devices; the famfs patch
> >   set adds the iomap API, which is required for fs-dax but until now missing
> >   from character dax.
> > * (/dev/pmem devices are block, and support the iomap api for fs-dax file
> >   systems)
> > * /dev/pmem devices can be converted to /dev/dax mode, but native /dev/dax
> >   devices cannot be converted to pmem mode.
> > * /dev/dax devices lack the iomap api that fs-dax uses with pmem, so the famfs
> >   patch set adds that.
> > * VFS layer hooks for a file system on a character device may be needed.
> > * Famfs has uncovered some previously latent bugs in the /dev/dax mmap
> >   machinery that probably require attention.
> > * Famfs currently works with either pmem or devdax devices, but our
> >   inclination is to drop pmem support to, reduce the complexity of supporting
> >   two different underlying device types - particularly since famfs is not
> >   intended for actual pmem.
> >
> >
> > Required :-
> > Dan Williams
> > Christian Brauner
> > Jonathan Cameron
> > Dave Hansen
> >
> > [LSF/MM + BPF ATTEND]
> >
> > I am the author of the famfs file system. Famfs was first introduced at LPC
> > 2023 [2]. I'm also Micron's voting member on the Software and Systems Working
> > Group (SSWG) of the CXL Consortium, and a co-author of the CXL 3.1
> > specification.
> >
> >
> > References
> >
> > [1] https://lore.kernel.org/linux-fsdevel/cover.1708709155.git.john@groves.net/#t
> > [2] https://lpc.events/event/17/contributions/1455/
> > [3] https://www.computeexpresslink.org/download-the-specification
> > [4] https://github.com/cxl-micron-reskit/famfs-linux
> >
> 
> Hi John,
> 
> Following our correspondence on your patch set [1], I am not sure that the
> details of famfs file system itself are an interesting topic for the
> LSFMM crowd??
> What I would like to do is schedule a session on:
> "Famfs: new userspace filesystem driver vs. improving FUSE/DAX"
> 
> I am hoping that Miklos and Bernd will be able to participate in this
> session remotely.
> 
> You see the last time that someone tried to introduce a specialized
> faster FUSE replacement [2], the comments from the community were
> that FUSE protocol can and should be improved instead of introducing
> another "filesystem in userspace" protocol.
> 
> Since 2019, FUSE has gained virtiofs/dax support, it recently gained
> FUSE passthrough support and Bernd is working on FUSE uring [3].
> 
> My hope is that you will be able to list the needed improvements
> to /dev/dax iomap and FUSE so that you could use the existing
> kernel infrastructure and FUSE libraries to implement famfs.
> 
> How does that sound for a discussion?
> 
> Thanks,
> Amir.
> 
> [1] https://lore.kernel.org/linux-fsdevel/3jwluwrqj6rwsxdsksfvdeo5uccgmnkh7rgefaeyxf2gu75344@ybhwncywkftx/
> [2] https://lore.kernel.org/linux-fsdevel/8d119597-4543-c6a4-917f-14f4f4a6a855@netapp.com/
> [3] https://lore.kernel.org/linux-fsdevel/20230321011047.3425786-1-bschubert@ddn.com/

Amir,

That sounds good, thanks! I'll start preparing for it!

Re: [2]: I do think there are important ways that famfs is not "another 
filesystem in user space protocol" - but I'll save it for the LSFMM session!

FYI famfs v2 patches will be going out before LSFMM (and possibly before
next week).

Thanks Amir,
John


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-04-24 12:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-29  0:20 [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND] John Groves
2024-04-23 13:30 ` [Lsf-pc] " Amir Goldstein
2024-04-24 12:22   ` John Groves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).