All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
To: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Linux RDMA Mailing List
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	linux-fsdevel
	<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Linux NFS Mailing List
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Remote access to pmem on storage targets
Date: Wed, 27 Jan 2016 10:55:36 -0500	[thread overview]
Message-ID: <D1002C60-C01D-410C-ABD4-7BDDB82E0CC1@oracle.com> (raw)
In-Reply-To: <20160127000404.GN6033@dastard>


> On Jan 26, 2016, at 7:04 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> 
> On Tue, Jan 26, 2016 at 10:58:44AM -0500, Chuck Lever wrote:
>> It is not going to be like the well-worn paradigm that
>> involves a page cache on the storage target backed by
>> slow I/O operations. The protocol layers on storage
>> targets need a way to discover memory addresses of
>> persistent memory that will be used as source/sink
>> buffers for RDMA operations.
>> 
>> And making data durable after a write is going to need
>> some thought. So I believe some new plumbing will be
>> necessary.
> 
> Haven't we already solve this for the pNFS file driver that XFS
> implements? i.e. these export operations:
> 
>        int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
>        int (*map_blocks)(struct inode *inode, loff_t offset,
>                          u64 len, struct iomap *iomap,
>                          bool write, u32 *device_generation);
>        int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
>                             int nr_iomaps, struct iattr *iattr);
> 
> so mapping/allocation of file offset to sector mappings, which can
> then trivially be used to grab the memory address through the bdev
> ->direct_access method, yes?

Thanks, that makes sense. How would such addresses be
utilized?

I'll speak about the NFS/RDMA server for this example, as
I am more familiar with that than with block targets. When
I say "NFS server" here I mean the software service on the
storage target that speaks the NFS protocol.

In today's RDMA-enabled storage protocols, an initiator
exposes its memory (in small segments) to storage targets,
sends a request, and the target's network transport performs
RDMA Read and Write operations to move the payload data in
that request.

Assuming the NFS server is somehow aware that what it is
getting from ->direct_access is a persistent memory address
and not an LBA, it would then have to pass it down to the
transport layer (svcrdma) so that the address can be used
as a source or sink buffer for RDMA operations.

For an NFS READ, this should be straightforward. An RPC
request comes in, the NFS server identifies the memory that
is to source the READ reply and passes the address of that
memory to the transport, which then pushes the data in
that memory via an RDMA Write to the client.

NFS WRITES are more difficult. An RPC request comes in,
and today the transport layer gathers incoming payload data
in anonymous pages before the NFS server even knows there
is an incoming RPC. We'd have to add some kind of hook to
enable the NFS server and the underlying filesystem to
provide appropriate sink buffers to the transport.

After the NFS WRITE request has been wholly received, the
NFS server today uses vfs_writev to put that data into the
target file. We'd probably want something more efficient
for pmem-backed filesystems. We want something more
efficient for traditional page cache-based filesystems
anyway.

Every NFS WRITE larger than a page would be essentially
CoW, since the filesystem would need to provide "anonymous"
blocks to sink incoming WRITE data and then transition
those blocks into the target file? Not sure how this works
for pNFS with block devices.

Finally a client needs to perform an NFS COMMIT to ensure
that the written data is at rest on durable storage. We
could insist that all NFS WRITE operations to pmem will
be DATA_SYNC or better (in other words, abandon UNSTABLE
mode). If not, then a separate NFS COMMIT/LAYOUTCOMMIT
is necessary to flush memory caches and ensure data
durability. An extra RPC round trip is likely not a good
idea when the cost structure of NFS WRITE is so much
different than it is for traditional block devices.

NFS WRITE is really the thing we want to go as fast as
possible, btw. NFS READ on RDMA is already faster, for
reasons I won't go into here. Aside from that, NFS READ
results are frequently cached on clients, and some of the
cost of NFS READ is already hidden by read-ahead. Because
read(2) is often satisfied from a local cache, application
progress is more frequently blocked by pending write(2)
calls than by reads.

A fully generic solution would have to provide NFS service
for transports that do not enable direct data placement
(eg TCP), and for filesystems that are legacy page
cache-based (anything residing on a traditional block
device).

I imagine that the issues are similar for block targets, if
they assume block devices are fronted by a memory cache.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)
From: Chuck Lever <chuck.lever@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
	lsf-pc@lists.linux-foundation.org,
	Linux RDMA Mailing List <linux-rdma@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Remote access to pmem on storage targets
Date: Wed, 27 Jan 2016 10:55:36 -0500	[thread overview]
Message-ID: <D1002C60-C01D-410C-ABD4-7BDDB82E0CC1@oracle.com> (raw)
In-Reply-To: <20160127000404.GN6033@dastard>


> On Jan 26, 2016, at 7:04 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jan 26, 2016 at 10:58:44AM -0500, Chuck Lever wrote:
>> It is not going to be like the well-worn paradigm that
>> involves a page cache on the storage target backed by
>> slow I/O operations. The protocol layers on storage
>> targets need a way to discover memory addresses of
>> persistent memory that will be used as source/sink
>> buffers for RDMA operations.
>> 
>> And making data durable after a write is going to need
>> some thought. So I believe some new plumbing will be
>> necessary.
> 
> Haven't we already solve this for the pNFS file driver that XFS
> implements? i.e. these export operations:
> 
>        int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
>        int (*map_blocks)(struct inode *inode, loff_t offset,
>                          u64 len, struct iomap *iomap,
>                          bool write, u32 *device_generation);
>        int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
>                             int nr_iomaps, struct iattr *iattr);
> 
> so mapping/allocation of file offset to sector mappings, which can
> then trivially be used to grab the memory address through the bdev
> ->direct_access method, yes?

Thanks, that makes sense. How would such addresses be
utilized?

I'll speak about the NFS/RDMA server for this example, as
I am more familiar with that than with block targets. When
I say "NFS server" here I mean the software service on the
storage target that speaks the NFS protocol.

In today's RDMA-enabled storage protocols, an initiator
exposes its memory (in small segments) to storage targets,
sends a request, and the target's network transport performs
RDMA Read and Write operations to move the payload data in
that request.

Assuming the NFS server is somehow aware that what it is
getting from ->direct_access is a persistent memory address
and not an LBA, it would then have to pass it down to the
transport layer (svcrdma) so that the address can be used
as a source or sink buffer for RDMA operations.

For an NFS READ, this should be straightforward. An RPC
request comes in, the NFS server identifies the memory that
is to source the READ reply and passes the address of that
memory to the transport, which then pushes the data in
that memory via an RDMA Write to the client.

NFS WRITES are more difficult. An RPC request comes in,
and today the transport layer gathers incoming payload data
in anonymous pages before the NFS server even knows there
is an incoming RPC. We'd have to add some kind of hook to
enable the NFS server and the underlying filesystem to
provide appropriate sink buffers to the transport.

After the NFS WRITE request has been wholly received, the
NFS server today uses vfs_writev to put that data into the
target file. We'd probably want something more efficient
for pmem-backed filesystems. We want something more
efficient for traditional page cache-based filesystems
anyway.

Every NFS WRITE larger than a page would be essentially
CoW, since the filesystem would need to provide "anonymous"
blocks to sink incoming WRITE data and then transition
those blocks into the target file? Not sure how this works
for pNFS with block devices.

Finally a client needs to perform an NFS COMMIT to ensure
that the written data is at rest on durable storage. We
could insist that all NFS WRITE operations to pmem will
be DATA_SYNC or better (in other words, abandon UNSTABLE
mode). If not, then a separate NFS COMMIT/LAYOUTCOMMIT
is necessary to flush memory caches and ensure data
durability. An extra RPC round trip is likely not a good
idea when the cost structure of NFS WRITE is so much
different than it is for traditional block devices.

NFS WRITE is really the thing we want to go as fast as
possible, btw. NFS READ on RDMA is already faster, for
reasons I won't go into here. Aside from that, NFS READ
results are frequently cached on clients, and some of the
cost of NFS READ is already hidden by read-ahead. Because
read(2) is often satisfied from a local cache, application
progress is more frequently blocked by pending write(2)
calls than by reads.

A fully generic solution would have to provide NFS service
for transports that do not enable direct data placement
(eg TCP), and for filesystems that are legacy page
cache-based (anything residing on a traditional block
device).

I imagine that the issues are similar for block targets, if
they assume block devices are fronted by a memory cache.


--
Chuck Lever





  reply	other threads:[~2016-01-27 15:55 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-25 21:19 [LSF/MM TOPIC] Remote access to pmem on storage targets Chuck Lever
2016-01-25 21:19 ` Chuck Lever
     [not found] ` <06414D5A-0632-4C74-B76C-038093E8AED3-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-01-26  8:25   ` [Lsf-pc] " Jan Kara
2016-01-26  8:25     ` Jan Kara
2016-01-26 15:58     ` Chuck Lever
2016-01-27  0:04       ` Dave Chinner
2016-01-27 15:55         ` Chuck Lever [this message]
2016-01-27 15:55           ` Chuck Lever
2016-01-28 21:10           ` Dave Chinner
     [not found]       ` <F0E2108B-891C-4570-B486-7DC7C4FB59C4-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-01-27 10:52         ` Sagi Grimberg
2016-01-27 10:52           ` Sagi Grimberg
2016-01-26 15:25   ` Atchley, Scott
2016-01-26 15:25     ` Atchley, Scott
2016-01-26 15:25     ` Atchley, Scott
     [not found]     ` <5FD20017-B588-42E6-BBDA-2AA8ABDBA42B-1Heg1YXhbW8@public.gmane.org>
2016-01-26 15:29       ` Chuck Lever
2016-01-26 15:29         ` Chuck Lever
     [not found]         ` <D0C5C0B9-A1A2-4428-B3CA-7BBCC5BEF10D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-01-26 17:00           ` Christoph Hellwig
2016-01-26 17:00             ` Christoph Hellwig
2016-01-27 16:54 ` [LSF/MM TOPIC/ATTEND] RDMA passive target Boaz Harrosh
     [not found]   ` <56A8F646.5020003-/8YdC2HfS5554TAoqtyWWQ@public.gmane.org>
2016-01-27 17:02     ` [Lsf-pc] " James Bottomley
2016-01-27 17:02       ` James Bottomley
2016-01-27 17:27   ` Sagi Grimberg
     [not found]     ` <56A8FE10.7000309-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2016-01-31 14:20       ` Boaz Harrosh
2016-01-31 14:20         ` Boaz Harrosh
2016-01-31 16:55         ` Yigal Korman
     [not found]           ` <CACTTzNaOChdWN2eS9_kzv6HO_LVib-JVdkmeUn0LDe2eKxPEgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-02-01 10:36             ` Sagi Grimberg
2016-02-01 10:36               ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D1002C60-C01D-410C-ABD4-7BDDB82E0CC1@oracle.com \
    --to=chuck.lever-qhclzuegtsvqt0dzr+alfa@public.gmane.org \
    --cc=david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org \
    --cc=jack-AlSwsSmVLrQ@public.gmane.org \
    --cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.