All of lore.kernel.org
 help / color / mirror / Atom feed
* why does nfsd write not use splice
@ 2013-06-09  7:35 Sandeep Joshi
  2013-06-11 19:51 ` J. Bruce Fields
  0 siblings, 1 reply; 15+ messages in thread
From: Sandeep Joshi @ 2013-06-09  7:35 UTC (permalink / raw)
  To: linux-nfs

Is there a reason as to why the nfsd server does not use splice in the
write calls - nfsd_vfs_write() ?  Is there some structural limitation
or is it just something nobody got around to implementing ?

I have looked at the source back to the 2.6.x kernels and it seems
only nfsd_vfs_read() has ever used splice/sendfile.
http://lxr.linux.no/linux+v3.9.5/fs/nfsd/vfs.c

TIA
-Sandeep

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: why does nfsd write not use splice
  2013-06-09  7:35 why does nfsd write not use splice Sandeep Joshi
@ 2013-06-11 19:51 ` J. Bruce Fields
  2013-06-12  2:36   ` Tom Talpey
  0 siblings, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2013-06-11 19:51 UTC (permalink / raw)
  To: Sandeep Joshi; +Cc: linux-nfs

On Sun, Jun 09, 2013 at 01:05:16PM +0530, Sandeep Joshi wrote:
> Is there a reason as to why the nfsd server does not use splice in the
> write calls - nfsd_vfs_write() ?  Is there some structural limitation
> or is it just something nobody got around to implementing ?
> 
> I have looked at the source back to the 2.6.x kernels and it seems
> only nfsd_vfs_read() has ever used splice/sendfile.
> http://lxr.linux.no/linux+v3.9.5/fs/nfsd/vfs.c

I don't actually know how splice_write works.... I assume to avoid a
copy we'd have to place the incoming write data into pages already
correctly aligned.  That would be an interesting trick.

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: why does nfsd write not use splice
  2013-06-11 19:51 ` J. Bruce Fields
@ 2013-06-12  2:36   ` Tom Talpey
  2013-06-12 15:39     ` J. Bruce Fields
  0 siblings, 1 reply; 15+ messages in thread
From: Tom Talpey @ 2013-06-12  2:36 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Sandeep Joshi, linux-nfs

On 6/11/2013 3:51 PM, J. Bruce Fields wrote:
> On Sun, Jun 09, 2013 at 01:05:16PM +0530, Sandeep Joshi wrote:
>> Is there a reason as to why the nfsd server does not use splice in the
>> write calls - nfsd_vfs_write() ?  Is there some structural limitation
>> or is it just something nobody got around to implementing ?
>>
>> I have looked at the source back to the 2.6.x kernels and it seems
>> only nfsd_vfs_read() has ever used splice/sendfile.
>> http://lxr.linux.no/linux+v3.9.5/fs/nfsd/vfs.c
>
> I don't actually know how splice_write works.... I assume to avoid a
> copy we'd have to place the incoming write data into pages already
> correctly aligned.  That would be an interesting trick.

NFS-RDMA does that, by design. ;-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: why does nfsd write not use splice
  2013-06-12  2:36   ` Tom Talpey
@ 2013-06-12 15:39     ` J. Bruce Fields
  2013-06-12 16:22       ` Sandeep Joshi
       [not found]       ` <CAEfL3KkdjB7bzvnfiDh024kHjCH0e64iH6GK6y+A+bpH3kUgJg@mail.gmail.com>
  0 siblings, 2 replies; 15+ messages in thread
From: J. Bruce Fields @ 2013-06-12 15:39 UTC (permalink / raw)
  To: Tom Talpey; +Cc: Sandeep Joshi, linux-nfs

On Tue, Jun 11, 2013 at 10:36:12PM -0400, Tom Talpey wrote:
> On 6/11/2013 3:51 PM, J. Bruce Fields wrote:
> >On Sun, Jun 09, 2013 at 01:05:16PM +0530, Sandeep Joshi wrote:
> >>Is there a reason as to why the nfsd server does not use splice in the
> >>write calls - nfsd_vfs_write() ?  Is there some structural limitation
> >>or is it just something nobody got around to implementing ?
> >>
> >>I have looked at the source back to the 2.6.x kernels and it seems
> >>only nfsd_vfs_read() has ever used splice/sendfile.
> >>http://lxr.linux.no/linux+v3.9.5/fs/nfsd/vfs.c
> >
> >I don't actually know how splice_write works.... I assume to avoid a
> >copy we'd have to place the incoming write data into pages already
> >correctly aligned.  That would be an interesting trick.
> 
> NFS-RDMA does that, by design. ;-)

So teaching nfsd to take advantage of splice in the rdma case could be
a doable project for someone?

I was thinking about NFSv4/tcp, in which case I guess we'd need to start
parsing the xdr in the tcp receive code.

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: why does nfsd write not use splice
  2013-06-12 15:39     ` J. Bruce Fields
@ 2013-06-12 16:22       ` Sandeep Joshi
       [not found]       ` <CAEfL3KkdjB7bzvnfiDh024kHjCH0e64iH6GK6y+A+bpH3kUgJg@mail.gmail.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Sandeep Joshi @ 2013-06-12 16:22 UTC (permalink / raw)
  To: linux-nfs

Bruce,

Splice can be implemented independent of RDMA.  It is supposed to
transfer pages between two file descriptors.  I found some postings on
lkml from 2006 where Linus says it is quite possible to splice from a
socket to a file.

See the paragraph:
" For filesystems, splice support tends to be really easy (both read
and write). For other things, it depends a bit. But unlike sendfile(),
it really is quite possible to splice _from_ a socket too, not just
_to_ a socket. But no, that case hasn't been written yet."
 http://yarchive.net/comp/linux/splice.html

Larry McVoy's 1997 proposal for adding splice support to the  kernel
can be read at ftp.tux.org/pub/sites/ftp.bitmover.com/pub/splice.ps.gz

Perhaps I should have opened this thread on lkml to determine if
splice from socket to file is still feasible..

-Sandeep

On Wed, Jun 12, 2013 at 9:09 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>
> On Tue, Jun 11, 2013 at 10:36:12PM -0400, Tom Talpey wrote:
> > On 6/11/2013 3:51 PM, J. Bruce Fields wrote:
> > >On Sun, Jun 09, 2013 at 01:05:16PM +0530, Sandeep Joshi wrote:
> > >>Is there a reason as to why the nfsd server does not use splice in the
> > >>write calls - nfsd_vfs_write() ?  Is there some structural limitation
> > >>or is it just something nobody got around to implementing ?
> > >>
> > >>I have looked at the source back to the 2.6.x kernels and it seems
> > >>only nfsd_vfs_read() has ever used splice/sendfile.
> > >>http://lxr.linux.no/linux+v3.9.5/fs/nfsd/vfs.c
> > >
> > >I don't actually know how splice_write works.... I assume to avoid a
> > >copy we'd have to place the incoming write data into pages already
> > >correctly aligned.  That would be an interesting trick.
> >
> > NFS-RDMA does that, by design. ;-)
>
> So teaching nfsd to take advantage of splice in the rdma case could be
> a doable project for someone?
>
> I was thinking about NFSv4/tcp, in which case I guess we'd need to start
> parsing the xdr in the tcp receive code.
>
> --b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: why does nfsd write not use splice
       [not found]       ` <CAEfL3KkdjB7bzvnfiDh024kHjCH0e64iH6GK6y+A+bpH3kUgJg@mail.gmail.com>
@ 2013-06-12 16:46         ` J. Bruce Fields
  2013-06-14 12:09           ` Sandeep Joshi
  0 siblings, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2013-06-12 16:46 UTC (permalink / raw)
  To: Sandeep Joshi; +Cc: Tom Talpey, linux-nfs

On Wed, Jun 12, 2013 at 09:51:09PM +0530, Sandeep Joshi wrote:
> Splice can be implemented independent of RDMA.  It is supposed to transfer
> pages between two file descriptors.  I found some postings on lkml from
> 2006 where Linus says it is quite possible to splice from a socket to a
> file.
> 
> See the paragraph:
> " For filesystems, splice support tends to be really easy (both read and
> write). For other things, it depends a bit. But unlike sendfile(), it
> really is quite possible to splice _from_ a socket too, not just _to_ a
> socket. But no, that case hasn't been written yet."
>  http://yarchive.net/comp/linux/splice.html
> 
> Larry McVoy's 1997 proposal for adding splice support to the  kernel can be
> read at ftp.tux.org/pub/sites/ftp.bitmover.com/pub/*splice*.*ps*.gz<http://ftp.tux.org/pub/sites/ftp.bitmover.com/pub/splice.ps.gz>
> 
> Perhaps I should have opened this thread on lkml to determine if splice
> from socket to file is still feasible..

Right, the thing is, nfsd reads the rpc request from the socket into its
own buffers before it parses it.  If you want to move the data directly
out of the network buffers into the page cache, then you have to know at
what point the write data starts in the request--which I believe will
mean doing the xdr parsing (and gss decryption if necessary) as the
request comes in off the wire.

That sounds like a lot of work and even if you have someone willing to
do the work they'd also need to justify that it's worth it.

RDMA may have some protocol support that simplifies this, I don't know.

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: why does nfsd write not use splice
  2013-06-12 16:46         ` J. Bruce Fields
@ 2013-06-14 12:09           ` Sandeep Joshi
  2013-06-14 19:22             ` Jeff Layton
  0 siblings, 1 reply; 15+ messages in thread
From: Sandeep Joshi @ 2013-06-14 12:09 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

On Wed, Jun 12, 2013 at 10:16 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>
> On Wed, Jun 12, 2013 at 09:51:09PM +0530, Sandeep Joshi wrote:
> > Splice can be implemented independent of RDMA.  It is supposed to
> > transfer
> > pages between two file descriptors.  I found some postings on lkml from
> > 2006 where Linus says it is quite possible to splice from a socket to a
> > file.
> >
> > See the paragraph:
> > " For filesystems, splice support tends to be really easy (both read and
> > write). For other things, it depends a bit. But unlike sendfile(), it
> > really is quite possible to splice _from_ a socket too, not just _to_ a
> > socket. But no, that case hasn't been written yet."
> >  http://yarchive.net/comp/linux/splice.html
> >
> > Larry McVoy's 1997 proposal for adding splice support to the  kernel can
> > be
> > read at
> > ftp.tux.org/pub/sites/ftp.bitmover.com/pub/*splice*.*ps*.gz<http://ftp.tux.org/pub/sites/ftp.bitmover.com/pub/splice.ps.gz>
> >
> > Perhaps I should have opened this thread on lkml to determine if splice
> > from socket to file is still feasible..
>
> Right, the thing is, nfsd reads the rpc request from the socket into its
> own buffers before it parses it.  If you want to move the data directly
> out of the network buffers into the page cache, then you have to know at
> what point the write data starts in the request--which I believe will
> mean doing the xdr parsing (and gss decryption if necessary) as the
> request comes in off the wire.
>
> That sounds like a lot of work and even if you have someone willing to
> do the work they'd also need to justify that it's worth it.
>
> RDMA may have some protocol support that simplifies this, I don't know.
>
> --b.

Hi Bruce,

> nfsd reads the rpc request from the socket into its own buffers before it parses it.

I am not intimate with the gss code but do you think the
svc_rqst->rq_pages[] can be spliced ?

-Sandeep

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: why does nfsd write not use splice
  2013-06-14 12:09           ` Sandeep Joshi
@ 2013-06-14 19:22             ` Jeff Layton
  2013-06-15  5:09               ` Myklebust, Trond
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2013-06-14 19:22 UTC (permalink / raw)
  To: Sandeep Joshi; +Cc: J. Bruce Fields, linux-nfs

On Fri, 14 Jun 2013 17:39:12 +0530
Sandeep Joshi <sanjos100@gmail.com> wrote:

> On Wed, Jun 12, 2013 at 10:16 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >
> > On Wed, Jun 12, 2013 at 09:51:09PM +0530, Sandeep Joshi wrote:
> > > Splice can be implemented independent of RDMA.  It is supposed to
> > > transfer
> > > pages between two file descriptors.  I found some postings on lkml from
> > > 2006 where Linus says it is quite possible to splice from a socket to a
> > > file.
> > >
> > > See the paragraph:
> > > " For filesystems, splice support tends to be really easy (both read and
> > > write). For other things, it depends a bit. But unlike sendfile(), it
> > > really is quite possible to splice _from_ a socket too, not just _to_ a
> > > socket. But no, that case hasn't been written yet."
> > >  http://yarchive.net/comp/linux/splice.html
> > >
> > > Larry McVoy's 1997 proposal for adding splice support to the  kernel can
> > > be
> > > read at
> > > ftp.tux.org/pub/sites/ftp.bitmover.com/pub/*splice*.*ps*.gz<http://ftp.tux.org/pub/sites/ftp.bitmover.com/pub/splice.ps.gz>
> > >
> > > Perhaps I should have opened this thread on lkml to determine if splice
> > > from socket to file is still feasible..
> >
> > Right, the thing is, nfsd reads the rpc request from the socket into its
> > own buffers before it parses it.  If you want to move the data directly
> > out of the network buffers into the page cache, then you have to know at
> > what point the write data starts in the request--which I believe will
> > mean doing the xdr parsing (and gss decryption if necessary) as the
> > request comes in off the wire.
> >
> > That sounds like a lot of work and even if you have someone willing to
> > do the work they'd also need to justify that it's worth it.
> >
> > RDMA may have some protocol support that simplifies this, I don't know.
> >
> > --b.
> 
> Hi Bruce,
> 
> > nfsd reads the rpc request from the socket into its own buffers before it parses it.
> 
> I am not intimate with the gss code but do you think the
> svc_rqst->rq_pages[] can be spliced ?
> 

Probably not in its current form. The problem is one of alignment. You
need to know where the write data actually starts before doing the
receive off the socket, so you can make sure that it ends up in the
correct spot in the pages you're going to splice in.

There's also the problem of what to do about WRITE requests that
contain data that isn't page aligned or that's shorter than a page...

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: why does nfsd write not use splice
  2013-06-14 19:22             ` Jeff Layton
@ 2013-06-15  5:09               ` Myklebust, Trond
  2013-06-17 11:01                 ` Jeff Layton
  0 siblings, 1 reply; 15+ messages in thread
From: Myklebust, Trond @ 2013-06-15  5:09 UTC (permalink / raw)
  To: Jeff Layton, Sandeep Joshi; +Cc: J. Bruce Fields, linux-nfs

> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
> owner@vger.kernel.org] On Behalf Of Jeff Layton
> Sent: Friday, June 14, 2013 3:22 PM
> To: Sandeep Joshi
> Cc: J. Bruce Fields; linux-nfs@vger.kernel.org
> Subject: Re: why does nfsd write not use splice
> 
> On Fri, 14 Jun 2013 17:39:12 +0530
> Sandeep Joshi <sanjos100@gmail.com> wrote:
> 
> > On Wed, Jun 12, 2013 at 10:16 PM, J. Bruce Fields <bfields@fieldses.org>
> wrote:
> > >
> > > On Wed, Jun 12, 2013 at 09:51:09PM +0530, Sandeep Joshi wrote:
> > > > Splice can be implemented independent of RDMA.  It is supposed to
> > > > transfer pages between two file descriptors.  I found some
> > > > postings on lkml from
> > > > 2006 where Linus says it is quite possible to splice from a socket
> > > > to a file.
> > > >
> > > > See the paragraph:
> > > > " For filesystems, splice support tends to be really easy (both
> > > > read and write). For other things, it depends a bit. But unlike
> > > > sendfile(), it really is quite possible to splice _from_ a socket
> > > > too, not just _to_ a socket. But no, that case hasn't been written yet."
> > > >  http://yarchive.net/comp/linux/splice.html
> > > >
> > > > Larry McVoy's 1997 proposal for adding splice support to the
> > > > kernel can be read at
> > > > ftp.tux.org/pub/sites/ftp.bitmover.com/pub/*splice*.*ps*.gz<http:/
> > > > /ftp.tux.org/pub/sites/ftp.bitmover.com/pub/splice.ps.gz>
> > > >
> > > > Perhaps I should have opened this thread on lkml to determine if
> > > > splice from socket to file is still feasible..
> > >
> > > Right, the thing is, nfsd reads the rpc request from the socket into
> > > its own buffers before it parses it.  If you want to move the data
> > > directly out of the network buffers into the page cache, then you
> > > have to know at what point the write data starts in the
> > > request--which I believe will mean doing the xdr parsing (and gss
> > > decryption if necessary) as the request comes in off the wire.
> > >
> > > That sounds like a lot of work and even if you have someone willing
> > > to do the work they'd also need to justify that it's worth it.
> > >
> > > RDMA may have some protocol support that simplifies this, I don't know.
> > >
> > > --b.
> >
> > Hi Bruce,
> >
> > > nfsd reads the rpc request from the socket into its own buffers before it
> parses it.
> >
> > I am not intimate with the gss code but do you think the
> > svc_rqst->rq_pages[] can be spliced ?
> >
> 
> Probably not in its current form. The problem is one of alignment. You need
> to know where the write data actually starts before doing the receive off the
> socket, so you can make sure that it ends up in the correct spot in the pages
> you're going to splice in.
> 
> There's also the problem of what to do about WRITE requests that contain
> data that isn't page aligned or that's shorter than a page...

Finally, there is the minor problem that the data that is actually received by the socket may be encrypted, or may need to be checksummed (krb5i) _before_ you can apply it to the file. That is not a particularly good fit for splice().

Trond

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: why does nfsd write not use splice
  2013-06-15  5:09               ` Myklebust, Trond
@ 2013-06-17 11:01                 ` Jeff Layton
  2013-06-17 11:48                   ` Myklebust, Trond
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2013-06-17 11:01 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: Sandeep Joshi, J. Bruce Fields, linux-nfs

On Sat, 15 Jun 2013 05:09:55 +0000
"Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:

> > -----Original Message-----
> > From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
> > owner@vger.kernel.org] On Behalf Of Jeff Layton
> > Sent: Friday, June 14, 2013 3:22 PM
> > To: Sandeep Joshi
> > Cc: J. Bruce Fields; linux-nfs@vger.kernel.org
> > Subject: Re: why does nfsd write not use splice
> > 
> > On Fri, 14 Jun 2013 17:39:12 +0530
> > Sandeep Joshi <sanjos100@gmail.com> wrote:
> > 
> > > On Wed, Jun 12, 2013 at 10:16 PM, J. Bruce Fields <bfields@fieldses.org>
> > wrote:
> > > >
> > > > On Wed, Jun 12, 2013 at 09:51:09PM +0530, Sandeep Joshi wrote:
> > > > > Splice can be implemented independent of RDMA.  It is supposed to
> > > > > transfer pages between two file descriptors.  I found some
> > > > > postings on lkml from
> > > > > 2006 where Linus says it is quite possible to splice from a socket
> > > > > to a file.
> > > > >
> > > > > See the paragraph:
> > > > > " For filesystems, splice support tends to be really easy (both
> > > > > read and write). For other things, it depends a bit. But unlike
> > > > > sendfile(), it really is quite possible to splice _from_ a socket
> > > > > too, not just _to_ a socket. But no, that case hasn't been written yet."
> > > > >  http://yarchive.net/comp/linux/splice.html
> > > > >
> > > > > Larry McVoy's 1997 proposal for adding splice support to the
> > > > > kernel can be read at
> > > > > ftp.tux.org/pub/sites/ftp.bitmover.com/pub/*splice*.*ps*.gz<http:/
> > > > > /ftp.tux.org/pub/sites/ftp.bitmover.com/pub/splice.ps.gz>
> > > > >
> > > > > Perhaps I should have opened this thread on lkml to determine if
> > > > > splice from socket to file is still feasible..
> > > >
> > > > Right, the thing is, nfsd reads the rpc request from the socket into
> > > > its own buffers before it parses it.  If you want to move the data
> > > > directly out of the network buffers into the page cache, then you
> > > > have to know at what point the write data starts in the
> > > > request--which I believe will mean doing the xdr parsing (and gss
> > > > decryption if necessary) as the request comes in off the wire.
> > > >
> > > > That sounds like a lot of work and even if you have someone willing
> > > > to do the work they'd also need to justify that it's worth it.
> > > >
> > > > RDMA may have some protocol support that simplifies this, I don't know.
> > > >
> > > > --b.
> > >
> > > Hi Bruce,
> > >
> > > > nfsd reads the rpc request from the socket into its own buffers before it
> > parses it.
> > >
> > > I am not intimate with the gss code but do you think the
> > > svc_rqst->rq_pages[] can be spliced ?
> > >
> > 
> > Probably not in its current form. The problem is one of alignment. You need
> > to know where the write data actually starts before doing the receive off the
> > socket, so you can make sure that it ends up in the correct spot in the pages
> > you're going to splice in.
> > 
> > There's also the problem of what to do about WRITE requests that contain
> > data that isn't page aligned or that's shorter than a page...
> 
> Finally, there is the minor problem that the data that is actually received by the socket may be encrypted, or may need to be checksummed (krb5i) _before_ you can apply it to the file. That is not a particularly good fit for splice().
> 

Encryption certainly can be a problem, but integrity isn't necessarily
one.

Basically the idea would be to receive the data off the socket into a
set of pages and then splice those into the correct spot in the local
file. In both the privacy and integrity cases, you just have an extra
step in between. Privacy *may* mean an extra copy too (though some of
the crypto routines can decrypt data in place), but handling integrity
shouldn't.

The tricky parts (I think) are determining how to lay out the received
data into the pages you eventually want to splice into the file before
you receive that data in, and how to deal with it when the WRITE
doesn't cover an entire page.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: why does nfsd write not use splice
  2013-06-17 11:01                 ` Jeff Layton
@ 2013-06-17 11:48                   ` Myklebust, Trond
  2013-06-17 14:34                     ` J. Bruce Fields
  2013-06-18 12:17                     ` Jeff Layton
  0 siblings, 2 replies; 15+ messages in thread
From: Myklebust, Trond @ 2013-06-17 11:48 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Myklebust, Trond, Sandeep Joshi, J. Bruce Fields, linux-nfs


On Jun 17, 2013, at 7:01 AM, Jeff Layton <jlayton@redhat.com>
 wrote:

> On Sat, 15 Jun 2013 05:09:55 +0000
> "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:
> 
>>> -----Original Message-----
>>> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
>>> owner@vger.kernel.org] On Behalf Of Jeff Layton
>>> Sent: Friday, June 14, 2013 3:22 PM
>>> To: Sandeep Joshi
>>> Cc: J. Bruce Fields; linux-nfs@vger.kernel.org
>>> Subject: Re: why does nfsd write not use splice
>>> 
>>> On Fri, 14 Jun 2013 17:39:12 +0530
>>> Sandeep Joshi <sanjos100@gmail.com> wrote:
>>> 
>>>> On Wed, Jun 12, 2013 at 10:16 PM, J. Bruce Fields <bfields@fieldses.org>
>>> wrote:
>>>>> 
>>>>> On Wed, Jun 12, 2013 at 09:51:09PM +0530, Sandeep Joshi wrote:
>>>>>> Splice can be implemented independent of RDMA.  It is supposed to
>>>>>> transfer pages between two file descriptors.  I found some
>>>>>> postings on lkml from
>>>>>> 2006 where Linus says it is quite possible to splice from a socket
>>>>>> to a file.
>>>>>> 
>>>>>> See the paragraph:
>>>>>> " For filesystems, splice support tends to be really easy (both
>>>>>> read and write). For other things, it depends a bit. But unlike
>>>>>> sendfile(), it really is quite possible to splice _from_ a socket
>>>>>> too, not just _to_ a socket. But no, that case hasn't been written yet."
>>>>>> http://yarchive.net/comp/linux/splice.html
>>>>>> 
>>>>>> Larry McVoy's 1997 proposal for adding splice support to the
>>>>>> kernel can be read at
>>>>>> ftp.tux.org/pub/sites/ftp.bitmover.com/pub/*splice*.*ps*.gz<http:/
>>>>>> /ftp.tux.org/pub/sites/ftp.bitmover.com/pub/splice.ps.gz>
>>>>>> 
>>>>>> Perhaps I should have opened this thread on lkml to determine if
>>>>>> splice from socket to file is still feasible..
>>>>> 
>>>>> Right, the thing is, nfsd reads the rpc request from the socket into
>>>>> its own buffers before it parses it.  If you want to move the data
>>>>> directly out of the network buffers into the page cache, then you
>>>>> have to know at what point the write data starts in the
>>>>> request--which I believe will mean doing the xdr parsing (and gss
>>>>> decryption if necessary) as the request comes in off the wire.
>>>>> 
>>>>> That sounds like a lot of work and even if you have someone willing
>>>>> to do the work they'd also need to justify that it's worth it.
>>>>> 
>>>>> RDMA may have some protocol support that simplifies this, I don't know.
>>>>> 
>>>>> --b.
>>>> 
>>>> Hi Bruce,
>>>> 
>>>>> nfsd reads the rpc request from the socket into its own buffers before it
>>> parses it.
>>>> 
>>>> I am not intimate with the gss code but do you think the
>>>> svc_rqst->rq_pages[] can be spliced ?
>>>> 
>>> 
>>> Probably not in its current form. The problem is one of alignment. You need
>>> to know where the write data actually starts before doing the receive off the
>>> socket, so you can make sure that it ends up in the correct spot in the pages
>>> you're going to splice in.
>>> 
>>> There's also the problem of what to do about WRITE requests that contain
>>> data that isn't page aligned or that's shorter than a page...
>> 
>> Finally, there is the minor problem that the data that is actually received by the socket may be encrypted, or may need to be checksummed (krb5i) _before_ you can apply it to the file. That is not a particularly good fit for splice().
>> 
> 
> Encryption certainly can be a problem, but integrity isn't necessarily
> one.
> 
> Basically the idea would be to receive the data off the socket into a
> set of pages and then splice those into the correct spot in the local
> file. In both the privacy and integrity cases, you just have an extra
> step in between. Privacy *may* mean an extra copy too (though some of
> the crypto routines can decrypt data in place), but handling integrity
> shouldn't.
> 
> The tricky parts (I think) are determining how to lay out the received
> data into the pages you eventually want to splice into the file before
> you receive that data in, and how to deal with it when the WRITE
> doesn't cover an entire page.

Once you've copied the data one time, most of the advantage of splice() is gone, since a copy will then exist in processor cache memory and can be duplicated quickly.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: why does nfsd write not use splice
  2013-06-17 11:48                   ` Myklebust, Trond
@ 2013-06-17 14:34                     ` J. Bruce Fields
  2013-06-17 14:36                       ` Myklebust, Trond
  2013-06-18 12:17                     ` Jeff Layton
  1 sibling, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2013-06-17 14:34 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: Jeff Layton, Sandeep Joshi, linux-nfs

On Mon, Jun 17, 2013 at 11:48:18AM +0000, Myklebust, Trond wrote:
> 
> On Jun 17, 2013, at 7:01 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > Encryption certainly can be a problem, but integrity isn't
> > necessarily one.
> > 
> > Basically the idea would be to receive the data off the socket into
> > a set of pages and then splice those into the correct spot in the
> > local file. In both the privacy and integrity cases, you just have
> > an extra step in between. Privacy *may* mean an extra copy too
> > (though some of the crypto routines can decrypt data in place), but
> > handling integrity shouldn't.
> > 
> > The tricky parts (I think) are determining how to lay out the
> > received data into the pages you eventually want to splice into the
> > file before you receive that data in, and how to deal with it when
> > the WRITE doesn't cover an entire page.
> 
> Once you've copied the data one time, most of the advantage of
> splice() is gone, since a copy will then exist in processor cache
> memory and can be duplicated quickly.

Well, worst case you could turn it off in krb5i/krb5p cases and maybe
still get some benefit in the auth_sys case?

I suspect it will be a fair amount of work just to get enough of a
prototype up that you can start to measure the benefit (if any).  So
this isn't going to happen without someone pretty committed to the idea.

(And such a person would be better off starting by describing the actual
probem they're trying to solve before jumping to the conclusion that
splice is the solution.)

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: why does nfsd write not use splice
  2013-06-17 14:34                     ` J. Bruce Fields
@ 2013-06-17 14:36                       ` Myklebust, Trond
  2013-06-17 14:48                         ` J. Bruce Fields
  0 siblings, 1 reply; 15+ messages in thread
From: Myklebust, Trond @ 2013-06-17 14:36 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, Sandeep Joshi, linux-nfs

> -----Original Message-----
> From: J. Bruce Fields [mailto:bfields@fieldses.org]
> Sent: Monday, June 17, 2013 10:34 AM
> To: Myklebust, Trond
> Cc: Jeff Layton; Sandeep Joshi; linux-nfs@vger.kernel.org
> Subject: Re: why does nfsd write not use splice
> 
> On Mon, Jun 17, 2013 at 11:48:18AM +0000, Myklebust, Trond wrote:
> >
> > On Jun 17, 2013, at 7:01 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > > Encryption certainly can be a problem, but integrity isn't
> > > necessarily one.
> > >
> > > Basically the idea would be to receive the data off the socket into
> > > a set of pages and then splice those into the correct spot in the
> > > local file. In both the privacy and integrity cases, you just have
> > > an extra step in between. Privacy *may* mean an extra copy too
> > > (though some of the crypto routines can decrypt data in place), but
> > > handling integrity shouldn't.
> > >
> > > The tricky parts (I think) are determining how to lay out the
> > > received data into the pages you eventually want to splice into the
> > > file before you receive that data in, and how to deal with it when
> > > the WRITE doesn't cover an entire page.
> >
> > Once you've copied the data one time, most of the advantage of
> > splice() is gone, since a copy will then exist in processor cache
> > memory and can be duplicated quickly.
> 
> Well, worst case you could turn it off in krb5i/krb5p cases and maybe still get
> some benefit in the auth_sys case?
> 

Not if you need to copy in order to realign the data anyway...

Cheers
  Trond


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: why does nfsd write not use splice
  2013-06-17 14:36                       ` Myklebust, Trond
@ 2013-06-17 14:48                         ` J. Bruce Fields
  0 siblings, 0 replies; 15+ messages in thread
From: J. Bruce Fields @ 2013-06-17 14:48 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: Jeff Layton, Sandeep Joshi, linux-nfs

On Mon, Jun 17, 2013 at 02:36:20PM +0000, Myklebust, Trond wrote:
> > -----Original Message-----
> > From: J. Bruce Fields [mailto:bfields@fieldses.org]
> > Sent: Monday, June 17, 2013 10:34 AM
> > To: Myklebust, Trond
> > Cc: Jeff Layton; Sandeep Joshi; linux-nfs@vger.kernel.org
> > Subject: Re: why does nfsd write not use splice
> > 
> > On Mon, Jun 17, 2013 at 11:48:18AM +0000, Myklebust, Trond wrote:
> > >
> > > On Jun 17, 2013, at 7:01 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > Encryption certainly can be a problem, but integrity isn't
> > > > necessarily one.
> > > >
> > > > Basically the idea would be to receive the data off the socket into
> > > > a set of pages and then splice those into the correct spot in the
> > > > local file. In both the privacy and integrity cases, you just have
> > > > an extra step in between. Privacy *may* mean an extra copy too
> > > > (though some of the crypto routines can decrypt data in place), but
> > > > handling integrity shouldn't.
> > > >
> > > > The tricky parts (I think) are determining how to lay out the
> > > > received data into the pages you eventually want to splice into the
> > > > file before you receive that data in, and how to deal with it when
> > > > the WRITE doesn't cover an entire page.
> > >
> > > Once you've copied the data one time, most of the advantage of
> > > splice() is gone, since a copy will then exist in processor cache
> > > memory and can be duplicated quickly.
> > 
> > Well, worst case you could turn it off in krb5i/krb5p cases and maybe still get
> > some benefit in the auth_sys case?
> > 
> 
> Not if you need to copy in order to realign the data anyway...

Right, so you get to rewrite the xdr code so that it can process at
least some part of the compound as it comes in.  Fun!

That's why I say it could take a lot of work even to get a prototype
sufficient to measure the effect of splice.  Though maybe you could find
some simple heuristic that would predict the offset of write data when
using your particular test setup....

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: why does nfsd write not use splice
  2013-06-17 11:48                   ` Myklebust, Trond
  2013-06-17 14:34                     ` J. Bruce Fields
@ 2013-06-18 12:17                     ` Jeff Layton
  1 sibling, 0 replies; 15+ messages in thread
From: Jeff Layton @ 2013-06-18 12:17 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: Sandeep Joshi, J. Bruce Fields, linux-nfs

On Mon, 17 Jun 2013 11:48:18 +0000
"Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:

> 
> On Jun 17, 2013, at 7:01 AM, Jeff Layton <jlayton@redhat.com>
>  wrote:
> 
> > On Sat, 15 Jun 2013 05:09:55 +0000
> > "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:
> > 
> >>> -----Original Message-----
> >>> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
> >>> owner@vger.kernel.org] On Behalf Of Jeff Layton
> >>> Sent: Friday, June 14, 2013 3:22 PM
> >>> To: Sandeep Joshi
> >>> Cc: J. Bruce Fields; linux-nfs@vger.kernel.org
> >>> Subject: Re: why does nfsd write not use splice
> >>> 
> >>> On Fri, 14 Jun 2013 17:39:12 +0530
> >>> Sandeep Joshi <sanjos100@gmail.com> wrote:
> >>> 
> >>>> On Wed, Jun 12, 2013 at 10:16 PM, J. Bruce Fields <bfields@fieldses.org>
> >>> wrote:
> >>>>> 
> >>>>> On Wed, Jun 12, 2013 at 09:51:09PM +0530, Sandeep Joshi wrote:
> >>>>>> Splice can be implemented independent of RDMA.  It is supposed to
> >>>>>> transfer pages between two file descriptors.  I found some
> >>>>>> postings on lkml from
> >>>>>> 2006 where Linus says it is quite possible to splice from a socket
> >>>>>> to a file.
> >>>>>> 
> >>>>>> See the paragraph:
> >>>>>> " For filesystems, splice support tends to be really easy (both
> >>>>>> read and write). For other things, it depends a bit. But unlike
> >>>>>> sendfile(), it really is quite possible to splice _from_ a socket
> >>>>>> too, not just _to_ a socket. But no, that case hasn't been written yet."
> >>>>>> http://yarchive.net/comp/linux/splice.html
> >>>>>> 
> >>>>>> Larry McVoy's 1997 proposal for adding splice support to the
> >>>>>> kernel can be read at
> >>>>>> ftp.tux.org/pub/sites/ftp.bitmover.com/pub/*splice*.*ps*.gz<http:/
> >>>>>> /ftp.tux.org/pub/sites/ftp.bitmover.com/pub/splice.ps.gz>
> >>>>>> 
> >>>>>> Perhaps I should have opened this thread on lkml to determine if
> >>>>>> splice from socket to file is still feasible..
> >>>>> 
> >>>>> Right, the thing is, nfsd reads the rpc request from the socket into
> >>>>> its own buffers before it parses it.  If you want to move the data
> >>>>> directly out of the network buffers into the page cache, then you
> >>>>> have to know at what point the write data starts in the
> >>>>> request--which I believe will mean doing the xdr parsing (and gss
> >>>>> decryption if necessary) as the request comes in off the wire.
> >>>>> 
> >>>>> That sounds like a lot of work and even if you have someone willing
> >>>>> to do the work they'd also need to justify that it's worth it.
> >>>>> 
> >>>>> RDMA may have some protocol support that simplifies this, I don't know.
> >>>>> 
> >>>>> --b.
> >>>> 
> >>>> Hi Bruce,
> >>>> 
> >>>>> nfsd reads the rpc request from the socket into its own buffers before it
> >>> parses it.
> >>>> 
> >>>> I am not intimate with the gss code but do you think the
> >>>> svc_rqst->rq_pages[] can be spliced ?
> >>>> 
> >>> 
> >>> Probably not in its current form. The problem is one of alignment. You need
> >>> to know where the write data actually starts before doing the receive off the
> >>> socket, so you can make sure that it ends up in the correct spot in the pages
> >>> you're going to splice in.
> >>> 
> >>> There's also the problem of what to do about WRITE requests that contain
> >>> data that isn't page aligned or that's shorter than a page...
> >> 
> >> Finally, there is the minor problem that the data that is actually received by the socket may be encrypted, or may need to be checksummed (krb5i) _before_ you can apply it to the file. That is not a particularly good fit for splice().
> >> 
> > 
> > Encryption certainly can be a problem, but integrity isn't necessarily
> > one.
> > 
> > Basically the idea would be to receive the data off the socket into a
> > set of pages and then splice those into the correct spot in the local
> > file. In both the privacy and integrity cases, you just have an extra
> > step in between. Privacy *may* mean an extra copy too (though some of
> > the crypto routines can decrypt data in place), but handling integrity
> > shouldn't.
> > 
> > The tricky parts (I think) are determining how to lay out the received
> > data into the pages you eventually want to splice into the file before
> > you receive that data in, and how to deal with it when the WRITE
> > doesn't cover an entire page.
> 
> Once you've copied the data one time, most of the advantage of splice() is gone, since a copy will then exist in processor cache memory and can be duplicated quickly.
> 

Good point. I'm not sure there's much you can do to avoid at least one
copy. When the data is received into the skbuff, the networking layer
won't have it aligned for an efficient splice.

ISTR that some Intel NICs actually have segmentation hardware that
purports to understand NFS read and write headers and can split the
header and data into different sets of pages.

In principle, you could try to use that capability to ensure that the
writedata got DMA'ed into a page aligned buffer that you could then
splice in. It wouldn't handle all cases so you'd probably still have to
copy sometimes, but it might help when you had page-aligned writes that
covered an entire page.

As you point out though, all of this is deep voodoo, and you'd have to
do a bunch of reengineering before you even had an inkling of how
worthwhile it is.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-06-18 12:17 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-09  7:35 why does nfsd write not use splice Sandeep Joshi
2013-06-11 19:51 ` J. Bruce Fields
2013-06-12  2:36   ` Tom Talpey
2013-06-12 15:39     ` J. Bruce Fields
2013-06-12 16:22       ` Sandeep Joshi
     [not found]       ` <CAEfL3KkdjB7bzvnfiDh024kHjCH0e64iH6GK6y+A+bpH3kUgJg@mail.gmail.com>
2013-06-12 16:46         ` J. Bruce Fields
2013-06-14 12:09           ` Sandeep Joshi
2013-06-14 19:22             ` Jeff Layton
2013-06-15  5:09               ` Myklebust, Trond
2013-06-17 11:01                 ` Jeff Layton
2013-06-17 11:48                   ` Myklebust, Trond
2013-06-17 14:34                     ` J. Bruce Fields
2013-06-17 14:36                       ` Myklebust, Trond
2013-06-17 14:48                         ` J. Bruce Fields
2013-06-18 12:17                     ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.