All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: linux-cifs-client Digest, Vol 70, Issue 25
       [not found] <mailman.1290.1254136357.2670.linux-cifs-client@lists.samba.org>
@ 2009-09-28 14:41 ` Steve French (smfltc)
  2009-09-28 15:54   ` Jeff Layton
  0 siblings, 1 reply; 8+ messages in thread
From: Steve French (smfltc) @ 2009-09-28 14:41 UTC (permalink / raw)
  To: linux-cifs-client, Jeff Layton, linux-nfs


>> This patchset is still preliminary and is just an RFC...
>>
>> First, some background. When I was at Connectathon this year, Trond
>> mentioned an interesting idea to me. He said (paraphrasing):
>>
>> "Why doesn't CIFS just use the RPC layer for transport? It's very
>> efficient at shoveling bits out onto the wire. You'd just need to
>> abstract out the XDR/RPC specific bits."
>>
>>     
My first reaction is that if you abstract out XDR/RPC specific parts of 
SunRPC it isn't SunRPC,
just a scheduler on top of tcp (not a bad thing in theory).   Pulling 
out the two key pieces from
SunRPC:
    - asynchronous event handling and scheduling
    - upcall for credentials
could be useful, but does add a lot of complexity.   If there is a way 
to use just the async
scheduling (and perhaps upcall) out of SunRPC, that part sounds fine as 
long as it
can skip the encoding/decoding and just pass in a raw kvec containing 
the SMB
header and data.

>>
>> CIFS in particular is also designed around synchronous ops, which
>> seriously limits throughput. Retrofitting it for asynchronous operation
>> will be adding even more kludges. 
>>     
There are only three operations that we can send asynchronous today, all 
of which require
special case handling in the VFS already:
    - readpages
   - writepages
   - blocking locks
(and also directory change notification which we and nfs don't do).   I 
think the "slow_work"
mechanism is probably sufficient for these cases already.

>>  works in our favor...
>> ------------------------------------------------------------------------
>> Q: can we hook up cifs or smbfs to use this as a transport?
>>
>> A: Not trivially. CIFS in particular is not designed with each call
>> having discrete encode and decode functions. They're sort of mashed
>>     
We certainly don't want to move to an abstract encoding mechanism, 
especially for SMB2
where there is only one encoding of wire operations, and no duplicate 
requests due
to 20 years of dialects.   I can see an argument for abstract encoding 
for requests
like SMB open, vs. SMB OpenX vs. SMB NTCreateX but this would be harder or
to abstract and has to be done case by case anyway due to differences in
field length, missing fields, different compensations.  It is not
like the simpler NFS case where encoding involves endian conversion etc.

>> ------------------------------------------------------------------------
>> Q: could we use this as a transport layer for a smb2fs ?
>>
>> A: Yes, I think so. This particular prototype is build around SMB1, but
>> SMB2 could be supported with only minor modifications. One of the
>> reasons for sending this patchset now before I've built a filesystem on
>> top of it is because I know that SMB2 work is in progress. I'd like to
>> see it based around a more asynchronous transport model, or at least
>> built with cleaner layering so that we can eventually bolt on a different
>> transport layer if we so choose.
>>     
Amost all the ops use "send_receive"  already - so there is no need to 
change the code much above
that if you want to experiment with changing the transport.   I like the 
idea of the
abtraction of async operations, and creating completion routines (and an 
async send
abstraction) for readpages,  writepages and directory change 
notification would make sense.
but in both cifs and smb2, the 95% of the operations that must be 
synchronous in
the VFS (open, lookup, unlink, create etc.) can already be hooked up to 
any transport
as long as it can send a kvec contain fs data and return a response 
(like the "send_receive"
and equivalent).

The idea of doing abstract translation and encoding of SMB protocol frames
does seem overengineered and probably would make it harder to 
read/understand
the setup of certain complex request frames which are quite different from
Samba to Windows.    As another example, generalized, abstract SMB frame
conversion isn't being done in Samba 3 for example, and with only
19 requests in SMB2 it makes even less sense.   On the client, since
we have control over which types of requests we send, our case
is simpler than for the server for sending requests, but in
response processing since we have to work around server bugs, xdr like
decoding of SMB responses could get harder still.

I like the idea of the way SunRPC keeps task information, and it may 
make it easier
to carry credentials around (although I think using Dave Howell's key 
management code
might be ok instead to access Winbind).   I am not sure how easy it 
would be to tie
SunRPC credential mapping to Winbind but that could probably be done.  I 
like the
async scheduling capability of SunRPC although I suspect that it is a 
factor in
a number of the (nfs client) performance problems we have seen so may 
need more work.
I don't like adding (in effect) an extra transport and "encoding layer" 
though to
protocols (cifs and smb2).   NFS since it is built on SunRPC on the 
wire, required
such a layer, and it makes sense for NFS to layer the code, like their 
protocol,
over SunRPC.   CIFS and SMB2 don't require (or even allow) XDR translation,
variable encodings, and SunRPC encapsulation so the idea of abstracting the
encoding of something that has a single defined encoding seems wrong.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: linux-cifs-client Digest, Vol 70, Issue 25
  2009-09-28 14:41 ` linux-cifs-client Digest, Vol 70, Issue 25 Steve French (smfltc)
@ 2009-09-28 15:54   ` Jeff Layton
       [not found]     ` <20090928115427.4103826e-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Layton @ 2009-09-28 15:54 UTC (permalink / raw)
  To: Steve French (smfltc); +Cc: linux-nfs, linux-cifs-client

On Mon, 28 Sep 2009 09:41:08 -0500
"Steve French (smfltc)" <smfltc@us.ibm.com> wrote:

> 
> >> This patchset is still preliminary and is just an RFC...
> >>
> >> First, some background. When I was at Connectathon this year, Trond
> >> mentioned an interesting idea to me. He said (paraphrasing):
> >>
> >> "Why doesn't CIFS just use the RPC layer for transport? It's very
> >> efficient at shoveling bits out onto the wire. You'd just need to
> >> abstract out the XDR/RPC specific bits."
> >>
> >>     
> My first reaction is that if you abstract out XDR/RPC specific parts of 
> SunRPC it isn't SunRPC,
> just a scheduler on top of tcp (not a bad thing in theory).   Pulling 
> out the two key pieces from
> SunRPC:
>     - asynchronous event handling and scheduling
>     - upcall for credentials
> could be useful, but does add a lot of complexity.   If there is a way 
> to use just the async
> scheduling (and perhaps upcall) out of SunRPC, that part sounds fine as 
> long as it
> can skip the encoding/decoding and just pass in a raw kvec containing 
> the SMB
> header and data.
> 

Well, the sunrpc layer currently contains a lot of pieces:

1) client side call/response handling (clnt routines)
2) server side call/response handling (svc routines)
3) XDR encoding and decoding routines (including crypto signatures, etc)

...the idea is to hook up new encoding and decoding routines and to add
a new "transport class" which will make the client-side scheduler handle
SMB/SMB2 properly.

We'll also eventually have to add new authentication/credential
"classes" too. I haven't researched that yet in any real depth, so I
can't state much about how difficult it'll be.
 
> >>
> >> CIFS in particular is also designed around synchronous ops, which
> >> seriously limits throughput. Retrofitting it for asynchronous operation
> >> will be adding even more kludges. 
> >>     
> There are only three operations that we can send asynchronous today, all 
> of which require
> special case handling in the VFS already:
>     - readpages
>    - writepages
>    - blocking locks
> (and also directory change notification which we and nfs don't do).   I 
> think the "slow_work"
> mechanism is probably sufficient for these cases already.
> 

The problem is that rolling a mechanism to handle asynchronous ops is
difficult to get right. I think it makes a lot of sense to reuse a
proven engine here. It also makes a lot of sense to implement
synchronous ops on top of an asynchronous infrastructure. RPC does this
under the hood, and so did smbfs.

What you're proposing, in effect, is to do this in reverse -- implement
an asynchronous transport engine using synchronous ops and offloading
the background parts onto threads. That's possible I suppose, but it
means you have a lot of tasks sleeping in the kernel and waiting for
stuff to happen.

> >>  works in our favor...
> >> ------------------------------------------------------------------------
> >> Q: can we hook up cifs or smbfs to use this as a transport?
> >>
> >> A: Not trivially. CIFS in particular is not designed with each call
> >> having discrete encode and decode functions. They're sort of mashed
> >>     
> We certainly don't want to move to an abstract encoding mechanism, 
> especially for SMB2
> where there is only one encoding of wire operations, and no duplicate 
> requests due
> to 20 years of dialects. I can see an argument for abstract encoding 
> for requests
> like SMB open, vs. SMB OpenX vs. SMB NTCreateX but this would be harder or
> to abstract and has to be done case by case anyway due to differences in
> field length, missing fields, different compensations.  It is not
> like the simpler NFS case where encoding involves endian conversion etc.
> 

I'm not sure what you mean by this. Assembling an SMB header and call
is very similar to assembling an RPC header and call. There are
differences of course, but they aren't that substantial.

SMB does introduce some more interesting wrinkles. For instance, since
state is tied up with the actual socket connection, we'll probably need
callbacks into the fs for socket state changes. That doesn't have much
to do with how you abstract out the encoding and decoding though.

> >> ------------------------------------------------------------------------
> >> Q: could we use this as a transport layer for a smb2fs ?
> >>
> >> A: Yes, I think so. This particular prototype is build around SMB1, but
> >> SMB2 could be supported with only minor modifications. One of the
> >> reasons for sending this patchset now before I've built a filesystem on
> >> top of it is because I know that SMB2 work is in progress. I'd like to
> >> see it based around a more asynchronous transport model, or at least
> >> built with cleaner layering so that we can eventually bolt on a different
> >> transport layer if we so choose.
> >>     
> Amost all the ops use "send_receive"  already - so there is no need to 
> change the code much above
> that if you want to experiment with changing the transport.   I like the 
> idea of the
> abtraction of async operations, and creating completion routines (and an 
> async send
> abstraction) for readpages,  writepages and directory change 
> notification would make sense.
> but in both cifs and smb2, the 95% of the operations that must be 
> synchronous in
> the VFS (open, lookup, unlink, create etc.) can already be hooked up to 
> any transport
> as long as it can send a kvec contain fs data and return a response 
> (like the "send_receive"
> and equivalent).
> 

The problem with the send_receive interface is that it assumes that the
encoding, send and decoding will be done by the same task. I think that
assumption will greatly limit this code later and force you to rely on
workarounds (like slow_work) to get asynchronous behavior.

At the very least, I suggest splitting off the decode portions into
separate functions. That at least should allow you the ability later to
offload that part to other tasks (similar to how async tasks get
offloaded to rpciod).

> The idea of doing abstract translation and encoding of SMB protocol frames
> does seem overengineered and probably would make it harder to 
> read/understand
> the setup of certain complex request frames which are quite different from
> Samba to Windows.    As another example, generalized, abstract SMB frame
> conversion isn't being done in Samba 3 for example, and with only
> 19 requests in SMB2 it makes even less sense.   On the client, since
> we have control over which types of requests we send, our case
> is simpler than for the server for sending requests, but in
> response processing since we have to work around server bugs, xdr like
> decoding of SMB responses could get harder still.
> 

Again, I don't see SMB as being that different from NFS in this regard.
You have a transport header (similar to the fraghdr with NFS TCP
transport), then a protocol header (the SMB/SMB2 header), and then
call-specific information. RPC/NFS works exactly the same way.

The code I've proposed more or less has abstractions along those lines.
There's:

transport class -- (net/sunrpc/xprtsmb.c)
header encoding/decoding -- (net/sunrpc/smb.c)

...the other parts will be implemented by the filesystem (similar to
how fs/nfs/nfs?xdr.c work).

> I like the idea of the way SunRPC keeps task information, and it may 
> make it easier
> to carry credentials around (although I think using Dave Howell's key 
> management code
> might be ok instead to access Winbind).   I am not sure how easy it 
> would be to tie
> SunRPC credential mapping to Winbind but that could probably be done.  I 
> like the
> async scheduling capability of SunRPC although I suspect that it is a 
> factor in
> a number of the (nfs client) performance problems we have seen so may 
> need more work.
> I don't like adding (in effect) an extra transport and "encoding layer" 
> though to
> protocols (cifs and smb2).   NFS since it is built on SunRPC on the 
> wire, required
> such a layer, and it makes sense for NFS to layer the code, like their 
> protocol,
> over SunRPC.   CIFS and SMB2 don't require (or even allow) XDR translation,
> variable encodings, and SunRPC encapsulation so the idea of abstracting the
> encoding of something that has a single defined encoding seems wrong.

I'm not sure I understand this last comment. CIFS/SMB2 and NFS are just
not that different in this regard. Either way you have to marshal up
the buffer correctly before you send it, and decode it properly.

Is it possible to go too far with layering and abstraction? Sure. I
don't think I've done that here though.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 0/9] sunrpc: teach the SUNRPC layer how to speak SMB
       [not found]     ` <20090928115427.4103826e-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2009-09-28 18:40       ` Steve French (smfltc)
  2009-09-28 19:37         ` Jeff Layton
  0 siblings, 1 reply; 8+ messages in thread
From: Steve French (smfltc) @ 2009-09-28 18:40 UTC (permalink / raw)
  To: Jeff Layton, linux-cifs-client, linux-nfs

Jeff Layton wrote:
> On Mon, 28 Sep 2009 09:41:08 -0500
> "Steve French (smfltc)" <smfltc@us.ibm.com> wrote:
>
>   
>>>> This patchset is still preliminary and is just an RFC...
>>>>
>>>> First, some background. When I was at Connectathon this year, Trond
>>>> mentioned an interesting idea to me. He said (paraphrasing):
>>>>
>>>> "Why doesn't CIFS just use the RPC layer for transport? It's very
>>>> efficient at shoveling bits out onto the wire. You'd just need to
>>>> abstract out the XDR/RPC specific bits."
>>>>
>>>>     
>>>>         
>> My first reaction is that if you abstract out XDR/RPC specific parts of 
>> SunRPC it isn't SunRPC,
>> just a scheduler on top of tcp (not a bad thing in theory).   Pulling 
>> out the two key pieces from
>> SunRPC:
>>     - asynchronous event handling and scheduling
>>     - upcall for credentials
>> could be useful, but does add a lot of complexity.   If there is a way 
>> to use just the async
>> scheduling (and perhaps upcall) out of SunRPC, that part sounds fine as 
>> long as it
>> can skip the encoding/decoding and just pass in a raw kvec containing 
>> the SMB
>> header and data.
>>
>>     
>
> Well, the sunrpc layer currently contains a lot of pieces:
>
> 1) client side call/response handling (clnt routines)
> 2) server side call/response handling (svc routines)
> 3) XDR encoding and decoding routines (including crypto signatures, etc)
>
> ...the idea is to hook up new encoding and decoding routines and to add
> a new "transport class" which will make the client-side scheduler handle
> SMB/SMB2 properly.
>
> We'll also eventually have to add new authentication/credential
> "classes" too. I haven't researched that yet in any real depth, so I
> can't state much about how difficult it'll be.
>
>   
>>>> CIFS in particular is also designed around synchronous ops, which
>>>> seriously limits throughput. Retrofitting it for asynchronous operation
>>>> will be adding even more kludges. 
>>>>     
>>>>         
>> There are only three operations that we can send asynchronous today, all 
>> of which require
>> special case handling in the VFS already:
>>     - readpages
>>    - writepages
>>    - blocking locks
>> (and also directory change notification which we and nfs don't do).   I 
>> think the "slow_work"
>> mechanism is probably sufficient for these cases already.
>>
>>     
>
> The problem is that rolling a mechanism to handle asynchronous ops is
> difficult to get right. I think it makes a lot of sense to reuse a
> proven engine here. It also makes a lot of sense to implement
> synchronous ops on top of an asynchronous infrastructure. RPC does this
> under the hood, and so did smbfs.
>
> What you're proposing, in effect, is to do this in reverse -- implement
> an asynchronous transport engine using synchronous ops and offloading
> the background parts onto threads. That's possible I suppose, but it
> means you have a lot of tasks sleeping in the kernel and waiting for
> stuff to happen.
>
>   
I don't think it changes the number of tasks sleeping.   For sync 
operations it doesn't
change how many tasks that are sleeping.   For async operations, you no 
longer have a
task sleeping in cifs_writepages or cifs_readpages, but do have the 
ability to dispatch
a decode routine when the SMB Write or Read response is returned (may or
may not have a pool to do this).   Seems like about the same number of 
task (possibly
less).   Moving to all asynchronous operations under open, unlink, close 
doesn't reduce
the number of sleeping tasks (it may increase it or stay the same).
>>>>  works in our favor...
>>>> ------------------------------------------------------------------------
>>>> Q: can we hook up cifs or smbfs to use this as a transport?
>>>>
>>>> A: Not trivially. CIFS in particular is not designed with each call
>>>> having discrete encode and decode functions. They're sort of mashed
>>>>     
>>>>         
>> We certainly don't want to move to an abstract encoding mechanism, 
>> especially for SMB2
>> where there is only one encoding of wire operations, and no duplicate 
>> requests due
>> to 20 years of dialects. I can see an argument for abstract encoding 
>> for requests
>> like SMB open, vs. SMB OpenX vs. SMB NTCreateX but this would be harder or
>> to abstract and has to be done case by case anyway due to differences in
>> field length, missing fields, different compensations.  It is not
>> like the simpler NFS case where encoding involves endian conversion etc.
>>
>>     
>
> I'm not sure what you mean by this. Assembling an SMB header and call
> is very similar to assembling an RPC header and call. There are
> differences of course, but they aren't that substantial.
>
> SMB does introduce some more interesting wrinkles. For instance, since
> state is tied up with the actual socket connection, we'll probably need
> callbacks into the fs for socket state changes. That doesn't have much
> to do with how you abstract out the encoding and decoding though.
>
>   
>>>> ------------------------------------------------------------------------
>>>> Q: could we use this as a transport layer for a smb2fs ?
>>>>
>>>> A: Yes, I think so. This particular prototype is build around SMB1, but
>>>> SMB2 could be supported with only minor modifications. One of the
>>>> reasons for sending this patchset now before I've built a filesystem on
>>>> top of it is because I know that SMB2 work is in progress. I'd like to
>>>> see it based around a more asynchronous transport model, or at least
>>>> built with cleaner layering so that we can eventually bolt on a different
>>>> transport layer if we so choose.
>>>>     
>>>>         
>> Amost all the ops use "send_receive"  already - so there is no need to 
>> change the code much above
>> that if you want to experiment with changing the transport.   I like the 
>> idea of the
>> abtraction of async operations, and creating completion routines (and an 
>> async send
>> abstraction) for readpages,  writepages and directory change 
>> notification would make sense.
>> but in both cifs and smb2, the 95% of the operations that must be 
>> synchronous in
>> the VFS (open, lookup, unlink, create etc.) can already be hooked up to 
>> any transport
>> as long as it can send a kvec contain fs data and return a response 
>> (like the "send_receive"
>> and equivalent).
>>
>>     
>
> The problem with the send_receive interface is that it assumes that the
> encoding, send and decoding will be done by the same task. I think that
> assumption will greatly limit this code later and force you to rely on
> workarounds (like slow_work) to get asynchronous behavior.
>
> At the very least, I suggest splitting off the decode portions into
> separate functions. That at least should allow you the ability later to
> offload that part to other tasks (similar to how async tasks get
> offloaded to rpciod).
>
>   
That (splitting decode into a distinct helper) makes sense (at least for 
async capable ops,
in particular write, read and directory change notification and byte 
range locks).  The
"smb_decode_write_response" case is a particularly simple and useful one 
to do and would
be an easy one to do as a test.   I think the prototype patches for 
async write that someone did
for cifs a few years ago did that.
>> The idea of doing abstract translation and encoding of SMB protocol frames
>> does seem overengineered and probably would make it harder to 
>> read/understand
>> the setup of certain complex request frames which are quite different from
>> Samba to Windows.    As another example, generalized, abstract SMB frame
>> conversion isn't being done in Samba 3 for example, and with only
>> 19 requests in SMB2 it makes even less sense.   On the client, since
>> we have control over which types of requests we send, our case
>> is simpler than for the server for sending requests, but in
>> response processing since we have to work around server bugs, xdr like
>> decoding of SMB responses could get harder still.
>>
>>     
>
> Again, I don't see SMB as being that different from NFS in this regard.
> You have a transport header (similar to the fraghdr with NFS TCP
> transport), then a protocol header (the SMB/SMB2 header), and then
> call-specific information. RPC/NFS works exactly the same way.
>
>   
NFS and CIFS/SMB2 seem pretty different to me in this regard.   CIFS and 
SMB2
are much simpler - for these once protocols unlike nfs,  after you assemble
the CIFS or SMB2 header (possibly with a data area as in SMB Write)
you simply add a 4 byte length (actually 3 byte length, 0 byte 0) and
you send it - no endian conversions, xdr, no adding 80 bytes or so rpc 
prefix. 
SunRPC adds a whole layer (not just a length field) with credentials.   
SunRPC
typically adds 80 bytes (or more depending on auth flavor) before the 
nfs frame
(cifs and smb2 frame don't need this, so the net/sunrpc code which 
handles the
80 bytes or so of rpc header before the network fs frame
is not used for the cifs code)

In SMB2 the pacing is part of the SMB packet, not the transport packet, 
and in
SMB2 any reconnection code which would be built into the proposed modified
SunRPC transport would probably have to be aware of the new handle types
and locking operations and state that have been added in SMB2.1 which may
break the abstraction between network fs and SunRPC transport

> The code I've proposed more or less has abstractions along those lines.
> There's:
>
> transport class -- (net/sunrpc/xprtsmb.c)
> header encoding/decoding -- (net/sunrpc/smb.c)
>
> ...the other parts will be implemented by the filesystem (similar to
> how fs/nfs/nfs?xdr.c work).
>
>   
>> I like the idea of the way SunRPC keeps task information, and it may 
>> make it easier
>> to carry credentials around (although I think using Dave Howell's key 
>> management code
>> might be ok instead to access Winbind).   I am not sure how easy it 
>> would be to tie
>> SunRPC credential mapping to Winbind but that could probably be done.  I 
>> like the
>> async scheduling capability of SunRPC although I suspect that it is a 
>> factor in
>> a number of the (nfs client) performance problems we have seen so may 
>> need more work.
>> I don't like adding (in effect) an extra transport and "encoding layer" 
>> though to
>> protocols (cifs and smb2).   NFS since it is built on SunRPC on the 
>> wire, required
>> such a layer, and it makes sense for NFS to layer the code, like their 
>> protocol,
>> over SunRPC.   CIFS and SMB2 don't require (or even allow) XDR translation,
>> variable encodings, and SunRPC encapsulation so the idea of abstracting the
>> encoding of something that has a single defined encoding seems wrong.
>>     
>
> I'm not sure I understand this last comment. CIFS/SMB2 and NFS are just
> not that different in this regard. Either way you have to marshal up
> the buffer correctly before you send it, and decode it properly.
>
>   

As an example of one of the more complicated cases for cifs (setting a 
time field).  
The code looks something like this and is very straightforward to see where
the "linux field" is being put into the packet - and to catch errors in 
size or
endianness.   Most cases are simpler for cifs.

struct file_basic_info buf;  /* the wire format of the file basic info 
SMB info level */

buf->LastAccessTime = 0;  /* we can't set access time to Windows */
buf->LastWriteTime = cpu_to_le64(cifs_UnixTimeToNT(attrs->ia_atime));
(followed by a few similar assignment statements for the remaining fields)

then send the SMB header followed by the buffer.   There is no marshalling
or xdr conversion needed.   One thing I like about this approach is that
"sparse" (make modules C=1) immediately catches any mismatch
between the wire format (size or endianness) and the vfs data
structure element being put in the frame.

Looking at nfs4xdr.c as an example for comparison, it is much harder to 
see the actual line
where a bigendian (or littlenedian) 64 bit quantity is being put into 
the request frame.

Splitting encode and decode routines probably would make code more
reasonable - but converting from a "linux file system thing" to an 
abstract structure
and then to one of many wire possibilities for a frame - doesn't make 
much sense
for the case where there is only one wire encoding (SMB2) or for cases
where a single operation maps to more than one frame in some dialects
but not others (as in SMB) and so can't be encoded abstractly.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 0/9] sunrpc: teach the SUNRPC layer how to speak SMB
  2009-09-28 18:40       ` [RFC PATCH 0/9] sunrpc: teach the SUNRPC layer how to speak SMB Steve French (smfltc)
@ 2009-09-28 19:37         ` Jeff Layton
       [not found]           ` <20090928153737.65c1ba3e-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Layton @ 2009-09-28 19:37 UTC (permalink / raw)
  To: Steve French (smfltc); +Cc: linux-cifs-client, linux-nfs

On Mon, 28 Sep 2009 13:40:23 -0500
"Steve French (smfltc)" <smfltc@us.ibm.com> wrote:

> Jeff Layton wrote:
> > On Mon, 28 Sep 2009 09:41:08 -0500
> > "Steve French (smfltc)" <smfltc@us.ibm.com> wrote:
> >
> >   
> >>>> This patchset is still preliminary and is just an RFC...
> >>>>
> >>>> First, some background. When I was at Connectathon this year, Trond
> >>>> mentioned an interesting idea to me. He said (paraphrasing):
> >>>>
> >>>> "Why doesn't CIFS just use the RPC layer for transport? It's very
> >>>> efficient at shoveling bits out onto the wire. You'd just need to
> >>>> abstract out the XDR/RPC specific bits."
> >>>>
> >>>>     
> >>>>         
> >> My first reaction is that if you abstract out XDR/RPC specific parts of 
> >> SunRPC it isn't SunRPC,
> >> just a scheduler on top of tcp (not a bad thing in theory).   Pulling 
> >> out the two key pieces from
> >> SunRPC:
> >>     - asynchronous event handling and scheduling
> >>     - upcall for credentials
> >> could be useful, but does add a lot of complexity.   If there is a way 
> >> to use just the async
> >> scheduling (and perhaps upcall) out of SunRPC, that part sounds fine as 
> >> long as it
> >> can skip the encoding/decoding and just pass in a raw kvec containing 
> >> the SMB
> >> header and data.
> >>
> >>     
> >
> > Well, the sunrpc layer currently contains a lot of pieces:
> >
> > 1) client side call/response handling (clnt routines)
> > 2) server side call/response handling (svc routines)
> > 3) XDR encoding and decoding routines (including crypto signatures, etc)
> >
> > ...the idea is to hook up new encoding and decoding routines and to add
> > a new "transport class" which will make the client-side scheduler handle
> > SMB/SMB2 properly.
> >
> > We'll also eventually have to add new authentication/credential
> > "classes" too. I haven't researched that yet in any real depth, so I
> > can't state much about how difficult it'll be.
> >
> >   
> >>>> CIFS in particular is also designed around synchronous ops, which
> >>>> seriously limits throughput. Retrofitting it for asynchronous operation
> >>>> will be adding even more kludges. 
> >>>>     
> >>>>         
> >> There are only three operations that we can send asynchronous today, all 
> >> of which require
> >> special case handling in the VFS already:
> >>     - readpages
> >>    - writepages
> >>    - blocking locks
> >> (and also directory change notification which we and nfs don't do).   I 
> >> think the "slow_work"
> >> mechanism is probably sufficient for these cases already.
> >>
> >>     
> >
> > The problem is that rolling a mechanism to handle asynchronous ops is
> > difficult to get right. I think it makes a lot of sense to reuse a
> > proven engine here. It also makes a lot of sense to implement
> > synchronous ops on top of an asynchronous infrastructure. RPC does this
> > under the hood, and so did smbfs.
> >
> > What you're proposing, in effect, is to do this in reverse -- implement
> > an asynchronous transport engine using synchronous ops and offloading
> > the background parts onto threads. That's possible I suppose, but it
> > means you have a lot of tasks sleeping in the kernel and waiting for
> > stuff to happen.
> >
> >   
> I don't think it changes the number of tasks sleeping.   For sync 
> operations it doesn't
> change how many tasks that are sleeping.   For async operations, you no 
> longer have a
> task sleeping in cifs_writepages or cifs_readpages, but do have the 
> ability to dispatch
> a decode routine when the SMB Write or Read response is returned (may or
> may not have a pool to do this).   Seems like about the same number of 
> task (possibly
> less).   Moving to all asynchronous operations under open, unlink, close 
> doesn't reduce
> the number of sleeping tasks (it may increase it or stay the same).

I think it does. Take the example of asynchronous writepages requests.
You want to send a bunch of write calls in quick succession and then
process the replies as they come in. If you rely on a SendReceive-type
interface, you have no choice but to spawn a separate slow_work task
for each call on the wire. No guarantee they'll run in parallel though
(depends on the rules for spawning new kslowd threads).

Now, you could hack up a routine that just sends the calls, sprinkle in
a callback routine that's run by cifsd or something when the writes
come back (though you'll need to be wary of deadlock, of course).

...or you could just start with a transport layer that's designed for
this from the beginning. That's what I'm suggesting that we do.

Heck, samba does the exact same thing. It's transport layer is
fundamentally asynchronous too (built around the tevent lib).


> >>>>  works in our favor...
> >>>> ------------------------------------------------------------------------
> >>>> Q: can we hook up cifs or smbfs to use this as a transport?
> >>>>
> >>>> A: Not trivially. CIFS in particular is not designed with each call
> >>>> having discrete encode and decode functions. They're sort of mashed
> >>>>     
> >>>>         
> >> We certainly don't want to move to an abstract encoding mechanism, 
> >> especially for SMB2
> >> where there is only one encoding of wire operations, and no duplicate 
> >> requests due
> >> to 20 years of dialects. I can see an argument for abstract encoding 
> >> for requests
> >> like SMB open, vs. SMB OpenX vs. SMB NTCreateX but this would be harder or
> >> to abstract and has to be done case by case anyway due to differences in
> >> field length, missing fields, different compensations.  It is not
> >> like the simpler NFS case where encoding involves endian conversion etc.
> >>
> >>     
> >
> > I'm not sure what you mean by this. Assembling an SMB header and call
> > is very similar to assembling an RPC header and call. There are
> > differences of course, but they aren't that substantial.
> >
> > SMB does introduce some more interesting wrinkles. For instance, since
> > state is tied up with the actual socket connection, we'll probably need
> > callbacks into the fs for socket state changes. That doesn't have much
> > to do with how you abstract out the encoding and decoding though.
> >
> >   
> >>>> ------------------------------------------------------------------------
> >>>> Q: could we use this as a transport layer for a smb2fs ?
> >>>>
> >>>> A: Yes, I think so. This particular prototype is build around SMB1, but
> >>>> SMB2 could be supported with only minor modifications. One of the
> >>>> reasons for sending this patchset now before I've built a filesystem on
> >>>> top of it is because I know that SMB2 work is in progress. I'd like to
> >>>> see it based around a more asynchronous transport model, or at least
> >>>> built with cleaner layering so that we can eventually bolt on a different
> >>>> transport layer if we so choose.
> >>>>     
> >>>>         
> >> Amost all the ops use "send_receive"  already - so there is no need to 
> >> change the code much above
> >> that if you want to experiment with changing the transport.   I like the 
> >> idea of the
> >> abtraction of async operations, and creating completion routines (and an 
> >> async send
> >> abstraction) for readpages,  writepages and directory change 
> >> notification would make sense.
> >> but in both cifs and smb2, the 95% of the operations that must be 
> >> synchronous in
> >> the VFS (open, lookup, unlink, create etc.) can already be hooked up to 
> >> any transport
> >> as long as it can send a kvec contain fs data and return a response 
> >> (like the "send_receive"
> >> and equivalent).
> >>
> >>     
> >
> > The problem with the send_receive interface is that it assumes that the
> > encoding, send and decoding will be done by the same task. I think that
> > assumption will greatly limit this code later and force you to rely on
> > workarounds (like slow_work) to get asynchronous behavior.
> >
> > At the very least, I suggest splitting off the decode portions into
> > separate functions. That at least should allow you the ability later to
> > offload that part to other tasks (similar to how async tasks get
> > offloaded to rpciod).
> >
> >   
> That (splitting decode into a distinct helper) makes sense (at least for 
> async capable ops,
> in particular write, read and directory change notification and byte 
> range locks).  The
> "smb_decode_write_response" case is a particularly simple and useful one 
> to do and would
> be an easy one to do as a test.   I think the prototype patches for 
> async write that someone did
> for cifs a few years ago did that.

Even for ops that are fundamentally synchronous, it makes sense to
split the encoding and decoding into separate routines. Maybe it's just
my personal bias, but I think it encourages a cleaner, more flexible
design.

> >> The idea of doing abstract translation and encoding of SMB protocol frames
> >> does seem overengineered and probably would make it harder to 
> >> read/understand
> >> the setup of certain complex request frames which are quite different from
> >> Samba to Windows.    As another example, generalized, abstract SMB frame
> >> conversion isn't being done in Samba 3 for example, and with only
> >> 19 requests in SMB2 it makes even less sense.   On the client, since
> >> we have control over which types of requests we send, our case
> >> is simpler than for the server for sending requests, but in
> >> response processing since we have to work around server bugs, xdr like
> >> decoding of SMB responses could get harder still.
> >>
> >>     
> >
> > Again, I don't see SMB as being that different from NFS in this regard.
> > You have a transport header (similar to the fraghdr with NFS TCP
> > transport), then a protocol header (the SMB/SMB2 header), and then
> > call-specific information. RPC/NFS works exactly the same way.
> >
> >   
> NFS and CIFS/SMB2 seem pretty different to me in this regard.   CIFS and 
> SMB2
> are much simpler - for these once protocols unlike nfs,  after you assemble
> the CIFS or SMB2 header (possibly with a data area as in SMB Write)
> you simply add a 4 byte length (actually 3 byte length, 0 byte 0) and
> you send it - no endian conversions, xdr, no adding 80 bytes or so rpc 
> prefix. 
> SunRPC adds a whole layer (not just a length field) with credentials.   
> SunRPC
> typically adds 80 bytes (or more depending on auth flavor) before the 
> nfs frame
> (cifs and smb2 frame don't need this, so the net/sunrpc code which 
> handles the
> 80 bytes or so of rpc header before the network fs frame
> is not used for the cifs code)
> 
> In SMB2 the pacing is part of the SMB packet, not the transport packet, 
> and in
> SMB2 any reconnection code which would be built into the proposed modified
> SunRPC transport would probably have to be aware of the new handle types
> and locking operations and state that have been added in SMB2.1 which may
> break the abstraction between network fs and SunRPC transport
> 

CIFS/SMB most definitely need to do endianness conversions -- BE arches
have to convert nearly everything.

Though honestly, you're splitting hairs here. All of these details are
meaningless. In NFS there are areas that are simpler to handle than SMB
(like the header parsing of the response)...and there are parts that
are more difficult (fragheader handling, authentication).

The fact of that matter is that they are just not that different. There
are distinct parallels to many of the important pieces:

SMB transport header is 32 bits and contains a length with a zeroed out
upper byte. The RPC over TCP transport header is 32 bits and contains a
31 bit length + a "last fragment" flag.

SMB has MID's, RPC has XID's

RPC has the RPC auth info, SMB has the session UID and TID

...the parallels go on...

We'll of course need a different set of arguments to build a SMB header
vs. a RPC one. That information comes from the filesystem anyway and
it'll know what sort of arguments it needs to pass in.

> > The code I've proposed more or less has abstractions along those lines.
> > There's:
> >
> > transport class -- (net/sunrpc/xprtsmb.c)
> > header encoding/decoding -- (net/sunrpc/smb.c)
> >
> > ...the other parts will be implemented by the filesystem (similar to
> > how fs/nfs/nfs?xdr.c work).
> >
> >   
> >> I like the idea of the way SunRPC keeps task information, and it may 
> >> make it easier
> >> to carry credentials around (although I think using Dave Howell's key 
> >> management code
> >> might be ok instead to access Winbind).   I am not sure how easy it 
> >> would be to tie
> >> SunRPC credential mapping to Winbind but that could probably be done.  I 
> >> like the
> >> async scheduling capability of SunRPC although I suspect that it is a 
> >> factor in
> >> a number of the (nfs client) performance problems we have seen so may 
> >> need more work.
> >> I don't like adding (in effect) an extra transport and "encoding layer" 
> >> though to
> >> protocols (cifs and smb2).   NFS since it is built on SunRPC on the 
> >> wire, required
> >> such a layer, and it makes sense for NFS to layer the code, like their 
> >> protocol,
> >> over SunRPC.   CIFS and SMB2 don't require (or even allow) XDR translation,
> >> variable encodings, and SunRPC encapsulation so the idea of abstracting the
> >> encoding of something that has a single defined encoding seems wrong.
> >>     
> >
> > I'm not sure I understand this last comment. CIFS/SMB2 and NFS are just
> > not that different in this regard. Either way you have to marshal up
> > the buffer correctly before you send it, and decode it properly.
> >
> >   
> 
> As an example of one of the more complicated cases for cifs (setting a 
> time field).  
> The code looks something like this and is very straightforward to see where
> the "linux field" is being put into the packet - and to catch errors in 
> size or
> endianness.   Most cases are simpler for cifs.
> 
> struct file_basic_info buf;  /* the wire format of the file basic info 
> SMB info level */
> 
> buf->LastAccessTime = 0;  /* we can't set access time to Windows */
> buf->LastWriteTime = cpu_to_le64(cifs_UnixTimeToNT(attrs->ia_atime));
> (followed by a few similar assignment statements for the remaining fields)
> 
> then send the SMB header followed by the buffer.   There is no marshalling
> or xdr conversion needed.   One thing I like about this approach is that
> "sparse" (make modules C=1) immediately catches any mismatch
> between the wire format (size or endianness) and the vfs data
> structure element being put in the frame.
> 
> Looking at nfs4xdr.c as an example for comparison, it is much harder to 
> see the actual line
> where a bigendian (or littlenedian) 64 bit quantity is being put into 
> the request frame.
> 

Woah...I think we just have a difference in terminology here. When I
say "encoding" the packet, I just mean that we're stuffing the data
into the buffer in the format that the wire expects.

Now, whether you do that with packed structures (like CIFS does and
which I tend to prefer), or via "hand-marshaling" (like the WRITE/READ
macros in the nfs?xdr.c files) is simply an implementation detail.

In fact, if you take a look at the sample "smbtest" module, you'll
notice that the negotiate protocol packets are "encoded" and "decoded"
using packed structures.

In truth, we could probably utilize packed structures in most of the
NFS xdr code too, it just hasn't traditionally been done that way...

> Splitting encode and decode routines probably would make code more
> reasonable - but converting from a "linux file system thing" to an 
> abstract structure
> and then to one of many wire possibilities for a frame - doesn't make 
> much sense
> for the case where there is only one wire encoding (SMB2) or for cases
> where a single operation maps to more than one frame in some dialects
> but not others (as in SMB) and so can't be encoded abstractly.

The transport engine doesn't (and shouldn't) care about any of this.
All it needs to know is how to encode the packet given a set of
arguments, how to send it, and what to do with the reply. The rest is
all the job of the actual filesystem.

The only difference here from what CIFS does today is that this is
handled via a discrete set of interfaces to the transport engine rather
than just calling SendReceive with an already layed-out buffer.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 0/9] sunrpc: teach the SUNRPC layer how to speak SMB
       [not found]           ` <20090928153737.65c1ba3e-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2009-09-28 21:40             ` Steve French (smfltc)
  2009-09-29 11:30               ` Jeff Layton
  0 siblings, 1 reply; 8+ messages in thread
From: Steve French (smfltc) @ 2009-09-28 21:40 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-cifs-client, linux-nfs

Jeff Layton wrote:
> On Mon, 28 Sep 2009 13:40:23 -0500
> "Steve French (smfltc)" <smfltc@us.ibm.com> wrote:
>
>   
>> Jeff Layton wrote:
>>     
>>> On Mon, 28 Sep 2009 09:41:08 -0500
>>> "Steve French (smfltc)" <smfltc@us.ibm.com> wrote:
>>>
>>>     
>>>       
>>>   
>>>       
>> I don't think it changes the number of tasks sleeping.   For sync 
>> operations it doesn't
>> change how many tasks that are sleeping.   For async operations, you no 
>> longer have a
>> task sleeping in cifs_writepages or cifs_readpages, but do have the 
>> ability to dispatch
>> a decode routine when the SMB Write or Read response is returned (may or
>> may not have a pool to do this).   Seems like about the same number of 
>> task (possibly
>> less).   Moving to all asynchronous operations under open, unlink, close 
>> doesn't reduce
>> the number of sleeping tasks (it may increase it or stay the same).
>>     
>
> I think it does. Take the example of asynchronous writepages requests.
> You want to send a bunch of write calls in quick succession and then
> process the replies as they come in. If you rely on a SendReceive-type
> interface, you have no choice but to spawn a separate slow_work task
> for each call on the wire. No guarantee they'll run in parallel though
> (depends on the rules for spawning new kslowd threads).
>
> Now, you could hack up a routine that just sends the calls, sprinkle in
> a callback routine that's run by cifsd or something when the writes
> come back (though you'll need to be wary of deadlock, of course).
>
>   
I was assuming the latter ... ie that cifsd processes the completion (or 
spawns a slow
work task to process the response).   For the case of write (actually 
writepages)
there is not processing to be done (checking the rc, updating the status of
the corresponding pages in the page cache).   For read there are various 
choices
(launching a slow work thread in readpages - but you only need one for what
is potentially a very large number of reads from that readpages call) or 
processing
in cifs or launching a slow work thread to process (my current 
preference since
this is just cache updates/readahead anyway).

>>
>> That (splitting decode into a distinct helper) makes sense (at least for 
>> async capable ops,
>> in particular write, read and directory change notification and byte 
>> range locks).  The
>> "smb_decode_write_response" case is a particularly simple and useful one 
>> to do and would
>> be an easy one to do as a test.   I think the prototype patches for 
>> async write that someone did
>> for cifs a few years ago did that.
>>     
>
> Even for ops that are fundamentally synchronous, it makes sense to
> split the encoding and decoding into separate routines. Maybe it's just
> my personal bias, but I think it encourages a cleaner, more flexible
> design.
>   
I don't have a strong preference either way.   There are a large number 
of SMBs which
are basically "responseless" (we just look at the rc, but they don't 
include parameters)
so they would not have interesting decode routines.    Even for SMB2 
although we
probably have to look at the credits, the generic SMB response 
processing can
probably deal with those.   For those SMBs which have response processing,
if it makes for smaller more readable functions to break out the decode 
routines, that seems
fine.

> SMB transport header is 32 bits and contains a length with a zeroed out
> upper byte. The RPC over TCP transport header is 32 bits and contains a
> 31 bit length + a "last fragment" flag.
>
> SMB has MID's, RPC has XID's
>
> RPC has the RPC auth info, SMB has the session UID and TID
>
>
>   
There are some similar concepts but at different levels of the network 
stack.  Having
net/sunrpc assemble SMB headers (uid, mids) would be needed if you 
really want
to leverage the similar concepts, and it seems like a strange idea to 
move much of fs/cifs
into net/sunrpc helpers

>>> The code I've proposed more or less has abstractions along those lines.
>>> There's:
>>>
>>> transport class -- (net/sunrpc/xprtsmb.c)
>>> header encoding/decoding -- (net/sunrpc/smb.c)
>>>
>>> ...the other parts will be implemented by the filesystem (similar to
>>> how fs/nfs/nfs?xdr.c work).
>>>
>>>   
>>>       
>>>> I like the idea of the way SunRPC keeps task information, and it may 
>>>> make it easier
>>>> to carry credentials around (although I think using Dave Howell's key 
>>>> management code
>>>> might be ok instead to access Winbind).   I am not sure how easy it 
>>>> would be to tie
>>>> SunRPC credential mapping to Winbind but that could probably be done.  I 
>>>> like the
>>>> async scheduling capability of SunRPC although I suspect that it is a 
>>>> factor in
>>>> a number of the (nfs client) performance problems we have seen so may 
>>>> need more work.
>>>> I don't like adding (in effect) an extra transport and "encoding layer" 
>>>> though to
>>>> protocols (cifs and smb2).   NFS since it is built on SunRPC on the 
>>>> wire, required
>>>> such a layer, and it makes sense for NFS to layer the code, like their 
>>>> protocol,
>>>> over SunRPC.   CIFS and SMB2 don't require (or even allow) XDR translation,
>>>> variable encodings, and SunRPC encapsulation so the idea of abstracting the
>>>> encoding of something that has a single defined encoding seems wrong.
>>>>     
>>>>         
>>> I'm not sure I understand this last comment. CIFS/SMB2 and NFS are just
>>> not that different in this regard. Either way you have to marshal up
>>> the buffer correctly before you send it, and decode it properly.
>>>
>>>   
>>>       
>> As an example of one of the more complicated cases for cifs (setting a 
>> time field).  
>> The code looks something like this and is very straightforward to see where
>> the "linux field" is being put into the packet - and to catch errors in 
>> size or
>> endianness.   Most cases are simpler for cifs.
>>
>> struct file_basic_info buf;  /* the wire format of the file basic info 
>> SMB info level */
>>
>> buf->LastAccessTime = 0;  /* we can't set access time to Windows */
>> buf->LastWriteTime = cpu_to_le64(cifs_UnixTimeToNT(attrs->ia_atime));
>> (followed by a few similar assignment statements for the remaining fields)
>>
>> then send the SMB header followed by the buffer.   There is no marshalling
>> or xdr conversion needed.   One thing I like about this approach is that
>> "sparse" (make modules C=1) immediately catches any mismatch
>> between the wire format (size or endianness) and the vfs data
>> structure element being put in the frame.
>>
>> Looking at nfs4xdr.c as an example for comparison, it is much harder to 
>> see the actual line
>> where a bigendian (or littlenedian) 64 bit quantity is being put into 
>> the request frame.
>>
>>     
>
> Woah...I think we just have a difference in terminology here. When I
> say "encoding" the packet, I just mean that we're stuffing the data
> into the buffer in the format that the wire expects.
>
> Now, whether you do that with packed structures (like CIFS does and
> which I tend to prefer), or via "hand-marshaling" (like the WRITE/READ
> macros in the nfs?xdr.c files) is simply an implementation detail.
>   
If we are stuffing the data into the buffer in the format the wire 
expects ie a series of assignment
statements more or less as we do today (but broken into distinct encode 
and decode helpers)

pSMB->SomeField = cpu_to_le64(inode->some_attribute_field);

and simply using SunRPC as a convenient wrapper around the socket API for
async dispatch that is probably fine (not sure if it is slower or faster 
... and
RPC has various performance limitations like the RPC_SLOT_COUNT)
although not sure it is worth the trouble.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 0/9] sunrpc: teach the SUNRPC layer how to speak SMB
  2009-09-28 21:40             ` Steve French (smfltc)
@ 2009-09-29 11:30               ` Jeff Layton
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Layton @ 2009-09-29 11:30 UTC (permalink / raw)
  To: Steve French (smfltc); +Cc: linux-nfs, linux-cifs-client

On Mon, 28 Sep 2009 16:40:35 -0500
"Steve French (smfltc)" <smfltc@us.ibm.com> wrote:

> Jeff Layton wrote:
> > On Mon, 28 Sep 2009 13:40:23 -0500
> > "Steve French (smfltc)" <smfltc@us.ibm.com> wrote:
> >
> >   
> >> Jeff Layton wrote:
> >>     
> >>> On Mon, 28 Sep 2009 09:41:08 -0500
> >>> "Steve French (smfltc)" <smfltc@us.ibm.com> wrote:
> >>>
> >>>     
> >>>       
> >>>   
> >>>       
> >> I don't think it changes the number of tasks sleeping.   For sync 
> >> operations it doesn't
> >> change how many tasks that are sleeping.   For async operations, you no 
> >> longer have a
> >> task sleeping in cifs_writepages or cifs_readpages, but do have the 
> >> ability to dispatch
> >> a decode routine when the SMB Write or Read response is returned (may or
> >> may not have a pool to do this).   Seems like about the same number of 
> >> task (possibly
> >> less).   Moving to all asynchronous operations under open, unlink, close 
> >> doesn't reduce
> >> the number of sleeping tasks (it may increase it or stay the same).
> >>     
> >
> > I think it does. Take the example of asynchronous writepages requests.
> > You want to send a bunch of write calls in quick succession and then
> > process the replies as they come in. If you rely on a SendReceive-type
> > interface, you have no choice but to spawn a separate slow_work task
> > for each call on the wire. No guarantee they'll run in parallel though
> > (depends on the rules for spawning new kslowd threads).
> >
> > Now, you could hack up a routine that just sends the calls, sprinkle in
> > a callback routine that's run by cifsd or something when the writes
> > come back (though you'll need to be wary of deadlock, of course).
> >
> >   
> I was assuming the latter ... ie that cifsd processes the completion (or 
> spawns a slow
> work task to process the response).   For the case of write (actually 
> writepages)
> there is not processing to be done (checking the rc, updating the status of
> the corresponding pages in the page cache).   For read there are various 
> choices
> (launching a slow work thread in readpages - but you only need one for what
> is potentially a very large number of reads from that readpages call) or 
> processing
> in cifs or launching a slow work thread to process (my current 
> preference since
> this is just cache updates/readahead anyway).
> 

Fair enough. In fact, I have a patch in my for-2.6.33 branch that
generalizes the callback routine when cifsd matches a response to a
MID. That may help get that moving.

Still, the fact of the matter is that cifs carries a thread (cifsd)
that's wouldn't truly be needed were the responses to be processed in
softirq context as sunrpc does.

> >>
> >> That (splitting decode into a distinct helper) makes sense (at least for 
> >> async capable ops,
> >> in particular write, read and directory change notification and byte 
> >> range locks).  The
> >> "smb_decode_write_response" case is a particularly simple and useful one 
> >> to do and would
> >> be an easy one to do as a test.   I think the prototype patches for 
> >> async write that someone did
> >> for cifs a few years ago did that.
> >>     
> >
> > Even for ops that are fundamentally synchronous, it makes sense to
> > split the encoding and decoding into separate routines. Maybe it's just
> > my personal bias, but I think it encourages a cleaner, more flexible
> > design.
> >   
> I don't have a strong preference either way.   There are a large number 
> of SMBs which
> are basically "responseless" (we just look at the rc, but they don't 
> include parameters)
> so they would not have interesting decode routines.    Even for SMB2 
> although we
> probably have to look at the credits, the generic SMB response 
> processing can
> probably deal with those.   For those SMBs which have response processing,
> if it makes for smaller more readable functions to break out the decode 
> routines, that seems
> fine.
> 

Sure, it's the same way with NFS too. For those we can probably just
use a standard "decode" routine that just looks at the error code. Code
consolidation is a good thing.

> > SMB transport header is 32 bits and contains a length with a zeroed out
> > upper byte. The RPC over TCP transport header is 32 bits and contains a
> > 31 bit length + a "last fragment" flag.
> >
> > SMB has MID's, RPC has XID's
> >
> > RPC has the RPC auth info, SMB has the session UID and TID
> >
> >
> >   
> There are some similar concepts but at different levels of the network 
> stack.  Having
> net/sunrpc assemble SMB headers (uid, mids) would be needed if you 
> really want
> to leverage the similar concepts, and it seems like a strange idea to 
> move much of fs/cifs
> into net/sunrpc helpers
> 

Are they really at different levels of the network stack? SMB/SMB2
carries a little more info within the protocol header than we need at
the transport layer, but it's not really that big a deal. The code I
have just copies off the data until it gets to the MID.

Assembling UID's/TID's etc for a packet I see as more a function for
the credential/auth management code in the RPC layer. Essentially I
think I'll have to write a "RPC_AUTH_SMB" rpc_auth plugin for it and
probably also a corresponding credential piece.

It'll be quite a bit different from how it's handled with "real" RPC
protocols, but should still fit within the basic framework.
 
> >>> The code I've proposed more or less has abstractions along those lines.
> >>> There's:
> >>>
> >>> transport class -- (net/sunrpc/xprtsmb.c)
> >>> header encoding/decoding -- (net/sunrpc/smb.c)
> >>>
> >>> ...the other parts will be implemented by the filesystem (similar to
> >>> how fs/nfs/nfs?xdr.c work).
> >>>
> >>>   
> >>>       
> >>>> I like the idea of the way SunRPC keeps task information, and it may 
> >>>> make it easier
> >>>> to carry credentials around (although I think using Dave Howell's key 
> >>>> management code
> >>>> might be ok instead to access Winbind).   I am not sure how easy it 
> >>>> would be to tie
> >>>> SunRPC credential mapping to Winbind but that could probably be done.  I 
> >>>> like the
> >>>> async scheduling capability of SunRPC although I suspect that it is a 
> >>>> factor in
> >>>> a number of the (nfs client) performance problems we have seen so may 
> >>>> need more work.
> >>>> I don't like adding (in effect) an extra transport and "encoding layer" 
> >>>> though to
> >>>> protocols (cifs and smb2).   NFS since it is built on SunRPC on the 
> >>>> wire, required
> >>>> such a layer, and it makes sense for NFS to layer the code, like their 
> >>>> protocol,
> >>>> over SunRPC.   CIFS and SMB2 don't require (or even allow) XDR translation,
> >>>> variable encodings, and SunRPC encapsulation so the idea of abstracting the
> >>>> encoding of something that has a single defined encoding seems wrong.
> >>>>     
> >>>>         
> >>> I'm not sure I understand this last comment. CIFS/SMB2 and NFS are just
> >>> not that different in this regard. Either way you have to marshal up
> >>> the buffer correctly before you send it, and decode it properly.
> >>>
> >>>   
> >>>       
> >> As an example of one of the more complicated cases for cifs (setting a 
> >> time field).  
> >> The code looks something like this and is very straightforward to see where
> >> the "linux field" is being put into the packet - and to catch errors in 
> >> size or
> >> endianness.   Most cases are simpler for cifs.
> >>
> >> struct file_basic_info buf;  /* the wire format of the file basic info 
> >> SMB info level */
> >>
> >> buf->LastAccessTime = 0;  /* we can't set access time to Windows */
> >> buf->LastWriteTime = cpu_to_le64(cifs_UnixTimeToNT(attrs->ia_atime));
> >> (followed by a few similar assignment statements for the remaining fields)
> >>
> >> then send the SMB header followed by the buffer.   There is no marshalling
> >> or xdr conversion needed.   One thing I like about this approach is that
> >> "sparse" (make modules C=1) immediately catches any mismatch
> >> between the wire format (size or endianness) and the vfs data
> >> structure element being put in the frame.
> >>
> >> Looking at nfs4xdr.c as an example for comparison, it is much harder to 
> >> see the actual line
> >> where a bigendian (or littlenedian) 64 bit quantity is being put into 
> >> the request frame.
> >>
> >>     
> >
> > Woah...I think we just have a difference in terminology here. When I
> > say "encoding" the packet, I just mean that we're stuffing the data
> > into the buffer in the format that the wire expects.
> >
> > Now, whether you do that with packed structures (like CIFS does and
> > which I tend to prefer), or via "hand-marshaling" (like the WRITE/READ
> > macros in the nfs?xdr.c files) is simply an implementation detail.
> >   
> If we are stuffing the data into the buffer in the format the wire 
> expects ie a series of assignment
> statements more or less as we do today (but broken into distinct encode 
> and decode helpers)
> 
> pSMB->SomeField = cpu_to_le64(inode->some_attribute_field);
> 
> and simply using SunRPC as a convenient wrapper around the socket API for
> async dispatch that is probably fine (not sure if it is slower or faster 
> ... and
> RPC has various performance limitations like the RPC_SLOT_COUNT)
> although not sure it is worth the trouble.

Why would it be any extra trouble when you're designing something new
in the first place? As I said above, trying to bolt CIFS onto this is
probably going to be more trouble than its worth. SMB2 is another
matter though.

When it comes to encoding, all of the header and call encoding routines
have a defined set of args and return values. How you actually do the
encoding is up to the person writing the implementation. I too think
that packed structures are easier to deal with and would probably
prefer those over hand-marshaling.

CIFS has a "slot count" limitation too (we default to 50 outstanding
calls). In fact, for the prototype here, I made the slot count default
to 50 because of that precedent.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 0/9] sunrpc: teach the SUNRPC layer how to speak SMB
  2009-09-27 16:50 Jeff Layton
@ 2009-09-28  2:21 ` Shirish Pargaonkar
  0 siblings, 0 replies; 8+ messages in thread
From: Shirish Pargaonkar @ 2009-09-28  2:21 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs, linux-cifs-client


[-- Attachment #1.1: Type: text/plain, Size: 6415 bytes --]

On Sun, Sep 27, 2009 at 11:50 AM, Jeff Layton <jlayton@redhat.com> wrote:

> This patchset is still preliminary and is just an RFC...
>
> First, some background. When I was at Connectathon this year, Trond
> mentioned an interesting idea to me. He said (paraphrasing):
>
> "Why doesn't CIFS just use the RPC layer for transport? It's very
> efficient at shoveling bits out onto the wire. You'd just need to
> abstract out the XDR/RPC specific bits."
>
> The idea has been floating around in the back of my head until recently,
> when I decided to give it a go. This patchset represents a first, rough
> draft at a patchset to do this. There's also a proof of concept module
> that demonstrates that it works as expected.
>
> The patchset is still very rough. Reconnect behavior is still very
> "RPC-ish", for instance. There are doubtless other changes that will
> need to be made before I had anything merge-worthy.
>
> At this point, I'm just interested in feedback on the general approach.
> Should I continue to pursue this or is it a non-starter for some
> reason?
>
> The next step here, obviously is to start building a fs on top of it.
> I'd particularly be interested in using this as a foundation of a
> new smb2fs.
>
> I've also got this patchset up on my public git tree:
>
>    http://git.samba.org/?p=jlayton/cifs.git;a=summary
>
> Here are some questions I anticipate on this, and their answers:
>
> ------------------------------------------------------------------------
> Q: Are you insane? Why would you attempt to do this?
>
> A: Maybe...but in the last couple of years, I've spent a substantial
> amount of time working on the CIFS code. Much of that time has been
> spent fixing bugs. Many of those bugs exist in the low-level transport
> code which has been hacked on, kludged around and hand tweaked to
> where it is today. Unfortunately that's made it a very hard to work
> on mess. This drives away potential developers.
>
> CIFS in particular is also designed around synchronous ops, which
> seriously limits throughput. Retrofitting it for asynchronous operation
> will be adding even more kludges. The sunrpc code is already
> fundamentally asynchronous.
> ------------------------------------------------------------------------
> Q: Okay, what's the benefit of hooking it up to sunrpc rather than
> building a new transport layer (or fixing the transport in the other two
> filesystems)?
>
> A: Using sunrpc allows us to share a lot of the rpc scheduler code with
> sunrpc. At a high level, NFS/RPC and SMB aren't really very different.
> Sure, they have different formats, and one is big endian on the wire and
> the other isn't...still there are significant similarities.
>
> We also get access to the great upcall mechanisms that sunrpc has, and
> the possibility to share code like the gssapi upcalls. The sunrpc layer
> has a credential and authentication management framework that we can
> build on to make a truly multiuser CIFS/SMB filesystem.
>
> I've heard it claimed before that Linux's sunrpc layer is
> over-engineered, but in this case that works in our favor...
> ------------------------------------------------------------------------
> Q: can we hook up cifs or smbfs to use this as a transport?
>
> A: Not trivially. CIFS in particular is not designed with each call
> having discrete encode and decode functions. They're sort of mashed
> together. smbfs might be possible...I'm a little less familiar with it,
> but it does have a transport layer that more closely resembles the
> sunrpc one. Still though, it'd take significant work to make that
> happen. I'm not opposed to the idea however.
>
> In the end though, I think we'll probably need to design something new
> to sit on top of this. We will probably be able to borrow code and
> concepts from the other filesystems however.
> ------------------------------------------------------------------------
> Q: could we use this as a transport layer for a smb2fs ?
>
> A: Yes, I think so. This particular prototype is build around SMB1, but
> SMB2 could be supported with only minor modifications. One of the
> reasons for sending this patchset now before I've built a filesystem on
> top of it is because I know that SMB2 work is in progress. I'd like to
> see it based around a more asynchronous transport model, or at least
> built with cleaner layering so that we can eventually bolt on a different
> transport layer if we so choose.
>
> Jeff Layton (9):
>  sunrpc: abstract out encoding function at rpc_clnt level
>  sunrpc: move check for too small reply size into rpc_verify_header
>  sunrpc: abstract our call decoding routine
>  sunrpc: move rpc_xdr_buf_init to clnt.h
>  sunrpc: make call_bind non-static
>  sunrpc: add new SMB transport class for sunrpc
>  sunrpc: add encoding and decoding routines for SMB
>  sunrpc: add Kconfig option for CONFIG_SUNRPC_SMB
>  smbtest: simple module for testing SMB/RPC code
>
>  fs/Makefile                    |    2 +
>  fs/lockd/host.c                |    4 +
>  fs/lockd/mon.c                 |    4 +
>  fs/nfs/client.c                |    4 +
>  fs/nfs/mount_clnt.c            |    4 +
>  fs/nfsd/nfs4callback.c         |    4 +
>  fs/smbtest/Makefile            |    1 +
>  fs/smbtest/smbtest.c           |  204 +++++
>  include/linux/sunrpc/clnt.h    |   24 +-
>  include/linux/sunrpc/smb.h     |   42 +
>  include/linux/sunrpc/xprtsmb.h |   59 ++
>  net/sunrpc/Kconfig             |   11 +
>  net/sunrpc/Makefile            |    1 +
>  net/sunrpc/clnt.c              |   98 ++-
>  net/sunrpc/rpcb_clnt.c         |    8 +
>  net/sunrpc/smb.c               |  120 +++
>  net/sunrpc/sunrpc_syms.c       |    3 +
>  net/sunrpc/xprtsmb.c           | 1723
> ++++++++++++++++++++++++++++++++++++++++
>  18 files changed, 2272 insertions(+), 44 deletions(-)
>  create mode 100644 fs/smbtest/Makefile
>  create mode 100644 fs/smbtest/smbtest.c
>  create mode 100644 include/linux/sunrpc/smb.h
>  create mode 100644 include/linux/sunrpc/xprtsmb.h
>  create mode 100644 net/sunrpc/smb.c
>  create mode 100644 net/sunrpc/xprtsmb.c
>
> _______________________________________________
> linux-cifs-client mailing list
> linux-cifs-client@lists.samba.org
> https://lists.samba.org/mailman/listinfo/linux-cifs-client
>


Jeff,

So servers need to speak SUNRPC as well right?
Are_threre/how_many servers are out there that speak CIFS/SMB over SUNRPC or
SMB2 over SUNRPC?

[-- Attachment #1.2: Type: text/html, Size: 7291 bytes --]

[-- Attachment #2: Type: text/plain, Size: 172 bytes --]

_______________________________________________
linux-cifs-client mailing list
linux-cifs-client@lists.samba.org
https://lists.samba.org/mailman/listinfo/linux-cifs-client

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH 0/9] sunrpc: teach the SUNRPC layer how to speak SMB
@ 2009-09-27 16:50 Jeff Layton
  2009-09-28  2:21 ` Shirish Pargaonkar
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Layton @ 2009-09-27 16:50 UTC (permalink / raw)
  To: linux-nfs, linux-cifs-client

This patchset is still preliminary and is just an RFC...

First, some background. When I was at Connectathon this year, Trond
mentioned an interesting idea to me. He said (paraphrasing):

"Why doesn't CIFS just use the RPC layer for transport? It's very
efficient at shoveling bits out onto the wire. You'd just need to
abstract out the XDR/RPC specific bits."

The idea has been floating around in the back of my head until recently,
when I decided to give it a go. This patchset represents a first, rough
draft at a patchset to do this. There's also a proof of concept module
that demonstrates that it works as expected.

The patchset is still very rough. Reconnect behavior is still very
"RPC-ish", for instance. There are doubtless other changes that will
need to be made before I had anything merge-worthy.

At this point, I'm just interested in feedback on the general approach.
Should I continue to pursue this or is it a non-starter for some
reason?

The next step here, obviously is to start building a fs on top of it.
I'd particularly be interested in using this as a foundation of a
new smb2fs.

I've also got this patchset up on my public git tree:

    http://git.samba.org/?p=jlayton/cifs.git;a=summary

Here are some questions I anticipate on this, and their answers:

------------------------------------------------------------------------
Q: Are you insane? Why would you attempt to do this?

A: Maybe...but in the last couple of years, I've spent a substantial
amount of time working on the CIFS code. Much of that time has been
spent fixing bugs. Many of those bugs exist in the low-level transport
code which has been hacked on, kludged around and hand tweaked to
where it is today. Unfortunately that's made it a very hard to work
on mess. This drives away potential developers.

CIFS in particular is also designed around synchronous ops, which
seriously limits throughput. Retrofitting it for asynchronous operation
will be adding even more kludges. The sunrpc code is already
fundamentally asynchronous.
------------------------------------------------------------------------
Q: Okay, what's the benefit of hooking it up to sunrpc rather than
building a new transport layer (or fixing the transport in the other two
filesystems)?

A: Using sunrpc allows us to share a lot of the rpc scheduler code with
sunrpc. At a high level, NFS/RPC and SMB aren't really very different.
Sure, they have different formats, and one is big endian on the wire and
the other isn't...still there are significant similarities.

We also get access to the great upcall mechanisms that sunrpc has, and
the possibility to share code like the gssapi upcalls. The sunrpc layer
has a credential and authentication management framework that we can
build on to make a truly multiuser CIFS/SMB filesystem.

I've heard it claimed before that Linux's sunrpc layer is
over-engineered, but in this case that works in our favor...
------------------------------------------------------------------------
Q: can we hook up cifs or smbfs to use this as a transport?

A: Not trivially. CIFS in particular is not designed with each call
having discrete encode and decode functions. They're sort of mashed
together. smbfs might be possible...I'm a little less familiar with it,
but it does have a transport layer that more closely resembles the
sunrpc one. Still though, it'd take significant work to make that
happen. I'm not opposed to the idea however.

In the end though, I think we'll probably need to design something new
to sit on top of this. We will probably be able to borrow code and
concepts from the other filesystems however.
------------------------------------------------------------------------
Q: could we use this as a transport layer for a smb2fs ?

A: Yes, I think so. This particular prototype is build around SMB1, but
SMB2 could be supported with only minor modifications. One of the
reasons for sending this patchset now before I've built a filesystem on
top of it is because I know that SMB2 work is in progress. I'd like to
see it based around a more asynchronous transport model, or at least
built with cleaner layering so that we can eventually bolt on a different
transport layer if we so choose.

Jeff Layton (9):
  sunrpc: abstract out encoding function at rpc_clnt level
  sunrpc: move check for too small reply size into rpc_verify_header
  sunrpc: abstract our call decoding routine
  sunrpc: move rpc_xdr_buf_init to clnt.h
  sunrpc: make call_bind non-static
  sunrpc: add new SMB transport class for sunrpc
  sunrpc: add encoding and decoding routines for SMB
  sunrpc: add Kconfig option for CONFIG_SUNRPC_SMB
  smbtest: simple module for testing SMB/RPC code

 fs/Makefile                    |    2 +
 fs/lockd/host.c                |    4 +
 fs/lockd/mon.c                 |    4 +
 fs/nfs/client.c                |    4 +
 fs/nfs/mount_clnt.c            |    4 +
 fs/nfsd/nfs4callback.c         |    4 +
 fs/smbtest/Makefile            |    1 +
 fs/smbtest/smbtest.c           |  204 +++++
 include/linux/sunrpc/clnt.h    |   24 +-
 include/linux/sunrpc/smb.h     |   42 +
 include/linux/sunrpc/xprtsmb.h |   59 ++
 net/sunrpc/Kconfig             |   11 +
 net/sunrpc/Makefile            |    1 +
 net/sunrpc/clnt.c              |   98 ++-
 net/sunrpc/rpcb_clnt.c         |    8 +
 net/sunrpc/smb.c               |  120 +++
 net/sunrpc/sunrpc_syms.c       |    3 +
 net/sunrpc/xprtsmb.c           | 1723 ++++++++++++++++++++++++++++++++++++++++
 18 files changed, 2272 insertions(+), 44 deletions(-)
 create mode 100644 fs/smbtest/Makefile
 create mode 100644 fs/smbtest/smbtest.c
 create mode 100644 include/linux/sunrpc/smb.h
 create mode 100644 include/linux/sunrpc/xprtsmb.h
 create mode 100644 net/sunrpc/smb.c
 create mode 100644 net/sunrpc/xprtsmb.c


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-09-29 11:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <mailman.1290.1254136357.2670.linux-cifs-client@lists.samba.org>
2009-09-28 14:41 ` linux-cifs-client Digest, Vol 70, Issue 25 Steve French (smfltc)
2009-09-28 15:54   ` Jeff Layton
     [not found]     ` <20090928115427.4103826e-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2009-09-28 18:40       ` [RFC PATCH 0/9] sunrpc: teach the SUNRPC layer how to speak SMB Steve French (smfltc)
2009-09-28 19:37         ` Jeff Layton
     [not found]           ` <20090928153737.65c1ba3e-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2009-09-28 21:40             ` Steve French (smfltc)
2009-09-29 11:30               ` Jeff Layton
2009-09-27 16:50 Jeff Layton
2009-09-28  2:21 ` Shirish Pargaonkar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.