All of lore.kernel.org
 help / color / mirror / Atom feed
* nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
@ 2017-03-11 16:53 Chuck Lever
  2017-03-11 17:08 ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2017-03-11 16:53 UTC (permalink / raw)
  To: J. Bruce Fields, Jeff Layton; +Cc: Linux NFS Mailing List

Hi Bruce, Jeff-

I've observed some interesting Linux NFS server behavior (v4.1.12).

We have a single system that has an NFSv4 mount via the kernel NFS
client, and an NFSv3 mount of the same export via a user space NFS
client. These two clients are accessing the same set of files.

The following pattern is seen on the wire. I've filtered a recent
capture on the FH of one of the shared files.

---- cut here ----

18507  19.483085    10.0.2.11 -> 10.0.1.8     NFS 238 V4 Call ACCESS FH: 0xc930444f, [Check: RD MD XT XE]
18508  19.483827     10.0.1.8 -> 10.0.2.11    NFS 194 V4 Reply (Call In 18507) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]
18510  19.484676     10.0.1.8 -> 10.0.2.11    NFS 434 V4 Reply (Call In 18509) OPEN StateID: 0x6de3

This OPEN reply offers a read delegation to the kernel NFS client.

18511  19.484806    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
18512  19.485549     10.0.1.8 -> 10.0.2.11    NFS 274 V4 Reply (Call In 18511) GETATTR
18513  19.485611    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
18514  19.486375     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18513) GETATTR
18515  19.486464    10.0.2.11 -> 10.0.1.8     NFS 254 V4 Call CLOSE StateID: 0x6de3
18516  19.487201     10.0.1.8 -> 10.0.2.11    NFS 202 V4 Reply (Call In 18515) CLOSE
18556  19.498617    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 8192 Len: 8192

This READ call by the user space client does not conflict with the
read delegation.

18559  19.499396     10.0.1.8 -> 10.0.2.11    NFS 8390 V3 READ Reply (Call In 18556) Len: 8192
18726  19.568975     10.0.1.8 -> 10.0.2.11    NFS 310 V3 LOOKUP Reply (Call In 18725), FH: 0xc930444f
18727  19.569170    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 0 Len: 512
18728  19.569923     10.0.1.8 -> 10.0.2.11    NFS 710 V3 READ Reply (Call In 18727) Len: 512
18729  19.570135    10.0.2.11 -> 10.0.1.8     NFS 234 V3 SETATTR Call, FH: 0xc930444f
18730  19.570901     10.0.1.8 -> 10.0.2.11    NFS 214 V3 SETATTR Reply (Call In 18729) Error: NFS3ERR_JUKEBOX

The user space client has attempted to extend the file. This does
conflict with the read delegation held by the kernel NFS client,
so the server returns JUKEBOX, the equivalent of NFS4ERR_DELAY.
This causes a negative performance impact on the user space NFS
client.

18731  19.575396    10.0.2.11 -> 10.0.1.8     NFS 250 V4 Call DELEGRETURN StateID: 0x6de3
18732  19.576132     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18731) DELEGRETURN

No CB_RECALL was done to trigger this DELEGRETURN. Apparently
the application that was accessing this file via the kernel OS
client decided already that it no longer needed the file before
the server could send the CB_RECALL. Sign of perhaps a race
between the applications accessing the file via these two
mounts.

---- cut here ----

The server is aware of non-NFSv4 accessors of this file in frame
18556. NFSv3 has no OPEN operation, of course, so it's not
possible for the server to determine how the NFSv3 client will
subsequently access this file.

Seems like at frame 18556, it would be a best practice to recall
the delegation to avoid potential future conflicts, such as the
SETATTR in frame 18729.

Or, perhaps that READ isn't the first NFSv3 access of that file.
After all, a LOOKUP would have to be done to retrieve that file's
FH. The OPEN in frame 18556 perhaps could have avoided offering
the READ delegation, knowing there is a recent non-NFSv4 accessor
of that file.

Would these be difficult or inappropriate policies to implement?


--
Chuck Lever




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
  2017-03-11 16:53 nfsd: delegation conflicts between NFSv3 and NFSv4 accessors Chuck Lever
@ 2017-03-11 17:08 ` Jeff Layton
  2017-03-11 20:46   ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2017-03-11 17:08 UTC (permalink / raw)
  To: Chuck Lever, J. Bruce Fields; +Cc: Linux NFS Mailing List

On Sat, 2017-03-11 at 11:53 -0500, Chuck Lever wrote:
> Hi Bruce, Jeff-
> 
> I've observed some interesting Linux NFS server behavior (v4.1.12).
> 
> We have a single system that has an NFSv4 mount via the kernel NFS
> client, and an NFSv3 mount of the same export via a user space NFS
> client. These two clients are accessing the same set of files.
> 
> The following pattern is seen on the wire. I've filtered a recent
> capture on the FH of one of the shared files.
> 
> ---- cut here ----
> 
> 18507  19.483085    10.0.2.11 -> 10.0.1.8     NFS 238 V4 Call ACCESS FH: 0xc930444f, [Check: RD MD XT XE]
> 18508  19.483827     10.0.1.8 -> 10.0.2.11    NFS 194 V4 Reply (Call In 18507) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]
> 18510  19.484676     10.0.1.8 -> 10.0.2.11    NFS 434 V4 Reply (Call In 18509) OPEN StateID: 0x6de3
> 
> This OPEN reply offers a read delegation to the kernel NFS client.
> 
> 18511  19.484806    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
> 18512  19.485549     10.0.1.8 -> 10.0.2.11    NFS 274 V4 Reply (Call In 18511) GETATTR
> 18513  19.485611    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
> 18514  19.486375     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18513) GETATTR
> 18515  19.486464    10.0.2.11 -> 10.0.1.8     NFS 254 V4 Call CLOSE StateID: 0x6de3
> 18516  19.487201     10.0.1.8 -> 10.0.2.11    NFS 202 V4 Reply (Call In 18515) CLOSE
> 18556  19.498617    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 8192 Len: 8192
> 
> This READ call by the user space client does not conflict with the
> read delegation.
> 
> 18559  19.499396     10.0.1.8 -> 10.0.2.11    NFS 8390 V3 READ Reply (Call In 18556) Len: 8192
> 18726  19.568975     10.0.1.8 -> 10.0.2.11    NFS 310 V3 LOOKUP Reply (Call In 18725), FH: 0xc930444f
> 18727  19.569170    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 0 Len: 512
> 18728  19.569923     10.0.1.8 -> 10.0.2.11    NFS 710 V3 READ Reply (Call In 18727) Len: 512
> 18729  19.570135    10.0.2.11 -> 10.0.1.8     NFS 234 V3 SETATTR Call, FH: 0xc930444f
> 18730  19.570901     10.0.1.8 -> 10.0.2.11    NFS 214 V3 SETATTR Reply (Call In 18729) Error: NFS3ERR_JUKEBOX
> 
> The user space client has attempted to extend the file. This does
> conflict with the read delegation held by the kernel NFS client,
> so the server returns JUKEBOX, the equivalent of NFS4ERR_DELAY.
> This causes a negative performance impact on the user space NFS
> client.
> 
> 18731  19.575396    10.0.2.11 -> 10.0.1.8     NFS 250 V4 Call DELEGRETURN StateID: 0x6de3
> 18732  19.576132     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18731) DELEGRETURN
> 
> No CB_RECALL was done to trigger this DELEGRETURN. Apparently
> the application that was accessing this file via the kernel OS
> client decided already that it no longer needed the file before
> the server could send the CB_RECALL. Sign of perhaps a race
> between the applications accessing the file via these two
> mounts.
> 
> ---- cut here ----
> 
> The server is aware of non-NFSv4 accessors of this file in frame
> 18556. NFSv3 has no OPEN operation, of course, so it's not
> possible for the server to determine how the NFSv3 client will
> subsequently access this file.
> 

Right. Why should we assume that the v3 client will do anything other
than read there? If we recall the delegation just for reads, then we
potentially negatively affect the performance of the v4 client.

> Seems like at frame 18556, it would be a best practice to recall
> the delegation to avoid potential future conflicts, such as the
> SETATTR in frame 18729.
> 
> Or, perhaps that READ isn't the first NFSv3 access of that file.
> After all, a LOOKUP would have to be done to retrieve that file's
> FH. The OPEN in frame 18556 perhaps could have avoided offering
> the READ delegation, knowing there is a recent non-NFSv4 accessor
> of that file.
> 
> Would these be difficult or inappropriate policies to implement?
> 
> 

Reads are not currently considered to be conflicting access vs. a read
delegation. I think that's the correct thing to do. Until we have some
sort of conflicting behavior I don't see why you'd want to prematurely
recall the delegation.

Note that we do have a bloom filter now that prevents us from handing
out a delegation on a file that was recently recalled. Does that help at
all here?
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
  2017-03-11 17:08 ` Jeff Layton
@ 2017-03-11 20:46   ` Chuck Lever
  2017-03-11 21:04     ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2017-03-11 20:46 UTC (permalink / raw)
  To: Jeff Layton; +Cc: J. Bruce Fields, Linux NFS Mailing List


> On Mar 11, 2017, at 12:08 PM, Jeff Layton <jlayton@redhat.com> wrote:
> 
> On Sat, 2017-03-11 at 11:53 -0500, Chuck Lever wrote:
>> Hi Bruce, Jeff-
>> 
>> I've observed some interesting Linux NFS server behavior (v4.1.12).
>> 
>> We have a single system that has an NFSv4 mount via the kernel NFS
>> client, and an NFSv3 mount of the same export via a user space NFS
>> client. These two clients are accessing the same set of files.
>> 
>> The following pattern is seen on the wire. I've filtered a recent
>> capture on the FH of one of the shared files.
>> 
>> ---- cut here ----
>> 
>> 18507  19.483085    10.0.2.11 -> 10.0.1.8     NFS 238 V4 Call ACCESS FH: 0xc930444f, [Check: RD MD XT XE]
>> 18508  19.483827     10.0.1.8 -> 10.0.2.11    NFS 194 V4 Reply (Call In 18507) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]
>> 18510  19.484676     10.0.1.8 -> 10.0.2.11    NFS 434 V4 Reply (Call In 18509) OPEN StateID: 0x6de3
>> 
>> This OPEN reply offers a read delegation to the kernel NFS client.
>> 
>> 18511  19.484806    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
>> 18512  19.485549     10.0.1.8 -> 10.0.2.11    NFS 274 V4 Reply (Call In 18511) GETATTR
>> 18513  19.485611    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
>> 18514  19.486375     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18513) GETATTR
>> 18515  19.486464    10.0.2.11 -> 10.0.1.8     NFS 254 V4 Call CLOSE StateID: 0x6de3
>> 18516  19.487201     10.0.1.8 -> 10.0.2.11    NFS 202 V4 Reply (Call In 18515) CLOSE
>> 18556  19.498617    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 8192 Len: 8192
>> 
>> This READ call by the user space client does not conflict with the
>> read delegation.
>> 
>> 18559  19.499396     10.0.1.8 -> 10.0.2.11    NFS 8390 V3 READ Reply (Call In 18556) Len: 8192
>> 18726  19.568975     10.0.1.8 -> 10.0.2.11    NFS 310 V3 LOOKUP Reply (Call In 18725), FH: 0xc930444f
>> 18727  19.569170    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 0 Len: 512
>> 18728  19.569923     10.0.1.8 -> 10.0.2.11    NFS 710 V3 READ Reply (Call In 18727) Len: 512
>> 18729  19.570135    10.0.2.11 -> 10.0.1.8     NFS 234 V3 SETATTR Call, FH: 0xc930444f
>> 18730  19.570901     10.0.1.8 -> 10.0.2.11    NFS 214 V3 SETATTR Reply (Call In 18729) Error: NFS3ERR_JUKEBOX
>> 
>> The user space client has attempted to extend the file. This does
>> conflict with the read delegation held by the kernel NFS client,
>> so the server returns JUKEBOX, the equivalent of NFS4ERR_DELAY.
>> This causes a negative performance impact on the user space NFS
>> client.
>> 
>> 18731  19.575396    10.0.2.11 -> 10.0.1.8     NFS 250 V4 Call DELEGRETURN StateID: 0x6de3
>> 18732  19.576132     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18731) DELEGRETURN
>> 
>> No CB_RECALL was done to trigger this DELEGRETURN. Apparently
>> the application that was accessing this file via the kernel OS
>> client decided already that it no longer needed the file before
>> the server could send the CB_RECALL. Sign of perhaps a race
>> between the applications accessing the file via these two
>> mounts.
>> 
>> ---- cut here ----
>> 
>> The server is aware of non-NFSv4 accessors of this file in frame
>> 18556. NFSv3 has no OPEN operation, of course, so it's not
>> possible for the server to determine how the NFSv3 client will
>> subsequently access this file.
>> 
> 
> Right. Why should we assume that the v3 client will do anything other
> than read there? If we recall the delegation just for reads, then we
> potentially negatively affect the performance of the v4 client.
> 
>> Seems like at frame 18556, it would be a best practice to recall
>> the delegation to avoid potential future conflicts, such as the
>> SETATTR in frame 18729.
>> 
>> Or, perhaps that READ isn't the first NFSv3 access of that file.
>> After all, a LOOKUP would have to be done to retrieve that file's
>> FH. The OPEN in frame 18556 perhaps could have avoided offering
>> the READ delegation, knowing there is a recent non-NFSv4 accessor
>> of that file.
>> 
>> Would these be difficult or inappropriate policies to implement?
>> 
>> 
> 
> Reads are not currently considered to be conflicting access vs. a read
> delegation.

Strictly speaking, a single NFSv3 READ does not violate the guarantee
made by the read delegation. And, strictly speaking, there can be no
OPEN conflict because NFSv3 does not have an OPEN operation.

The question is whether the server has an adequate mechanism for
delaying NFSv3 accessors when an NFSv4 delegation must be recalled.

NFS3ERR_JUKEBOX and NFS4ERR_DELAY share the same numeric value, but
imply different semantics.

RFC1813 says:
 
NFS3ERR_JUKEBOX
    The server initiated the request, but was not able to
    complete it in a timely fashion. The client should wait
    and then try the request with a new RPC transaction ID.
    For example, this error should be returned from a server
    that supports hierarchical storage and receives a request
    to process a file that has been migrated. In this case,
    the server should start the immigration process and
    respond to client with this error.

Some clients respond to NFS3ERR_JUKEBOX by waiting quite some time
before retrying.

RFC7530 says:

13.1.1.3.  NFS4ERR_DELAY (Error Code 10008)

   For any of a number of reasons, the replier could not process this
   operation in what was deemed a reasonable time.  The client should
   wait and then try the request with a new RPC transaction ID.

   The following are two examples of what might lead to this situation:

   o  A server that supports hierarchical storage receives a request to
      process a file that had been migrated.

   o  An operation requires a delegation recall to proceed, and waiting
      for this delegation recall makes processing this request in a
      timely fashion impossible.

An NFSv4 client is prepared to retry this error almost immediately
because most of the time it is due to the second bullet.

I agree that not recalling after an NFSv3 READ is reasonable in some
cases. However, I demonstrated a case where the current policy does
not serve one of these clients well at all. In fact, the NFSv3
accessor in this case is the performance-sensitive one.

To put it another way, the NFSv4 protocol does not forbid the
current Linux server policy, but interoperating well with existing
NFSv3 clients suggests it's not an optimal policy choice.


> I think that's the correct thing to do. Until we have some
> sort of conflicting behavior I don't see why you'd want to prematurely
> recall the delegation.

The reason to recall a delegation is to avoid returning
NFS3ERR_JUKEBOX if at all possible, because doing so is a drastic
remedy that results in a performance regression.

The negative impact of not having a delegation is small. The negative
impact of returning NFS3ERR_JUKEBOX to a SETATTR or WRITE can be as
much as a 5 minute wait. (This is intolerably long for, say, online
transaction processing workloads).

The server can detect there are other accessors that do not provide
OPEN/CLOSE semantics. In addition, the server cannot predict when one
of these accessors may use a WRITE or SETATTR. And finally it does
not have a reasonably performant mechanism for delaying those
accessors when a delegation must be recalled.


> Note that we do have a bloom filter now that prevents us from handing
> out a delegation on a file that was recently recalled. Does that help at
> all here?

Not offering a delegation again will help during subsequent accesses,
though not for the initial write access.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
  2017-03-11 20:46   ` Chuck Lever
@ 2017-03-11 21:04     ` Jeff Layton
  2017-03-13 13:27       ` J. Bruce Fields
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2017-03-11 21:04 UTC (permalink / raw)
  To: Chuck Lever; +Cc: J. Bruce Fields, Linux NFS Mailing List

On Sat, 2017-03-11 at 15:46 -0500, Chuck Lever wrote:
> > On Mar 11, 2017, at 12:08 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > On Sat, 2017-03-11 at 11:53 -0500, Chuck Lever wrote:
> > > Hi Bruce, Jeff-
> > > 
> > > I've observed some interesting Linux NFS server behavior (v4.1.12).
> > > 
> > > We have a single system that has an NFSv4 mount via the kernel NFS
> > > client, and an NFSv3 mount of the same export via a user space NFS
> > > client. These two clients are accessing the same set of files.
> > > 
> > > The following pattern is seen on the wire. I've filtered a recent
> > > capture on the FH of one of the shared files.
> > > 
> > > ---- cut here ----
> > > 
> > > 18507  19.483085    10.0.2.11 -> 10.0.1.8     NFS 238 V4 Call ACCESS FH: 0xc930444f, [Check: RD MD XT XE]
> > > 18508  19.483827     10.0.1.8 -> 10.0.2.11    NFS 194 V4 Reply (Call In 18507) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]
> > > 18510  19.484676     10.0.1.8 -> 10.0.2.11    NFS 434 V4 Reply (Call In 18509) OPEN StateID: 0x6de3
> > > 
> > > This OPEN reply offers a read delegation to the kernel NFS client.
> > > 
> > > 18511  19.484806    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
> > > 18512  19.485549     10.0.1.8 -> 10.0.2.11    NFS 274 V4 Reply (Call In 18511) GETATTR
> > > 18513  19.485611    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
> > > 18514  19.486375     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18513) GETATTR
> > > 18515  19.486464    10.0.2.11 -> 10.0.1.8     NFS 254 V4 Call CLOSE StateID: 0x6de3
> > > 18516  19.487201     10.0.1.8 -> 10.0.2.11    NFS 202 V4 Reply (Call In 18515) CLOSE
> > > 18556  19.498617    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 8192 Len: 8192
> > > 
> > > This READ call by the user space client does not conflict with the
> > > read delegation.
> > > 
> > > 18559  19.499396     10.0.1.8 -> 10.0.2.11    NFS 8390 V3 READ Reply (Call In 18556) Len: 8192
> > > 18726  19.568975     10.0.1.8 -> 10.0.2.11    NFS 310 V3 LOOKUP Reply (Call In 18725), FH: 0xc930444f
> > > 18727  19.569170    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 0 Len: 512
> > > 18728  19.569923     10.0.1.8 -> 10.0.2.11    NFS 710 V3 READ Reply (Call In 18727) Len: 512
> > > 18729  19.570135    10.0.2.11 -> 10.0.1.8     NFS 234 V3 SETATTR Call, FH: 0xc930444f
> > > 18730  19.570901     10.0.1.8 -> 10.0.2.11    NFS 214 V3 SETATTR Reply (Call In 18729) Error: NFS3ERR_JUKEBOX
> > > 
> > > The user space client has attempted to extend the file. This does
> > > conflict with the read delegation held by the kernel NFS client,
> > > so the server returns JUKEBOX, the equivalent of NFS4ERR_DELAY.
> > > This causes a negative performance impact on the user space NFS
> > > client.
> > > 
> > > 18731  19.575396    10.0.2.11 -> 10.0.1.8     NFS 250 V4 Call DELEGRETURN StateID: 0x6de3
> > > 18732  19.576132     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18731) DELEGRETURN
> > > 
> > > No CB_RECALL was done to trigger this DELEGRETURN. Apparently
> > > the application that was accessing this file via the kernel OS
> > > client decided already that it no longer needed the file before
> > > the server could send the CB_RECALL. Sign of perhaps a race
> > > between the applications accessing the file via these two
> > > mounts.
> > > 
> > > ---- cut here ----
> > > 
> > > The server is aware of non-NFSv4 accessors of this file in frame
> > > 18556. NFSv3 has no OPEN operation, of course, so it's not
> > > possible for the server to determine how the NFSv3 client will
> > > subsequently access this file.
> > > 
> > 
> > Right. Why should we assume that the v3 client will do anything other
> > than read there? If we recall the delegation just for reads, then we
> > potentially negatively affect the performance of the v4 client.
> > 
> > > Seems like at frame 18556, it would be a best practice to recall
> > > the delegation to avoid potential future conflicts, such as the
> > > SETATTR in frame 18729.
> > > 
> > > Or, perhaps that READ isn't the first NFSv3 access of that file.
> > > After all, a LOOKUP would have to be done to retrieve that file's
> > > FH. The OPEN in frame 18556 perhaps could have avoided offering
> > > the READ delegation, knowing there is a recent non-NFSv4 accessor
> > > of that file.
> > > 
> > > Would these be difficult or inappropriate policies to implement?
> > > 
> > > 
> > 
> > Reads are not currently considered to be conflicting access vs. a read
> > delegation.
> 
> Strictly speaking, a single NFSv3 READ does not violate the guarantee
> made by the read delegation. And, strictly speaking, there can be no
> OPEN conflict because NFSv3 does not have an OPEN operation.
> 
> The question is whether the server has an adequate mechanism for
> delaying NFSv3 accessors when an NFSv4 delegation must be recalled.
> 
> NFS3ERR_JUKEBOX and NFS4ERR_DELAY share the same numeric value, but
> imply different semantics.
> 
> RFC1813 says:
>  
> NFS3ERR_JUKEBOX
>     The server initiated the request, but was not able to
>     complete it in a timely fashion. The client should wait
>     and then try the request with a new RPC transaction ID.
>     For example, this error should be returned from a server
>     that supports hierarchical storage and receives a request
>     to process a file that has been migrated. In this case,
>     the server should start the immigration process and
>     respond to client with this error.
> 
> Some clients respond to NFS3ERR_JUKEBOX by waiting quite some time
> before retrying.
> 
> RFC7530 says:
> 
> 13.1.1.3.  NFS4ERR_DELAY (Error Code 10008)
> 
>    For any of a number of reasons, the replier could not process this
>    operation in what was deemed a reasonable time.  The client should
>    wait and then try the request with a new RPC transaction ID.
> 
>    The following are two examples of what might lead to this situation:
> 
>    o  A server that supports hierarchical storage receives a request to
>       process a file that had been migrated.
> 
>    o  An operation requires a delegation recall to proceed, and waiting
>       for this delegation recall makes processing this request in a
>       timely fashion impossible.
> 
> An NFSv4 client is prepared to retry this error almost immediately
> because most of the time it is due to the second bullet.
> 
> I agree that not recalling after an NFSv3 READ is reasonable in some
> cases. However, I demonstrated a case where the current policy does
> not serve one of these clients well at all. In fact, the NFSv3
> accessor in this case is the performance-sensitive one.
> 
> To put it another way, the NFSv4 protocol does not forbid the
> current Linux server policy, but interoperating well with existing
> NFSv3 clients suggests it's not an optimal policy choice.
> 

I think that is entirely dependent on the workload. If we proactively
recall delegations because we think the v3 client _might_ do some
conflicting access, and then it doesn't, then that's also a non-optimal
choice.

> 
> > I think that's the correct thing to do. Until we have some
> > sort of conflicting behavior I don't see why you'd want to prematurely
> > recall the delegation.
> 
> The reason to recall a delegation is to avoid returning
> NFS3ERR_JUKEBOX if at all possible, because doing so is a drastic
> remedy that results in a performance regression.
> 
> The negative impact of not having a delegation is small. The negative
> impact of returning NFS3ERR_JUKEBOX to a SETATTR or WRITE can be as
> much as a 5 minute wait. (This is intolerably long for, say, online
> transaction processing workloads).
> 

That sounds like a deficient v3 client, IMO. There's nothing in the v3
spec that I know of that advocates a delay that long before
reattempting. I'm pretty sure the Linux client treats NFSERR3_JUKEBOX
and NFS4ERR_DELAY more or less equivalently.

> The server can detect there are other accessors that do not provide
> OPEN/CLOSE semantics. In addition, the server cannot predict when one
> of these accessors may use a WRITE or SETATTR. And finally it does
> not have a reasonably performant mechanism for delaying those
> accessors when a delegation must be recalled.
> 

Interoperability is hard (and sometimes it doesn't work well :). We
simply don't have enough info to reliably guess what the v3 client will
do in this situation.

That said, I wouldn't have a huge objection to a server side tunable
(module parameter?) that says "Recall read delegations on v2/3 READ
calls". Make it default to off, and then people in your situation could
set it if they thought it a better policy for their workload.

> 
> > Note that we do have a bloom filter now that prevents us from handing
> > out a delegation on a file that was recently recalled. Does that help at
> > all here?
> 
> Not offering a delegation again will help during subsequent accesses,
> though not for the initial write access.
> 
> 

Yeah, I wasn't sure how long-lived the v4 opens are in this situation.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
  2017-03-11 21:04     ` Jeff Layton
@ 2017-03-13 13:27       ` J. Bruce Fields
  2017-03-13 15:30         ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2017-03-13 13:27 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Chuck Lever, Linux NFS Mailing List

On Sat, Mar 11, 2017 at 04:04:34PM -0500, Jeff Layton wrote:
> On Sat, 2017-03-11 at 15:46 -0500, Chuck Lever wrote:
> > > On Mar 11, 2017, at 12:08 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > 
> > > On Sat, 2017-03-11 at 11:53 -0500, Chuck Lever wrote:
> > > > Hi Bruce, Jeff-
> > > > 
> > > > I've observed some interesting Linux NFS server behavior (v4.1.12).
> > > > 
> > > > We have a single system that has an NFSv4 mount via the kernel NFS
> > > > client, and an NFSv3 mount of the same export via a user space NFS
> > > > client. These two clients are accessing the same set of files.
> > > > 
> > > > The following pattern is seen on the wire. I've filtered a recent
> > > > capture on the FH of one of the shared files.
> > > > 
> > > > ---- cut here ----
> > > > 
> > > > 18507  19.483085    10.0.2.11 -> 10.0.1.8     NFS 238 V4 Call ACCESS FH: 0xc930444f, [Check: RD MD XT XE]
> > > > 18508  19.483827     10.0.1.8 -> 10.0.2.11    NFS 194 V4 Reply (Call In 18507) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]
> > > > 18510  19.484676     10.0.1.8 -> 10.0.2.11    NFS 434 V4 Reply (Call In 18509) OPEN StateID: 0x6de3
> > > > 
> > > > This OPEN reply offers a read delegation to the kernel NFS client.
> > > > 
> > > > 18511  19.484806    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
> > > > 18512  19.485549     10.0.1.8 -> 10.0.2.11    NFS 274 V4 Reply (Call In 18511) GETATTR
> > > > 18513  19.485611    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
> > > > 18514  19.486375     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18513) GETATTR
> > > > 18515  19.486464    10.0.2.11 -> 10.0.1.8     NFS 254 V4 Call CLOSE StateID: 0x6de3
> > > > 18516  19.487201     10.0.1.8 -> 10.0.2.11    NFS 202 V4 Reply (Call In 18515) CLOSE
> > > > 18556  19.498617    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 8192 Len: 8192
> > > > 
> > > > This READ call by the user space client does not conflict with the
> > > > read delegation.
> > > > 
> > > > 18559  19.499396     10.0.1.8 -> 10.0.2.11    NFS 8390 V3 READ Reply (Call In 18556) Len: 8192
> > > > 18726  19.568975     10.0.1.8 -> 10.0.2.11    NFS 310 V3 LOOKUP Reply (Call In 18725), FH: 0xc930444f
> > > > 18727  19.569170    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 0 Len: 512
> > > > 18728  19.569923     10.0.1.8 -> 10.0.2.11    NFS 710 V3 READ Reply (Call In 18727) Len: 512
> > > > 18729  19.570135    10.0.2.11 -> 10.0.1.8     NFS 234 V3 SETATTR Call, FH: 0xc930444f
> > > > 18730  19.570901     10.0.1.8 -> 10.0.2.11    NFS 214 V3 SETATTR Reply (Call In 18729) Error: NFS3ERR_JUKEBOX
> > > > 
> > > > The user space client has attempted to extend the file. This does
> > > > conflict with the read delegation held by the kernel NFS client,
> > > > so the server returns JUKEBOX, the equivalent of NFS4ERR_DELAY.
> > > > This causes a negative performance impact on the user space NFS
> > > > client.
> > > > 
> > > > 18731  19.575396    10.0.2.11 -> 10.0.1.8     NFS 250 V4 Call DELEGRETURN StateID: 0x6de3
> > > > 18732  19.576132     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18731) DELEGRETURN
> > > > 
> > > > No CB_RECALL was done to trigger this DELEGRETURN. Apparently
> > > > the application that was accessing this file via the kernel OS
> > > > client decided already that it no longer needed the file before
> > > > the server could send the CB_RECALL. Sign of perhaps a race
> > > > between the applications accessing the file via these two
> > > > mounts.
> > > > 
> > > > ---- cut here ----
> > > > 
> > > > The server is aware of non-NFSv4 accessors of this file in frame
> > > > 18556. NFSv3 has no OPEN operation, of course, so it's not
> > > > possible for the server to determine how the NFSv3 client will
> > > > subsequently access this file.
> > > > 
> > > 
> > > Right. Why should we assume that the v3 client will do anything other
> > > than read there? If we recall the delegation just for reads, then we
> > > potentially negatively affect the performance of the v4 client.
> > > 
> > > > Seems like at frame 18556, it would be a best practice to recall
> > > > the delegation to avoid potential future conflicts, such as the
> > > > SETATTR in frame 18729.
> > > > 
> > > > Or, perhaps that READ isn't the first NFSv3 access of that file.
> > > > After all, a LOOKUP would have to be done to retrieve that file's
> > > > FH. The OPEN in frame 18556 perhaps could have avoided offering
> > > > the READ delegation, knowing there is a recent non-NFSv4 accessor
> > > > of that file.
> > > > 
> > > > Would these be difficult or inappropriate policies to implement?
> > > > 
> > > > 
> > > 
> > > Reads are not currently considered to be conflicting access vs. a read
> > > delegation.
> > 
> > Strictly speaking, a single NFSv3 READ does not violate the guarantee
> > made by the read delegation. And, strictly speaking, there can be no
> > OPEN conflict because NFSv3 does not have an OPEN operation.
> > 
> > The question is whether the server has an adequate mechanism for
> > delaying NFSv3 accessors when an NFSv4 delegation must be recalled.
> > 
> > NFS3ERR_JUKEBOX and NFS4ERR_DELAY share the same numeric value, but
> > imply different semantics.
> > 
> > RFC1813 says:
> >  
> > NFS3ERR_JUKEBOX
> >     The server initiated the request, but was not able to
> >     complete it in a timely fashion. The client should wait
> >     and then try the request with a new RPC transaction ID.
> >     For example, this error should be returned from a server
> >     that supports hierarchical storage and receives a request
> >     to process a file that has been migrated. In this case,
> >     the server should start the immigration process and
> >     respond to client with this error.
> > 
> > Some clients respond to NFS3ERR_JUKEBOX by waiting quite some time
> > before retrying.
> > 
> > RFC7530 says:
> > 
> > 13.1.1.3.  NFS4ERR_DELAY (Error Code 10008)
> > 
> >    For any of a number of reasons, the replier could not process this
> >    operation in what was deemed a reasonable time.  The client should
> >    wait and then try the request with a new RPC transaction ID.
> > 
> >    The following are two examples of what might lead to this situation:
> > 
> >    o  A server that supports hierarchical storage receives a request to
> >       process a file that had been migrated.
> > 
> >    o  An operation requires a delegation recall to proceed, and waiting
> >       for this delegation recall makes processing this request in a
> >       timely fashion impossible.
> > 
> > An NFSv4 client is prepared to retry this error almost immediately
> > because most of the time it is due to the second bullet.
> > 
> > I agree that not recalling after an NFSv3 READ is reasonable in some
> > cases. However, I demonstrated a case where the current policy does
> > not serve one of these clients well at all. In fact, the NFSv3
> > accessor in this case is the performance-sensitive one.
> > 
> > To put it another way, the NFSv4 protocol does not forbid the
> > current Linux server policy, but interoperating well with existing
> > NFSv3 clients suggests it's not an optimal policy choice.
> > 
> 
> I think that is entirely dependent on the workload. If we proactively
> recall delegations because we think the v3 client _might_ do some
> conflicting access, and then it doesn't, then that's also a non-optimal
> choice.
> 
> > 
> > > I think that's the correct thing to do. Until we have some
> > > sort of conflicting behavior I don't see why you'd want to prematurely
> > > recall the delegation.
> > 
> > The reason to recall a delegation is to avoid returning
> > NFS3ERR_JUKEBOX if at all possible, because doing so is a drastic
> > remedy that results in a performance regression.
> > 
> > The negative impact of not having a delegation is small. The negative
> > impact of returning NFS3ERR_JUKEBOX to a SETATTR or WRITE can be as
> > much as a 5 minute wait. (This is intolerably long for, say, online
> > transaction processing workloads).
> > 
> 
> That sounds like a deficient v3 client, IMO. There's nothing in the v3
> spec that I know of that advocates a delay that long before
> reattempting. I'm pretty sure the Linux client treats NFSERR3_JUKEBOX
> and NFS4ERR_DELAY more or less equivalently.

The v3 client uses a 5 second delay (see NFS_JUKEBOX_RETRY_TIME).

The v4 client, at least in the case of operations that could break a
deleg, does exponential backoff starting with a tenth of a second--see
nfs4_delay.

So Trond's been taking the spec at its word here.

Like Jeff I'm pretty unhappy at the idea of revoking delegations
preemptively on v3 read and lookup.  And a 5 minute wait does sound like
a client problem.

> > The server can detect there are other accessors that do not provide
> > OPEN/CLOSE semantics. In addition, the server cannot predict when one
> > of these accessors may use a WRITE or SETATTR. And finally it does
> > not have a reasonably performant mechanism for delaying those
> > accessors when a delegation must be recalled.
> > 
> 
> Interoperability is hard (and sometimes it doesn't work well :). We
> simply don't have enough info to reliably guess what the v3 client will
> do in this situation.
> 
> That said, I wouldn't have a huge objection to a server side tunable
> (module parameter?) that says "Recall read delegations on v2/3 READ
> calls". Make it default to off, and then people in your situation could
> set it if they thought it a better policy for their workload.

I also wonder if in v3 case we should try a small synchronous wait
before returning JUKEBOX.  Read delegations shouldn't require the client
to do very much, so it could be they're typically returned in a
fraction of a second.

Since we have a fixed number of threads, I don't think we'd want to keep
one waiting much longer than that.  Also, it'd be nice if we could get
woken up early when the delegation return comes in before our wait's
over, but I haven't thought about how to do that.

And I don't know if that actually helps.

--b.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
  2017-03-13 13:27       ` J. Bruce Fields
@ 2017-03-13 15:30         ` Chuck Lever
  2017-03-13 16:01           ` J. Bruce Fields
  2017-03-13 16:33           ` Jeff Layton
  0 siblings, 2 replies; 13+ messages in thread
From: Chuck Lever @ 2017-03-13 15:30 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, Linux NFS Mailing List

Hi Bruce-


> On Mar 13, 2017, at 9:27 AM, J. Bruce Fields <bfields@redhat.com> wrote:
> 
> On Sat, Mar 11, 2017 at 04:04:34PM -0500, Jeff Layton wrote:
>> On Sat, 2017-03-11 at 15:46 -0500, Chuck Lever wrote:
>>>> On Mar 11, 2017, at 12:08 PM, Jeff Layton <jlayton@redhat.com> wrote:
>>>> 
>>>> On Sat, 2017-03-11 at 11:53 -0500, Chuck Lever wrote:
>>>>> Hi Bruce, Jeff-
>>>>> 
>>>>> I've observed some interesting Linux NFS server behavior (v4.1.12).
>>>>> 
>>>>> We have a single system that has an NFSv4 mount via the kernel NFS
>>>>> client, and an NFSv3 mount of the same export via a user space NFS
>>>>> client. These two clients are accessing the same set of files.
>>>>> 
>>>>> The following pattern is seen on the wire. I've filtered a recent
>>>>> capture on the FH of one of the shared files.
>>>>> 
>>>>> ---- cut here ----
>>>>> 
>>>>> 18507  19.483085    10.0.2.11 -> 10.0.1.8     NFS 238 V4 Call ACCESS FH: 0xc930444f, [Check: RD MD XT XE]
>>>>> 18508  19.483827     10.0.1.8 -> 10.0.2.11    NFS 194 V4 Reply (Call In 18507) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]
>>>>> 18510  19.484676     10.0.1.8 -> 10.0.2.11    NFS 434 V4 Reply (Call In 18509) OPEN StateID: 0x6de3
>>>>> 
>>>>> This OPEN reply offers a read delegation to the kernel NFS client.
>>>>> 
>>>>> 18511  19.484806    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
>>>>> 18512  19.485549     10.0.1.8 -> 10.0.2.11    NFS 274 V4 Reply (Call In 18511) GETATTR
>>>>> 18513  19.485611    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
>>>>> 18514  19.486375     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18513) GETATTR
>>>>> 18515  19.486464    10.0.2.11 -> 10.0.1.8     NFS 254 V4 Call CLOSE StateID: 0x6de3
>>>>> 18516  19.487201     10.0.1.8 -> 10.0.2.11    NFS 202 V4 Reply (Call In 18515) CLOSE
>>>>> 18556  19.498617    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 8192 Len: 8192
>>>>> 
>>>>> This READ call by the user space client does not conflict with the
>>>>> read delegation.
>>>>> 
>>>>> 18559  19.499396     10.0.1.8 -> 10.0.2.11    NFS 8390 V3 READ Reply (Call In 18556) Len: 8192
>>>>> 18726  19.568975     10.0.1.8 -> 10.0.2.11    NFS 310 V3 LOOKUP Reply (Call In 18725), FH: 0xc930444f
>>>>> 18727  19.569170    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 0 Len: 512
>>>>> 18728  19.569923     10.0.1.8 -> 10.0.2.11    NFS 710 V3 READ Reply (Call In 18727) Len: 512
>>>>> 18729  19.570135    10.0.2.11 -> 10.0.1.8     NFS 234 V3 SETATTR Call, FH: 0xc930444f
>>>>> 18730  19.570901     10.0.1.8 -> 10.0.2.11    NFS 214 V3 SETATTR Reply (Call In 18729) Error: NFS3ERR_JUKEBOX
>>>>> 
>>>>> The user space client has attempted to extend the file. This does
>>>>> conflict with the read delegation held by the kernel NFS client,
>>>>> so the server returns JUKEBOX, the equivalent of NFS4ERR_DELAY.
>>>>> This causes a negative performance impact on the user space NFS
>>>>> client.
>>>>> 
>>>>> 18731  19.575396    10.0.2.11 -> 10.0.1.8     NFS 250 V4 Call DELEGRETURN StateID: 0x6de3
>>>>> 18732  19.576132     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18731) DELEGRETURN
>>>>> 
>>>>> No CB_RECALL was done to trigger this DELEGRETURN. Apparently
>>>>> the application that was accessing this file via the kernel OS
>>>>> client decided already that it no longer needed the file before
>>>>> the server could send the CB_RECALL. Sign of perhaps a race
>>>>> between the applications accessing the file via these two
>>>>> mounts.
>>>>> 
>>>>> ---- cut here ----
>>>>> 
>>>>> The server is aware of non-NFSv4 accessors of this file in frame
>>>>> 18556. NFSv3 has no OPEN operation, of course, so it's not
>>>>> possible for the server to determine how the NFSv3 client will
>>>>> subsequently access this file.
>>>>> 
>>>> 
>>>> Right. Why should we assume that the v3 client will do anything other
>>>> than read there? If we recall the delegation just for reads, then we
>>>> potentially negatively affect the performance of the v4 client.
>>>> 
>>>>> Seems like at frame 18556, it would be a best practice to recall
>>>>> the delegation to avoid potential future conflicts, such as the
>>>>> SETATTR in frame 18729.
>>>>> 
>>>>> Or, perhaps that READ isn't the first NFSv3 access of that file.
>>>>> After all, a LOOKUP would have to be done to retrieve that file's
>>>>> FH. The OPEN in frame 18556 perhaps could have avoided offering
>>>>> the READ delegation, knowing there is a recent non-NFSv4 accessor
>>>>> of that file.
>>>>> 
>>>>> Would these be difficult or inappropriate policies to implement?
>>>>> 
>>>>> 
>>>> 
>>>> Reads are not currently considered to be conflicting access vs. a read
>>>> delegation.
>>> 
>>> Strictly speaking, a single NFSv3 READ does not violate the guarantee
>>> made by the read delegation. And, strictly speaking, there can be no
>>> OPEN conflict because NFSv3 does not have an OPEN operation.
>>> 
>>> The question is whether the server has an adequate mechanism for
>>> delaying NFSv3 accessors when an NFSv4 delegation must be recalled.
>>> 
>>> NFS3ERR_JUKEBOX and NFS4ERR_DELAY share the same numeric value, but
>>> imply different semantics.
>>> 
>>> RFC1813 says:
>>> 
>>> NFS3ERR_JUKEBOX
>>>    The server initiated the request, but was not able to
>>>    complete it in a timely fashion. The client should wait
>>>    and then try the request with a new RPC transaction ID.
>>>    For example, this error should be returned from a server
>>>    that supports hierarchical storage and receives a request
>>>    to process a file that has been migrated. In this case,
>>>    the server should start the immigration process and
>>>    respond to client with this error.
>>> 
>>> Some clients respond to NFS3ERR_JUKEBOX by waiting quite some time
>>> before retrying.
>>> 
>>> RFC7530 says:
>>> 
>>> 13.1.1.3.  NFS4ERR_DELAY (Error Code 10008)
>>> 
>>>   For any of a number of reasons, the replier could not process this
>>>   operation in what was deemed a reasonable time.  The client should
>>>   wait and then try the request with a new RPC transaction ID.
>>> 
>>>   The following are two examples of what might lead to this situation:
>>> 
>>>   o  A server that supports hierarchical storage receives a request to
>>>      process a file that had been migrated.
>>> 
>>>   o  An operation requires a delegation recall to proceed, and waiting
>>>      for this delegation recall makes processing this request in a
>>>      timely fashion impossible.
>>> 
>>> An NFSv4 client is prepared to retry this error almost immediately
>>> because most of the time it is due to the second bullet.
>>> 
>>> I agree that not recalling after an NFSv3 READ is reasonable in some
>>> cases. However, I demonstrated a case where the current policy does
>>> not serve one of these clients well at all. In fact, the NFSv3
>>> accessor in this case is the performance-sensitive one.
>>> 
>>> To put it another way, the NFSv4 protocol does not forbid the
>>> current Linux server policy, but interoperating well with existing
>>> NFSv3 clients suggests it's not an optimal policy choice.
>>> 
>> 
>> I think that is entirely dependent on the workload. If we proactively
>> recall delegations because we think the v3 client _might_ do some
>> conflicting access, and then it doesn't, then that's also a non-optimal
>> choice.
>> 
>>> 
>>>> I think that's the correct thing to do. Until we have some
>>>> sort of conflicting behavior I don't see why you'd want to prematurely
>>>> recall the delegation.
>>> 
>>> The reason to recall a delegation is to avoid returning
>>> NFS3ERR_JUKEBOX if at all possible, because doing so is a drastic
>>> remedy that results in a performance regression.
>>> 
>>> The negative impact of not having a delegation is small. The negative
>>> impact of returning NFS3ERR_JUKEBOX to a SETATTR or WRITE can be as
>>> much as a 5 minute wait. (This is intolerably long for, say, online
>>> transaction processing workloads).
>>> 
>> 
>> That sounds like a deficient v3 client, IMO. There's nothing in the v3
>> spec that I know of that advocates a delay that long before
>> reattempting. I'm pretty sure the Linux client treats NFSERR3_JUKEBOX
>> and NFS4ERR_DELAY more or less equivalently.
> 
> The v3 client uses a 5 second delay (see NFS_JUKEBOX_RETRY_TIME).

> The v4 client, at least in the case of operations that could break a
> deleg, does exponential backoff starting with a tenth of a second--see
> nfs4_delay.
> 
> So Trond's been taking the spec at its word here.
> 
> Like Jeff I'm pretty unhappy at the idea of revoking delegations
> preemptively on v3 read and lookup.

To completely avoid JUKEBOX, you'd have to recall asynchronously.
Even better would be not to offer delegations when it is clear
there is an active NFSv3 accessor.

Is there a specific use case where holding onto delegations in
this case is measurably valuable?

As Jeff said above, it is workload dependent, but it seems that
we are choosing arbitrarily which workloads work well and which
will be penalized.

Clearly, speculating about future access is not allowed when
only NFSv4 is in play.


> And a 5 minute wait does sound like a client problem.

Even a 5 second wait is not good. A simple "touch" that takes
five seconds can generate user complaints.

I do see the point that a NFSv3 client implementation can be
changed to retry JUKEBOX more aggressively. Not all NFSv3 code
bases are actively maintained, however.


>>> The server can detect there are other accessors that do not provide
>>> OPEN/CLOSE semantics. In addition, the server cannot predict when one
>>> of these accessors may use a WRITE or SETATTR. And finally it does
>>> not have a reasonably performant mechanism for delaying those
>>> accessors when a delegation must be recalled.
>>> 
>> 
>> Interoperability is hard (and sometimes it doesn't work well :). We
>> simply don't have enough info to reliably guess what the v3 client will
>> do in this situation.

(This is in response to Jeff's comment)

Interoperability means following the spec, but IMO it also
means respecting longstanding implementation practice when
a specification does not prescribe particular behavior.

In this case, strictly speaking interoperability is not the
concern.

-> The spec authors clearly believed this is an area where
implementations are to be given free rein. Otherwise the text
would have provided RFC 2119 directives or other specific
guidelines. There was opportunity to add specifics in RFCs
3530, 7530, and 5661, but that wasn't done.

-> The scenario I reported does not involve operational
failure. It eventually succeeds whether the client's retry
is aggressive or lazy. It just works _better_ when there is
no DELAY/JUKEBOX.

There are a few normative constraints here, and I think we
have a bead on what those are, but IMO the issue is one of
implementation quality (on both ends).

/soapbox


>> That said, I wouldn't have a huge objection to a server side tunable
>> (module parameter?) that says "Recall read delegations on v2/3 READ
>> calls". Make it default to off, and then people in your situation could
>> set it if they thought it a better policy for their workload.

> I also wonder if in v3 case we should try a small synchronous wait
> before returning JUKEBOX.  Read delegations shouldn't require the client
> to do very much, so it could be they're typically returned in a
> fraction of a second.

That wait would have to be very short in the NFSv3 / UDP case
to avoid a retransmit timeout. I know, UDP is going away.

It's hard to say how long to wait. The RTT to the client might
have to be taken into account. In WAN deployments, this could
be as long as 50ms, for instance.

Although, again, waiting is speculative. A fixed 20ms wait
would be appropriate for most LAN deployments, and that's
where the expectation of consistently fast operation lies.


> Since we have a fixed number of threads, I don't think we'd want to keep
> one waiting much longer than that.  Also, it'd be nice if we could get
> woken up early when the delegation return comes in before our wait's
> over, but I haven't thought about how to do that.
> 
> And I don't know if that actually helps.

When there is a lot of file sharing between clients, it might
be good to reduce the penalty of delegation recalls.

Clients, after all, cannot know when a recall has completed,
so they have to guess about when to retransmit, and usually
make a conservative estimate. If server behavior can shorten
the delay without introducing race windows, that would be good
added value.

But I'm not clear why waiting must tie up the nfsd thread (pun
intended). How is a COMMIT or synchronous WRITE handled? Seems
like waiting for a delegation recall to complete is a similar
kind of thing.

--
Chuck Lever




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
  2017-03-13 15:30         ` Chuck Lever
@ 2017-03-13 16:01           ` J. Bruce Fields
  2017-03-13 16:06             ` J. Bruce Fields
  2017-03-13 16:33           ` Jeff Layton
  1 sibling, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2017-03-13 16:01 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, Linux NFS Mailing List

On Mon, Mar 13, 2017 at 11:30:21AM -0400, Chuck Lever wrote:
> Hi Bruce-
> 
> 
> > On Mar 13, 2017, at 9:27 AM, J. Bruce Fields <bfields@redhat.com> wrote:
> > 
> > On Sat, Mar 11, 2017 at 04:04:34PM -0500, Jeff Layton wrote:
> >> On Sat, 2017-03-11 at 15:46 -0500, Chuck Lever wrote:
> >>>> On Mar 11, 2017, at 12:08 PM, Jeff Layton <jlayton@redhat.com> wrote:
> >>>> 
> >>>> On Sat, 2017-03-11 at 11:53 -0500, Chuck Lever wrote:
> >>>>> Hi Bruce, Jeff-
> >>>>> 
> >>>>> I've observed some interesting Linux NFS server behavior (v4.1.12).
> >>>>> 
> >>>>> We have a single system that has an NFSv4 mount via the kernel NFS
> >>>>> client, and an NFSv3 mount of the same export via a user space NFS
> >>>>> client. These two clients are accessing the same set of files.
> >>>>> 
> >>>>> The following pattern is seen on the wire. I've filtered a recent
> >>>>> capture on the FH of one of the shared files.
> >>>>> 
> >>>>> ---- cut here ----
> >>>>> 
> >>>>> 18507  19.483085    10.0.2.11 -> 10.0.1.8     NFS 238 V4 Call ACCESS FH: 0xc930444f, [Check: RD MD XT XE]
> >>>>> 18508  19.483827     10.0.1.8 -> 10.0.2.11    NFS 194 V4 Reply (Call In 18507) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]
> >>>>> 18510  19.484676     10.0.1.8 -> 10.0.2.11    NFS 434 V4 Reply (Call In 18509) OPEN StateID: 0x6de3
> >>>>> 
> >>>>> This OPEN reply offers a read delegation to the kernel NFS client.
> >>>>> 
> >>>>> 18511  19.484806    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
> >>>>> 18512  19.485549     10.0.1.8 -> 10.0.2.11    NFS 274 V4 Reply (Call In 18511) GETATTR
> >>>>> 18513  19.485611    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
> >>>>> 18514  19.486375     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18513) GETATTR
> >>>>> 18515  19.486464    10.0.2.11 -> 10.0.1.8     NFS 254 V4 Call CLOSE StateID: 0x6de3
> >>>>> 18516  19.487201     10.0.1.8 -> 10.0.2.11    NFS 202 V4 Reply (Call In 18515) CLOSE
> >>>>> 18556  19.498617    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 8192 Len: 8192
> >>>>> 
> >>>>> This READ call by the user space client does not conflict with the
> >>>>> read delegation.
> >>>>> 
> >>>>> 18559  19.499396     10.0.1.8 -> 10.0.2.11    NFS 8390 V3 READ Reply (Call In 18556) Len: 8192
> >>>>> 18726  19.568975     10.0.1.8 -> 10.0.2.11    NFS 310 V3 LOOKUP Reply (Call In 18725), FH: 0xc930444f
> >>>>> 18727  19.569170    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 0 Len: 512
> >>>>> 18728  19.569923     10.0.1.8 -> 10.0.2.11    NFS 710 V3 READ Reply (Call In 18727) Len: 512
> >>>>> 18729  19.570135    10.0.2.11 -> 10.0.1.8     NFS 234 V3 SETATTR Call, FH: 0xc930444f
> >>>>> 18730  19.570901     10.0.1.8 -> 10.0.2.11    NFS 214 V3 SETATTR Reply (Call In 18729) Error: NFS3ERR_JUKEBOX
> >>>>> 
> >>>>> The user space client has attempted to extend the file. This does
> >>>>> conflict with the read delegation held by the kernel NFS client,
> >>>>> so the server returns JUKEBOX, the equivalent of NFS4ERR_DELAY.
> >>>>> This causes a negative performance impact on the user space NFS
> >>>>> client.
> >>>>> 
> >>>>> 18731  19.575396    10.0.2.11 -> 10.0.1.8     NFS 250 V4 Call DELEGRETURN StateID: 0x6de3
> >>>>> 18732  19.576132     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18731) DELEGRETURN
> >>>>> 
> >>>>> No CB_RECALL was done to trigger this DELEGRETURN. Apparently
> >>>>> the application that was accessing this file via the kernel OS
> >>>>> client decided already that it no longer needed the file before
> >>>>> the server could send the CB_RECALL. Sign of perhaps a race
> >>>>> between the applications accessing the file via these two
> >>>>> mounts.
> >>>>> 
> >>>>> ---- cut here ----
> >>>>> 
> >>>>> The server is aware of non-NFSv4 accessors of this file in frame
> >>>>> 18556. NFSv3 has no OPEN operation, of course, so it's not
> >>>>> possible for the server to determine how the NFSv3 client will
> >>>>> subsequently access this file.
> >>>>> 
> >>>> 
> >>>> Right. Why should we assume that the v3 client will do anything other
> >>>> than read there? If we recall the delegation just for reads, then we
> >>>> potentially negatively affect the performance of the v4 client.
> >>>> 
> >>>>> Seems like at frame 18556, it would be a best practice to recall
> >>>>> the delegation to avoid potential future conflicts, such as the
> >>>>> SETATTR in frame 18729.
> >>>>> 
> >>>>> Or, perhaps that READ isn't the first NFSv3 access of that file.
> >>>>> After all, a LOOKUP would have to be done to retrieve that file's
> >>>>> FH. The OPEN in frame 18556 perhaps could have avoided offering
> >>>>> the READ delegation, knowing there is a recent non-NFSv4 accessor
> >>>>> of that file.
> >>>>> 
> >>>>> Would these be difficult or inappropriate policies to implement?
> >>>>> 
> >>>>> 
> >>>> 
> >>>> Reads are not currently considered to be conflicting access vs. a read
> >>>> delegation.
> >>> 
> >>> Strictly speaking, a single NFSv3 READ does not violate the guarantee
> >>> made by the read delegation. And, strictly speaking, there can be no
> >>> OPEN conflict because NFSv3 does not have an OPEN operation.
> >>> 
> >>> The question is whether the server has an adequate mechanism for
> >>> delaying NFSv3 accessors when an NFSv4 delegation must be recalled.
> >>> 
> >>> NFS3ERR_JUKEBOX and NFS4ERR_DELAY share the same numeric value, but
> >>> imply different semantics.
> >>> 
> >>> RFC1813 says:
> >>> 
> >>> NFS3ERR_JUKEBOX
> >>>    The server initiated the request, but was not able to
> >>>    complete it in a timely fashion. The client should wait
> >>>    and then try the request with a new RPC transaction ID.
> >>>    For example, this error should be returned from a server
> >>>    that supports hierarchical storage and receives a request
> >>>    to process a file that has been migrated. In this case,
> >>>    the server should start the immigration process and
> >>>    respond to client with this error.
> >>> 
> >>> Some clients respond to NFS3ERR_JUKEBOX by waiting quite some time
> >>> before retrying.
> >>> 
> >>> RFC7530 says:
> >>> 
> >>> 13.1.1.3.  NFS4ERR_DELAY (Error Code 10008)
> >>> 
> >>>   For any of a number of reasons, the replier could not process this
> >>>   operation in what was deemed a reasonable time.  The client should
> >>>   wait and then try the request with a new RPC transaction ID.
> >>> 
> >>>   The following are two examples of what might lead to this situation:
> >>> 
> >>>   o  A server that supports hierarchical storage receives a request to
> >>>      process a file that had been migrated.
> >>> 
> >>>   o  An operation requires a delegation recall to proceed, and waiting
> >>>      for this delegation recall makes processing this request in a
> >>>      timely fashion impossible.
> >>> 
> >>> An NFSv4 client is prepared to retry this error almost immediately
> >>> because most of the time it is due to the second bullet.
> >>> 
> >>> I agree that not recalling after an NFSv3 READ is reasonable in some
> >>> cases. However, I demonstrated a case where the current policy does
> >>> not serve one of these clients well at all. In fact, the NFSv3
> >>> accessor in this case is the performance-sensitive one.
> >>> 
> >>> To put it another way, the NFSv4 protocol does not forbid the
> >>> current Linux server policy, but interoperating well with existing
> >>> NFSv3 clients suggests it's not an optimal policy choice.
> >>> 
> >> 
> >> I think that is entirely dependent on the workload. If we proactively
> >> recall delegations because we think the v3 client _might_ do some
> >> conflicting access, and then it doesn't, then that's also a non-optimal
> >> choice.
> >> 
> >>> 
> >>>> I think that's the correct thing to do. Until we have some
> >>>> sort of conflicting behavior I don't see why you'd want to prematurely
> >>>> recall the delegation.
> >>> 
> >>> The reason to recall a delegation is to avoid returning
> >>> NFS3ERR_JUKEBOX if at all possible, because doing so is a drastic
> >>> remedy that results in a performance regression.
> >>> 
> >>> The negative impact of not having a delegation is small. The negative
> >>> impact of returning NFS3ERR_JUKEBOX to a SETATTR or WRITE can be as
> >>> much as a 5 minute wait. (This is intolerably long for, say, online
> >>> transaction processing workloads).
> >>> 
> >> 
> >> That sounds like a deficient v3 client, IMO. There's nothing in the v3
> >> spec that I know of that advocates a delay that long before
> >> reattempting. I'm pretty sure the Linux client treats NFSERR3_JUKEBOX
> >> and NFS4ERR_DELAY more or less equivalently.
> > 
> > The v3 client uses a 5 second delay (see NFS_JUKEBOX_RETRY_TIME).
> 
> > The v4 client, at least in the case of operations that could break a
> > deleg, does exponential backoff starting with a tenth of a second--see
> > nfs4_delay.
> > 
> > So Trond's been taking the spec at its word here.
> > 
> > Like Jeff I'm pretty unhappy at the idea of revoking delegations
> > preemptively on v3 read and lookup.
> 
> To completely avoid JUKEBOX, you'd have to recall asynchronously.
> Even better would be not to offer delegations when it is clear
> there is an active NFSv3 accessor.
> 
> Is there a specific use case where holding onto delegations in
> this case is measurably valuable?
> 
> As Jeff said above, it is workload dependent, but it seems that
> we are choosing arbitrarily which workloads work well and which
> will be penalized.
> 
> Clearly, speculating about future access is not allowed when
> only NFSv4 is in play.
> 
> 
> > And a 5 minute wait does sound like a client problem.
> 
> Even a 5 second wait is not good. A simple "touch" that takes
> five seconds can generate user complaints.
> 
> I do see the point that a NFSv3 client implementation can be
> changed to retry JUKEBOX more aggressively. Not all NFSv3 code
> bases are actively maintained, however.
> 
> 
> >>> The server can detect there are other accessors that do not provide
> >>> OPEN/CLOSE semantics. In addition, the server cannot predict when one
> >>> of these accessors may use a WRITE or SETATTR. And finally it does
> >>> not have a reasonably performant mechanism for delaying those
> >>> accessors when a delegation must be recalled.
> >>> 
> >> 
> >> Interoperability is hard (and sometimes it doesn't work well :). We
> >> simply don't have enough info to reliably guess what the v3 client will
> >> do in this situation.
> 
> (This is in response to Jeff's comment)
> 
> Interoperability means following the spec, but IMO it also
> means respecting longstanding implementation practice when
> a specification does not prescribe particular behavior.
> 
> In this case, strictly speaking interoperability is not the
> concern.
> 
> -> The spec authors clearly believed this is an area where
> implementations are to be given free rein. Otherwise the text
> would have provided RFC 2119 directives or other specific
> guidelines. There was opportunity to add specifics in RFCs
> 3530, 7530, and 5661, but that wasn't done.
> 
> -> The scenario I reported does not involve operational
> failure. It eventually succeeds whether the client's retry
> is aggressive or lazy. It just works _better_ when there is
> no DELAY/JUKEBOX.
> 
> There are a few normative constraints here, and I think we
> have a bead on what those are, but IMO the issue is one of
> implementation quality (on both ends).
> 
> /soapbox
> 
> 
> >> That said, I wouldn't have a huge objection to a server side tunable
> >> (module parameter?) that says "Recall read delegations on v2/3 READ
> >> calls". Make it default to off, and then people in your situation could
> >> set it if they thought it a better policy for their workload.
> 
> > I also wonder if in v3 case we should try a small synchronous wait
> > before returning JUKEBOX.  Read delegations shouldn't require the client
> > to do very much, so it could be they're typically returned in a
> > fraction of a second.
> 
> That wait would have to be very short in the NFSv3 / UDP case
> to avoid a retransmit timeout. I know, UDP is going away.
> 
> It's hard to say how long to wait. The RTT to the client might
> have to be taken into account. In WAN deployments, this could
> be as long as 50ms, for instance.
> 
> Although, again, waiting is speculative. A fixed 20ms wait
> would be appropriate for most LAN deployments, and that's
> where the expectation of consistently fast operation lies.

That wait doesn't sound bad at all.

The server could track round trip times to the clients holding
delegations and use that to estimate the right wait time, but hopefully
that's overkill.

I think the occasional retransmitted truncate probably isn't a big deal.
Seems like the kind of thing the reply cache would handle well.

> > Since we have a fixed number of threads, I don't think we'd want to keep
> > one waiting much longer than that.  Also, it'd be nice if we could get
> > woken up early when the delegation return comes in before our wait's
> > over, but I haven't thought about how to do that.
> > 
> > And I don't know if that actually helps.
> 
> When there is a lot of file sharing between clients, it might
> be good to reduce the penalty of delegation recalls.
> 
> Clients, after all, cannot know when a recall has completed,
> so they have to guess about when to retransmit, and usually
> make a conservative estimate. If server behavior can shorten
> the delay without introducing race windows, that would be good
> added value.
> 
> But I'm not clear why waiting must tie up the nfsd thread (pun
> intended). How is a COMMIT or synchronous WRITE handled? Seems
> like waiting for a delegation recall to complete is a similar
> kind of thing.

I agree, if the wait time is on the order of the time you'd spend
waiting for disk, then it's quite reasonable.

There is one odd issue, which maybe wouldn't be an issue in practice,
but: to release the delegation, we need a thread free to process the
delegation return.  So there's a deadlock if we allow threads to wait
for delegation returns.  The timeout should prevent the deadlock.  But
if there was some event that caused a bunch of near-simultaneous
delegation-recalling NFSv3 operations, then in theory waiting could make
the problem worse by delaying the handling of the delegation return.

But I think it's not a big risk, and we could manage it somehow if I
turned out to be wrong....

--b.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
  2017-03-13 16:01           ` J. Bruce Fields
@ 2017-03-13 16:06             ` J. Bruce Fields
  0 siblings, 0 replies; 13+ messages in thread
From: J. Bruce Fields @ 2017-03-13 16:06 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, Linux NFS Mailing List

On Mon, Mar 13, 2017 at 12:01:45PM -0400, J. Bruce Fields wrote:
> That wait doesn't sound bad at all.
> 
> The server could track round trip times to the clients holding
> delegations and use that to estimate the right wait time, but hopefully
> that's overkill.
> 
> I think the occasional retransmitted truncate probably isn't a big deal.
> Seems like the kind of thing the reply cache would handle well.

I haven't thought about what it'd take to implement.  Just for testing
purposes, maybe stick a 50ms wait and a retry in the JUKEBOX case in
anything under fs/nfsd/nfs3proc.c that looks like it could break a
delegation.

Jeff's proposal should also be pretty easy to prototype.

--b.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
  2017-03-13 15:30         ` Chuck Lever
  2017-03-13 16:01           ` J. Bruce Fields
@ 2017-03-13 16:33           ` Jeff Layton
  2017-03-13 17:12             ` Chuck Lever
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2017-03-13 16:33 UTC (permalink / raw)
  To: Chuck Lever, J. Bruce Fields; +Cc: Linux NFS Mailing List

On Mon, 2017-03-13 at 11:30 -0400, Chuck Lever wrote:
> Hi Bruce-
> 
> 
> > On Mar 13, 2017, at 9:27 AM, J. Bruce Fields <bfields@redhat.com> wrote:
> > 
> > On Sat, Mar 11, 2017 at 04:04:34PM -0500, Jeff Layton wrote:
> > > On Sat, 2017-03-11 at 15:46 -0500, Chuck Lever wrote:
> > > > > On Mar 11, 2017, at 12:08 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > > 
> > > > > On Sat, 2017-03-11 at 11:53 -0500, Chuck Lever wrote:
> > > > > > Hi Bruce, Jeff-
> > > > > > 
> > > > > > I've observed some interesting Linux NFS server behavior (v4.1.12).
> > > > > > 
> > > > > > We have a single system that has an NFSv4 mount via the kernel NFS
> > > > > > client, and an NFSv3 mount of the same export via a user space NFS
> > > > > > client. These two clients are accessing the same set of files.
> > > > > > 
> > > > > > The following pattern is seen on the wire. I've filtered a recent
> > > > > > capture on the FH of one of the shared files.
> > > > > > 
> > > > > > ---- cut here ----
> > > > > > 
> > > > > > 18507  19.483085    10.0.2.11 -> 10.0.1.8     NFS 238 V4 Call ACCESS FH: 0xc930444f, [Check: RD MD XT XE]
> > > > > > 18508  19.483827     10.0.1.8 -> 10.0.2.11    NFS 194 V4 Reply (Call In 18507) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]
> > > > > > 18510  19.484676     10.0.1.8 -> 10.0.2.11    NFS 434 V4 Reply (Call In 18509) OPEN StateID: 0x6de3
> > > > > > 
> > > > > > This OPEN reply offers a read delegation to the kernel NFS client.
> > > > > > 
> > > > > > 18511  19.484806    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
> > > > > > 18512  19.485549     10.0.1.8 -> 10.0.2.11    NFS 274 V4 Reply (Call In 18511) GETATTR
> > > > > > 18513  19.485611    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
> > > > > > 18514  19.486375     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18513) GETATTR
> > > > > > 18515  19.486464    10.0.2.11 -> 10.0.1.8     NFS 254 V4 Call CLOSE StateID: 0x6de3
> > > > > > 18516  19.487201     10.0.1.8 -> 10.0.2.11    NFS 202 V4 Reply (Call In 18515) CLOSE
> > > > > > 18556  19.498617    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 8192 Len: 8192
> > > > > > 
> > > > > > This READ call by the user space client does not conflict with the
> > > > > > read delegation.
> > > > > > 
> > > > > > 18559  19.499396     10.0.1.8 -> 10.0.2.11    NFS 8390 V3 READ Reply (Call In 18556) Len: 8192
> > > > > > 18726  19.568975     10.0.1.8 -> 10.0.2.11    NFS 310 V3 LOOKUP Reply (Call In 18725), FH: 0xc930444f
> > > > > > 18727  19.569170    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 0 Len: 512
> > > > > > 18728  19.569923     10.0.1.8 -> 10.0.2.11    NFS 710 V3 READ Reply (Call In 18727) Len: 512
> > > > > > 18729  19.570135    10.0.2.11 -> 10.0.1.8     NFS 234 V3 SETATTR Call, FH: 0xc930444f
> > > > > > 18730  19.570901     10.0.1.8 -> 10.0.2.11    NFS 214 V3 SETATTR Reply (Call In 18729) Error: NFS3ERR_JUKEBOX
> > > > > > 
> > > > > > The user space client has attempted to extend the file. This does
> > > > > > conflict with the read delegation held by the kernel NFS client,
> > > > > > so the server returns JUKEBOX, the equivalent of NFS4ERR_DELAY.
> > > > > > This causes a negative performance impact on the user space NFS
> > > > > > client.
> > > > > > 
> > > > > > 18731  19.575396    10.0.2.11 -> 10.0.1.8     NFS 250 V4 Call DELEGRETURN StateID: 0x6de3
> > > > > > 18732  19.576132     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18731) DELEGRETURN
> > > > > > 
> > > > > > No CB_RECALL was done to trigger this DELEGRETURN. Apparently
> > > > > > the application that was accessing this file via the kernel OS
> > > > > > client decided already that it no longer needed the file before
> > > > > > the server could send the CB_RECALL. Sign of perhaps a race
> > > > > > between the applications accessing the file via these two
> > > > > > mounts.
> > > > > > 
> > > > > > ---- cut here ----
> > > > > > 
> > > > > > The server is aware of non-NFSv4 accessors of this file in frame
> > > > > > 18556. NFSv3 has no OPEN operation, of course, so it's not
> > > > > > possible for the server to determine how the NFSv3 client will
> > > > > > subsequently access this file.
> > > > > > 
> > > > > 
> > > > > Right. Why should we assume that the v3 client will do anything other
> > > > > than read there? If we recall the delegation just for reads, then we
> > > > > potentially negatively affect the performance of the v4 client.
> > > > > 
> > > > > > Seems like at frame 18556, it would be a best practice to recall
> > > > > > the delegation to avoid potential future conflicts, such as the
> > > > > > SETATTR in frame 18729.
> > > > > > 
> > > > > > Or, perhaps that READ isn't the first NFSv3 access of that file.
> > > > > > After all, a LOOKUP would have to be done to retrieve that file's
> > > > > > FH. The OPEN in frame 18556 perhaps could have avoided offering
> > > > > > the READ delegation, knowing there is a recent non-NFSv4 accessor
> > > > > > of that file.
> > > > > > 
> > > > > > Would these be difficult or inappropriate policies to implement?
> > > > > > 
> > > > > > 
> > > > > 
> > > > > Reads are not currently considered to be conflicting access vs. a read
> > > > > delegation.
> > > > 
> > > > Strictly speaking, a single NFSv3 READ does not violate the guarantee
> > > > made by the read delegation. And, strictly speaking, there can be no
> > > > OPEN conflict because NFSv3 does not have an OPEN operation.
> > > > 
> > > > The question is whether the server has an adequate mechanism for
> > > > delaying NFSv3 accessors when an NFSv4 delegation must be recalled.
> > > > 
> > > > NFS3ERR_JUKEBOX and NFS4ERR_DELAY share the same numeric value, but
> > > > imply different semantics.
> > > > 
> > > > RFC1813 says:
> > > > 
> > > > NFS3ERR_JUKEBOX
> > > >    The server initiated the request, but was not able to
> > > >    complete it in a timely fashion. The client should wait
> > > >    and then try the request with a new RPC transaction ID.
> > > >    For example, this error should be returned from a server
> > > >    that supports hierarchical storage and receives a request
> > > >    to process a file that has been migrated. In this case,
> > > >    the server should start the immigration process and
> > > >    respond to client with this error.
> > > > 
> > > > Some clients respond to NFS3ERR_JUKEBOX by waiting quite some time
> > > > before retrying.
> > > > 
> > > > RFC7530 says:
> > > > 
> > > > 13.1.1.3.  NFS4ERR_DELAY (Error Code 10008)
> > > > 
> > > >   For any of a number of reasons, the replier could not process this
> > > >   operation in what was deemed a reasonable time.  The client should
> > > >   wait and then try the request with a new RPC transaction ID.
> > > > 
> > > >   The following are two examples of what might lead to this situation:
> > > > 
> > > >   o  A server that supports hierarchical storage receives a request to
> > > >      process a file that had been migrated.
> > > > 
> > > >   o  An operation requires a delegation recall to proceed, and waiting
> > > >      for this delegation recall makes processing this request in a
> > > >      timely fashion impossible.
> > > > 
> > > > An NFSv4 client is prepared to retry this error almost immediately
> > > > because most of the time it is due to the second bullet.
> > > > 
> > > > I agree that not recalling after an NFSv3 READ is reasonable in some
> > > > cases. However, I demonstrated a case where the current policy does
> > > > not serve one of these clients well at all. In fact, the NFSv3
> > > > accessor in this case is the performance-sensitive one.
> > > > 
> > > > To put it another way, the NFSv4 protocol does not forbid the
> > > > current Linux server policy, but interoperating well with existing
> > > > NFSv3 clients suggests it's not an optimal policy choice.
> > > > 
> > > 
> > > I think that is entirely dependent on the workload. If we proactively
> > > recall delegations because we think the v3 client _might_ do some
> > > conflicting access, and then it doesn't, then that's also a non-optimal
> > > choice.
> > > 
> > > > 
> > > > > I think that's the correct thing to do. Until we have some
> > > > > sort of conflicting behavior I don't see why you'd want to prematurely
> > > > > recall the delegation.
> > > > 
> > > > The reason to recall a delegation is to avoid returning
> > > > NFS3ERR_JUKEBOX if at all possible, because doing so is a drastic
> > > > remedy that results in a performance regression.
> > > > 
> > > > The negative impact of not having a delegation is small. The negative
> > > > impact of returning NFS3ERR_JUKEBOX to a SETATTR or WRITE can be as
> > > > much as a 5 minute wait. (This is intolerably long for, say, online
> > > > transaction processing workloads).
> > > > 
> > > 
> > > That sounds like a deficient v3 client, IMO. There's nothing in the v3
> > > spec that I know of that advocates a delay that long before
> > > reattempting. I'm pretty sure the Linux client treats NFSERR3_JUKEBOX
> > > and NFS4ERR_DELAY more or less equivalently.
> > 
> > The v3 client uses a 5 second delay (see NFS_JUKEBOX_RETRY_TIME).
> > The v4 client, at least in the case of operations that could break a
> > deleg, does exponential backoff starting with a tenth of a second--see
> > nfs4_delay.
> > 
> > So Trond's been taking the spec at its word here.
> > 
> > Like Jeff I'm pretty unhappy at the idea of revoking delegations
> > preemptively on v3 read and lookup.
> 
> To completely avoid JUKEBOX, you'd have to recall asynchronously.
> Even better would be not to offer delegations when it is clear
> there is an active NFSv3 accessor.
> 
> Is there a specific use case where holding onto delegations in
> this case is measurably valuable?
> 
> As Jeff said above, it is workload dependent, but it seems that
> we are choosing arbitrarily which workloads work well and which
> will be penalized.
> 
> Clearly, speculating about future access is not allowed when
> only NFSv4 is in play.
> 
> 
> > And a 5 minute wait does sound like a client problem.
> 
> Even a 5 second wait is not good. A simple "touch" that takes
> five seconds can generate user complaints.
> 
> I do see the point that a NFSv3 client implementation can be
> changed to retry JUKEBOX more aggressively. Not all NFSv3 code
> bases are actively maintained, however.
> 
> 
> > > > The server can detect there are other accessors that do not provide
> > > > OPEN/CLOSE semantics. In addition, the server cannot predict when one
> > > > of these accessors may use a WRITE or SETATTR. And finally it does
> > > > not have a reasonably performant mechanism for delaying those
> > > > accessors when a delegation must be recalled.
> > > > 
> > > 
> > > Interoperability is hard (and sometimes it doesn't work well :). We
> > > simply don't have enough info to reliably guess what the v3 client will
> > > do in this situation.
> 
> (This is in response to Jeff's comment)
> 
> Interoperability means following the spec, but IMO it also
> means respecting longstanding implementation practice when
> a specification does not prescribe particular behavior.
> 
> In this case, strictly speaking interoperability is not the
> concern.
> 
> -> The spec authors clearly believed this is an area where
> implementations are to be given free rein. Otherwise the text
> would have provided RFC 2119 directives or other specific
> guidelines. There was opportunity to add specifics in RFCs
> 3530, 7530, and 5661, but that wasn't done.
> 
> -> The scenario I reported does not involve operational
> failure. It eventually succeeds whether the client's retry
> is aggressive or lazy. It just works _better_ when there is
> no DELAY/JUKEBOX.
> 
> There are a few normative constraints here, and I think we
> have a bead on what those are, but IMO the issue is one of
> implementation quality (on both ends).
> 

Yes. I'm just not sold that what you're proposing would be any better
than what we have for the vast majority of people. It might be, but I
don't think that's necessarily the case.

> 
> > > That said, I wouldn't have a huge objection to a server side tunable
> > > (module parameter?) that says "Recall read delegations on v2/3 READ
> > > calls". Make it default to off, and then people in your situation could
> > > set it if they thought it a better policy for their workload.
> > I also wonder if in v3 case we should try a small synchronous wait
> > before returning JUKEBOX.  Read delegations shouldn't require the client
> > to do very much, so it could be they're typically returned in a
> > fraction of a second.
> 
> That wait would have to be very short in the NFSv3 / UDP case
> to avoid a retransmit timeout. I know, UDP is going away.
> 
> It's hard to say how long to wait. The RTT to the client might
> have to be taken into account. In WAN deployments, this could
> be as long as 50ms, for instance.
> 
> Although, again, waiting is speculative. A fixed 20ms wait
> would be appropriate for most LAN deployments, and that's
> where the expectation of consistently fast operation lies.
> 

Not a bad idea. That delay could be tunable as well.

> 
> > Since we have a fixed number of threads, I don't think we'd want to keep
> > one waiting much longer than that.  Also, it'd be nice if we could get
> > woken up early when the delegation return comes in before our wait's
> > over, but I haven't thought about how to do that.
> > 
> > And I don't know if that actually helps.
> 
> When there is a lot of file sharing between clients, it might
> be good to reduce the penalty of delegation recalls.
> 

The best way to do that would probably be to have better heuristics for
deciding whether to hand them out in the first place. We have a little
of that now with the bloom filter, but maybe those rules could be more
friendly to this use-case?

> Clients, after all, cannot know when a recall has completed,
> so they have to guess about when to retransmit, and usually
> make a conservative estimate. If server behavior can shorten
> the delay without introducing race windows, that would be good
> added value.
> 
> But I'm not clear why waiting must tie up the nfsd thread (pun
> intended). How is a COMMIT or synchronous WRITE handled? Seems
> like waiting for a delegation recall to complete is a similar
> kind of thing.
> 

It's not required per-se, but there currently isn't a good mechanism to
idle RPCs in the server without putting the thread to sleep. It may be
possible to do that with the svc_defer stuff, but I'm a little leery of
that code.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
  2017-03-13 16:33           ` Jeff Layton
@ 2017-03-13 17:12             ` Chuck Lever
  2017-03-13 18:26               ` Chuck Lever
  2017-03-14 13:55               ` J. Bruce Fields
  0 siblings, 2 replies; 13+ messages in thread
From: Chuck Lever @ 2017-03-13 17:12 UTC (permalink / raw)
  To: Jeff Layton; +Cc: J. Bruce Fields, Linux NFS Mailing List


> On Mar 13, 2017, at 12:33 PM, Jeff Layton <jlayton@redhat.com> wrote:
> 
> On Mon, 2017-03-13 at 11:30 -0400, Chuck Lever wrote:
>> Hi Bruce-
>> 
>> 
>>> On Mar 13, 2017, at 9:27 AM, J. Bruce Fields <bfields@redhat.com> wrote:
>>> 
>>> On Sat, Mar 11, 2017 at 04:04:34PM -0500, Jeff Layton wrote:
>>>> On Sat, 2017-03-11 at 15:46 -0500, Chuck Lever wrote:
>>>>>> On Mar 11, 2017, at 12:08 PM, Jeff Layton <jlayton@redhat.com> wrote:
>>>>>> 
>>>>>> On Sat, 2017-03-11 at 11:53 -0500, Chuck Lever wrote:
>>>>>>> Hi Bruce, Jeff-
>>>>>>> 
>>>>>>> I've observed some interesting Linux NFS server behavior (v4.1.12).
>>>>>>> 
>>>>>>> We have a single system that has an NFSv4 mount via the kernel NFS
>>>>>>> client, and an NFSv3 mount of the same export via a user space NFS
>>>>>>> client. These two clients are accessing the same set of files.
>>>>>>> 
>>>>>>> The following pattern is seen on the wire. I've filtered a recent
>>>>>>> capture on the FH of one of the shared files.
>>>>>>> 
>>>>>>> ---- cut here ----
>>>>>>> 
>>>>>>> 18507  19.483085    10.0.2.11 -> 10.0.1.8     NFS 238 V4 Call ACCESS FH: 0xc930444f, [Check: RD MD XT XE]
>>>>>>> 18508  19.483827     10.0.1.8 -> 10.0.2.11    NFS 194 V4 Reply (Call In 18507) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]
>>>>>>> 18510  19.484676     10.0.1.8 -> 10.0.2.11    NFS 434 V4 Reply (Call In 18509) OPEN StateID: 0x6de3
>>>>>>> 
>>>>>>> This OPEN reply offers a read delegation to the kernel NFS client.
>>>>>>> 
>>>>>>> 18511  19.484806    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
>>>>>>> 18512  19.485549     10.0.1.8 -> 10.0.2.11    NFS 274 V4 Reply (Call In 18511) GETATTR
>>>>>>> 18513  19.485611    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
>>>>>>> 18514  19.486375     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18513) GETATTR
>>>>>>> 18515  19.486464    10.0.2.11 -> 10.0.1.8     NFS 254 V4 Call CLOSE StateID: 0x6de3
>>>>>>> 18516  19.487201     10.0.1.8 -> 10.0.2.11    NFS 202 V4 Reply (Call In 18515) CLOSE
>>>>>>> 18556  19.498617    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 8192 Len: 8192
>>>>>>> 
>>>>>>> This READ call by the user space client does not conflict with the
>>>>>>> read delegation.
>>>>>>> 
>>>>>>> 18559  19.499396     10.0.1.8 -> 10.0.2.11    NFS 8390 V3 READ Reply (Call In 18556) Len: 8192
>>>>>>> 18726  19.568975     10.0.1.8 -> 10.0.2.11    NFS 310 V3 LOOKUP Reply (Call In 18725), FH: 0xc930444f
>>>>>>> 18727  19.569170    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 0 Len: 512
>>>>>>> 18728  19.569923     10.0.1.8 -> 10.0.2.11    NFS 710 V3 READ Reply (Call In 18727) Len: 512
>>>>>>> 18729  19.570135    10.0.2.11 -> 10.0.1.8     NFS 234 V3 SETATTR Call, FH: 0xc930444f
>>>>>>> 18730  19.570901     10.0.1.8 -> 10.0.2.11    NFS 214 V3 SETATTR Reply (Call In 18729) Error: NFS3ERR_JUKEBOX
>>>>>>> 
>>>>>>> The user space client has attempted to extend the file. This does
>>>>>>> conflict with the read delegation held by the kernel NFS client,
>>>>>>> so the server returns JUKEBOX, the equivalent of NFS4ERR_DELAY.
>>>>>>> This causes a negative performance impact on the user space NFS
>>>>>>> client.
>>>>>>> 
>>>>>>> 18731  19.575396    10.0.2.11 -> 10.0.1.8     NFS 250 V4 Call DELEGRETURN StateID: 0x6de3
>>>>>>> 18732  19.576132     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18731) DELEGRETURN
>>>>>>> 
>>>>>>> No CB_RECALL was done to trigger this DELEGRETURN. Apparently
>>>>>>> the application that was accessing this file via the kernel OS
>>>>>>> client decided already that it no longer needed the file before
>>>>>>> the server could send the CB_RECALL. Sign of perhaps a race
>>>>>>> between the applications accessing the file via these two
>>>>>>> mounts.
>>>>>>> 
>>>>>>> ---- cut here ----
>>>>>>> 
>>>>>>> The server is aware of non-NFSv4 accessors of this file in frame
>>>>>>> 18556. NFSv3 has no OPEN operation, of course, so it's not
>>>>>>> possible for the server to determine how the NFSv3 client will
>>>>>>> subsequently access this file.
>>>>>>> 
>>>>>> 
>>>>>> Right. Why should we assume that the v3 client will do anything other
>>>>>> than read there? If we recall the delegation just for reads, then we
>>>>>> potentially negatively affect the performance of the v4 client.
>>>>>> 
>>>>>>> Seems like at frame 18556, it would be a best practice to recall
>>>>>>> the delegation to avoid potential future conflicts, such as the
>>>>>>> SETATTR in frame 18729.
>>>>>>> 
>>>>>>> Or, perhaps that READ isn't the first NFSv3 access of that file.
>>>>>>> After all, a LOOKUP would have to be done to retrieve that file's
>>>>>>> FH. The OPEN in frame 18556 perhaps could have avoided offering
>>>>>>> the READ delegation, knowing there is a recent non-NFSv4 accessor
>>>>>>> of that file.
>>>>>>> 
>>>>>>> Would these be difficult or inappropriate policies to implement?
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> Reads are not currently considered to be conflicting access vs. a read
>>>>>> delegation.
>>>>> 
>>>>> Strictly speaking, a single NFSv3 READ does not violate the guarantee
>>>>> made by the read delegation. And, strictly speaking, there can be no
>>>>> OPEN conflict because NFSv3 does not have an OPEN operation.
>>>>> 
>>>>> The question is whether the server has an adequate mechanism for
>>>>> delaying NFSv3 accessors when an NFSv4 delegation must be recalled.
>>>>> 
>>>>> NFS3ERR_JUKEBOX and NFS4ERR_DELAY share the same numeric value, but
>>>>> imply different semantics.
>>>>> 
>>>>> RFC1813 says:
>>>>> 
>>>>> NFS3ERR_JUKEBOX
>>>>>   The server initiated the request, but was not able to
>>>>>   complete it in a timely fashion. The client should wait
>>>>>   and then try the request with a new RPC transaction ID.
>>>>>   For example, this error should be returned from a server
>>>>>   that supports hierarchical storage and receives a request
>>>>>   to process a file that has been migrated. In this case,
>>>>>   the server should start the immigration process and
>>>>>   respond to client with this error.
>>>>> 
>>>>> Some clients respond to NFS3ERR_JUKEBOX by waiting quite some time
>>>>> before retrying.
>>>>> 
>>>>> RFC7530 says:
>>>>> 
>>>>> 13.1.1.3.  NFS4ERR_DELAY (Error Code 10008)
>>>>> 
>>>>>  For any of a number of reasons, the replier could not process this
>>>>>  operation in what was deemed a reasonable time.  The client should
>>>>>  wait and then try the request with a new RPC transaction ID.
>>>>> 
>>>>>  The following are two examples of what might lead to this situation:
>>>>> 
>>>>>  o  A server that supports hierarchical storage receives a request to
>>>>>     process a file that had been migrated.
>>>>> 
>>>>>  o  An operation requires a delegation recall to proceed, and waiting
>>>>>     for this delegation recall makes processing this request in a
>>>>>     timely fashion impossible.
>>>>> 
>>>>> An NFSv4 client is prepared to retry this error almost immediately
>>>>> because most of the time it is due to the second bullet.
>>>>> 
>>>>> I agree that not recalling after an NFSv3 READ is reasonable in some
>>>>> cases. However, I demonstrated a case where the current policy does
>>>>> not serve one of these clients well at all. In fact, the NFSv3
>>>>> accessor in this case is the performance-sensitive one.
>>>>> 
>>>>> To put it another way, the NFSv4 protocol does not forbid the
>>>>> current Linux server policy, but interoperating well with existing
>>>>> NFSv3 clients suggests it's not an optimal policy choice.
>>>>> 
>>>> 
>>>> I think that is entirely dependent on the workload. If we proactively
>>>> recall delegations because we think the v3 client _might_ do some
>>>> conflicting access, and then it doesn't, then that's also a non-optimal
>>>> choice.
>>>> 
>>>>> 
>>>>>> I think that's the correct thing to do. Until we have some
>>>>>> sort of conflicting behavior I don't see why you'd want to prematurely
>>>>>> recall the delegation.
>>>>> 
>>>>> The reason to recall a delegation is to avoid returning
>>>>> NFS3ERR_JUKEBOX if at all possible, because doing so is a drastic
>>>>> remedy that results in a performance regression.
>>>>> 
>>>>> The negative impact of not having a delegation is small. The negative
>>>>> impact of returning NFS3ERR_JUKEBOX to a SETATTR or WRITE can be as
>>>>> much as a 5 minute wait. (This is intolerably long for, say, online
>>>>> transaction processing workloads).
>>>>> 
>>>> 
>>>> That sounds like a deficient v3 client, IMO. There's nothing in the v3
>>>> spec that I know of that advocates a delay that long before
>>>> reattempting. I'm pretty sure the Linux client treats NFSERR3_JUKEBOX
>>>> and NFS4ERR_DELAY more or less equivalently.
>>> 
>>> The v3 client uses a 5 second delay (see NFS_JUKEBOX_RETRY_TIME).
>>> The v4 client, at least in the case of operations that could break a
>>> deleg, does exponential backoff starting with a tenth of a second--see
>>> nfs4_delay.
>>> 
>>> So Trond's been taking the spec at its word here.
>>> 
>>> Like Jeff I'm pretty unhappy at the idea of revoking delegations
>>> preemptively on v3 read and lookup.
>> 
>> To completely avoid JUKEBOX, you'd have to recall asynchronously.
>> Even better would be not to offer delegations when it is clear
>> there is an active NFSv3 accessor.
>> 
>> Is there a specific use case where holding onto delegations in
>> this case is measurably valuable?
>> 
>> As Jeff said above, it is workload dependent, but it seems that
>> we are choosing arbitrarily which workloads work well and which
>> will be penalized.
>> 
>> Clearly, speculating about future access is not allowed when
>> only NFSv4 is in play.
>> 
>> 
>>> And a 5 minute wait does sound like a client problem.
>> 
>> Even a 5 second wait is not good. A simple "touch" that takes
>> five seconds can generate user complaints.
>> 
>> I do see the point that a NFSv3 client implementation can be
>> changed to retry JUKEBOX more aggressively. Not all NFSv3 code
>> bases are actively maintained, however.
>> 
>> 
>>>>> The server can detect there are other accessors that do not provide
>>>>> OPEN/CLOSE semantics. In addition, the server cannot predict when one
>>>>> of these accessors may use a WRITE or SETATTR. And finally it does
>>>>> not have a reasonably performant mechanism for delaying those
>>>>> accessors when a delegation must be recalled.
>>>>> 
>>>> 
>>>> Interoperability is hard (and sometimes it doesn't work well :). We
>>>> simply don't have enough info to reliably guess what the v3 client will
>>>> do in this situation.
>> 
>> (This is in response to Jeff's comment)
>> 
>> Interoperability means following the spec, but IMO it also
>> means respecting longstanding implementation practice when
>> a specification does not prescribe particular behavior.
>> 
>> In this case, strictly speaking interoperability is not the
>> concern.
>> 
>> -> The spec authors clearly believed this is an area where
>> implementations are to be given free rein. Otherwise the text
>> would have provided RFC 2119 directives or other specific
>> guidelines. There was opportunity to add specifics in RFCs
>> 3530, 7530, and 5661, but that wasn't done.
>> 
>> -> The scenario I reported does not involve operational
>> failure. It eventually succeeds whether the client's retry
>> is aggressive or lazy. It just works _better_ when there is
>> no DELAY/JUKEBOX.
>> 
>> There are a few normative constraints here, and I think we
>> have a bead on what those are, but IMO the issue is one of
>> implementation quality (on both ends).
>> 
> 
> Yes. I'm just not sold that what you're proposing would be any better
> than what we have for the vast majority of people. It might be, but I
> don't think that's necessarily the case.

In other words, both of you are comparing my use case with
a counterfactual. That doesn't seem like a fair fight.

Can you demonstrate a specific use case where not offering
a delegation during mixed NFSv3 and NFSv4 access is a true
detriment? (I am open to hearing about it).

What happens when an NFSv3 client sends an NLM LOCK on a
delegated file? I assume the correct response is for the
server to return NLM_LCK_BLOCKED, recall the delegation, and
then call the client back when the delegation has been
returned. Is that known to work?


>>>> That said, I wouldn't have a huge objection to a server side tunable
>>>> (module parameter?) that says "Recall read delegations on v2/3 READ
>>>> calls". Make it default to off, and then people in your situation could
>>>> set it if they thought it a better policy for their workload.
>>> I also wonder if in v3 case we should try a small synchronous wait
>>> before returning JUKEBOX.  Read delegations shouldn't require the client
>>> to do very much, so it could be they're typically returned in a
>>> fraction of a second.
>> 
>> That wait would have to be very short in the NFSv3 / UDP case
>> to avoid a retransmit timeout. I know, UDP is going away.
>> 
>> It's hard to say how long to wait. The RTT to the client might
>> have to be taken into account. In WAN deployments, this could
>> be as long as 50ms, for instance.
>> 
>> Although, again, waiting is speculative. A fixed 20ms wait
>> would be appropriate for most LAN deployments, and that's
>> where the expectation of consistently fast operation lies.
>> 
> 
> Not a bad idea. That delay could be tunable as well.

>>> Since we have a fixed number of threads, I don't think we'd want to keep
>>> one waiting much longer than that.  Also, it'd be nice if we could get
>>> woken up early when the delegation return comes in before our wait's
>>> over, but I haven't thought about how to do that.
>>> 
>>> And I don't know if that actually helps.
>> 
>> When there is a lot of file sharing between clients, it might
>> be good to reduce the penalty of delegation recalls.
>> 
> 
> The best way to do that would probably be to have better heuristics for
> deciding whether to hand them out in the first place.

I thought that was exactly what I was suggesting. ;-)
See above ("To completely avoid...").


> We have a little
> of that now with the bloom filter, but maybe those rules could be more
> friendly to this use-case?
> 
>> Clients, after all, cannot know when a recall has completed,
>> so they have to guess about when to retransmit, and usually
>> make a conservative estimate. If server behavior can shorten
>> the delay without introducing race windows, that would be good
>> added value.
>> 
>> But I'm not clear why waiting must tie up the nfsd thread (pun
>> intended). How is a COMMIT or synchronous WRITE handled? Seems
>> like waiting for a delegation recall to complete is a similar
>> kind of thing.
>> 
> 
> It's not required per-se, but there currently isn't a good mechanism to
> idle RPCs in the server without putting the thread to sleep. It may be
> possible to do that with the svc_defer stuff, but I'm a little leery of
> that code.

There are other cases where context switching an nfsd would be
useful. For example, inserting an opportunity for nfsd_write
to perform transport reads (after having allocated pages in
the right file) could provide some benefits by reducing data
copies and page allocator calls.

I'm agnostic about exactly how this is done.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
  2017-03-13 17:12             ` Chuck Lever
@ 2017-03-13 18:26               ` Chuck Lever
  2017-03-14 14:05                 ` Jeff Layton
  2017-03-14 13:55               ` J. Bruce Fields
  1 sibling, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2017-03-13 18:26 UTC (permalink / raw)
  To: Jeff Layton; +Cc: J. Bruce Fields, Linux NFS Mailing List


> On Mar 13, 2017, at 1:12 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
>> 
>> On Mar 13, 2017, at 12:33 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> 
>> On Mon, 2017-03-13 at 11:30 -0400, Chuck Lever wrote:
>>> Hi Bruce-
>>> 
>>> 
>>>> On Mar 13, 2017, at 9:27 AM, J. Bruce Fields <bfields@redhat.com> wrote:
>>>> 
>>>> On Sat, Mar 11, 2017 at 04:04:34PM -0500, Jeff Layton wrote:
>>>>> On Sat, 2017-03-11 at 15:46 -0500, Chuck Lever wrote:
>>>>>>> On Mar 11, 2017, at 12:08 PM, Jeff Layton <jlayton@redhat.com> wrote:
>>>>>>> 
>>>>>>> On Sat, 2017-03-11 at 11:53 -0500, Chuck Lever wrote:
>>>>>>>> Hi Bruce, Jeff-
>>>>>>>> 
>>>>>>>> I've observed some interesting Linux NFS server behavior (v4.1.12).
>>>>>>>> 
>>>>>>>> We have a single system that has an NFSv4 mount via the kernel NFS
>>>>>>>> client, and an NFSv3 mount of the same export via a user space NFS
>>>>>>>> client. These two clients are accessing the same set of files.
>>>>>>>> 
>>>>>>>> The following pattern is seen on the wire. I've filtered a recent
>>>>>>>> capture on the FH of one of the shared files.
>>>>>>>> 
>>>>>>>> ---- cut here ----
>>>>>>>> 
>>>>>>>> 18507  19.483085    10.0.2.11 -> 10.0.1.8     NFS 238 V4 Call ACCESS FH: 0xc930444f, [Check: RD MD XT XE]
>>>>>>>> 18508  19.483827     10.0.1.8 -> 10.0.2.11    NFS 194 V4 Reply (Call In 18507) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]
>>>>>>>> 18510  19.484676     10.0.1.8 -> 10.0.2.11    NFS 434 V4 Reply (Call In 18509) OPEN StateID: 0x6de3
>>>>>>>> 
>>>>>>>> This OPEN reply offers a read delegation to the kernel NFS client.
>>>>>>>> 
>>>>>>>> 18511  19.484806    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
>>>>>>>> 18512  19.485549     10.0.1.8 -> 10.0.2.11    NFS 274 V4 Reply (Call In 18511) GETATTR
>>>>>>>> 18513  19.485611    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
>>>>>>>> 18514  19.486375     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18513) GETATTR
>>>>>>>> 18515  19.486464    10.0.2.11 -> 10.0.1.8     NFS 254 V4 Call CLOSE StateID: 0x6de3
>>>>>>>> 18516  19.487201     10.0.1.8 -> 10.0.2.11    NFS 202 V4 Reply (Call In 18515) CLOSE
>>>>>>>> 18556  19.498617    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 8192 Len: 8192
>>>>>>>> 
>>>>>>>> This READ call by the user space client does not conflict with the
>>>>>>>> read delegation.
>>>>>>>> 
>>>>>>>> 18559  19.499396     10.0.1.8 -> 10.0.2.11    NFS 8390 V3 READ Reply (Call In 18556) Len: 8192
>>>>>>>> 18726  19.568975     10.0.1.8 -> 10.0.2.11    NFS 310 V3 LOOKUP Reply (Call In 18725), FH: 0xc930444f
>>>>>>>> 18727  19.569170    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 0 Len: 512
>>>>>>>> 18728  19.569923     10.0.1.8 -> 10.0.2.11    NFS 710 V3 READ Reply (Call In 18727) Len: 512
>>>>>>>> 18729  19.570135    10.0.2.11 -> 10.0.1.8     NFS 234 V3 SETATTR Call, FH: 0xc930444f
>>>>>>>> 18730  19.570901     10.0.1.8 -> 10.0.2.11    NFS 214 V3 SETATTR Reply (Call In 18729) Error: NFS3ERR_JUKEBOX
>>>>>>>> 
>>>>>>>> The user space client has attempted to extend the file. This does
>>>>>>>> conflict with the read delegation held by the kernel NFS client,
>>>>>>>> so the server returns JUKEBOX, the equivalent of NFS4ERR_DELAY.
>>>>>>>> This causes a negative performance impact on the user space NFS
>>>>>>>> client.
>>>>>>>> 
>>>>>>>> 18731  19.575396    10.0.2.11 -> 10.0.1.8     NFS 250 V4 Call DELEGRETURN StateID: 0x6de3
>>>>>>>> 18732  19.576132     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18731) DELEGRETURN
>>>>>>>> 
>>>>>>>> No CB_RECALL was done to trigger this DELEGRETURN. Apparently
>>>>>>>> the application that was accessing this file via the kernel OS
>>>>>>>> client decided already that it no longer needed the file before
>>>>>>>> the server could send the CB_RECALL. Sign of perhaps a race
>>>>>>>> between the applications accessing the file via these two
>>>>>>>> mounts.
>>>>>>>> 
>>>>>>>> ---- cut here ----
>>>>>>>> 
>>>>>>>> The server is aware of non-NFSv4 accessors of this file in frame
>>>>>>>> 18556. NFSv3 has no OPEN operation, of course, so it's not
>>>>>>>> possible for the server to determine how the NFSv3 client will
>>>>>>>> subsequently access this file.
>>>>>>>> 
>>>>>>> 
>>>>>>> Right. Why should we assume that the v3 client will do anything other
>>>>>>> than read there? If we recall the delegation just for reads, then we
>>>>>>> potentially negatively affect the performance of the v4 client.
>>>>>>> 
>>>>>>>> Seems like at frame 18556, it would be a best practice to recall
>>>>>>>> the delegation to avoid potential future conflicts, such as the
>>>>>>>> SETATTR in frame 18729.
>>>>>>>> 
>>>>>>>> Or, perhaps that READ isn't the first NFSv3 access of that file.
>>>>>>>> After all, a LOOKUP would have to be done to retrieve that file's
>>>>>>>> FH. The OPEN in frame 18556 perhaps could have avoided offering
>>>>>>>> the READ delegation, knowing there is a recent non-NFSv4 accessor
>>>>>>>> of that file.
>>>>>>>> 
>>>>>>>> Would these be difficult or inappropriate policies to implement?
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> Reads are not currently considered to be conflicting access vs. a read
>>>>>>> delegation.
>>>>>> 
>>>>>> Strictly speaking, a single NFSv3 READ does not violate the guarantee
>>>>>> made by the read delegation. And, strictly speaking, there can be no
>>>>>> OPEN conflict because NFSv3 does not have an OPEN operation.
>>>>>> 
>>>>>> The question is whether the server has an adequate mechanism for
>>>>>> delaying NFSv3 accessors when an NFSv4 delegation must be recalled.
>>>>>> 
>>>>>> NFS3ERR_JUKEBOX and NFS4ERR_DELAY share the same numeric value, but
>>>>>> imply different semantics.
>>>>>> 
>>>>>> RFC1813 says:
>>>>>> 
>>>>>> NFS3ERR_JUKEBOX
>>>>>>  The server initiated the request, but was not able to
>>>>>>  complete it in a timely fashion. The client should wait
>>>>>>  and then try the request with a new RPC transaction ID.
>>>>>>  For example, this error should be returned from a server
>>>>>>  that supports hierarchical storage and receives a request
>>>>>>  to process a file that has been migrated. In this case,
>>>>>>  the server should start the immigration process and
>>>>>>  respond to client with this error.
>>>>>> 
>>>>>> Some clients respond to NFS3ERR_JUKEBOX by waiting quite some time
>>>>>> before retrying.
>>>>>> 
>>>>>> RFC7530 says:
>>>>>> 
>>>>>> 13.1.1.3.  NFS4ERR_DELAY (Error Code 10008)
>>>>>> 
>>>>>> For any of a number of reasons, the replier could not process this
>>>>>> operation in what was deemed a reasonable time.  The client should
>>>>>> wait and then try the request with a new RPC transaction ID.
>>>>>> 
>>>>>> The following are two examples of what might lead to this situation:
>>>>>> 
>>>>>> o  A server that supports hierarchical storage receives a request to
>>>>>>    process a file that had been migrated.
>>>>>> 
>>>>>> o  An operation requires a delegation recall to proceed, and waiting
>>>>>>    for this delegation recall makes processing this request in a
>>>>>>    timely fashion impossible.
>>>>>> 
>>>>>> An NFSv4 client is prepared to retry this error almost immediately
>>>>>> because most of the time it is due to the second bullet.
>>>>>> 
>>>>>> I agree that not recalling after an NFSv3 READ is reasonable in some
>>>>>> cases. However, I demonstrated a case where the current policy does
>>>>>> not serve one of these clients well at all. In fact, the NFSv3
>>>>>> accessor in this case is the performance-sensitive one.
>>>>>> 
>>>>>> To put it another way, the NFSv4 protocol does not forbid the
>>>>>> current Linux server policy, but interoperating well with existing
>>>>>> NFSv3 clients suggests it's not an optimal policy choice.
>>>>>> 
>>>>> 
>>>>> I think that is entirely dependent on the workload. If we proactively
>>>>> recall delegations because we think the v3 client _might_ do some
>>>>> conflicting access, and then it doesn't, then that's also a non-optimal
>>>>> choice.
>>>>> 
>>>>>> 
>>>>>>> I think that's the correct thing to do. Until we have some
>>>>>>> sort of conflicting behavior I don't see why you'd want to prematurely
>>>>>>> recall the delegation.
>>>>>> 
>>>>>> The reason to recall a delegation is to avoid returning
>>>>>> NFS3ERR_JUKEBOX if at all possible, because doing so is a drastic
>>>>>> remedy that results in a performance regression.
>>>>>> 
>>>>>> The negative impact of not having a delegation is small. The negative
>>>>>> impact of returning NFS3ERR_JUKEBOX to a SETATTR or WRITE can be as
>>>>>> much as a 5 minute wait. (This is intolerably long for, say, online
>>>>>> transaction processing workloads).
>>>>>> 
>>>>> 
>>>>> That sounds like a deficient v3 client, IMO. There's nothing in the v3
>>>>> spec that I know of that advocates a delay that long before
>>>>> reattempting. I'm pretty sure the Linux client treats NFSERR3_JUKEBOX
>>>>> and NFS4ERR_DELAY more or less equivalently.
>>>> 
>>>> The v3 client uses a 5 second delay (see NFS_JUKEBOX_RETRY_TIME).
>>>> The v4 client, at least in the case of operations that could break a
>>>> deleg, does exponential backoff starting with a tenth of a second--see
>>>> nfs4_delay.
>>>> 
>>>> So Trond's been taking the spec at its word here.
>>>> 
>>>> Like Jeff I'm pretty unhappy at the idea of revoking delegations
>>>> preemptively on v3 read and lookup.
>>> 
>>> To completely avoid JUKEBOX, you'd have to recall asynchronously.
>>> Even better would be not to offer delegations when it is clear
>>> there is an active NFSv3 accessor.
>>> 
>>> Is there a specific use case where holding onto delegations in
>>> this case is measurably valuable?
>>> 
>>> As Jeff said above, it is workload dependent, but it seems that
>>> we are choosing arbitrarily which workloads work well and which
>>> will be penalized.
>>> 
>>> Clearly, speculating about future access is not allowed when
>>> only NFSv4 is in play.
>>> 
>>> 
>>>> And a 5 minute wait does sound like a client problem.
>>> 
>>> Even a 5 second wait is not good. A simple "touch" that takes
>>> five seconds can generate user complaints.
>>> 
>>> I do see the point that a NFSv3 client implementation can be
>>> changed to retry JUKEBOX more aggressively. Not all NFSv3 code
>>> bases are actively maintained, however.
>>> 
>>> 
>>>>>> The server can detect there are other accessors that do not provide
>>>>>> OPEN/CLOSE semantics. In addition, the server cannot predict when one
>>>>>> of these accessors may use a WRITE or SETATTR. And finally it does
>>>>>> not have a reasonably performant mechanism for delaying those
>>>>>> accessors when a delegation must be recalled.
>>>>>> 
>>>>> 
>>>>> Interoperability is hard (and sometimes it doesn't work well :). We
>>>>> simply don't have enough info to reliably guess what the v3 client will
>>>>> do in this situation.
>>> 
>>> (This is in response to Jeff's comment)
>>> 
>>> Interoperability means following the spec, but IMO it also
>>> means respecting longstanding implementation practice when
>>> a specification does not prescribe particular behavior.
>>> 
>>> In this case, strictly speaking interoperability is not the
>>> concern.
>>> 
>>> -> The spec authors clearly believed this is an area where
>>> implementations are to be given free rein. Otherwise the text
>>> would have provided RFC 2119 directives or other specific
>>> guidelines. There was opportunity to add specifics in RFCs
>>> 3530, 7530, and 5661, but that wasn't done.
>>> 
>>> -> The scenario I reported does not involve operational
>>> failure. It eventually succeeds whether the client's retry
>>> is aggressive or lazy. It just works _better_ when there is
>>> no DELAY/JUKEBOX.
>>> 
>>> There are a few normative constraints here, and I think we
>>> have a bead on what those are, but IMO the issue is one of
>>> implementation quality (on both ends).
>>> 
>> 
>> Yes. I'm just not sold that what you're proposing would be any better
>> than what we have for the vast majority of people. It might be, but I
>> don't think that's necessarily the case.
> 
> In other words, both of you are comparing my use case with
> a counterfactual. That doesn't seem like a fair fight.
> 
> Can you demonstrate a specific use case where not offering
> a delegation during mixed NFSv3 and NFSv4 access is a true
> detriment? (I am open to hearing about it).
> 
> What happens when an NFSv3 client sends an NLM LOCK on a
> delegated file? I assume the correct response is for the
> server to return NLM_LCK_BLOCKED, recall the delegation, and
> then call the client back when the delegation has been
> returned. Is that known to work?
> 
> 
>>>>> That said, I wouldn't have a huge objection to a server side tunable
>>>>> (module parameter?) that says "Recall read delegations on v2/3 READ
>>>>> calls". Make it default to off, and then people in your situation could
>>>>> set it if they thought it a better policy for their workload.
>>>> I also wonder if in v3 case we should try a small synchronous wait
>>>> before returning JUKEBOX.  Read delegations shouldn't require the client
>>>> to do very much, so it could be they're typically returned in a
>>>> fraction of a second.
>>> 
>>> That wait would have to be very short in the NFSv3 / UDP case
>>> to avoid a retransmit timeout. I know, UDP is going away.
>>> 
>>> It's hard to say how long to wait. The RTT to the client might
>>> have to be taken into account. In WAN deployments, this could
>>> be as long as 50ms, for instance.
>>> 
>>> Although, again, waiting is speculative. A fixed 20ms wait
>>> would be appropriate for most LAN deployments, and that's
>>> where the expectation of consistently fast operation lies.
>>> 
>> 
>> Not a bad idea. That delay could be tunable as well.
> 
>>>> Since we have a fixed number of threads, I don't think we'd want to keep
>>>> one waiting much longer than that.  Also, it'd be nice if we could get
>>>> woken up early when the delegation return comes in before our wait's
>>>> over, but I haven't thought about how to do that.
>>>> 
>>>> And I don't know if that actually helps.
>>> 
>>> When there is a lot of file sharing between clients, it might
>>> be good to reduce the penalty of delegation recalls.
>>> 
>> 
>> The best way to do that would probably be to have better heuristics for
>> deciding whether to hand them out in the first place.
> 
> I thought that was exactly what I was suggesting. ;-)
> See above ("To completely avoid...").
> 
> 
>> We have a little
>> of that now with the bloom filter, but maybe those rules could be more
>> friendly to this use-case?
>> 
>>> Clients, after all, cannot know when a recall has completed,
>>> so they have to guess about when to retransmit, and usually
>>> make a conservative estimate. If server behavior can shorten
>>> the delay without introducing race windows, that would be good
>>> added value.
>>> 
>>> But I'm not clear why waiting must tie up the nfsd thread (pun
>>> intended). How is a COMMIT or synchronous WRITE handled? Seems
>>> like waiting for a delegation recall to complete is a similar
>>> kind of thing.
>>> 
>> 
>> It's not required per-se, but there currently isn't a good mechanism to
>> idle RPCs in the server without putting the thread to sleep. It may be
>> possible to do that with the svc_defer stuff, but I'm a little leery of
>> that code.
> 
> There are other cases where context switching an nfsd would be
> useful. For example, inserting an opportunity for nfsd_write
> to perform transport reads (after having allocated pages in
> the right file) could provide some benefits by reducing data
> copies and page allocator calls.
> 
> I'm agnostic about exactly how this is done.

Meaning I don't have any particular design preferences.

I'd like to help with implementation, though, if there is
agreement about what approach is preferred.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
  2017-03-13 17:12             ` Chuck Lever
  2017-03-13 18:26               ` Chuck Lever
@ 2017-03-14 13:55               ` J. Bruce Fields
  1 sibling, 0 replies; 13+ messages in thread
From: J. Bruce Fields @ 2017-03-14 13:55 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, J. Bruce Fields, Linux NFS Mailing List

On Mon, Mar 13, 2017 at 01:12:32PM -0400, Chuck Lever wrote:
> > On Mar 13, 2017, at 12:33 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > Yes. I'm just not sold that what you're proposing would be any better
> > than what we have for the vast majority of people. It might be, but I
> > don't think that's necessarily the case.
> 
> In other words, both of you are comparing my use case with
> a counterfactual. That doesn't seem like a fair fight.

There's always a bias towards existing behavior.

> Can you demonstrate a specific use case where not offering
> a delegation during mixed NFSv3 and NFSv4 access is a true
> detriment? (I am open to hearing about it).

I'd look for a read-mostly workload with frequent opens and
closes--maybe libraries or config files that every executable opens at
startup?

> What happens when an NFSv3 client sends an NLM LOCK on a
> delegated file? I assume the correct response is for the
> server to return NLM_LCK_BLOCKED, recall the delegation, and
> then call the client back when the delegation has been
> returned. Is that known to work?

I don't currently have a test for that, and from looking at the code I'm
not confident it's correct.

Looks like the call chain is:

	nlmsvc_proc_lock->
	  nlmsvc_retrieve_args->
	    nlm_lookup_file->
	      fopen = nlm_open->
	        nfsd_open->nfsd_open_break_lease

And I'd expect that to return EWOULDBLOCK.  But nfsd_open is getting
called just with MAY_LOCK, and I think it needs MAY_READ or _WRITE to
recognize conflicts correctly.

And then if it did hit a conflict it would return EWOULDBLOCK, but it
looks like that ends up as nlm_lck_denied_nolocks, which is totally
unhelpful.

It's possible I'm missing something.  Obviously it needs testing.

> There are other cases where context switching an nfsd would be
> useful. For example, inserting an opportunity for nfsd_write
> to perform transport reads (after having allocated pages in
> the right file) could provide some benefits by reducing data
> copies and page allocator calls.

For now I think we can just live with threads blocking briefly--as you
say, we already do for disk.  Though if a slow client could cause that
rdma read to block too long then we'd need a maximum timeout.

--b.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors
  2017-03-13 18:26               ` Chuck Lever
@ 2017-03-14 14:05                 ` Jeff Layton
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff Layton @ 2017-03-14 14:05 UTC (permalink / raw)
  To: Chuck Lever; +Cc: J. Bruce Fields, Linux NFS Mailing List

On Mon, 2017-03-13 at 14:26 -0400, Chuck Lever wrote:
> > On Mar 13, 2017, at 1:12 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > 
> > > 
> > > On Mar 13, 2017, at 12:33 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > 
> > > On Mon, 2017-03-13 at 11:30 -0400, Chuck Lever wrote:
> > > > Hi Bruce-
> > > > 
> > > > 
> > > > > On Mar 13, 2017, at 9:27 AM, J. Bruce Fields <bfields@redhat.com> wrote:
> > > > > 
> > > > > On Sat, Mar 11, 2017 at 04:04:34PM -0500, Jeff Layton wrote:
> > > > > > On Sat, 2017-03-11 at 15:46 -0500, Chuck Lever wrote:
> > > > > > > > On Mar 11, 2017, at 12:08 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > > > > > 
> > > > > > > > On Sat, 2017-03-11 at 11:53 -0500, Chuck Lever wrote:
> > > > > > > > > Hi Bruce, Jeff-
> > > > > > > > > 
> > > > > > > > > I've observed some interesting Linux NFS server behavior (v4.1.12).
> > > > > > > > > 
> > > > > > > > > We have a single system that has an NFSv4 mount via the kernel NFS
> > > > > > > > > client, and an NFSv3 mount of the same export via a user space NFS
> > > > > > > > > client. These two clients are accessing the same set of files.
> > > > > > > > > 
> > > > > > > > > The following pattern is seen on the wire. I've filtered a recent
> > > > > > > > > capture on the FH of one of the shared files.
> > > > > > > > > 
> > > > > > > > > ---- cut here ----
> > > > > > > > > 
> > > > > > > > > 18507  19.483085    10.0.2.11 -> 10.0.1.8     NFS 238 V4 Call ACCESS FH: 0xc930444f, [Check: RD MD XT XE]
> > > > > > > > > 18508  19.483827     10.0.1.8 -> 10.0.2.11    NFS 194 V4 Reply (Call In 18507) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]
> > > > > > > > > 18510  19.484676     10.0.1.8 -> 10.0.2.11    NFS 434 V4 Reply (Call In 18509) OPEN StateID: 0x6de3
> > > > > > > > > 
> > > > > > > > > This OPEN reply offers a read delegation to the kernel NFS client.
> > > > > > > > > 
> > > > > > > > > 18511  19.484806    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
> > > > > > > > > 18512  19.485549     10.0.1.8 -> 10.0.2.11    NFS 274 V4 Reply (Call In 18511) GETATTR
> > > > > > > > > 18513  19.485611    10.0.2.11 -> 10.0.1.8     NFS 230 V4 Call GETATTR FH: 0xc930444f
> > > > > > > > > 18514  19.486375     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18513) GETATTR
> > > > > > > > > 18515  19.486464    10.0.2.11 -> 10.0.1.8     NFS 254 V4 Call CLOSE StateID: 0x6de3
> > > > > > > > > 18516  19.487201     10.0.1.8 -> 10.0.2.11    NFS 202 V4 Reply (Call In 18515) CLOSE
> > > > > > > > > 18556  19.498617    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 8192 Len: 8192
> > > > > > > > > 
> > > > > > > > > This READ call by the user space client does not conflict with the
> > > > > > > > > read delegation.
> > > > > > > > > 
> > > > > > > > > 18559  19.499396     10.0.1.8 -> 10.0.2.11    NFS 8390 V3 READ Reply (Call In 18556) Len: 8192
> > > > > > > > > 18726  19.568975     10.0.1.8 -> 10.0.2.11    NFS 310 V3 LOOKUP Reply (Call In 18725), FH: 0xc930444f
> > > > > > > > > 18727  19.569170    10.0.2.11 -> 10.0.1.8     NFS 210 V3 READ Call, FH: 0xc930444f Offset: 0 Len: 512
> > > > > > > > > 18728  19.569923     10.0.1.8 -> 10.0.2.11    NFS 710 V3 READ Reply (Call In 18727) Len: 512
> > > > > > > > > 18729  19.570135    10.0.2.11 -> 10.0.1.8     NFS 234 V3 SETATTR Call, FH: 0xc930444f
> > > > > > > > > 18730  19.570901     10.0.1.8 -> 10.0.2.11    NFS 214 V3 SETATTR Reply (Call In 18729) Error: NFS3ERR_JUKEBOX
> > > > > > > > > 
> > > > > > > > > The user space client has attempted to extend the file. This does
> > > > > > > > > conflict with the read delegation held by the kernel NFS client,
> > > > > > > > > so the server returns JUKEBOX, the equivalent of NFS4ERR_DELAY.
> > > > > > > > > This causes a negative performance impact on the user space NFS
> > > > > > > > > client.
> > > > > > > > > 
> > > > > > > > > 18731  19.575396    10.0.2.11 -> 10.0.1.8     NFS 250 V4 Call DELEGRETURN StateID: 0x6de3
> > > > > > > > > 18732  19.576132     10.0.1.8 -> 10.0.2.11    NFS 186 V4 Reply (Call In 18731) DELEGRETURN
> > > > > > > > > 
> > > > > > > > > No CB_RECALL was done to trigger this DELEGRETURN. Apparently
> > > > > > > > > the application that was accessing this file via the kernel OS
> > > > > > > > > client decided already that it no longer needed the file before
> > > > > > > > > the server could send the CB_RECALL. Sign of perhaps a race
> > > > > > > > > between the applications accessing the file via these two
> > > > > > > > > mounts.
> > > > > > > > > 
> > > > > > > > > ---- cut here ----
> > > > > > > > > 
> > > > > > > > > The server is aware of non-NFSv4 accessors of this file in frame
> > > > > > > > > 18556. NFSv3 has no OPEN operation, of course, so it's not
> > > > > > > > > possible for the server to determine how the NFSv3 client will
> > > > > > > > > subsequently access this file.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Right. Why should we assume that the v3 client will do anything other
> > > > > > > > than read there? If we recall the delegation just for reads, then we
> > > > > > > > potentially negatively affect the performance of the v4 client.
> > > > > > > > 
> > > > > > > > > Seems like at frame 18556, it would be a best practice to recall
> > > > > > > > > the delegation to avoid potential future conflicts, such as the
> > > > > > > > > SETATTR in frame 18729.
> > > > > > > > > 
> > > > > > > > > Or, perhaps that READ isn't the first NFSv3 access of that file.
> > > > > > > > > After all, a LOOKUP would have to be done to retrieve that file's
> > > > > > > > > FH. The OPEN in frame 18556 perhaps could have avoided offering
> > > > > > > > > the READ delegation, knowing there is a recent non-NFSv4 accessor
> > > > > > > > > of that file.
> > > > > > > > > 
> > > > > > > > > Would these be difficult or inappropriate policies to implement?
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Reads are not currently considered to be conflicting access vs. a read
> > > > > > > > delegation.
> > > > > > > 
> > > > > > > Strictly speaking, a single NFSv3 READ does not violate the guarantee
> > > > > > > made by the read delegation. And, strictly speaking, there can be no
> > > > > > > OPEN conflict because NFSv3 does not have an OPEN operation.
> > > > > > > 
> > > > > > > The question is whether the server has an adequate mechanism for
> > > > > > > delaying NFSv3 accessors when an NFSv4 delegation must be recalled.
> > > > > > > 
> > > > > > > NFS3ERR_JUKEBOX and NFS4ERR_DELAY share the same numeric value, but
> > > > > > > imply different semantics.
> > > > > > > 
> > > > > > > RFC1813 says:
> > > > > > > 
> > > > > > > NFS3ERR_JUKEBOX
> > > > > > >  The server initiated the request, but was not able to
> > > > > > >  complete it in a timely fashion. The client should wait
> > > > > > >  and then try the request with a new RPC transaction ID.
> > > > > > >  For example, this error should be returned from a server
> > > > > > >  that supports hierarchical storage and receives a request
> > > > > > >  to process a file that has been migrated. In this case,
> > > > > > >  the server should start the immigration process and
> > > > > > >  respond to client with this error.
> > > > > > > 
> > > > > > > Some clients respond to NFS3ERR_JUKEBOX by waiting quite some time
> > > > > > > before retrying.
> > > > > > > 
> > > > > > > RFC7530 says:
> > > > > > > 
> > > > > > > 13.1.1.3.  NFS4ERR_DELAY (Error Code 10008)
> > > > > > > 
> > > > > > > For any of a number of reasons, the replier could not process this
> > > > > > > operation in what was deemed a reasonable time.  The client should
> > > > > > > wait and then try the request with a new RPC transaction ID.
> > > > > > > 
> > > > > > > The following are two examples of what might lead to this situation:
> > > > > > > 
> > > > > > > o  A server that supports hierarchical storage receives a request to
> > > > > > >    process a file that had been migrated.
> > > > > > > 
> > > > > > > o  An operation requires a delegation recall to proceed, and waiting
> > > > > > >    for this delegation recall makes processing this request in a
> > > > > > >    timely fashion impossible.
> > > > > > > 
> > > > > > > An NFSv4 client is prepared to retry this error almost immediately
> > > > > > > because most of the time it is due to the second bullet.
> > > > > > > 
> > > > > > > I agree that not recalling after an NFSv3 READ is reasonable in some
> > > > > > > cases. However, I demonstrated a case where the current policy does
> > > > > > > not serve one of these clients well at all. In fact, the NFSv3
> > > > > > > accessor in this case is the performance-sensitive one.
> > > > > > > 
> > > > > > > To put it another way, the NFSv4 protocol does not forbid the
> > > > > > > current Linux server policy, but interoperating well with existing
> > > > > > > NFSv3 clients suggests it's not an optimal policy choice.
> > > > > > > 
> > > > > > 
> > > > > > I think that is entirely dependent on the workload. If we proactively
> > > > > > recall delegations because we think the v3 client _might_ do some
> > > > > > conflicting access, and then it doesn't, then that's also a non-optimal
> > > > > > choice.
> > > > > > 
> > > > > > > 
> > > > > > > > I think that's the correct thing to do. Until we have some
> > > > > > > > sort of conflicting behavior I don't see why you'd want to prematurely
> > > > > > > > recall the delegation.
> > > > > > > 
> > > > > > > The reason to recall a delegation is to avoid returning
> > > > > > > NFS3ERR_JUKEBOX if at all possible, because doing so is a drastic
> > > > > > > remedy that results in a performance regression.
> > > > > > > 
> > > > > > > The negative impact of not having a delegation is small. The negative
> > > > > > > impact of returning NFS3ERR_JUKEBOX to a SETATTR or WRITE can be as
> > > > > > > much as a 5 minute wait. (This is intolerably long for, say, online
> > > > > > > transaction processing workloads).
> > > > > > > 
> > > > > > 
> > > > > > That sounds like a deficient v3 client, IMO. There's nothing in the v3
> > > > > > spec that I know of that advocates a delay that long before
> > > > > > reattempting. I'm pretty sure the Linux client treats NFSERR3_JUKEBOX
> > > > > > and NFS4ERR_DELAY more or less equivalently.
> > > > > 
> > > > > The v3 client uses a 5 second delay (see NFS_JUKEBOX_RETRY_TIME).
> > > > > The v4 client, at least in the case of operations that could break a
> > > > > deleg, does exponential backoff starting with a tenth of a second--see
> > > > > nfs4_delay.
> > > > > 
> > > > > So Trond's been taking the spec at its word here.
> > > > > 
> > > > > Like Jeff I'm pretty unhappy at the idea of revoking delegations
> > > > > preemptively on v3 read and lookup.
> > > > 
> > > > To completely avoid JUKEBOX, you'd have to recall asynchronously.
> > > > Even better would be not to offer delegations when it is clear
> > > > there is an active NFSv3 accessor.
> > > > 
> > > > Is there a specific use case where holding onto delegations in
> > > > this case is measurably valuable?
> > > > 
> > > > As Jeff said above, it is workload dependent, but it seems that
> > > > we are choosing arbitrarily which workloads work well and which
> > > > will be penalized.
> > > > 
> > > > Clearly, speculating about future access is not allowed when
> > > > only NFSv4 is in play.
> > > > 
> > > > 
> > > > > And a 5 minute wait does sound like a client problem.
> > > > 
> > > > Even a 5 second wait is not good. A simple "touch" that takes
> > > > five seconds can generate user complaints.
> > > > 
> > > > I do see the point that a NFSv3 client implementation can be
> > > > changed to retry JUKEBOX more aggressively. Not all NFSv3 code
> > > > bases are actively maintained, however.
> > > > 
> > > > 
> > > > > > > The server can detect there are other accessors that do not provide
> > > > > > > OPEN/CLOSE semantics. In addition, the server cannot predict when one
> > > > > > > of these accessors may use a WRITE or SETATTR. And finally it does
> > > > > > > not have a reasonably performant mechanism for delaying those
> > > > > > > accessors when a delegation must be recalled.
> > > > > > > 
> > > > > > 
> > > > > > Interoperability is hard (and sometimes it doesn't work well :). We
> > > > > > simply don't have enough info to reliably guess what the v3 client will
> > > > > > do in this situation.
> > > > 
> > > > (This is in response to Jeff's comment)
> > > > 
> > > > Interoperability means following the spec, but IMO it also
> > > > means respecting longstanding implementation practice when
> > > > a specification does not prescribe particular behavior.
> > > > 
> > > > In this case, strictly speaking interoperability is not the
> > > > concern.
> > > > 
> > > > -> The spec authors clearly believed this is an area where
> > > > implementations are to be given free rein. Otherwise the text
> > > > would have provided RFC 2119 directives or other specific
> > > > guidelines. There was opportunity to add specifics in RFCs
> > > > 3530, 7530, and 5661, but that wasn't done.
> > > > 
> > > > -> The scenario I reported does not involve operational
> > > > failure. It eventually succeeds whether the client's retry
> > > > is aggressive or lazy. It just works _better_ when there is
> > > > no DELAY/JUKEBOX.
> > > > 
> > > > There are a few normative constraints here, and I think we
> > > > have a bead on what those are, but IMO the issue is one of
> > > > implementation quality (on both ends).
> > > > 
> > > 
> > > Yes. I'm just not sold that what you're proposing would be any better
> > > than what we have for the vast majority of people. It might be, but I
> > > don't think that's necessarily the case.
> > 
> > In other words, both of you are comparing my use case with
> > a counterfactual. That doesn't seem like a fair fight.
> > 
> > Can you demonstrate a specific use case where not offering
> > a delegation during mixed NFSv3 and NFSv4 access is a true
> > detriment? (I am open to hearing about it).
> > 
> > What happens when an NFSv3 client sends an NLM LOCK on a
> > delegated file? I assume the correct response is for the
> > server to return NLM_LCK_BLOCKED, recall the delegation, and
> > then call the client back when the delegation has been
> > returned. Is that known to work?
> > 
> > 
> > > > > > That said, I wouldn't have a huge objection to a server side tunable
> > > > > > (module parameter?) that says "Recall read delegations on v2/3 READ
> > > > > > calls". Make it default to off, and then people in your situation could
> > > > > > set it if they thought it a better policy for their workload.
> > > > > 
> > > > > I also wonder if in v3 case we should try a small synchronous wait
> > > > > before returning JUKEBOX.  Read delegations shouldn't require the client
> > > > > to do very much, so it could be they're typically returned in a
> > > > > fraction of a second.
> > > > 
> > > > That wait would have to be very short in the NFSv3 / UDP case
> > > > to avoid a retransmit timeout. I know, UDP is going away.
> > > > 
> > > > It's hard to say how long to wait. The RTT to the client might
> > > > have to be taken into account. In WAN deployments, this could
> > > > be as long as 50ms, for instance.
> > > > 
> > > > Although, again, waiting is speculative. A fixed 20ms wait
> > > > would be appropriate for most LAN deployments, and that's
> > > > where the expectation of consistently fast operation lies.
> > > > 
> > > 
> > > Not a bad idea. That delay could be tunable as well.
> > > > > Since we have a fixed number of threads, I don't think we'd want to keep
> > > > > one waiting much longer than that.  Also, it'd be nice if we could get
> > > > > woken up early when the delegation return comes in before our wait's
> > > > > over, but I haven't thought about how to do that.
> > > > > 
> > > > > And I don't know if that actually helps.
> > > > 
> > > > When there is a lot of file sharing between clients, it might
> > > > be good to reduce the penalty of delegation recalls.
> > > > 
> > > 
> > > The best way to do that would probably be to have better heuristics for
> > > deciding whether to hand them out in the first place.
> > 
> > I thought that was exactly what I was suggesting. ;-)
> > See above ("To completely avoid...").
> > 
> > 
> > > We have a little
> > > of that now with the bloom filter, but maybe those rules could be more
> > > friendly to this use-case?
> > > 
> > > > Clients, after all, cannot know when a recall has completed,
> > > > so they have to guess about when to retransmit, and usually
> > > > make a conservative estimate. If server behavior can shorten
> > > > the delay without introducing race windows, that would be good
> > > > added value.
> > > > 
> > > > But I'm not clear why waiting must tie up the nfsd thread (pun
> > > > intended). How is a COMMIT or synchronous WRITE handled? Seems
> > > > like waiting for a delegation recall to complete is a similar
> > > > kind of thing.
> > > > 
> > > 
> > > It's not required per-se, but there currently isn't a good mechanism to
> > > idle RPCs in the server without putting the thread to sleep. It may be
> > > possible to do that with the svc_defer stuff, but I'm a little leery of
> > > that code.
> > 
> > There are other cases where context switching an nfsd would be
> > useful. For example, inserting an opportunity for nfsd_write
> > to perform transport reads (after having allocated pages in
> > the right file) could provide some benefits by reducing data
> > copies and page allocator calls.
> > 
> > I'm agnostic about exactly how this is done.
> 
> Meaning I don't have any particular design preferences.
> 
> I'd like to help with implementation, though, if there is
> agreement about what approach is preferred.
> 
> 

Yeah, it would be wonderful to move knfsd to a more asynchronous model.
Being able to free up a thread to do other work when it would otherwise
be blocked would likely do wonders for scalability.

FWIW, that was a long-term goal that I had in mind when I did the
patches to move knfsd to workqueues. Once you have it running in a
workqueue you could more easily idle RPCs and re-dispatch them to the
workqueue for further processing as needed.

You could do it with dedicated threads too of course, it's just not
quite as conducive to it.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-03-14 14:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-11 16:53 nfsd: delegation conflicts between NFSv3 and NFSv4 accessors Chuck Lever
2017-03-11 17:08 ` Jeff Layton
2017-03-11 20:46   ` Chuck Lever
2017-03-11 21:04     ` Jeff Layton
2017-03-13 13:27       ` J. Bruce Fields
2017-03-13 15:30         ` Chuck Lever
2017-03-13 16:01           ` J. Bruce Fields
2017-03-13 16:06             ` J. Bruce Fields
2017-03-13 16:33           ` Jeff Layton
2017-03-13 17:12             ` Chuck Lever
2017-03-13 18:26               ` Chuck Lever
2017-03-14 14:05                 ` Jeff Layton
2017-03-14 13:55               ` J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.