On Thu, Nov 03 2016, Benjamin Coddington wrote: > On 13 Oct 2016, at 0:26, NeilBrown wrote: >> @@ -3694,20 +3695,17 @@ nfs4_proc_setattr(struct dentry *dentry, >> struct nfs_fattr *fattr, >> >> /* Search for an existing open(O_WRITE) file */ >> if (sattr->ia_valid & ATTR_FILE) { >> - struct nfs_open_context *ctx; >> >> ctx = nfs_file_open_context(sattr->ia_file); >> - if (ctx) { >> + if (ctx) >> cred = ctx->cred; >> - state = ctx->state; >> - } >> } > > > Does this need a get_nfs_open_context() there to make sure the open > context > doesn't drop away? I can't see why you would. The ia_file must hold a reference to the ctx, and this code doesn't keep any reference to the ctx after nfs4_proc_setattr completes - does it? > I'm getting this on generic/089: > > [ 651.855291] run fstests generic/089 at 2016-11-01 11:15:57 > [ 652.645828] NFS: nfs4_reclaim_open_state: Lock reclaim failed! > [ 653.166259] NFSD: client ::1 testing state ID with incorrect client > ID > [ 653.167218] BUG: unable to handle kernel NULL pointer dereference at > 0000000000000018 I think this BUG is happening in nfs41_check_expired_locks. This: list_for_each_entry(lsp, &state->lock_states, ls_locks) { walks off the end of a list, finding a NULL on a list which should never have a NULL pointer. That does suggest a use-after-free of an nfs4_lock_state, or possibly of an nfs4_state. I can't see it in the code yet though. > > Something else is also wrong there.. wrapping that with > get_nfs_open_context() makes the crash go away, but there are still > several > "NFS: nfs4_reclaim_open_state: Lock reclaim failed!" in the log. Why > would > we be doing reclaim at all? I'll look at a network capture next. The > [ 653.166259] NFSD: client ::1 testing state ID with incorrect client ID errors suggests that the client is sending a stateid that doesn't match the client id, so the server reports and error and the client enters state recovery. Maybe one thread is dropping a flock lock while another thread is using it for some IO and they race? I think there are refcounts in place to protect that but something might be missing. I look forward to seeing the network capture. Thanks, NeilBrown