NFSv4 open sequencing can lead to incorrect ESTALE

* NFSv4 open sequencing can lead to incorrect ESTALE
@ 2014-07-02  1:05 NeilBrown
  0 siblings, 0 replies; only message in thread
From: NeilBrown @ 2014-07-02  1:05 UTC (permalink / raw)
  To: Trond Myklebust, Alexander Viro; +Cc: NFS

[-- Attachment #1: Type: text/plain, Size: 4132 bytes --]

Hi,
 The bash script below demonstrates a problem with NFSv4 - maybe two.
 I'm testing on 3.16-rc2.

 It mounts a local filesystem via NFS.  The main thread then replaces
 a particular file on the local file system and then accesses it via NFS,
 checking that the contents are correct.  It does this repeatedly.

 In parallel a number of threads repeated open/close the same file (using
 'grep' rather than 'cat' so tracing in the kernel can see which is which).

 With NFSv4, none of these should ever get ESTALE.  The server holds the
 file open while the client has it open so even if it is unlinked in the
 local filesystem, the client will still be able to access it.
 Yet we do get ESTALE errors.

 These are caused by the may_open() call in do_last().
 may_open() calls inode_permission() which calls nfs_permission() which
 performs an ACCESS call over NFS, which can get ESTALE.  This error will
 stop do_last() from calling finish_open() which, in the NFSv4 case, does the
 final lookup and would for this test find the correct, non-stale, inode.

 When may_open() and then do_last() return ESTALE, do_filp_open() will call
 path_openat() again, this time with LOOKUP_REVAL, but that doesn't help.
 nfs_permission only gets the inode and so cannot d_drop the dentry or
 otherwise trigger a reval.

 I added a d_drop() to may_open() when inode_permission() returns ESTALE and
 the symptom went away, but I doubt that is the right thing to do.
 For NFSv4 I think it is really best to leave all the work to the 'open' call
 and not perform any access tests before hand.  All the access tests should
 happen inside the open call once the server knows that the file is 'open'.

 I have no suggestion for how to fix this properly.

 Once that bug is fixed, the script still shows unexpected behaviour.  It
 will eventually report that the file seen over NFS has the old content
 instead of the new content.

 This happens because the file (which has been unlinked on the server via a
 rename) is still open by one of the background threads and so
 can_open_cached() reports that "cat" doesn't need to actually open the file
 - it can re-use the open that the 'grep' has.
 This seems a little odd.  It is a bit like treating an active  'open' of a
 file as a mini-delegation, you don't need to open it again.  However if it
 was a real delegation, then when it was unlinked on the server the
 delegation would be lost.
 I tried running with "lookupcache=none" but I still get the same errors.
 That certainly seems like an error.  With lookupcache=none, opening a file
 should check the name on the server, not assume that the name cached on the
 client is correct.  But that isn't what happens.

 I have no idea how to fix this one either.  I'm not even 100% sure which bit
 is the bug, but something definitely seems wrong.
 I changed can_open_cached() to always return 0 and the problem went away,
 but again I don't think that is a correct fix.

Thanks,
NeilBrown

cnt=${1-10000}
local=${2-/export}
nfs=${3-/mnt}
max_errs=${4-1}

echo "using: $cnt $local $nfs $max_errs"
mount -o vers=4,lookupcache=none localhost:$local $nfs || { echo mount failed ; exit 1; }

rm -f $nfs/afile
touch $nfs/afile

for i in {1..5}; do    while [ -f $nfs/afile ]; do grep . $nfs/afile > /dev/null 2>&1 ; done &  done

i=0
err=0
while [ $i -lt $cnt ]; do
  mydate=$(date +%s.%N)
  want="$i $mydate"
  echo  $want > $local/bfile
  mv $local/bfile $local/afile
  have=`cat $nfs/afile`
  if [ "$want" == "$have" ]; then
     echo -n -e "$want\r"
  else
     echo fail > /dev/kmsg
     echo "Wanted $want have $have."
     for x in {1..1000}; do
        sleep 0.1
        grep "$mydate" $nfs/afile && { echo File now correct; break;}
     done
     let err=$err+1
     if [ $err -ge $max_errs ]; then
         echo
         echo $err failures after $i attempts
         rm $local/afile
         umount $nfs
         exit 1
     fi
  fi
  let i=$i+1
done
rm $local/afile
echo
echo $err failures in $i attempts
sleep 2
umount $nfs
exit 0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] only message in thread