All of lore.kernel.org
 help / color / mirror / Atom feed
* Problems with NFS4.1 on ESXi
       [not found] <AM9P191MB1665484E1EFD2088D22C2E2F8EF59@AM9P191MB1665.EURP191.PROD.OUTLOOK.COM>
@ 2022-04-21  4:55 ` Andreas Nagy
  2022-04-21 14:58   ` Chuck Lever III
  0 siblings, 1 reply; 25+ messages in thread
From: Andreas Nagy @ 2022-04-21  4:55 UTC (permalink / raw)
  To: linux-nfs

Hi,

I hope this mailing list is the right place to discuss some problems with nfs4.1.

Switching from FreeBSD host as NFS server to a Proxmox environment also serving NFS I see some strange issues in combination with VMWare ESXi.

After first thinking it works fine, I started to realize that there are problems with ESXi datastores on NFS4.1 when trying to import VMs (OVF).

Importing ESXi OVF VM Templates fails nearly every time with a ESXi error message "postNFCData failed: Not Found". With NFS3 it is working fine.

NFS server is running on a Proxmox host:

 root@sepp-sto-01:~# hostnamectl
 Static hostname: sepp-sto-01
 Icon name: computer-server
 Chassis: server
 Machine ID: 028da2386e514db19a3793d876fadf12
 Boot ID: c5130c8524c64bc38994f6cdd170d9fd
 Operating System: Debian GNU/Linux 11 (bullseye)
 Kernel: Linux 5.13.19-4-pve
 Architecture: x86-64


File system is ZFS, but also tried it with others and it is the same behaivour.


ESXi version 7.2U3

ESXi vmkernel.log:
2022-04-19T17:46:38.933Z cpu0:262261)cswitch: L2Sec_EnforcePortCompliance:209: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]client vmk1 requested promiscuous mode on port 0x4000010, disallowed by vswitch policy
2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)World: 12075: VC opID esxui-d6ab-f678 maps to vmkernel opID 936118c3
2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fce02850 failed: Stale file handle
2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fcdaa000 failed: Stale file handle
2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
2022-04-19T17:47:25.166Z cpu18:262376)ScsiVmas: 1074: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Not supported
2022-04-19T17:47:25.167Z cpu18:262375)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)World: 12075: VC opID esxui-6787-f694 maps to vmkernel opID 9529ace7
2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx

tcpdump taken on the esxi with filter on the nfs server ip is attached here:
https://easyupload.io/xvtpt1

I tried to analyze, but have no idea what exactly the problem is. Maybe it is some issue with the VMWare implementation? 
Would be nice if someone with better NFS knowledge could have a look on the traces.

Best regards,
cd

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-21  4:55 ` Problems with NFS4.1 on ESXi Andreas Nagy
@ 2022-04-21 14:58   ` Chuck Lever III
  2022-04-21 15:30     ` AW: " crispyduck
  0 siblings, 1 reply; 25+ messages in thread
From: Chuck Lever III @ 2022-04-21 14:58 UTC (permalink / raw)
  To: Andreas Nagy; +Cc: Linux NFS Mailing List

Hi Andreas-

> On Apr 21, 2022, at 12:55 AM, Andreas Nagy <crispyduck@outlook.at> wrote:
> 
> Hi,
> 
> I hope this mailing list is the right place to discuss some problems with nfs4.1.

Well, yes and no. This is an upstream developer mailing list,
not really for user support.

You seem to be asking about products that are currently supported,
and I'm not sure if the Debian kernel is stock upstream 5.13 or
something else. ZFS is not an upstream Linux filesystem and the
ESXi NFS client is something we have little to no experience with.

I recommend contacting the support desk for your products. If
they find a specific problem with the Linux NFS server's
implementation of the NFSv4.1 protocol, then come back here.


> Switching from FreeBSD host as NFS server to a Proxmox environment also serving NFS I see some strange issues in combination with VMWare ESXi.
> 
> After first thinking it works fine, I started to realize that there are problems with ESXi datastores on NFS4.1 when trying to import VMs (OVF).
> 
> Importing ESXi OVF VM Templates fails nearly every time with a ESXi error message "postNFCData failed: Not Found". With NFS3 it is working fine.
> 
> NFS server is running on a Proxmox host:
> 
>  root@sepp-sto-01:~# hostnamectl
>  Static hostname: sepp-sto-01
>  Icon name: computer-server
>  Chassis: server
>  Machine ID: 028da2386e514db19a3793d876fadf12
>  Boot ID: c5130c8524c64bc38994f6cdd170d9fd
>  Operating System: Debian GNU/Linux 11 (bullseye)
>  Kernel: Linux 5.13.19-4-pve
>  Architecture: x86-64
> 
> 
> File system is ZFS, but also tried it with others and it is the same behaivour.
> 
> 
> ESXi version 7.2U3
> 
> ESXi vmkernel.log:
> 2022-04-19T17:46:38.933Z cpu0:262261)cswitch: L2Sec_EnforcePortCompliance:209: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]client vmk1 requested promiscuous mode on port 0x4000010, disallowed by vswitch policy
> 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)World: 12075: VC opID esxui-d6ab-f678 maps to vmkernel opID 936118c3
> 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fce02850 failed: Stale file handle
> 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fcdaa000 failed: Stale file handle
> 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> 2022-04-19T17:47:25.166Z cpu18:262376)ScsiVmas: 1074: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Not supported
> 2022-04-19T17:47:25.167Z cpu18:262375)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
> 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)World: 12075: VC opID esxui-6787-f694 maps to vmkernel opID 9529ace7
> 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> 
> tcpdump taken on the esxi with filter on the nfs server ip is attached here:
> https://easyupload.io/xvtpt1
> 
> I tried to analyze, but have no idea what exactly the problem is. Maybe it is some issue with the VMWare implementation? 
> Would be nice if someone with better NFS knowledge could have a look on the traces.
> 
> Best regards,
> cd

--
Chuck Lever




^ permalink raw reply	[flat|nested] 25+ messages in thread

* AW: Problems with NFS4.1 on ESXi
  2022-04-21 14:58   ` Chuck Lever III
@ 2022-04-21 15:30     ` crispyduck
  2022-04-21 16:40       ` J. Bruce Fields
  0 siblings, 1 reply; 25+ messages in thread
From: crispyduck @ 2022-04-21 15:30 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: Linux NFS Mailing List

Hi Chuck!

Thanks. From VMWare side nobody will help here as this is not supported. They support NFS4.1, but officially only from some storage vendors.

I had it running in the past on FreeBSD, where I also some problems in the beginning  (RECLAIM_COMPLETE) and Rick Macklem helped to figure out the problem and fixed it with some patches that should now be part of FreeBSD.

I plan to use it with ZFS, but also tested it on ext4, with exact same behavior. 

NFS3 works fine, NFS4.1 seems to work fine, except the described problems.

The reason for NFS4.1 is session trunking, which gives really awesome speeds when using multiple NICs/subnets. Comparable to ISCSI.
ANFS4.1 based storage for ESXi and other Hypervisors.

The test is also done without session trunking.

This needs NFS expertise, no idea where else i could ask to have a look on the traces.

Br,
Andi






Von: Chuck Lever III <chuck.lever@oracle.com>
Gesendet: Donnerstag, 21. April 2022 16:58
An: Andreas Nagy <crispyduck@outlook.at>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Betreff: Re: Problems with NFS4.1 on ESXi 
 
Hi Andreas-

> On Apr 21, 2022, at 12:55 AM, Andreas Nagy <crispyduck@outlook.at> wrote:
> 
> Hi,
> 
> I hope this mailing list is the right place to discuss some problems with nfs4.1.

Well, yes and no. This is an upstream developer mailing list,
not really for user support.

You seem to be asking about products that are currently supported,
and I'm not sure if the Debian kernel is stock upstream 5.13 or
something else. ZFS is not an upstream Linux filesystem and the
ESXi NFS client is something we have little to no experience with.

I recommend contacting the support desk for your products. If
they find a specific problem with the Linux NFS server's
implementation of the NFSv4.1 protocol, then come back here.


> Switching from FreeBSD host as NFS server to a Proxmox environment also serving NFS I see some strange issues in combination with VMWare ESXi.
> 
> After first thinking it works fine, I started to realize that there are problems with ESXi datastores on NFS4.1 when trying to import VMs (OVF).
> 
> Importing ESXi OVF VM Templates fails nearly every time with a ESXi error message "postNFCData failed: Not Found". With NFS3 it is working fine.
> 
> NFS server is running on a Proxmox host:
> 
>  root@sepp-sto-01:~# hostnamectl
>  Static hostname: sepp-sto-01
>  Icon name: computer-server
>  Chassis: server
>  Machine ID: 028da2386e514db19a3793d876fadf12
>  Boot ID: c5130c8524c64bc38994f6cdd170d9fd
>  Operating System: Debian GNU/Linux 11 (bullseye)
>  Kernel: Linux 5.13.19-4-pve
>  Architecture: x86-64
> 
> 
> File system is ZFS, but also tried it with others and it is the same behaivour.
> 
> 
> ESXi version 7.2U3
> 
> ESXi vmkernel.log:
> 2022-04-19T17:46:38.933Z cpu0:262261)cswitch: L2Sec_EnforcePortCompliance:209: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]client vmk1 requested promiscuous mode on port 0x4000010, disallowed by vswitch policy
> 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)World: 12075: VC opID esxui-d6ab-f678 maps to vmkernel opID 936118c3
> 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fce02850 failed: Stale file handle
> 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fcdaa000 failed: Stale file handle
> 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> 2022-04-19T17:47:25.166Z cpu18:262376)ScsiVmas: 1074: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Not supported
> 2022-04-19T17:47:25.167Z cpu18:262375)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
> 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)World: 12075: VC opID esxui-6787-f694 maps to vmkernel opID 9529ace7
> 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> 
> tcpdump taken on the esxi with filter on the nfs server ip is attached here:
> https://easyupload.io/xvtpt1
> 
> I tried to analyze, but have no idea what exactly the problem is. Maybe it is some issue with the VMWare implementation? 
> Would be nice if someone with better NFS knowledge could have a look on the traces.
> 
> Best regards,
> cd

--
Chuck Lever



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-21 15:30     ` AW: " crispyduck
@ 2022-04-21 16:40       ` J. Bruce Fields
       [not found]         ` <AM9P191MB16654F5B7541CD1E489D75608EF49@AM9P191MB1665.EURP191.PROD.OUTLOOK.COM>
  2022-04-21 18:54         ` J. Bruce Fields
  0 siblings, 2 replies; 25+ messages in thread
From: J. Bruce Fields @ 2022-04-21 16:40 UTC (permalink / raw)
  To: crispyduck; +Cc: Chuck Lever III, Linux NFS Mailing List

On Thu, Apr 21, 2022 at 03:30:19PM +0000, crispyduck@outlook.at wrote:
> Thanks. From VMWare side nobody will help here as this is not supported. They support NFS4.1, but officially only from some storage vendors.
> 
> I had it running in the past on FreeBSD, where I also some problems in the beginning  (RECLAIM_COMPLETE) and Rick Macklem helped to figure out the problem and fixed it with some patches that should now be part of FreeBSD.
> 
> I plan to use it with ZFS, but also tested it on ext4, with exact same behavior. 
> 
> NFS3 works fine, NFS4.1 seems to work fine, except the described problems.
> 
> The reason for NFS4.1 is session trunking, which gives really awesome speeds when using multiple NICs/subnets. Comparable to ISCSI.
> ANFS4.1 based storage for ESXi and other Hypervisors.
> 
> The test is also done without session trunking.
> 
> This needs NFS expertise, no idea where else i could ask to have a look on the traces.

Stale filehandles aren't normal, and suggest some bug or
misconfiguration on the server side, either in NFS or the exported
filesystem.  Figuring out more than that would require more
investigation.

--b.

> 
> Br,
> Andi
> 
> 
> 
> 
> 
> 
> Von: Chuck Lever III <chuck.lever@oracle.com>
> Gesendet: Donnerstag, 21. April 2022 16:58
> An: Andreas Nagy <crispyduck@outlook.at>
> Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
> Betreff: Re: Problems with NFS4.1 on ESXi 
>  
> Hi Andreas-
> 
> > On Apr 21, 2022, at 12:55 AM, Andreas Nagy <crispyduck@outlook.at> wrote:
> > 
> > Hi,
> > 
> > I hope this mailing list is the right place to discuss some problems with nfs4.1.
> 
> Well, yes and no. This is an upstream developer mailing list,
> not really for user support.
> 
> You seem to be asking about products that are currently supported,
> and I'm not sure if the Debian kernel is stock upstream 5.13 or
> something else. ZFS is not an upstream Linux filesystem and the
> ESXi NFS client is something we have little to no experience with.
> 
> I recommend contacting the support desk for your products. If
> they find a specific problem with the Linux NFS server's
> implementation of the NFSv4.1 protocol, then come back here.
> 
> 
> > Switching from FreeBSD host as NFS server to a Proxmox environment also serving NFS I see some strange issues in combination with VMWare ESXi.
> > 
> > After first thinking it works fine, I started to realize that there are problems with ESXi datastores on NFS4.1 when trying to import VMs (OVF).
> > 
> > Importing ESXi OVF VM Templates fails nearly every time with a ESXi error message "postNFCData failed: Not Found". With NFS3 it is working fine.
> > 
> > NFS server is running on a Proxmox host:
> > 
> >  root@sepp-sto-01:~# hostnamectl
> >  Static hostname: sepp-sto-01
> >  Icon name: computer-server
> >  Chassis: server
> >  Machine ID: 028da2386e514db19a3793d876fadf12
> >  Boot ID: c5130c8524c64bc38994f6cdd170d9fd
> >  Operating System: Debian GNU/Linux 11 (bullseye)
> >  Kernel: Linux 5.13.19-4-pve
> >  Architecture: x86-64
> > 
> > 
> > File system is ZFS, but also tried it with others and it is the same behaivour.
> > 
> > 
> > ESXi version 7.2U3
> > 
> > ESXi vmkernel.log:
> > 2022-04-19T17:46:38.933Z cpu0:262261)cswitch: L2Sec_EnforcePortCompliance:209: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]client vmk1 requested promiscuous mode on port 0x4000010, disallowed by vswitch policy
> > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)World: 12075: VC opID esxui-d6ab-f678 maps to vmkernel opID 936118c3
> > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fce02850 failed: Stale file handle
> > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fcdaa000 failed: Stale file handle
> > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > 2022-04-19T17:47:25.166Z cpu18:262376)ScsiVmas: 1074: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Not supported
> > 2022-04-19T17:47:25.167Z cpu18:262375)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
> > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)World: 12075: VC opID esxui-6787-f694 maps to vmkernel opID 9529ace7
> > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > 
> > tcpdump taken on the esxi with filter on the nfs server ip is attached here:
> > https://easyupload.io/xvtpt1
> > 
> > I tried to analyze, but have no idea what exactly the problem is. Maybe it is some issue with the VMWare implementation? 
> > Would be nice if someone with better NFS knowledge could have a look on the traces.
> > 
> > Best regards,
> > cd
> 
> --
> Chuck Lever
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
       [not found]         ` <AM9P191MB16654F5B7541CD1E489D75608EF49@AM9P191MB1665.EURP191.PROD.OUTLOOK.COM>
@ 2022-04-21 18:41           ` crispyduck
  0 siblings, 0 replies; 25+ messages in thread
From: crispyduck @ 2022-04-21 18:41 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Chuck Lever III, Linux NFS Mailing List

I can do traces,... as needed if someone is interested and would investigate deeper.
I can read traces, but my knowledge here and on NFS is very limited.

Br,
Andi


From: J. Bruce Fields <bfields@fieldses.org>
Sent: Thursday, April 21, 2022 6:40:49 PM
To: crispyduck@outlook.at <crispyduck@outlook.at>
Cc: Chuck Lever III <chuck.lever@oracle.com>; Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: Problems with NFS4.1 on ESXi 
 
On Thu, Apr 21, 2022 at 03:30:19PM +0000, crispyduck@outlook.at wrote:
> Thanks. From VMWare side nobody will help here as this is not supported. They support NFS4.1, but officially only from some storage vendors.
> 
> I had it running in the past on FreeBSD, where I also some problems in the beginning  (RECLAIM_COMPLETE) and Rick Macklem helped to figure out the problem and fixed it with some patches that should now be part of FreeBSD.
> 
> I plan to use it with ZFS, but also tested it on ext4, with exact same behavior. 
> 
> NFS3 works fine, NFS4.1 seems to work fine, except the described problems.
> 
> The reason for NFS4.1 is session trunking, which gives really awesome speeds when using multiple NICs/subnets. Comparable to ISCSI.
> ANFS4.1 based storage for ESXi and other Hypervisors.
> 
> The test is also done without session trunking.
> 
> This needs NFS expertise, no idea where else i could ask to have a look on the traces.

Stale filehandles aren't normal, and suggest some bug or
misconfiguration on the server side, either in NFS or the exported
filesystem.  Figuring out more than that would require more
investigation.

--b.

> 
> Br,
> Andi
> 
> 
> 
> 
> 
> 
> Von: Chuck Lever III <chuck.lever@oracle.com>
> Gesendet: Donnerstag, 21. April 2022 16:58
> An: Andreas Nagy <crispyduck@outlook.at>
> Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
> Betreff: Re: Problems with NFS4.1 on ESXi 
>  
> Hi Andreas-
> 
> > On Apr 21, 2022, at 12:55 AM, Andreas Nagy <crispyduck@outlook.at> wrote:
> > 
> > Hi,
> > 
> > I hope this mailing list is the right place to discuss some problems with nfs4.1.
> 
> Well, yes and no. This is an upstream developer mailing list,
> not really for user support.
> 
> You seem to be asking about products that are currently supported,
> and I'm not sure if the Debian kernel is stock upstream 5.13 or
> something else. ZFS is not an upstream Linux filesystem and the
> ESXi NFS client is something we have little to no experience with.
> 
> I recommend contacting the support desk for your products. If
> they find a specific problem with the Linux NFS server's
> implementation of the NFSv4.1 protocol, then come back here.
> 
> 
> > Switching from FreeBSD host as NFS server to a Proxmox environment also serving NFS I see some strange issues in combination with VMWare ESXi.
> > 
> > After first thinking it works fine, I started to realize that there are problems with ESXi datastores on NFS4.1 when trying to import VMs (OVF).
> > 
> > Importing ESXi OVF VM Templates fails nearly every time with a ESXi error message "postNFCData failed: Not Found". With NFS3 it is working fine.
> > 
> > NFS server is running on a Proxmox host:
> > 
> >  root@sepp-sto-01:~# hostnamectl
> >  Static hostname: sepp-sto-01
> >  Icon name: computer-server
> >  Chassis: server
> >  Machine ID: 028da2386e514db19a3793d876fadf12
> >  Boot ID: c5130c8524c64bc38994f6cdd170d9fd
> >  Operating System: Debian GNU/Linux 11 (bullseye)
> >  Kernel: Linux 5.13.19-4-pve
> >  Architecture: x86-64
> > 
> > 
> > File system is ZFS, but also tried it with others and it is the same behaivour.
> > 
> > 
> > ESXi version 7.2U3
> > 
> > ESXi vmkernel.log:
> > 2022-04-19T17:46:38.933Z cpu0:262261)cswitch: L2Sec_EnforcePortCompliance:209: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]client vmk1 requested promiscuous mode on port 0x4000010, disallowed by vswitch policy
> > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)World: 12075: VC opID esxui-d6ab-f678 maps to vmkernel opID 936118c3
> > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fce02850 failed: Stale file handle
> > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fcdaa000 failed: Stale file handle
> > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > 2022-04-19T17:47:25.166Z cpu18:262376)ScsiVmas: 1074: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Not supported
> > 2022-04-19T17:47:25.167Z cpu18:262375)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
> > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)World: 12075: VC opID esxui-6787-f694 maps to vmkernel opID 9529ace7
> > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > 
> > tcpdump taken on the esxi with filter on the nfs server ip is attached here:
> > https://easyupload.io/xvtpt1
> > 
> > I tried to analyze, but have no idea what exactly the problem is. Maybe it is some issue with the VMWare implementation? 
> > Would be nice if someone with better NFS knowledge could have a look on the traces.
> > 
> > Best regards,
> > cd
> 
> --
> Chuck Lever
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-21 16:40       ` J. Bruce Fields
       [not found]         ` <AM9P191MB16654F5B7541CD1E489D75608EF49@AM9P191MB1665.EURP191.PROD.OUTLOOK.COM>
@ 2022-04-21 18:54         ` J. Bruce Fields
  2022-04-21 23:52           ` Rick Macklem
  2022-04-22 14:23           ` Olga Kornievskaia
  1 sibling, 2 replies; 25+ messages in thread
From: J. Bruce Fields @ 2022-04-21 18:54 UTC (permalink / raw)
  To: crispyduck; +Cc: Chuck Lever III, Linux NFS Mailing List

On Thu, Apr 21, 2022 at 12:40:49PM -0400, bfields wrote:
> On Thu, Apr 21, 2022 at 03:30:19PM +0000, crispyduck@outlook.at wrote:
> > Thanks. From VMWare side nobody will help here as this is not supported. They support NFS4.1, but officially only from some storage vendors.
> > 
> > I had it running in the past on FreeBSD, where I also some problems in the beginning  (RECLAIM_COMPLETE) and Rick Macklem helped to figure out the problem and fixed it with some patches that should now be part of FreeBSD.
> > 
> > I plan to use it with ZFS, but also tested it on ext4, with exact same behavior. 
> > 
> > NFS3 works fine, NFS4.1 seems to work fine, except the described problems.
> > 
> > The reason for NFS4.1 is session trunking, which gives really awesome speeds when using multiple NICs/subnets. Comparable to ISCSI.
> > ANFS4.1 based storage for ESXi and other Hypervisors.
> > 
> > The test is also done without session trunking.
> > 
> > This needs NFS expertise, no idea where else i could ask to have a look on the traces.
> 
> Stale filehandles aren't normal, and suggest some bug or
> misconfiguration on the server side, either in NFS or the exported
> filesystem.

Actually, I should take that back: if one client removes files while a
second client is using them, it'd be normal for applications on that
second client to see ESTALE.

So it might be interesting to know what actually happens when VM
templates are imported.

I suppose you could also try NFSv4.0 or try varying kernel versions to
try to narrow down the problem.

No easy ideas off the top of my head, sorry.

--b.

> Figuring out more than that would require more
> investigation.
> 
> --b.
> 
> > 
> > Br,
> > Andi
> > 
> > 
> > 
> > 
> > 
> > 
> > Von: Chuck Lever III <chuck.lever@oracle.com>
> > Gesendet: Donnerstag, 21. April 2022 16:58
> > An: Andreas Nagy <crispyduck@outlook.at>
> > Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
> > Betreff: Re: Problems with NFS4.1 on ESXi 
> >  
> > Hi Andreas-
> > 
> > > On Apr 21, 2022, at 12:55 AM, Andreas Nagy <crispyduck@outlook.at> wrote:
> > > 
> > > Hi,
> > > 
> > > I hope this mailing list is the right place to discuss some problems with nfs4.1.
> > 
> > Well, yes and no. This is an upstream developer mailing list,
> > not really for user support.
> > 
> > You seem to be asking about products that are currently supported,
> > and I'm not sure if the Debian kernel is stock upstream 5.13 or
> > something else. ZFS is not an upstream Linux filesystem and the
> > ESXi NFS client is something we have little to no experience with.
> > 
> > I recommend contacting the support desk for your products. If
> > they find a specific problem with the Linux NFS server's
> > implementation of the NFSv4.1 protocol, then come back here.
> > 
> > 
> > > Switching from FreeBSD host as NFS server to a Proxmox environment also serving NFS I see some strange issues in combination with VMWare ESXi.
> > > 
> > > After first thinking it works fine, I started to realize that there are problems with ESXi datastores on NFS4.1 when trying to import VMs (OVF).
> > > 
> > > Importing ESXi OVF VM Templates fails nearly every time with a ESXi error message "postNFCData failed: Not Found". With NFS3 it is working fine.
> > > 
> > > NFS server is running on a Proxmox host:
> > > 
> > >  root@sepp-sto-01:~# hostnamectl
> > >  Static hostname: sepp-sto-01
> > >  Icon name: computer-server
> > >  Chassis: server
> > >  Machine ID: 028da2386e514db19a3793d876fadf12
> > >  Boot ID: c5130c8524c64bc38994f6cdd170d9fd
> > >  Operating System: Debian GNU/Linux 11 (bullseye)
> > >  Kernel: Linux 5.13.19-4-pve
> > >  Architecture: x86-64
> > > 
> > > 
> > > File system is ZFS, but also tried it with others and it is the same behaivour.
> > > 
> > > 
> > > ESXi version 7.2U3
> > > 
> > > ESXi vmkernel.log:
> > > 2022-04-19T17:46:38.933Z cpu0:262261)cswitch: L2Sec_EnforcePortCompliance:209: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]client vmk1 requested promiscuous mode on port 0x4000010, disallowed by vswitch policy
> > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)World: 12075: VC opID esxui-d6ab-f678 maps to vmkernel opID 936118c3
> > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fce02850 failed: Stale file handle
> > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fcdaa000 failed: Stale file handle
> > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > 2022-04-19T17:47:25.166Z cpu18:262376)ScsiVmas: 1074: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Not supported
> > > 2022-04-19T17:47:25.167Z cpu18:262375)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
> > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)World: 12075: VC opID esxui-6787-f694 maps to vmkernel opID 9529ace7
> > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > 
> > > tcpdump taken on the esxi with filter on the nfs server ip is attached here:
> > > https://easyupload.io/xvtpt1
> > > 
> > > I tried to analyze, but have no idea what exactly the problem is. Maybe it is some issue with the VMWare implementation? 
> > > Would be nice if someone with better NFS knowledge could have a look on the traces.
> > > 
> > > Best regards,
> > > cd
> > 
> > --
> > Chuck Lever
> > 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-21 18:54         ` J. Bruce Fields
@ 2022-04-21 23:52           ` Rick Macklem
  2022-04-21 23:58             ` Rick Macklem
                               ` (2 more replies)
  2022-04-22 14:23           ` Olga Kornievskaia
  1 sibling, 3 replies; 25+ messages in thread
From: Rick Macklem @ 2022-04-21 23:52 UTC (permalink / raw)
  To: J. Bruce Fields, crispyduck; +Cc: Chuck Lever III, Linux NFS Mailing List

J. Bruce Fields <bfields@fieldses.org> wrote:
[stuff snipped]
> On Thu, Apr 21, 2022 at 12:40:49PM -0400, bfields wrote:
> >
> >
> > Stale filehandles aren't normal, and suggest some bug or
> > misconfiguration on the server side, either in NFS or the exported
> > filesystem.
> 
> Actually, I should take that back: if one client removes files while a
> second client is using them, it'd be normal for applications on that
> second client to see ESTALE.
I took a look at crispyduck's packet trace and here's what I saw:
Packet#
48 Lookup of test-ovf.vmx
49 NFS_OK FH is 0x7c9ce14b (the hash)
...
51 Open Claim_FH fo 0x7c9ce14b
52 NFS_OK Open Stateid 0x35be
...
138 Rename test-ovf.vmx~ to test-ovf.vmx
139 NFS_OK
...
141 Close with PutFH 0x7c9ce14b
142 NFS4ERR_STALE for the PutFH

So, it seems that the Rename will delete the file (names another file to the
same name "test-ovf.vmx".  Then the subsequent Close's PutFH fails,
because the file for the FH has been deleted.

Looks like yet another ESXi client bug to me?
(I've seen assorted other ones, but not this one. I have no idea how this
 might work on a FreeBSD server. I can only assume the RPC sequence
 ends up different for FreeBSD for some reason? Maybe the Close gets
 processed before the Rename? I didn't look at the Sequence args for
 these RPCs to see if they use different slots.)


> So it might be interesting to know what actually happens when VM
> templates are imported.
If you look at the packet trace, somewhat weird, like most things for this
client. It does a Lookup of the same file name over and over again, for
example.

> I suppose you could also try NFSv4.0 or try varying kernel versions to
> try to narrow down the problem.
I think it only does NFSv4.1.
I've tried to contact the VMware engineers, but never had any luck.
I wish they'd show up at a bakeathon, but...

> No easy ideas off the top of my head, sorry.
I once posted a list of problems I had found with ESXi 6.5 to a FreeBSD
mailing list and someone who worked for VMware cut/pasted it into their
problem database.  They responded to him with "might be fixed in a future
release" and, indeed, they were fixed in ESXi 6.7, so if you can get this to
them, they might fix it?

rick

--b.

> Figuring out more than that would require more
> investigation.
>
> --b.
>
> >
> > Br,
> > Andi
> >
> >
> >
> >
> >
> >
> > Von: Chuck Lever III <chuck.lever@oracle.com>
> > Gesendet: Donnerstag, 21. April 2022 16:58
> > An: Andreas Nagy <crispyduck@outlook.at>
> > Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
> > Betreff: Re: Problems with NFS4.1 on ESXi
> >
> > Hi Andreas-
> >
> > > On Apr 21, 2022, at 12:55 AM, Andreas Nagy <crispyduck@outlook.at> wrote:
> > >
> > > Hi,
> > >
> > > I hope this mailing list is the right place to discuss some problems with nfs4.1.
> >
> > Well, yes and no. This is an upstream developer mailing list,
> > not really for user support.
> >
> > You seem to be asking about products that are currently supported,
> > and I'm not sure if the Debian kernel is stock upstream 5.13 or
> > something else. ZFS is not an upstream Linux filesystem and the
> > ESXi NFS client is something we have little to no experience with.
> >
> > I recommend contacting the support desk for your products. If
> > they find a specific problem with the Linux NFS server's
> > implementation of the NFSv4.1 protocol, then come back here.
> >
> >
> > > Switching from FreeBSD host as NFS server to a Proxmox environment also serving NFS I see some strange issues in combination with VMWare ESXi.
> > >
> > > After first thinking it works fine, I started to realize that there are problems with ESXi datastores on NFS4.1 when trying to import VMs (OVF).
> > >
> > > Importing ESXi OVF VM Templates fails nearly every time with a ESXi error message "postNFCData failed: Not Found". With NFS3 it is working fine.
> > >
> > > NFS server is running on a Proxmox host:
> > >
> > >  root@sepp-sto-01:~# hostnamectl
> > >  Static hostname: sepp-sto-01
> > >  Icon name: computer-server
> > >  Chassis: server
> > >  Machine ID: 028da2386e514db19a3793d876fadf12
> > >  Boot ID: c5130c8524c64bc38994f6cdd170d9fd
> > >  Operating System: Debian GNU/Linux 11 (bullseye)
> > >  Kernel: Linux 5.13.19-4-pve
> > >  Architecture: x86-64
> > >
> > >
> > > File system is ZFS, but also tried it with others and it is the same behaivour.
> > >
> > >
> > > ESXi version 7.2U3
> > >
> > > ESXi vmkernel.log:
> > > 2022-04-19T17:46:38.933Z cpu0:262261)cswitch: L2Sec_EnforcePortCompliance:209: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]client vmk1 requested promiscuous mode on port 0x4000010, disallowed by vswitch policy
> > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)World: 12075: VC opID esxui-d6ab-f678 maps to vmkernel opID 936118c3
> > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fce02850 failed: Stale file handle
> > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fcdaa000 failed: Stale file handle
> > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > 2022-04-19T17:47:25.166Z cpu18:262376)ScsiVmas: 1074: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Not supported
> > > 2022-04-19T17:47:25.167Z cpu18:262375)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
> > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)World: 12075: VC opID esxui-6787-f694 maps to vmkernel opID 9529ace7
> > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > >
> > > tcpdump taken on the esxi with filter on the nfs server ip is attached here:
> > > https://easyupload.io/xvtpt1
> > >
> > > I tried to analyze, but have no idea what exactly the problem is. Maybe it is some issue with the VMWare implementation?
> > > Would be nice if someone with better NFS knowledge could have a look on the traces.
> > >
> > > Best regards,
> > > cd
> >
> > --
> > Chuck Lever
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-21 23:52           ` Rick Macklem
@ 2022-04-21 23:58             ` Rick Macklem
  2022-04-22 14:29             ` Chuck Lever III
  2022-04-22 15:15             ` J. Bruce Fields
  2 siblings, 0 replies; 25+ messages in thread
From: Rick Macklem @ 2022-04-21 23:58 UTC (permalink / raw)
  To: J. Bruce Fields, crispyduck; +Cc: Chuck Lever III, Linux NFS Mailing List

I looked and both the Rename and Close are done on slotid 0, so
the client does do them in that order.

Also, I should mention that this bug may not be what causes crispyduck's
problem. (It could result in an accumulation of Opens on the server, I think?)

rick

________________________________________
From: Rick Macklem <rmacklem@uoguelph.ca>
Sent: Thursday, April 21, 2022 7:52 PM
To: J. Bruce Fields; crispyduck@outlook.at
Cc: Chuck Lever III; Linux NFS Mailing List
Subject: Re: Problems with NFS4.1 on ESXi

J. Bruce Fields <bfields@fieldses.org> wrote:
[stuff snipped]
> On Thu, Apr 21, 2022 at 12:40:49PM -0400, bfields wrote:
> >
> >
> > Stale filehandles aren't normal, and suggest some bug or
> > misconfiguration on the server side, either in NFS or the exported
> > filesystem.
>
> Actually, I should take that back: if one client removes files while a
> second client is using them, it'd be normal for applications on that
> second client to see ESTALE.
I took a look at crispyduck's packet trace and here's what I saw:
Packet#
48 Lookup of test-ovf.vmx
49 NFS_OK FH is 0x7c9ce14b (the hash)
...
51 Open Claim_FH fo 0x7c9ce14b
52 NFS_OK Open Stateid 0x35be
...
138 Rename test-ovf.vmx~ to test-ovf.vmx
139 NFS_OK
...
141 Close with PutFH 0x7c9ce14b
142 NFS4ERR_STALE for the PutFH

So, it seems that the Rename will delete the file (names another file to the
same name "test-ovf.vmx".  Then the subsequent Close's PutFH fails,
because the file for the FH has been deleted.

Looks like yet another ESXi client bug to me?
(I've seen assorted other ones, but not this one. I have no idea how this
 might work on a FreeBSD server. I can only assume the RPC sequence
 ends up different for FreeBSD for some reason? Maybe the Close gets
 processed before the Rename? I didn't look at the Sequence args for
 these RPCs to see if they use different slots.)


> So it might be interesting to know what actually happens when VM
> templates are imported.
If you look at the packet trace, somewhat weird, like most things for this
client. It does a Lookup of the same file name over and over again, for
example.

> I suppose you could also try NFSv4.0 or try varying kernel versions to
> try to narrow down the problem.
I think it only does NFSv4.1.
I've tried to contact the VMware engineers, but never had any luck.
I wish they'd show up at a bakeathon, but...

> No easy ideas off the top of my head, sorry.
I once posted a list of problems I had found with ESXi 6.5 to a FreeBSD
mailing list and someone who worked for VMware cut/pasted it into their
problem database.  They responded to him with "might be fixed in a future
release" and, indeed, they were fixed in ESXi 6.7, so if you can get this to
them, they might fix it?

rick

--b.

> Figuring out more than that would require more
> investigation.
>
> --b.
>
> >
> > Br,
> > Andi
> >
> >
> >
> >
> >
> >
> > Von: Chuck Lever III <chuck.lever@oracle.com>
> > Gesendet: Donnerstag, 21. April 2022 16:58
> > An: Andreas Nagy <crispyduck@outlook.at>
> > Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
> > Betreff: Re: Problems with NFS4.1 on ESXi
> >
> > Hi Andreas-
> >
> > > On Apr 21, 2022, at 12:55 AM, Andreas Nagy <crispyduck@outlook.at> wrote:
> > >
> > > Hi,
> > >
> > > I hope this mailing list is the right place to discuss some problems with nfs4.1.
> >
> > Well, yes and no. This is an upstream developer mailing list,
> > not really for user support.
> >
> > You seem to be asking about products that are currently supported,
> > and I'm not sure if the Debian kernel is stock upstream 5.13 or
> > something else. ZFS is not an upstream Linux filesystem and the
> > ESXi NFS client is something we have little to no experience with.
> >
> > I recommend contacting the support desk for your products. If
> > they find a specific problem with the Linux NFS server's
> > implementation of the NFSv4.1 protocol, then come back here.
> >
> >
> > > Switching from FreeBSD host as NFS server to a Proxmox environment also serving NFS I see some strange issues in combination with VMWare ESXi.
> > >
> > > After first thinking it works fine, I started to realize that there are problems with ESXi datastores on NFS4.1 when trying to import VMs (OVF).
> > >
> > > Importing ESXi OVF VM Templates fails nearly every time with a ESXi error message "postNFCData failed: Not Found". With NFS3 it is working fine.
> > >
> > > NFS server is running on a Proxmox host:
> > >
> > >  root@sepp-sto-01:~# hostnamectl
> > >  Static hostname: sepp-sto-01
> > >  Icon name: computer-server
> > >  Chassis: server
> > >  Machine ID: 028da2386e514db19a3793d876fadf12
> > >  Boot ID: c5130c8524c64bc38994f6cdd170d9fd
> > >  Operating System: Debian GNU/Linux 11 (bullseye)
> > >  Kernel: Linux 5.13.19-4-pve
> > >  Architecture: x86-64
> > >
> > >
> > > File system is ZFS, but also tried it with others and it is the same behaivour.
> > >
> > >
> > > ESXi version 7.2U3
> > >
> > > ESXi vmkernel.log:
> > > 2022-04-19T17:46:38.933Z cpu0:262261)cswitch: L2Sec_EnforcePortCompliance:209: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]client vmk1 requested promiscuous mode on port 0x4000010, disallowed by vswitch policy
> > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)World: 12075: VC opID esxui-d6ab-f678 maps to vmkernel opID 936118c3
> > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fce02850 failed: Stale file handle
> > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fcdaa000 failed: Stale file handle
> > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > 2022-04-19T17:47:25.166Z cpu18:262376)ScsiVmas: 1074: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Not supported
> > > 2022-04-19T17:47:25.167Z cpu18:262375)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
> > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)World: 12075: VC opID esxui-6787-f694 maps to vmkernel opID 9529ace7
> > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > >
> > > tcpdump taken on the esxi with filter on the nfs server ip is attached here:
> > > https://easyupload.io/xvtpt1
> > >
> > > I tried to analyze, but have no idea what exactly the problem is. Maybe it is some issue with the VMWare implementation?
> > > Would be nice if someone with better NFS knowledge could have a look on the traces.
> > >
> > > Best regards,
> > > cd
> >
> > --
> > Chuck Lever
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-21 18:54         ` J. Bruce Fields
  2022-04-21 23:52           ` Rick Macklem
@ 2022-04-22 14:23           ` Olga Kornievskaia
  1 sibling, 0 replies; 25+ messages in thread
From: Olga Kornievskaia @ 2022-04-22 14:23 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: crispyduck, Chuck Lever III, Linux NFS Mailing List

On Fri, Apr 22, 2022 at 8:59 AM J. Bruce Fields <bfields@fieldses.org> wrote:
>
> On Thu, Apr 21, 2022 at 12:40:49PM -0400, bfields wrote:
> > On Thu, Apr 21, 2022 at 03:30:19PM +0000, crispyduck@outlook.at wrote:
> > > Thanks. From VMWare side nobody will help here as this is not supported. They support NFS4.1, but officially only from some storage vendors.
> > >
> > > I had it running in the past on FreeBSD, where I also some problems in the beginning  (RECLAIM_COMPLETE) and Rick Macklem helped to figure out the problem and fixed it with some patches that should now be part of FreeBSD.
> > >
> > > I plan to use it with ZFS, but also tested it on ext4, with exact same behavior.
> > >
> > > NFS3 works fine, NFS4.1 seems to work fine, except the described problems.
> > >
> > > The reason for NFS4.1 is session trunking, which gives really awesome speeds when using multiple NICs/subnets. Comparable to ISCSI.
> > > ANFS4.1 based storage for ESXi and other Hypervisors.
> > >
> > > The test is also done without session trunking.
> > >
> > > This needs NFS expertise, no idea where else i could ask to have a look on the traces.
> >
> > Stale filehandles aren't normal, and suggest some bug or
> > misconfiguration on the server side, either in NFS or the exported
> > filesystem.
>
> Actually, I should take that back: if one client removes files while a
> second client is using them, it'd be normal for applications on that
> second client to see ESTALE.

I looked at the traces and they looked OK to me. The ESTALE was from
the vmware client sending a RENAME onto a file that was opened
previously and then sending a CLOSE on that filehandle which resulted
in ESTALE. So something like this:
OPEN (foobar)
RENAME (something else, foobar)
CLOSE (foobar) leads to ESTALE

I agree with Chuck's suggestion which was to ask vmware support.

> So it might be interesting to know what actually happens when VM
> templates are imported.
>
> I suppose you could also try NFSv4.0 or try varying kernel versions to
> try to narrow down the problem.
>
> No easy ideas off the top of my head, sorry.
>
> --b.
>
> > Figuring out more than that would require more
> > investigation.
> >
> > --b.
> >
> > >
> > > Br,
> > > Andi
> > >
> > >
> > >
> > >
> > >
> > >
> > > Von: Chuck Lever III <chuck.lever@oracle.com>
> > > Gesendet: Donnerstag, 21. April 2022 16:58
> > > An: Andreas Nagy <crispyduck@outlook.at>
> > > Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
> > > Betreff: Re: Problems with NFS4.1 on ESXi
> > >
> > > Hi Andreas-
> > >
> > > > On Apr 21, 2022, at 12:55 AM, Andreas Nagy <crispyduck@outlook.at> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I hope this mailing list is the right place to discuss some problems with nfs4.1.
> > >
> > > Well, yes and no. This is an upstream developer mailing list,
> > > not really for user support.
> > >
> > > You seem to be asking about products that are currently supported,
> > > and I'm not sure if the Debian kernel is stock upstream 5.13 or
> > > something else. ZFS is not an upstream Linux filesystem and the
> > > ESXi NFS client is something we have little to no experience with.
> > >
> > > I recommend contacting the support desk for your products. If
> > > they find a specific problem with the Linux NFS server's
> > > implementation of the NFSv4.1 protocol, then come back here.
> > >
> > >
> > > > Switching from FreeBSD host as NFS server to a Proxmox environment also serving NFS I see some strange issues in combination with VMWare ESXi.
> > > >
> > > > After first thinking it works fine, I started to realize that there are problems with ESXi datastores on NFS4.1 when trying to import VMs (OVF).
> > > >
> > > > Importing ESXi OVF VM Templates fails nearly every time with a ESXi error message "postNFCData failed: Not Found". With NFS3 it is working fine.
> > > >
> > > > NFS server is running on a Proxmox host:
> > > >
> > > >  root@sepp-sto-01:~# hostnamectl
> > > >  Static hostname: sepp-sto-01
> > > >  Icon name: computer-server
> > > >  Chassis: server
> > > >  Machine ID: 028da2386e514db19a3793d876fadf12
> > > >  Boot ID: c5130c8524c64bc38994f6cdd170d9fd
> > > >  Operating System: Debian GNU/Linux 11 (bullseye)
> > > >  Kernel: Linux 5.13.19-4-pve
> > > >  Architecture: x86-64
> > > >
> > > >
> > > > File system is ZFS, but also tried it with others and it is the same behaivour.
> > > >
> > > >
> > > > ESXi version 7.2U3
> > > >
> > > > ESXi vmkernel.log:
> > > > 2022-04-19T17:46:38.933Z cpu0:262261)cswitch: L2Sec_EnforcePortCompliance:209: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]client vmk1 requested promiscuous mode on port 0x4000010, disallowed by vswitch policy
> > > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)World: 12075: VC opID esxui-d6ab-f678 maps to vmkernel opID 936118c3
> > > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fce02850 failed: Stale file handle
> > > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fcdaa000 failed: Stale file handle
> > > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > > 2022-04-19T17:47:25.166Z cpu18:262376)ScsiVmas: 1074: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Not supported
> > > > 2022-04-19T17:47:25.167Z cpu18:262375)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)World: 12075: VC opID esxui-6787-f694 maps to vmkernel opID 9529ace7
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > >
> > > > tcpdump taken on the esxi with filter on the nfs server ip is attached here:
> > > > https://easyupload.io/xvtpt1
> > > >
> > > > I tried to analyze, but have no idea what exactly the problem is. Maybe it is some issue with the VMWare implementation?
> > > > Would be nice if someone with better NFS knowledge could have a look on the traces.
> > > >
> > > > Best regards,
> > > > cd
> > >
> > > --
> > > Chuck Lever
> > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-21 23:52           ` Rick Macklem
  2022-04-21 23:58             ` Rick Macklem
@ 2022-04-22 14:29             ` Chuck Lever III
  2022-04-22 14:59               ` AW: " crispyduck
  2022-04-22 22:58               ` Rick Macklem
  2022-04-22 15:15             ` J. Bruce Fields
  2 siblings, 2 replies; 25+ messages in thread
From: Chuck Lever III @ 2022-04-22 14:29 UTC (permalink / raw)
  To: Rick Macklem; +Cc: Bruce Fields, crispyduck, Linux NFS Mailing List



> On Apr 21, 2022, at 7:52 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote:
> 
> J. Bruce Fields <bfields@fieldses.org> wrote:
> [stuff snipped]
>> On Thu, Apr 21, 2022 at 12:40:49PM -0400, bfields wrote:
>>> 
>>> 
>>> Stale filehandles aren't normal, and suggest some bug or
>>> misconfiguration on the server side, either in NFS or the exported
>>> filesystem.
>> 
>> Actually, I should take that back: if one client removes files while a
>> second client is using them, it'd be normal for applications on that
>> second client to see ESTALE.
> I took a look at crispyduck's packet trace and here's what I saw:
> Packet#
> 48 Lookup of test-ovf.vmx
> 49 NFS_OK FH is 0x7c9ce14b (the hash)
> ...
> 51 Open Claim_FH fo 0x7c9ce14b
> 52 NFS_OK Open Stateid 0x35be
> ...
> 138 Rename test-ovf.vmx~ to test-ovf.vmx
> 139 NFS_OK
> ...
> 141 Close with PutFH 0x7c9ce14b
> 142 NFS4ERR_STALE for the PutFH
> 
> So, it seems that the Rename will delete the file (names another file to the
> same name "test-ovf.vmx".  Then the subsequent Close's PutFH fails,
> because the file for the FH has been deleted.

So, Rick, Andreas: does this sequence of operations work without
error against a FreeBSD NFS server?


--
Chuck Lever




^ permalink raw reply	[flat|nested] 25+ messages in thread

* AW: Problems with NFS4.1 on ESXi
  2022-04-22 14:29             ` Chuck Lever III
@ 2022-04-22 14:59               ` crispyduck
  2022-04-22 15:02                 ` Chuck Lever III
  2022-04-22 22:58               ` Rick Macklem
  1 sibling, 1 reply; 25+ messages in thread
From: crispyduck @ 2022-04-22 14:59 UTC (permalink / raw)
  To: Chuck Lever III, Rick Macklem; +Cc: Bruce Fields, Linux NFS Mailing List

I will try to make all the requested traces, NFS3, NFS4.1 and if a VM is okay I will also set up a FreeBSD NFS server.
Not sure if I can do it on weekend, but Monday should work.

I can not to 100% say that there are no such errors on FreeBSD, as I don't have the old servers running anymore, but I had this running now for years and did not see any issues (beside the usual ESXi/browser ones) when importing/exporting VMs.

It is also not ESXi 7.2 specific, as I receive similar error on ESXi 6.7.

What I forgot to mention is that it sometimes, worked,, maybe 1 out of 30 try's. but no clue what is different and why. 
I will try to trace also one of this cases.

I am afraid that VMware will not support here. They simply tell that a Linux Server is not supported. 🙁
If I could I would abode ESXi, but I am somehow forced to also use it beside other hypervisors.

I thought NFS4.1 should be NFS4.1, independent of the vendor. 
On the other hand this setup using NFS server as datastore is really great. NFS3, works also without any issues, but NFS4.1 session trunking makes this also useable on hosts connected with several 1G NICs.

Best regards,
Andreas

Von: Chuck Lever III <chuck.lever@oracle.com>
Gesendet: Freitag, 22. April 2022 16:29
An: Rick Macklem <rmacklem@uoguelph.ca>
Cc: Bruce Fields <bfields@fieldses.org>; crispyduck@outlook.at <crispyduck@outlook.at>; Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Betreff: Re: Problems with NFS4.1 on ESXi 
 


> On Apr 21, 2022, at 7:52 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote:
> 
> J. Bruce Fields <bfields@fieldses.org> wrote:
> [stuff snipped]
>> On Thu, Apr 21, 2022 at 12:40:49PM -0400, bfields wrote:
>>> 
>>> 
>>> Stale filehandles aren't normal, and suggest some bug or
>>> misconfiguration on the server side, either in NFS or the exported
>>> filesystem.
>> 
>> Actually, I should take that back: if one client removes files while a
>> second client is using them, it'd be normal for applications on that
>> second client to see ESTALE.
> I took a look at crispyduck's packet trace and here's what I saw:
> Packet#
> 48 Lookup of test-ovf.vmx
> 49 NFS_OK FH is 0x7c9ce14b (the hash)
> ...
> 51 Open Claim_FH fo 0x7c9ce14b
> 52 NFS_OK Open Stateid 0x35be
> ...
> 138 Rename test-ovf.vmx~ to test-ovf.vmx
> 139 NFS_OK
> ...
> 141 Close with PutFH 0x7c9ce14b
> 142 NFS4ERR_STALE for the PutFH
> 
> So, it seems that the Rename will delete the file (names another file to the
> same name "test-ovf.vmx".  Then the subsequent Close's PutFH fails,
> because the file for the FH has been deleted.

So, Rick, Andreas: does this sequence of operations work without
error against a FreeBSD NFS server?


--
Chuck Lever



I

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-22 14:59               ` AW: " crispyduck
@ 2022-04-22 15:02                 ` Chuck Lever III
  0 siblings, 0 replies; 25+ messages in thread
From: Chuck Lever III @ 2022-04-22 15:02 UTC (permalink / raw)
  To: crispyduck; +Cc: Rick Macklem, Bruce Fields, Linux NFS Mailing List



> On Apr 22, 2022, at 10:59 AM, crispyduck@outlook.at wrote:
> 
> I will try to make all the requested traces, NFS3, NFS4.1 and if a VM is okay I will also set up a FreeBSD NFS server.

Someone mentioned NFSv4.0 is a "working" case, so that would be interesting to see too.


> Not sure if I can do it on weekend, but Monday should work.
> 
> I can not to 100% say that there are no such errors on FreeBSD, as I don't have the old servers running anymore, but I had this running now for years and did not see any issues (beside the usual ESXi/browser ones) when importing/exporting VMs.
> 
> It is also not ESXi 7.2 specific, as I receive similar error on ESXi 6.7.
> 
> What I forgot to mention is that it sometimes, worked,, maybe 1 out of 30 try's. but no clue what is different and why. 
> I will try to trace also one of this cases.
> 
> I am afraid that VMware will not support here. They simply tell that a Linux Server is not supported. 🙁
> If I could I would abode ESXi, but I am somehow forced to also use it beside other hypervisors.
> 
> I thought NFS4.1 should be NFS4.1, independent of the vendor.

Theoretically that is correct. The implementations of that standard
can vary and have bugs.


> On the other hand this setup using NFS server as datastore is really great. NFS3, works also without any issues, but NFS4.1 session trunking makes this also useable on hosts connected with several 1G NICs.

--
Chuck Lever




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-21 23:52           ` Rick Macklem
  2022-04-21 23:58             ` Rick Macklem
  2022-04-22 14:29             ` Chuck Lever III
@ 2022-04-22 15:15             ` J. Bruce Fields
  2022-04-22 18:43               ` AW: " crispyduck
  2022-04-22 23:03               ` Rick Macklem
  2 siblings, 2 replies; 25+ messages in thread
From: J. Bruce Fields @ 2022-04-22 15:15 UTC (permalink / raw)
  To: Rick Macklem; +Cc: crispyduck, Chuck Lever III, Linux NFS Mailing List

On Thu, Apr 21, 2022 at 11:52:32PM +0000, Rick Macklem wrote:
> J. Bruce Fields <bfields@fieldses.org> wrote:
> [stuff snipped]
> > On Thu, Apr 21, 2022 at 12:40:49PM -0400, bfields wrote:
> > >
> > >
> > > Stale filehandles aren't normal, and suggest some bug or
> > > misconfiguration on the server side, either in NFS or the exported
> > > filesystem.
> > 
> > Actually, I should take that back: if one client removes files while a
> > second client is using them, it'd be normal for applications on that
> > second client to see ESTALE.
> I took a look at crispyduck's packet trace and here's what I saw:
> Packet#
> 48 Lookup of test-ovf.vmx
> 49 NFS_OK FH is 0x7c9ce14b (the hash)
> ...
> 51 Open Claim_FH fo 0x7c9ce14b
> 52 NFS_OK Open Stateid 0x35be
> ...
> 138 Rename test-ovf.vmx~ to test-ovf.vmx
> 139 NFS_OK
> ...
> 141 Close with PutFH 0x7c9ce14b
> 142 NFS4ERR_STALE for the PutFH
> 
> So, it seems that the Rename will delete the file (names another file to the
> same name "test-ovf.vmx".  Then the subsequent Close's PutFH fails,
> because the file for the FH has been deleted.

Actually (sorry I'm slow to understand this)--why would our 4.1 server
ever be returning STALE on a close?  We normally hold a reference to the
file.

Oh, wait, is subtree_check set on the export?  You don't want to do
that.  (The freebsd server probably doesn't even give that as an
option?)

--b.

> 
> Looks like yet another ESXi client bug to me?
> (I've seen assorted other ones, but not this one. I have no idea how this
>  might work on a FreeBSD server. I can only assume the RPC sequence
>  ends up different for FreeBSD for some reason? Maybe the Close gets
>  processed before the Rename? I didn't look at the Sequence args for
>  these RPCs to see if they use different slots.)
> 
> 
> > So it might be interesting to know what actually happens when VM
> > templates are imported.
> If you look at the packet trace, somewhat weird, like most things for this
> client. It does a Lookup of the same file name over and over again, for
> example.
> 
> > I suppose you could also try NFSv4.0 or try varying kernel versions to
> > try to narrow down the problem.
> I think it only does NFSv4.1.
> I've tried to contact the VMware engineers, but never had any luck.
> I wish they'd show up at a bakeathon, but...
> 
> > No easy ideas off the top of my head, sorry.
> I once posted a list of problems I had found with ESXi 6.5 to a FreeBSD
> mailing list and someone who worked for VMware cut/pasted it into their
> problem database.  They responded to him with "might be fixed in a future
> release" and, indeed, they were fixed in ESXi 6.7, so if you can get this to
> them, they might fix it?
> 
> rick
> 
> --b.
> 
> > Figuring out more than that would require more
> > investigation.
> >
> > --b.
> >
> > >
> > > Br,
> > > Andi
> > >
> > >
> > >
> > >
> > >
> > >
> > > Von: Chuck Lever III <chuck.lever@oracle.com>
> > > Gesendet: Donnerstag, 21. April 2022 16:58
> > > An: Andreas Nagy <crispyduck@outlook.at>
> > > Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
> > > Betreff: Re: Problems with NFS4.1 on ESXi
> > >
> > > Hi Andreas-
> > >
> > > > On Apr 21, 2022, at 12:55 AM, Andreas Nagy <crispyduck@outlook.at> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I hope this mailing list is the right place to discuss some problems with nfs4.1.
> > >
> > > Well, yes and no. This is an upstream developer mailing list,
> > > not really for user support.
> > >
> > > You seem to be asking about products that are currently supported,
> > > and I'm not sure if the Debian kernel is stock upstream 5.13 or
> > > something else. ZFS is not an upstream Linux filesystem and the
> > > ESXi NFS client is something we have little to no experience with.
> > >
> > > I recommend contacting the support desk for your products. If
> > > they find a specific problem with the Linux NFS server's
> > > implementation of the NFSv4.1 protocol, then come back here.
> > >
> > >
> > > > Switching from FreeBSD host as NFS server to a Proxmox environment also serving NFS I see some strange issues in combination with VMWare ESXi.
> > > >
> > > > After first thinking it works fine, I started to realize that there are problems with ESXi datastores on NFS4.1 when trying to import VMs (OVF).
> > > >
> > > > Importing ESXi OVF VM Templates fails nearly every time with a ESXi error message "postNFCData failed: Not Found". With NFS3 it is working fine.
> > > >
> > > > NFS server is running on a Proxmox host:
> > > >
> > > >  root@sepp-sto-01:~# hostnamectl
> > > >  Static hostname: sepp-sto-01
> > > >  Icon name: computer-server
> > > >  Chassis: server
> > > >  Machine ID: 028da2386e514db19a3793d876fadf12
> > > >  Boot ID: c5130c8524c64bc38994f6cdd170d9fd
> > > >  Operating System: Debian GNU/Linux 11 (bullseye)
> > > >  Kernel: Linux 5.13.19-4-pve
> > > >  Architecture: x86-64
> > > >
> > > >
> > > > File system is ZFS, but also tried it with others and it is the same behaivour.
> > > >
> > > >
> > > > ESXi version 7.2U3
> > > >
> > > > ESXi vmkernel.log:
> > > > 2022-04-19T17:46:38.933Z cpu0:262261)cswitch: L2Sec_EnforcePortCompliance:209: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]client vmk1 requested promiscuous mode on port 0x4000010, disallowed by vswitch policy
> > > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)World: 12075: VC opID esxui-d6ab-f678 maps to vmkernel opID 936118c3
> > > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fce02850 failed: Stale file handle
> > > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fcdaa000 failed: Stale file handle
> > > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > > 2022-04-19T17:47:25.166Z cpu18:262376)ScsiVmas: 1074: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Not supported
> > > > 2022-04-19T17:47:25.167Z cpu18:262375)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)World: 12075: VC opID esxui-6787-f694 maps to vmkernel opID 9529ace7
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > >
> > > > tcpdump taken on the esxi with filter on the nfs server ip is attached here:
> > > > https://easyupload.io/xvtpt1
> > > >
> > > > I tried to analyze, but have no idea what exactly the problem is. Maybe it is some issue with the VMWare implementation?
> > > > Would be nice if someone with better NFS knowledge could have a look on the traces.
> > > >
> > > > Best regards,
> > > > cd
> > >
> > > --
> > > Chuck Lever
> > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* AW: Problems with NFS4.1 on ESXi
  2022-04-22 15:15             ` J. Bruce Fields
@ 2022-04-22 18:43               ` crispyduck
  2022-04-22 23:03               ` Rick Macklem
  1 sibling, 0 replies; 25+ messages in thread
From: crispyduck @ 2022-04-22 18:43 UTC (permalink / raw)
  To: J. Bruce Fields, Rick Macklem; +Cc: Chuck Lever III, Linux NFS Mailing List

Output of exportfs -v:
sync,wdelay,hide,crossmnt,no_subtree_check,fsid=74345722,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash

ESXi only supports NFS3 and NFS4.1, NFS4 is not supported, no idea why, think thy only somehow implemented 4.1 for session trunking and kerberos.

Br,
Andreas




Von: J. Bruce Fields <bfields@fieldses.org>
Gesendet: Freitag, 22. April 2022 17:15
An: Rick Macklem <rmacklem@uoguelph.ca>
Cc: crispyduck@outlook.at <crispyduck@outlook.at>; Chuck Lever III <chuck.lever@oracle.com>; Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Betreff: Re: Problems with NFS4.1 on ESXi 
 
On Thu, Apr 21, 2022 at 11:52:32PM +0000, Rick Macklem wrote:
> J. Bruce Fields <bfields@fieldses.org> wrote:
> [stuff snipped]
> > On Thu, Apr 21, 2022 at 12:40:49PM -0400, bfields wrote:
> > >
> > >
> > > Stale filehandles aren't normal, and suggest some bug or
> > > misconfiguration on the server side, either in NFS or the exported
> > > filesystem.
> > 
> > Actually, I should take that back: if one client removes files while a
> > second client is using them, it'd be normal for applications on that
> > second client to see ESTALE.
> I took a look at crispyduck's packet trace and here's what I saw:
> Packet#
> 48 Lookup of test-ovf.vmx
> 49 NFS_OK FH is 0x7c9ce14b (the hash)
> ...
> 51 Open Claim_FH fo 0x7c9ce14b
> 52 NFS_OK Open Stateid 0x35be
> ...
> 138 Rename test-ovf.vmx~ to test-ovf.vmx
> 139 NFS_OK
> ...
> 141 Close with PutFH 0x7c9ce14b
> 142 NFS4ERR_STALE for the PutFH
> 
> So, it seems that the Rename will delete the file (names another file to the
> same name "test-ovf.vmx".  Then the subsequent Close's PutFH fails,
> because the file for the FH has been deleted.

Actually (sorry I'm slow to understand this)--why would our 4.1 server
ever be returning STALE on a close?  We normally hold a reference to the
file.

Oh, wait, is subtree_check set on the export?  You don't want to do
that.  (The freebsd server probably doesn't even give that as an
option?)

--b.

> 
> Looks like yet another ESXi client bug to me?
> (I've seen assorted other ones, but not this one. I have no idea how this
>  might work on a FreeBSD server. I can only assume the RPC sequence
>  ends up different for FreeBSD for some reason? Maybe the Close gets
>  processed before the Rename? I didn't look at the Sequence args for
>  these RPCs to see if they use different slots.)
> 
> 
> > So it might be interesting to know what actually happens when VM
> > templates are imported.
> If you look at the packet trace, somewhat weird, like most things for this
> client. It does a Lookup of the same file name over and over again, for
> example.
> 
> > I suppose you could also try NFSv4.0 or try varying kernel versions to
> > try to narrow down the problem.
> I think it only does NFSv4.1.
> I've tried to contact the VMware engineers, but never had any luck.
> I wish they'd show up at a bakeathon, but...
> 
> > No easy ideas off the top of my head, sorry.
> I once posted a list of problems I had found with ESXi 6.5 to a FreeBSD
> mailing list and someone who worked for VMware cut/pasted it into their
> problem database.  They responded to him with "might be fixed in a future
> release" and, indeed, they were fixed in ESXi 6.7, so if you can get this to
> them, they might fix it?
> 
> rick
> 
> --b.
> 
> > Figuring out more than that would require more
> > investigation.
> >
> > --b.
> >
> > >
> > > Br,
> > > Andi
> > >
> > >
> > >
> > >
> > >
> > >
> > > Von: Chuck Lever III <chuck.lever@oracle.com>
> > > Gesendet: Donnerstag, 21. April 2022 16:58
> > > An: Andreas Nagy <crispyduck@outlook.at>
> > > Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
> > > Betreff: Re: Problems with NFS4.1 on ESXi
> > >
> > > Hi Andreas-
> > >
> > > > On Apr 21, 2022, at 12:55 AM, Andreas Nagy <crispyduck@outlook.at> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I hope this mailing list is the right place to discuss some problems with nfs4.1.
> > >
> > > Well, yes and no. This is an upstream developer mailing list,
> > > not really for user support.
> > >
> > > You seem to be asking about products that are currently supported,
> > > and I'm not sure if the Debian kernel is stock upstream 5.13 or
> > > something else. ZFS is not an upstream Linux filesystem and the
> > > ESXi NFS client is something we have little to no experience with.
> > >
> > > I recommend contacting the support desk for your products. If
> > > they find a specific problem with the Linux NFS server's
> > > implementation of the NFSv4.1 protocol, then come back here.
> > >
> > >
> > > > Switching from FreeBSD host as NFS server to a Proxmox environment also serving NFS I see some strange issues in combination with VMWare ESXi.
> > > >
> > > > After first thinking it works fine, I started to realize that there are problems with ESXi datastores on NFS4.1 when trying to import VMs (OVF).
> > > >
> > > > Importing ESXi OVF VM Templates fails nearly every time with a ESXi error message "postNFCData failed: Not Found". With NFS3 it is working fine.
> > > >
> > > > NFS server is running on a Proxmox host:
> > > >
> > > >  root@sepp-sto-01:~# hostnamectl
> > > >  Static hostname: sepp-sto-01
> > > >  Icon name: computer-server
> > > >  Chassis: server
> > > >  Machine ID: 028da2386e514db19a3793d876fadf12
> > > >  Boot ID: c5130c8524c64bc38994f6cdd170d9fd
> > > >  Operating System: Debian GNU/Linux 11 (bullseye)
> > > >  Kernel: Linux 5.13.19-4-pve
> > > >  Architecture: x86-64
> > > >
> > > >
> > > > File system is ZFS, but also tried it with others and it is the same behaivour.
> > > >
> > > >
> > > > ESXi version 7.2U3
> > > >
> > > > ESXi vmkernel.log:
> > > > 2022-04-19T17:46:38.933Z cpu0:262261)cswitch: L2Sec_EnforcePortCompliance:209: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]client vmk1 requested promiscuous mode on port 0x4000010, disallowed by vswitch policy
> > > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)World: 12075: VC opID esxui-d6ab-f678 maps to vmkernel opID 936118c3
> > > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fce02850 failed: Stale file handle
> > > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fcdaa000 failed: Stale file handle
> > > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > > 2022-04-19T17:47:25.166Z cpu18:262376)ScsiVmas: 1074: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Not supported
> > > > 2022-04-19T17:47:25.167Z cpu18:262375)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)World: 12075: VC opID esxui-6787-f694 maps to vmkernel opID 9529ace7
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > >
> > > > tcpdump taken on the esxi with filter on the nfs server ip is attached here:
> > > > https://easyupload.io/xvtpt1
> > > >
> > > > I tried to analyze, but have no idea what exactly the problem is. Maybe it is some issue with the VMWare implementation?
> > > > Would be nice if someone with better NFS knowledge could have a look on the traces.
> > > >
> > > > Best regards,
> > > > cd
> > >
> > > --
> > > Chuck Lever
> > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-22 14:29             ` Chuck Lever III
  2022-04-22 14:59               ` AW: " crispyduck
@ 2022-04-22 22:58               ` Rick Macklem
  1 sibling, 0 replies; 25+ messages in thread
From: Rick Macklem @ 2022-04-22 22:58 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: Bruce Fields, crispyduck, Linux NFS Mailing List

Chuck Lever III <chuck.lever@oracle.com> wrote:
> On Apr 21, 2022, at 7:52 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote:
[stuff snipped]
> > I took a look at crispyduck's packet trace and here's what I saw:
> > Packet#
> > 48 Lookup of test-ovf.vmx
> > 49 NFS_OK FH is 0x7c9ce14b (the hash)
> > ...
> > 51 Open Claim_FH fo 0x7c9ce14b
> > 52 NFS_OK Open Stateid 0x35be
> > ...
> > 138 Rename test-ovf.vmx~ to test-ovf.vmx
> > 139 NFS_OK
> > ...
> > 141 Close with PutFH 0x7c9ce14b
> > 142 NFS4ERR_STALE for the PutFH
> >
> > So, it seems that the Rename will delete the file (names another file to the
> > same name "test-ovf.vmx".  Then the subsequent Close's PutFH fails,
> > because the file for the FH has been deleted.
>
> So, Rick, Andreas: does this sequence of operations work without
> error against a FreeBSD NFS server?
Good question. For a UFS exported file system I am pretty sure the server
would reply with ESTALE to the PutFH, just like Linux.

For ZFS, I am not so sure. The translation from FH to vnode is done by
a file system specific method. If that fails, ESTALE is replied. If ZFS can still
generate a vnode for a file when it has been removed (or while the remove
is in progress, as it might be in this case), then no error would be replied.
(The NFSv4 Close operation doesn't actually use the vnode, it only uses
 the StateID and FH to find/close the NFSv4 open state.)

The FreeBSD server never sets OPEN_RESULT_PRESERVE_UNLINKED
in the Open reply and it is not intended to retain the file until Close.

Maybe Andreas will find that out, if he can do more testing against a
FreeBSD server?

I am also not sure if the ESTALE replies are an issue for the ESXi client, since
they happen multiple times in the packet trace and only generate a warning
message in the client's log.

I did not see anything else in the trace that would indicate why the client
might be failing, however it did look like some packets were missing from
the trace.

rick

--
Chuck Lever





^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-22 15:15             ` J. Bruce Fields
  2022-04-22 18:43               ` AW: " crispyduck
@ 2022-04-22 23:03               ` Rick Macklem
  2022-04-24 15:07                 ` J. Bruce Fields
  1 sibling, 1 reply; 25+ messages in thread
From: Rick Macklem @ 2022-04-22 23:03 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: crispyduck, Chuck Lever III, Linux NFS Mailing List

J. Bruce Fields <bfields@fieldses.org> wrote:
> On Thu, Apr 21, 2022 at 11:52:32PM +0000, Rick Macklem wrote:
> > J. Bruce Fields <bfields@fieldses.org> wrote:
> > [stuff snipped]
> > > On Thu, Apr 21, 2022 at 12:40:49PM -0400, bfields wrote:
> > > >
> > > >
> > > > Stale filehandles aren't normal, and suggest some bug or
> > > > misconfiguration on the server side, either in NFS or the exported
> > > > filesystem.
> > >
> > > Actually, I should take that back: if one client removes files while a
> > > second client is using them, it'd be normal for applications on that
> > > second client to see ESTALE.
> > I took a look at crispyduck's packet trace and here's what I saw:
> > Packet#
> > 48 Lookup of test-ovf.vmx
> > 49 NFS_OK FH is 0x7c9ce14b (the hash)
> > ...
> > 51 Open Claim_FH fo 0x7c9ce14b
> > 52 NFS_OK Open Stateid 0x35be
> > ...
> > 138 Rename test-ovf.vmx~ to test-ovf.vmx
> > 139 NFS_OK
> > ...
> > 141 Close with PutFH 0x7c9ce14b
> > 142 NFS4ERR_STALE for the PutFH
> >
> > So, it seems that the Rename will delete the file (names another file to the
> > same name "test-ovf.vmx".  Then the subsequent Close's PutFH fails,
> > because the file for the FH has been deleted.
>
> Actually (sorry I'm slow to understand this)--why would our 4.1 server
> ever be returning STALE on a close?  We normally hold a reference to the
> file.
Well, OPEN_RESULT_PRESERVE_UNLINKED is not set in the Open reply,
so even if it normally does so, it is not telling the ESXi client that it
will retain it.

> Oh, wait, is subtree_check set on the export?  You don't want to do
> that.  (The freebsd server probably doesn't even give that as an
> option?)
Nope, Never heard of it.

rick

--b.

>
> Looks like yet another ESXi client bug to me?
> (I've seen assorted other ones, but not this one. I have no idea how this
>  might work on a FreeBSD server. I can only assume the RPC sequence
>  ends up different for FreeBSD for some reason? Maybe the Close gets
>  processed before the Rename? I didn't look at the Sequence args for
>  these RPCs to see if they use different slots.)
>
>
> > So it might be interesting to know what actually happens when VM
> > templates are imported.
> If you look at the packet trace, somewhat weird, like most things for this
> client. It does a Lookup of the same file name over and over again, for
> example.
>
> > I suppose you could also try NFSv4.0 or try varying kernel versions to
> > try to narrow down the problem.
> I think it only does NFSv4.1.
> I've tried to contact the VMware engineers, but never had any luck.
> I wish they'd show up at a bakeathon, but...
>
> > No easy ideas off the top of my head, sorry.
> I once posted a list of problems I had found with ESXi 6.5 to a FreeBSD
> mailing list and someone who worked for VMware cut/pasted it into their
> problem database.  They responded to him with "might be fixed in a future
> release" and, indeed, they were fixed in ESXi 6.7, so if you can get this to
> them, they might fix it?
>
> rick
>
> --b.
>
> > Figuring out more than that would require more
> > investigation.
> >
> > --b.
> >
> > >
> > > Br,
> > > Andi
> > >
> > >
> > >
> > >
> > >
> > >
> > > Von: Chuck Lever III <chuck.lever@oracle.com>
> > > Gesendet: Donnerstag, 21. April 2022 16:58
> > > An: Andreas Nagy <crispyduck@outlook.at>
> > > Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
> > > Betreff: Re: Problems with NFS4.1 on ESXi
> > >
> > > Hi Andreas-
> > >
> > > > On Apr 21, 2022, at 12:55 AM, Andreas Nagy <crispyduck@outlook.at> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I hope this mailing list is the right place to discuss some problems with nfs4.1.
> > >
> > > Well, yes and no. This is an upstream developer mailing list,
> > > not really for user support.
> > >
> > > You seem to be asking about products that are currently supported,
> > > and I'm not sure if the Debian kernel is stock upstream 5.13 or
> > > something else. ZFS is not an upstream Linux filesystem and the
> > > ESXi NFS client is something we have little to no experience with.
> > >
> > > I recommend contacting the support desk for your products. If
> > > they find a specific problem with the Linux NFS server's
> > > implementation of the NFSv4.1 protocol, then come back here.
> > >
> > >
> > > > Switching from FreeBSD host as NFS server to a Proxmox environment also serving NFS I see some strange issues in combination with VMWare ESXi.
> > > >
> > > > After first thinking it works fine, I started to realize that there are problems with ESXi datastores on NFS4.1 when trying to import VMs (OVF).
> > > >
> > > > Importing ESXi OVF VM Templates fails nearly every time with a ESXi error message "postNFCData failed: Not Found". With NFS3 it is working fine.
> > > >
> > > > NFS server is running on a Proxmox host:
> > > >
> > > >  root@sepp-sto-01:~# hostnamectl
> > > >  Static hostname: sepp-sto-01
> > > >  Icon name: computer-server
> > > >  Chassis: server
> > > >  Machine ID: 028da2386e514db19a3793d876fadf12
> > > >  Boot ID: c5130c8524c64bc38994f6cdd170d9fd
> > > >  Operating System: Debian GNU/Linux 11 (bullseye)
> > > >  Kernel: Linux 5.13.19-4-pve
> > > >  Architecture: x86-64
> > > >
> > > >
> > > > File system is ZFS, but also tried it with others and it is the same behaivour.
> > > >
> > > >
> > > > ESXi version 7.2U3
> > > >
> > > > ESXi vmkernel.log:
> > > > 2022-04-19T17:46:38.933Z cpu0:262261)cswitch: L2Sec_EnforcePortCompliance:209: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]client vmk1 requested promiscuous mode on port 0x4000010, disallowed by vswitch policy
> > > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)World: 12075: VC opID esxui-d6ab-f678 maps to vmkernel opID 936118c3
> > > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fce02850 failed: Stale file handle
> > > > 2022-04-19T17:46:40.897Z cpu10:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileDoCloseFile:3128: file handle close on obj 0x4303fcdaa000 failed: Stale file handle
> > > > 2022-04-19T17:46:41.164Z cpu4:266351 opID=936118c3)WARNING: NFS41: NFS41FileOpCloseFile:3718: NFS41FileCloseFile failed: Stale file handle
> > > > 2022-04-19T17:47:25.166Z cpu18:262376)ScsiVmas: 1074: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Not supported
> > > > 2022-04-19T17:47:25.167Z cpu18:262375)StorageDevice: 7059: End path evaluation for device mpx.vmhba32:C0:T0:L0
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)World: 12075: VC opID esxui-6787-f694 maps to vmkernel opID 9529ace7
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > > 2022-04-19T17:47:30.645Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2465: Evicting VM with path:/vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: 209: Creating crypto hash
> > > > 2022-04-19T17:47:30.693Z cpu4:264565 opID=9529ace7)VmMemXfer: vm 264565: 2479: Could not find MemXferFS region for /vmfs/volumes/9f10677f-697882ed-0000-000000000000/test-ovf/test-ovf.vmx
> > > >
> > > > tcpdump taken on the esxi with filter on the nfs server ip is attached here:
> > > > https://easyupload.io/xvtpt1
> > > >
> > > > I tried to analyze, but have no idea what exactly the problem is. Maybe it is some issue with the VMWare implementation?
> > > > Would be nice if someone with better NFS knowledge could have a look on the traces.
> > > >
> > > > Best regards,
> > > > cd
> > >
> > > --
> > > Chuck Lever
> > >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-22 23:03               ` Rick Macklem
@ 2022-04-24 15:07                 ` J. Bruce Fields
  2022-04-24 20:36                   ` Rick Macklem
  0 siblings, 1 reply; 25+ messages in thread
From: J. Bruce Fields @ 2022-04-24 15:07 UTC (permalink / raw)
  To: Rick Macklem; +Cc: crispyduck, Chuck Lever III, Linux NFS Mailing List

On Fri, Apr 22, 2022 at 11:03:17PM +0000, Rick Macklem wrote:
> J. Bruce Fields <bfields@fieldses.org> wrote:
> > Actually (sorry I'm slow to understand this)--why would our 4.1 server
> > ever be returning STALE on a close?  We normally hold a reference to the
> > file.
> Well, OPEN_RESULT_PRESERVE_UNLINKED is not set in the Open reply,
> so even if it normally does so, it is not telling the ESXi client that it
> will retain it.

Yeah, we don't guarantee it, but I thought in this cases we did.  The
object we use to represent the open stateid (indirectly) holds a
reference on the inode that prevents it from being removed, so the
filehandle lookup should still work.  If I had the time, I'd write an
open-rename over-close test in pynfs and see if we could reproduce this,
and if so see what's happening.

> > Oh, wait, is subtree_check set on the export?  You don't want to do
> > that.  (The freebsd server probably doesn't even give that as an
> > option?)
> Nope, Never heard of it.

It adds a reference to the parent into the filehandle, so we can foil
filehandle-guessing attacks on exports of subdirectories of filesystems.
With the major drawback that it breaks on cross-directory rename, for
example.  So it's not the default.

--b.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-24 15:07                 ` J. Bruce Fields
@ 2022-04-24 20:36                   ` Rick Macklem
  2022-04-24 20:39                     ` Rick Macklem
  0 siblings, 1 reply; 25+ messages in thread
From: Rick Macklem @ 2022-04-24 20:36 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: crispyduck, Chuck Lever III, Linux NFS Mailing List

J. Bruce Fields <bfields@fieldses.org> wrote:
> On Fri, Apr 22, 2022 at 11:03:17PM +0000, Rick Macklem wrote:
> > J. Bruce Fields <bfields@fieldses.org> wrote:
> > > Actually (sorry I'm slow to understand this)--why would our 4.1 server
> > > ever be returning STALE on a close?  We normally hold a reference to the
> > > file.
> > Well, OPEN_RESULT_PRESERVE_UNLINKED is not set in the Open reply,
> > so even if it normally does so, it is not telling the ESXi client that it
> > will retain it.
>
> Yeah, we don't guarantee it, but I thought in this cases we did.  The
> object we use to represent the open stateid (indirectly) holds a
> reference on the inode that prevents it from being removed, so the
> filehandle lookup should still work.  If I had the time, I'd write an
> open-rename over-close test in pynfs and see if we could reproduce this,
> and if so see what's happening.
Then I guess the next question would be...
What happens to the file/open when the close never happens?

Could that be causing problems for the client, since we know the Close
never happens?

> > > Oh, wait, is subtree_check set on the export?  You don't want to do
> > > that.  (The freebsd server probably doesn't even give that as an
> > > option?)
> > Nope, Never heard of it.
> 
> It adds a reference to the parent into the filehandle, so we can foil
> filehandle-guessing attacks on exports of subdirectories of filesystems.
> With the major drawback that it breaks on cross-directory rename, for
> example.  So it's not the default.
In FreeBSD, it actually hangs onto the parent's FH (verbatim), but mostly
so it can do Open/Claim_NULLs for it. There is nothing in FreeBSD that
tries to subvert FH guessing.

rick

--b.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-04-24 20:36                   ` Rick Macklem
@ 2022-04-24 20:39                     ` Rick Macklem
  2022-04-25  9:00                       ` AW: " crispyduck
  0 siblings, 1 reply; 25+ messages in thread
From: Rick Macklem @ 2022-04-24 20:39 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: crispyduck, Chuck Lever III, Linux NFS Mailing List

Rick Macklem <rmacklem@uoguelph.ca> wrote:
[stuff snipped]
> In FreeBSD, it actually hangs onto the parent's FH (verbatim), but mostly
> so it can do Open/Claim_NULLs for it. There is nothing in FreeBSD that
> tries to subvert FH guessing.
Oops, this is client side, not server side. (I forgot which hat I was wearing;-)
The FreeBSD server does not keep track of parents.

rick

--b.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* AW: Problems with NFS4.1 on ESXi
  2022-04-24 20:39                     ` Rick Macklem
@ 2022-04-25  9:00                       ` crispyduck
  2022-04-27  6:08                         ` crispyduck
  0 siblings, 1 reply; 25+ messages in thread
From: crispyduck @ 2022-04-25  9:00 UTC (permalink / raw)
  To: Rick Macklem, J. Bruce Fields; +Cc: Chuck Lever III, Linux NFS Mailing List

I have made some more traces, one time with nfs3 and one time with nfs41:
uploaded here:
https://easyupload.io/7bt624

Both from mount till start of vm import (testvm).

NFS3 works, NFS41 fails.

exportfs -v:
/zfstank/sto1/ds110
                <world>(async,wdelay,hide,crossmnt,no_subtree_check,fsid=74345722,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)

I tried it now many times, but was not able to reproduce a "good" case with nfs41.

Tests with FreeBSD I can do when I am back from my business trip, As I need to add some disks to the servers.

regards,
Andreas



Von: Rick Macklem <rmacklem@uoguelph.ca>
Gesendet: Sonntag, 24. April 2022 22:39
An: J. Bruce Fields <bfields@fieldses.org>
Cc: crispyduck@outlook.at <crispyduck@outlook.at>; Chuck Lever III <chuck.lever@oracle.com>; Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Betreff: Re: Problems with NFS4.1 on ESXi 
 
Rick Macklem <rmacklem@uoguelph.ca> wrote:
[stuff snipped]
> In FreeBSD, it actually hangs onto the parent's FH (verbatim), but mostly
> so it can do Open/Claim_NULLs for it. There is nothing in FreeBSD that
> tries to subvert FH guessing.
Oops, this is client side, not server side. (I forgot which hat I was wearing;-)
The FreeBSD server does not keep track of parents.

rick

--b.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* AW: Problems with NFS4.1 on ESXi
  2022-04-25  9:00                       ` AW: " crispyduck
@ 2022-04-27  6:08                         ` crispyduck
  2022-05-05  5:31                           ` andreas-nagy
  0 siblings, 1 reply; 25+ messages in thread
From: crispyduck @ 2022-04-27  6:08 UTC (permalink / raw)
  To: Rick Macklem, J. Bruce Fields, linux-nfs; +Cc: Chuck Lever III

I tried again to reproduce the "sometimes working" case, but at the moment it always fails. No Idea why it in the past sometimes worked. 
Why are this much lookups in the trace? Dont see this on other NFS clients.
 
The traces with nfs3 where it works and nfs41 where it always fails are here:
https://easyupload.io/7bt624

Both from mount till start of vm import (testvm).

exportfs -v:
/zfstank/sto1/ds110
                <world>(async,wdelay,hide,crossmnt,no_subtree_check,fsid=74345722,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)


I hope I can also do some tests against a FreeBSD server end of the week.

regards,
Andreas



Von: Rick Macklem <rmacklem@uoguelph.ca>
Gesendet: Sonntag, 24. April 2022 22:39
An: J. Bruce Fields <bfields@fieldses.org>
Cc: crispyduck@outlook.at <crispyduck@outlook.at>; Chuck Lever III <chuck.lever@oracle.com>; Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Betreff: Re: Problems with NFS4.1 on ESXi 
 
Rick Macklem <rmacklem@uoguelph.ca> wrote:
[stuff snipped]
> In FreeBSD, it actually hangs onto the parent's FH (verbatim), but mostly
> so it can do Open/Claim_NULLs for it. There is nothing in FreeBSD that
> tries to subvert FH guessing.
Oops, this is client side, not server side. (I forgot which hat I was wearing;-)
The FreeBSD server does not keep track of parents.

rick

--b.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* AW: Problems with NFS4.1 on ESXi
  2022-04-27  6:08                         ` crispyduck
@ 2022-05-05  5:31                           ` andreas-nagy
  2022-05-05 14:19                             ` Rick Macklem
  2022-05-05 16:38                             ` Chuck Lever III
  0 siblings, 2 replies; 25+ messages in thread
From: andreas-nagy @ 2022-05-05  5:31 UTC (permalink / raw)
  To: crispyduck, Rick Macklem, J. Bruce Fields, linux-nfs; +Cc: Chuck Lever III

Hi,

was someone able to check the NFS3 vs NFS4.1 traces (https://easyupload.io/7bt624)? I was due to quarantine I was so far not able to test it against FreeBSD.

Would it maybe make any difference updating the Ubuntu based Linux kernel from 5.13 to 5.15?

Br
Andreas




Von: crispyduck@outlook.at <crispyduck@outlook.at>
Gesendet: Mittwoch, 27. April 2022 08:08
An: Rick Macklem <rmacklem@uoguelph.ca>; J. Bruce Fields <bfields@fieldses.org>; linux-nfs@vger.kernel.org <linux-nfs@vger.kernel.org>
Cc: Chuck Lever III <chuck.lever@oracle.com>
Betreff: AW: Problems with NFS4.1 on ESXi 
 
I tried again to reproduce the "sometimes working" case, but at the moment it always fails. No Idea why it in the past sometimes worked. 
Why are this much lookups in the trace? Dont see this on other NFS clients.
 
The traces with nfs3 where it works and nfs41 where it always fails are here:
https://easyupload.io/7bt624

Both from mount till start of vm import (testvm).

exportfs -v:
/zfstank/sto1/ds110
                <world>(async,wdelay,hide,crossmnt,no_subtree_check,fsid=74345722,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)


I hope I can also do some tests against a FreeBSD server end of the week.

regards,
Andreas



Von: Rick Macklem <rmacklem@uoguelph.ca>
Gesendet: Sonntag, 24. April 2022 22:39
An: J. Bruce Fields <bfields@fieldses.org>
Cc: crispyduck@outlook.at <crispyduck@outlook.at>; Chuck Lever III <chuck.lever@oracle.com>; Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Betreff: Re: Problems with NFS4.1 on ESXi 
 
Rick Macklem <rmacklem@uoguelph.ca> wrote:
[stuff snipped]
> In FreeBSD, it actually hangs onto the parent's FH (verbatim), but mostly
> so it can do Open/Claim_NULLs for it. There is nothing in FreeBSD that
> tries to subvert FH guessing.
Oops, this is client side, not server side. (I forgot which hat I was wearing;-)
The FreeBSD server does not keep track of parents.

rick

--b.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-05-05  5:31                           ` andreas-nagy
@ 2022-05-05 14:19                             ` Rick Macklem
  2022-05-05 16:38                             ` Chuck Lever III
  1 sibling, 0 replies; 25+ messages in thread
From: Rick Macklem @ 2022-05-05 14:19 UTC (permalink / raw)
  To: andreas-nagy, crispyduck, J. Bruce Fields, linux-nfs; +Cc: Chuck Lever III

I took a quick look at the 4.1 capture, but did not see anything
that would explain the problem. There appears to be several
Read requests near the end of the capture that do not have
replies in the capture, but that was all I saw that looked unusual.

rick


________________________________________
From: andreas-nagy@outlook.com <andreas-nagy@outlook.com>
Sent: Thursday, May 5, 2022 1:31 AM
To: crispyduck@outlook.at; Rick Macklem; J. Bruce Fields; linux-nfs@vger.kernel.org
Cc: Chuck Lever III
Subject: AW: Problems with NFS4.1 on ESXi

CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to IThelp@uoguelph.ca


Hi,

was someone able to check the NFS3 vs NFS4.1 traces (https://easyupload.io/7bt624)? I was due to quarantine I was so far not able to test it against FreeBSD.

Would it maybe make any difference updating the Ubuntu based Linux kernel from 5.13 to 5.15?

Br
Andreas




Von: crispyduck@outlook.at <crispyduck@outlook.at>
Gesendet: Mittwoch, 27. April 2022 08:08
An: Rick Macklem <rmacklem@uoguelph.ca>; J. Bruce Fields <bfields@fieldses.org>; linux-nfs@vger.kernel.org <linux-nfs@vger.kernel.org>
Cc: Chuck Lever III <chuck.lever@oracle.com>
Betreff: AW: Problems with NFS4.1 on ESXi

I tried again to reproduce the "sometimes working" case, but at the moment it always fails. No Idea why it in the past sometimes worked.
Why are this much lookups in the trace? Dont see this on other NFS clients.

The traces with nfs3 where it works and nfs41 where it always fails are here:
https://easyupload.io/7bt624

Both from mount till start of vm import (testvm).

exportfs -v:
/zfstank/sto1/ds110
                <world>(async,wdelay,hide,crossmnt,no_subtree_check,fsid=74345722,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)


I hope I can also do some tests against a FreeBSD server end of the week.

regards,
Andreas



Von: Rick Macklem <rmacklem@uoguelph.ca>
Gesendet: Sonntag, 24. April 2022 22:39
An: J. Bruce Fields <bfields@fieldses.org>
Cc: crispyduck@outlook.at <crispyduck@outlook.at>; Chuck Lever III <chuck.lever@oracle.com>; Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Betreff: Re: Problems with NFS4.1 on ESXi

Rick Macklem <rmacklem@uoguelph.ca> wrote:
[stuff snipped]
> In FreeBSD, it actually hangs onto the parent's FH (verbatim), but mostly
> so it can do Open/Claim_NULLs for it. There is nothing in FreeBSD that
> tries to subvert FH guessing.
Oops, this is client side, not server side. (I forgot which hat I was wearing;-)
The FreeBSD server does not keep track of parents.

rick

--b.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-05-05  5:31                           ` andreas-nagy
  2022-05-05 14:19                             ` Rick Macklem
@ 2022-05-05 16:38                             ` Chuck Lever III
  2022-05-07  1:53                               ` Chuck Lever III
  1 sibling, 1 reply; 25+ messages in thread
From: Chuck Lever III @ 2022-05-05 16:38 UTC (permalink / raw)
  To: andreas-nagy
  Cc: crispyduck, Rick Macklem, Bruce Fields, Linux NFS Mailing List



> On May 5, 2022, at 1:31 AM, andreas-nagy@outlook.com wrote:
> 
> Hi,
> 
> was someone able to check the NFS3 vs NFS4.1 traces (https://easyupload.io/7bt624)? I was due to quarantine I was so far not able to test it against FreeBSD.

I don't see anything new in the NFSv4.1 trace from the above package.

The NFSv3 trace doesn't have any remarkable failures. But since the
NFSv3 protocol doesn't have a CLOSE operation, it shouldn't be
surprising that there is no failure there.

Seeing the FreeBSD behavior is the next step. I have a little time
today to audit code to see if there's anything obvious there. I will
have to stick with ext4 since I don't have any ZFS code here and you
said you were able to reproduce on an ext4 export.


> Would it maybe make any difference updating the Ubuntu based Linux kernel from 5.13 to 5.15?

I don't yet know enough about the issue to say whether it might
have been addressed between .13 and .15. So far the issue is not
familiar from recent code changes.


> Br
> Andreas
> 
> 
> 
> 
> Von: crispyduck@outlook.at <crispyduck@outlook.at>
> Gesendet: Mittwoch, 27. April 2022 08:08
> An: Rick Macklem <rmacklem@uoguelph.ca>; J. Bruce Fields <bfields@fieldses.org>; linux-nfs@vger.kernel.org <linux-nfs@vger.kernel.org>
> Cc: Chuck Lever III <chuck.lever@oracle.com>
> Betreff: AW: Problems with NFS4.1 on ESXi 
>  
> I tried again to reproduce the "sometimes working" case, but at the moment it always fails. No Idea why it in the past sometimes worked. 
> Why are this much lookups in the trace? Dont see this on other NFS clients.
>  
> The traces with nfs3 where it works and nfs41 where it always fails are here:
> https://easyupload.io/7bt624
> 
> Both from mount till start of vm import (testvm).
> 
> exportfs -v:
> /zfstank/sto1/ds110
>                <world>(async,wdelay,hide,crossmnt,no_subtree_check,fsid=74345722,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)
> 
> 
> I hope I can also do some tests against a FreeBSD server end of the week.
> 
> regards,
> Andreas
> 
> 
> 
> Von: Rick Macklem <rmacklem@uoguelph.ca>
> Gesendet: Sonntag, 24. April 2022 22:39
> An: J. Bruce Fields <bfields@fieldses.org>
> Cc: crispyduck@outlook.at <crispyduck@outlook.at>; Chuck Lever III <chuck.lever@oracle.com>; Linux NFS Mailing List <linux-nfs@vger.kernel.org>
> Betreff: Re: Problems with NFS4.1 on ESXi 
>  
> Rick Macklem <rmacklem@uoguelph.ca> wrote:
> [stuff snipped]
>> In FreeBSD, it actually hangs onto the parent's FH (verbatim), but mostly
>> so it can do Open/Claim_NULLs for it. There is nothing in FreeBSD that
>> tries to subvert FH guessing.
> Oops, this is client side, not server side. (I forgot which hat I was wearing;-)
> The FreeBSD server does not keep track of parents.
> 
> rick
> 
> --b.

--
Chuck Lever




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Problems with NFS4.1 on ESXi
  2022-05-05 16:38                             ` Chuck Lever III
@ 2022-05-07  1:53                               ` Chuck Lever III
  0 siblings, 0 replies; 25+ messages in thread
From: Chuck Lever III @ 2022-05-07  1:53 UTC (permalink / raw)
  To: andreas-nagy; +Cc: crispyduck, Rick Macklem, Linux NFS Mailing List



> On May 5, 2022, at 12:38 PM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On May 5, 2022, at 1:31 AM, andreas-nagy@outlook.com wrote:
>> 
>> Hi,
>> 
>> was someone able to check the NFS3 vs NFS4.1 traces (https://easyupload.io/7bt624)? I was due to quarantine I was so far not able to test it against FreeBSD.
> 
> I don't see anything new in the NFSv4.1 trace from the above package.
> 
> The NFSv3 trace doesn't have any remarkable failures. But since the
> NFSv3 protocol doesn't have a CLOSE operation, it shouldn't be
> surprising that there is no failure there.
> 
> Seeing the FreeBSD behavior is the next step. I have a little time
> today to audit code to see if there's anything obvious there. I will
> have to stick with ext4 since I don't have any ZFS code here and you
> said you were able to reproduce on an ext4 export.

I looked for ways in which a cached open might be unintentionally
closed by a RENAME. Code audit revealed two potential candidates:
commit 7775ec57f4c7 ("nfsd: close cached files prior to a REMOVE
or RENAME that would replace target") and commit 7f84b488f9ad
("nfsd: close cached files prior to a REMOVE or RENAME that would
replace target") (Yes, they have the same short description).

I need to explore these two patches and possibly build a pynfs
test that does OPEN(CREATE)/RENAME/CLOSE. I'm away from the office
for another few days to it will take a while.

Until then, I guess reproducing with FreeBSD isn't needed.


>> Would it maybe make any difference updating the Ubuntu based Linux kernel from 5.13 to 5.15?
> 
> I don't yet know enough about the issue to say whether it might
> have been addressed between .13 and .15. So far the issue is not
> familiar from recent code changes.
> 
> 
>> Br
>> Andreas
>> 
>> 
>> 
>> 
>> Von: crispyduck@outlook.at <crispyduck@outlook.at>
>> Gesendet: Mittwoch, 27. April 2022 08:08
>> An: Rick Macklem <rmacklem@uoguelph.ca>; J. Bruce Fields <bfields@fieldses.org>; linux-nfs@vger.kernel.org <linux-nfs@vger.kernel.org>
>> Cc: Chuck Lever III <chuck.lever@oracle.com>
>> Betreff: AW: Problems with NFS4.1 on ESXi 
>> 
>> I tried again to reproduce the "sometimes working" case, but at the moment it always fails. No Idea why it in the past sometimes worked. 
>> Why are this much lookups in the trace? Dont see this on other NFS clients.
>> 
>> The traces with nfs3 where it works and nfs41 where it always fails are here:
>> https://easyupload.io/7bt624
>> 
>> Both from mount till start of vm import (testvm).
>> 
>> exportfs -v:
>> /zfstank/sto1/ds110
>>               <world>(async,wdelay,hide,crossmnt,no_subtree_check,fsid=74345722,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)
>> 
>> 
>> I hope I can also do some tests against a FreeBSD server end of the week.
>> 
>> regards,
>> Andreas
>> 
>> 
>> 
>> Von: Rick Macklem <rmacklem@uoguelph.ca>
>> Gesendet: Sonntag, 24. April 2022 22:39
>> An: J. Bruce Fields <bfields@fieldses.org>
>> Cc: crispyduck@outlook.at <crispyduck@outlook.at>; Chuck Lever III <chuck.lever@oracle.com>; Linux NFS Mailing List <linux-nfs@vger.kernel.org>
>> Betreff: Re: Problems with NFS4.1 on ESXi 
>> 
>> Rick Macklem <rmacklem@uoguelph.ca> wrote:
>> [stuff snipped]
>>> In FreeBSD, it actually hangs onto the parent's FH (verbatim), but mostly
>>> so it can do Open/Claim_NULLs for it. There is nothing in FreeBSD that
>>> tries to subvert FH guessing.
>> Oops, this is client side, not server side. (I forgot which hat I was wearing;-)
>> The FreeBSD server does not keep track of parents.
>> 
>> rick
>> 
>> --b.
> 
> --
> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-05-07  1:53 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <AM9P191MB1665484E1EFD2088D22C2E2F8EF59@AM9P191MB1665.EURP191.PROD.OUTLOOK.COM>
2022-04-21  4:55 ` Problems with NFS4.1 on ESXi Andreas Nagy
2022-04-21 14:58   ` Chuck Lever III
2022-04-21 15:30     ` AW: " crispyduck
2022-04-21 16:40       ` J. Bruce Fields
     [not found]         ` <AM9P191MB16654F5B7541CD1E489D75608EF49@AM9P191MB1665.EURP191.PROD.OUTLOOK.COM>
2022-04-21 18:41           ` crispyduck
2022-04-21 18:54         ` J. Bruce Fields
2022-04-21 23:52           ` Rick Macklem
2022-04-21 23:58             ` Rick Macklem
2022-04-22 14:29             ` Chuck Lever III
2022-04-22 14:59               ` AW: " crispyduck
2022-04-22 15:02                 ` Chuck Lever III
2022-04-22 22:58               ` Rick Macklem
2022-04-22 15:15             ` J. Bruce Fields
2022-04-22 18:43               ` AW: " crispyduck
2022-04-22 23:03               ` Rick Macklem
2022-04-24 15:07                 ` J. Bruce Fields
2022-04-24 20:36                   ` Rick Macklem
2022-04-24 20:39                     ` Rick Macklem
2022-04-25  9:00                       ` AW: " crispyduck
2022-04-27  6:08                         ` crispyduck
2022-05-05  5:31                           ` andreas-nagy
2022-05-05 14:19                             ` Rick Macklem
2022-05-05 16:38                             ` Chuck Lever III
2022-05-07  1:53                               ` Chuck Lever III
2022-04-22 14:23           ` Olga Kornievskaia

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.