linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NFS Caching broken in 4.19.37
@ 2019-07-08 18:19 Anton Ivanov
  2021-02-20 20:04 ` Salvatore Bonaccorso
  0 siblings, 1 reply; 11+ messages in thread
From: Anton Ivanov @ 2019-07-08 18:19 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Hi list,

NFS caching appears broken in 4.19.37.

The more cores/threads the easier to reproduce. Tested with identical 
results on Ryzen 1600 and 1600X.

1. Mount an openwrt build tree over NFS v4
2. Run make -j `cat /proc/cpuinfo | grep vendor | wc -l` ; make clean in 
a loop
3. Result after 3-4 iterations:

State on the client

ls -laF 
/var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm

total 8
drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../

State as seen on the server (mounted via nfs from localhost):

ls -laF 
/var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
total 12
drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
-rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h

Actual state on the filesystem:

ls -laF 
/exports/work/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
total 12
drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
-rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h

So the client has quite clearly lost the plot. Telling it to drop caches 
and re-reading the directory shows the file present.

It is possible to reproduce this using a linux kernel tree too, just 
takes much more iterations - 10+ at least.

Both client and server run 4.19.37 from Debian buster. This is filed as 
debian bug 931500. I originally thought it to be autofs related, but 
IMHO it is actually something fundamentally broken in nfs caching 
resulting in cache corruption.

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFS Caching broken in 4.19.37
  2019-07-08 18:19 NFS Caching broken in 4.19.37 Anton Ivanov
@ 2021-02-20 20:04 ` Salvatore Bonaccorso
  2021-02-20 20:13   ` Anton Ivanov
  0 siblings, 1 reply; 11+ messages in thread
From: Salvatore Bonaccorso @ 2021-02-20 20:04 UTC (permalink / raw)
  To: Anton Ivanov
  Cc: Linux Kernel Mailing List, 940821, linux-nfs, bfields, chuck.lever

Hi,

On Mon, Jul 08, 2019 at 07:19:54PM +0100, Anton Ivanov wrote:
> Hi list,
> 
> NFS caching appears broken in 4.19.37.
> 
> The more cores/threads the easier to reproduce. Tested with identical
> results on Ryzen 1600 and 1600X.
> 
> 1. Mount an openwrt build tree over NFS v4
> 2. Run make -j `cat /proc/cpuinfo | grep vendor | wc -l` ; make clean in a
> loop
> 3. Result after 3-4 iterations:
> 
> State on the client
> 
> ls -laF /var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
> 
> total 8
> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
> 
> State as seen on the server (mounted via nfs from localhost):
> 
> ls -laF /var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
> total 12
> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
> -rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h
> 
> Actual state on the filesystem:
> 
> ls -laF /exports/work/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
> total 12
> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
> -rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h
> 
> So the client has quite clearly lost the plot. Telling it to drop caches and
> re-reading the directory shows the file present.
> 
> It is possible to reproduce this using a linux kernel tree too, just takes
> much more iterations - 10+ at least.
> 
> Both client and server run 4.19.37 from Debian buster. This is filed as
> debian bug 931500. I originally thought it to be autofs related, but IMHO it
> is actually something fundamentally broken in nfs caching resulting in cache
> corruption.

According to the reporter downstream in Debian, at
https://bugs.debian.org/940821#26 thi seem still reproducible with
more recent kernels than the initial reported. Is there anything Anton
can provide to try to track down the issue?

Anton, can you reproduce with current stable series?

Regards,
Salvatore

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFS Caching broken in 4.19.37
  2021-02-20 20:04 ` Salvatore Bonaccorso
@ 2021-02-20 20:13   ` Anton Ivanov
  2021-02-20 20:16     ` Chuck Lever
  0 siblings, 1 reply; 11+ messages in thread
From: Anton Ivanov @ 2021-02-20 20:13 UTC (permalink / raw)
  To: Salvatore Bonaccorso
  Cc: Linux Kernel Mailing List, 940821, linux-nfs, bfields, chuck.lever

On 20/02/2021 20:04, Salvatore Bonaccorso wrote:
> Hi,
>
> On Mon, Jul 08, 2019 at 07:19:54PM +0100, Anton Ivanov wrote:
>> Hi list,
>>
>> NFS caching appears broken in 4.19.37.
>>
>> The more cores/threads the easier to reproduce. Tested with identical
>> results on Ryzen 1600 and 1600X.
>>
>> 1. Mount an openwrt build tree over NFS v4
>> 2. Run make -j `cat /proc/cpuinfo | grep vendor | wc -l` ; make clean in a
>> loop
>> 3. Result after 3-4 iterations:
>>
>> State on the client
>>
>> ls -laF /var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
>>
>> total 8
>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
>>
>> State as seen on the server (mounted via nfs from localhost):
>>
>> ls -laF /var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
>> total 12
>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
>> -rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h
>>
>> Actual state on the filesystem:
>>
>> ls -laF /exports/work/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
>> total 12
>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
>> -rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h
>>
>> So the client has quite clearly lost the plot. Telling it to drop caches and
>> re-reading the directory shows the file present.
>>
>> It is possible to reproduce this using a linux kernel tree too, just takes
>> much more iterations - 10+ at least.
>>
>> Both client and server run 4.19.37 from Debian buster. This is filed as
>> debian bug 931500. I originally thought it to be autofs related, but IMHO it
>> is actually something fundamentally broken in nfs caching resulting in cache
>> corruption.
> According to the reporter downstream in Debian, at
> https://bugs.debian.org/940821#26 thi seem still reproducible with
> more recent kernels than the initial reported. Is there anything Anton
> can provide to try to track down the issue?
>
> Anton, can you reproduce with current stable series?

100% reproducible with any kernel from 4.9 to 5.4, stable or backports. 
It may exist in earlier versions, but I do not have a machine with 
anything before 4.9 to test at present.

 From 1-2 make clean && make  cycles to one afternoon depending on the 
number of machine cores. More cores/threads the faster it does it.

I tried playing with protocol minor versions, caching options, etc - it 
is still reproducible for any nfs4 settings as long as there is client 
side caching of metadata.

A.

>
> Regards,
> Salvatore
>

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFS Caching broken in 4.19.37
  2021-02-20 20:13   ` Anton Ivanov
@ 2021-02-20 20:16     ` Chuck Lever
  2021-02-21  9:13       ` Salvatore Bonaccorso
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2021-02-20 20:16 UTC (permalink / raw)
  To: Anton Ivanov
  Cc: Salvatore Bonaccorso, Linux Kernel Mailing List, 940821,
	Linux NFS Mailing List, Bruce Fields



> On Feb 20, 2021, at 3:13 PM, Anton Ivanov <anton.ivanov@cambridgegreys.com> wrote:
> 
> On 20/02/2021 20:04, Salvatore Bonaccorso wrote:
>> Hi,
>> 
>> On Mon, Jul 08, 2019 at 07:19:54PM +0100, Anton Ivanov wrote:
>>> Hi list,
>>> 
>>> NFS caching appears broken in 4.19.37.
>>> 
>>> The more cores/threads the easier to reproduce. Tested with identical
>>> results on Ryzen 1600 and 1600X.
>>> 
>>> 1. Mount an openwrt build tree over NFS v4
>>> 2. Run make -j `cat /proc/cpuinfo | grep vendor | wc -l` ; make clean in a
>>> loop
>>> 3. Result after 3-4 iterations:
>>> 
>>> State on the client
>>> 
>>> ls -laF /var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
>>> 
>>> total 8
>>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
>>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
>>> 
>>> State as seen on the server (mounted via nfs from localhost):
>>> 
>>> ls -laF /var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
>>> total 12
>>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
>>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
>>> -rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h
>>> 
>>> Actual state on the filesystem:
>>> 
>>> ls -laF /exports/work/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
>>> total 12
>>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
>>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
>>> -rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h
>>> 
>>> So the client has quite clearly lost the plot. Telling it to drop caches and
>>> re-reading the directory shows the file present.
>>> 
>>> It is possible to reproduce this using a linux kernel tree too, just takes
>>> much more iterations - 10+ at least.
>>> 
>>> Both client and server run 4.19.37 from Debian buster. This is filed as
>>> debian bug 931500. I originally thought it to be autofs related, but IMHO it
>>> is actually something fundamentally broken in nfs caching resulting in cache
>>> corruption.
>> According to the reporter downstream in Debian, at
>> https://bugs.debian.org/940821#26 thi seem still reproducible with
>> more recent kernels than the initial reported. Is there anything Anton
>> can provide to try to track down the issue?
>> 
>> Anton, can you reproduce with current stable series?
> 
> 100% reproducible with any kernel from 4.9 to 5.4, stable or backports. It may exist in earlier versions, but I do not have a machine with anything before 4.9 to test at present.

Confirming you are varying client-side kernels. Should the Linux
NFS client maintainers be Cc'd?


> From 1-2 make clean && make  cycles to one afternoon depending on the number of machine cores. More cores/threads the faster it does it.
> 
> I tried playing with protocol minor versions, caching options, etc - it is still reproducible for any nfs4 settings as long as there is client side caching of metadata.
> 
> A.
> 
>> 
>> Regards,
>> Salvatore
>> 
> 
> -- 
> Anton R. Ivanov
> Cambridgegreys Limited. Registered in England. Company Number 10273661
> https://www.cambridgegreys.com/

--
Chuck Lever




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFS Caching broken in 4.19.37
  2021-02-20 20:16     ` Chuck Lever
@ 2021-02-21  9:13       ` Salvatore Bonaccorso
  2021-02-21 11:38         ` Anton Ivanov
  0 siblings, 1 reply; 11+ messages in thread
From: Salvatore Bonaccorso @ 2021-02-21  9:13 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Anton Ivanov, Linux Kernel Mailing List, 940821,
	Linux NFS Mailing List, Bruce Fields, trond.myklebust,
	anna.schumaker

Hi,

On Sat, Feb 20, 2021 at 08:16:26PM +0000, Chuck Lever wrote:
> 
> 
> > On Feb 20, 2021, at 3:13 PM, Anton Ivanov <anton.ivanov@cambridgegreys.com> wrote:
> > 
> > On 20/02/2021 20:04, Salvatore Bonaccorso wrote:
> >> Hi,
> >> 
> >> On Mon, Jul 08, 2019 at 07:19:54PM +0100, Anton Ivanov wrote:
> >>> Hi list,
> >>> 
> >>> NFS caching appears broken in 4.19.37.
> >>> 
> >>> The more cores/threads the easier to reproduce. Tested with identical
> >>> results on Ryzen 1600 and 1600X.
> >>> 
> >>> 1. Mount an openwrt build tree over NFS v4
> >>> 2. Run make -j `cat /proc/cpuinfo | grep vendor | wc -l` ; make clean in a
> >>> loop
> >>> 3. Result after 3-4 iterations:
> >>> 
> >>> State on the client
> >>> 
> >>> ls -laF /var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
> >>> 
> >>> total 8
> >>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
> >>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
> >>> 
> >>> State as seen on the server (mounted via nfs from localhost):
> >>> 
> >>> ls -laF /var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
> >>> total 12
> >>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
> >>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
> >>> -rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h
> >>> 
> >>> Actual state on the filesystem:
> >>> 
> >>> ls -laF /exports/work/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
> >>> total 12
> >>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
> >>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
> >>> -rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h
> >>> 
> >>> So the client has quite clearly lost the plot. Telling it to drop caches and
> >>> re-reading the directory shows the file present.
> >>> 
> >>> It is possible to reproduce this using a linux kernel tree too, just takes
> >>> much more iterations - 10+ at least.
> >>> 
> >>> Both client and server run 4.19.37 from Debian buster. This is filed as
> >>> debian bug 931500. I originally thought it to be autofs related, but IMHO it
> >>> is actually something fundamentally broken in nfs caching resulting in cache
> >>> corruption.
> >> According to the reporter downstream in Debian, at
> >> https://bugs.debian.org/940821#26 thi seem still reproducible with
> >> more recent kernels than the initial reported. Is there anything Anton
> >> can provide to try to track down the issue?
> >> 
> >> Anton, can you reproduce with current stable series?
> > 
> > 100% reproducible with any kernel from 4.9 to 5.4, stable or backports. It may exist in earlier versions, but I do not have a machine with anything before 4.9 to test at present.
> 
> Confirming you are varying client-side kernels. Should the Linux
> NFS client maintainers be Cc'd?

Ok, agreed. Let's add them as well. NFS client maintainers any ideas
on how to trackle this?

> 
> > From 1-2 make clean && make  cycles to one afternoon depending on the number of machine cores. More cores/threads the faster it does it.
> > 
> > I tried playing with protocol minor versions, caching options, etc - it is still reproducible for any nfs4 settings as long as there is client side caching of metadata.
> > 
> > A.
> > 
> >> 
> >> Regards,
> >> Salvatore
> >> 
> > 
> > -- 
> > Anton R. Ivanov
> > Cambridgegreys Limited. Registered in England. Company Number 10273661
> > https://www.cambridgegreys.com/
> 
> --
> Chuck Lever

Regards,
Salvatore

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFS Caching broken in 4.19.37
  2021-02-21  9:13       ` Salvatore Bonaccorso
@ 2021-02-21 11:38         ` Anton Ivanov
  2021-02-21 14:37           ` Bruce Fields
  0 siblings, 1 reply; 11+ messages in thread
From: Anton Ivanov @ 2021-02-21 11:38 UTC (permalink / raw)
  To: Salvatore Bonaccorso, Chuck Lever
  Cc: Linux Kernel Mailing List, 940821, Linux NFS Mailing List,
	Bruce Fields, trond.myklebust, anna.schumaker

On 21/02/2021 09:13, Salvatore Bonaccorso wrote:
> Hi,
> 
> On Sat, Feb 20, 2021 at 08:16:26PM +0000, Chuck Lever wrote:
>>
>>
>>> On Feb 20, 2021, at 3:13 PM, Anton Ivanov <anton.ivanov@cambridgegreys.com> wrote:
>>>
>>> On 20/02/2021 20:04, Salvatore Bonaccorso wrote:
>>>> Hi,
>>>>
>>>> On Mon, Jul 08, 2019 at 07:19:54PM +0100, Anton Ivanov wrote:
>>>>> Hi list,
>>>>>
>>>>> NFS caching appears broken in 4.19.37.
>>>>>
>>>>> The more cores/threads the easier to reproduce. Tested with identical
>>>>> results on Ryzen 1600 and 1600X.
>>>>>
>>>>> 1. Mount an openwrt build tree over NFS v4
>>>>> 2. Run make -j `cat /proc/cpuinfo | grep vendor | wc -l` ; make clean in a
>>>>> loop
>>>>> 3. Result after 3-4 iterations:
>>>>>
>>>>> State on the client
>>>>>
>>>>> ls -laF /var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
>>>>>
>>>>> total 8
>>>>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
>>>>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
>>>>>
>>>>> State as seen on the server (mounted via nfs from localhost):
>>>>>
>>>>> ls -laF /var/autofs/local/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
>>>>> total 12
>>>>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
>>>>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
>>>>> -rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h
>>>>>
>>>>> Actual state on the filesystem:
>>>>>
>>>>> ls -laF /exports/work/src/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.125/arch/mips/include/generated/uapi/asm
>>>>> total 12
>>>>> drwxr-xr-x 2 anivanov anivanov 4096 Jul  8 11:40 ./
>>>>> drwxr-xr-x 3 anivanov anivanov 4096 Jul  8 11:40 ../
>>>>> -rw-r--r-- 1 anivanov anivanov   32 Jul  8 11:40 ipcbuf.h
>>>>>
>>>>> So the client has quite clearly lost the plot. Telling it to drop caches and
>>>>> re-reading the directory shows the file present.
>>>>>
>>>>> It is possible to reproduce this using a linux kernel tree too, just takes
>>>>> much more iterations - 10+ at least.
>>>>>
>>>>> Both client and server run 4.19.37 from Debian buster. This is filed as
>>>>> debian bug 931500. I originally thought it to be autofs related, but IMHO it
>>>>> is actually something fundamentally broken in nfs caching resulting in cache
>>>>> corruption.
>>>> According to the reporter downstream in Debian, at
>>>> https://bugs.debian.org/940821#26 thi seem still reproducible with
>>>> more recent kernels than the initial reported. Is there anything Anton
>>>> can provide to try to track down the issue?
>>>>
>>>> Anton, can you reproduce with current stable series?
>>>
>>> 100% reproducible with any kernel from 4.9 to 5.4, stable or backports. It may exist in earlier versions, but I do not have a machine with anything before 4.9 to test at present.
>>
>> Confirming you are varying client-side kernels. Should the Linux
>> NFS client maintainers be Cc'd?
> 
> Ok, agreed. Let's add them as well. NFS client maintainers any ideas
> on how to trackle this?

This is not observed with Debian backports 5.10 package

uname -a
Linux madding 5.10.0-0.bpo.3-amd64 #1 SMP Debian 5.10.13-1~bpo10+1 
(2021-02-11) x86_64 GNU/Linux

I left the testcase running for ~ 4 hours on a 6core/12thread Ryzen. It 
should have blown up 10 times by now.

So one of the commits between 5.4 and 5.10.13 fixed it.

If nobody can think of a particular commit which fixes it, I can try 
dissecting it during the week.

A.

> 
>>
>>>  From 1-2 make clean && make  cycles to one afternoon depending on the number of machine cores. More cores/threads the faster it does it.
>>>
>>> I tried playing with protocol minor versions, caching options, etc - it is still reproducible for any nfs4 settings as long as there is client side caching of metadata.
>>>
>>> A.
>>>
>>>>
>>>> Regards,
>>>> Salvatore
>>>>
>>>
>>> -- 
>>> Anton R. Ivanov
>>> Cambridgegreys Limited. Registered in England. Company Number 10273661
>>> https://www.cambridgegreys.com/
>>
>> --
>> Chuck Lever
> 
> Regards,
> Salvatore
> 


-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFS Caching broken in 4.19.37
  2021-02-21 11:38         ` Anton Ivanov
@ 2021-02-21 14:37           ` Bruce Fields
  2021-02-21 15:53             ` Anton Ivanov
  0 siblings, 1 reply; 11+ messages in thread
From: Bruce Fields @ 2021-02-21 14:37 UTC (permalink / raw)
  To: Anton Ivanov
  Cc: Salvatore Bonaccorso, Chuck Lever, Linux Kernel Mailing List,
	940821, Linux NFS Mailing List, trond.myklebust, anna.schumaker

On Sun, Feb 21, 2021 at 11:38:51AM +0000, Anton Ivanov wrote:
> On 21/02/2021 09:13, Salvatore Bonaccorso wrote:
> >On Sat, Feb 20, 2021 at 08:16:26PM +0000, Chuck Lever wrote:
> >>Confirming you are varying client-side kernels. Should the Linux
> >>NFS client maintainers be Cc'd?
> >
> >Ok, agreed. Let's add them as well. NFS client maintainers any ideas
> >on how to trackle this?
> 
> This is not observed with Debian backports 5.10 package
> 
> uname -a
> Linux madding 5.10.0-0.bpo.3-amd64 #1 SMP Debian 5.10.13-1~bpo10+1
> (2021-02-11) x86_64 GNU/Linux

I'm still unclear: when you say you tested a certain kernel: are you
varying the client-side kernel version, or the server side, or both at
once?

--b.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFS Caching broken in 4.19.37
  2021-02-21 14:37           ` Bruce Fields
@ 2021-02-21 15:53             ` Anton Ivanov
  2021-02-26 15:03               ` Timo Rothenpieler
  0 siblings, 1 reply; 11+ messages in thread
From: Anton Ivanov @ 2021-02-21 15:53 UTC (permalink / raw)
  To: Bruce Fields
  Cc: Salvatore Bonaccorso, Chuck Lever, Linux Kernel Mailing List,
	940821, Linux NFS Mailing List, trond.myklebust, anna.schumaker

On 21/02/2021 14:37, Bruce Fields wrote:
> On Sun, Feb 21, 2021 at 11:38:51AM +0000, Anton Ivanov wrote:
>> On 21/02/2021 09:13, Salvatore Bonaccorso wrote:
>>> On Sat, Feb 20, 2021 at 08:16:26PM +0000, Chuck Lever wrote:
>>>> Confirming you are varying client-side kernels. Should the Linux
>>>> NFS client maintainers be Cc'd?
>>> Ok, agreed. Let's add them as well. NFS client maintainers any ideas
>>> on how to trackle this?
>> This is not observed with Debian backports 5.10 package
>>
>> uname -a
>> Linux madding 5.10.0-0.bpo.3-amd64 #1 SMP Debian 5.10.13-1~bpo10+1
>> (2021-02-11) x86_64 GNU/Linux
> I'm still unclear: when you say you tested a certain kernel: are you
> varying the client-side kernel version, or the server side, or both at
> once?

Client side. This seems to be an entirely client side issue.

A variety of kernels on the clients starting from 4.9 and up to 5.10 
using 4.19 servers. I have observed it on a 4.9 client versus 4.9 server 
earlier.

4.9 fails, 4.19 fails, 5.2 fails, 5.4 fails, 5.10 works.

At present the server is at 4.19.67 in all tests.

Linux jain 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) 
x86_64 GNU/Linux

I can set-up a couple of alternative servers during the week, but so far 
everything is pointing towards a client fs cache issue, not a server one.

Brgds,

> --b.
>

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFS Caching broken in 4.19.37
  2021-02-21 15:53             ` Anton Ivanov
@ 2021-02-26 15:03               ` Timo Rothenpieler
  2021-02-26 15:40                 ` Anton Ivanov
  0 siblings, 1 reply; 11+ messages in thread
From: Timo Rothenpieler @ 2021-02-26 15:03 UTC (permalink / raw)
  To: Anton Ivanov, Bruce Fields
  Cc: Salvatore Bonaccorso, Chuck Lever, Linux Kernel Mailing List,
	940821, Linux NFS Mailing List, trond.myklebust, anna.schumaker

[-- Attachment #1: Type: text/plain, Size: 2184 bytes --]

I think I can reproduce this, or something that at least looks very 
similar to this, on 5.10. Namely on 5.10.17 (On both Client and Server).

We are running slurm, and since a while now (coincides with updating 
from 5.4 to 5.10, but a whole bunch of other stuff was updated at the 
same time, so it took me a while to correlate this) the logs it writes 
have been truncated, but only while they're being observed on the 
client, using tail -f or something like that.

Looks like this then:

On Server:
> store01 /srv/export/home/users/timo/TestRun # ls -l slurm-41101.out
> -rw-r--r-- 1 timo timo 1931 Feb 26 15:46 slurm-41101.out
> store01 /srv/export/home/users/timo/TestRun # wc -l slurm-41101.out
> 61 slurm-41101.out

On Client:
> timo@login01 ~/TestRun $ ls -l slurm-41101.out
> -rw-r--r-- 1 timo timo 1931 Feb 26 15:46 slurm-41101.out
> timo@login01 ~/TestRun $ wc -l slurm-41101.out
> 24 slurm-41101.out

See https://gist.github.com/BtbN/b9eb4fc08ccc53bb20087bce0bf9f826 for 
the respective file-contents.

If I run the same test job, wait until its done, and then look at its 
slurm.out file, it matches between NFS Client and Server.
If I tail -f the slurm.out on an NFS client, the file stops getting 
updated on the client, but keeps getting more logs written to it on the 
NFS server.

The slurm.out file is being written to by another NFS client, which is 
running on one of the compute nodes of the system. It's being reads from 
a login node.



Timo


On 21.02.2021 16:53, Anton Ivanov wrote:
> Client side. This seems to be an entirely client side issue.
> 
> A variety of kernels on the clients starting from 4.9 and up to 5.10 
> using 4.19 servers. I have observed it on a 4.9 client versus 4.9 server 
> earlier.
> 
> 4.9 fails, 4.19 fails, 5.2 fails, 5.4 fails, 5.10 works.
> 
> At present the server is at 4.19.67 in all tests.
> 
> Linux jain 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) 
> x86_64 GNU/Linux
> 
> I can set-up a couple of alternative servers during the week, but so far 
> everything is pointing towards a client fs cache issue, not a server one.
> 
> Brgds,
> 



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4494 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFS Caching broken in 4.19.37
  2021-02-26 15:03               ` Timo Rothenpieler
@ 2021-02-26 15:40                 ` Anton Ivanov
  2021-02-26 15:48                   ` Timo Rothenpieler
  0 siblings, 1 reply; 11+ messages in thread
From: Anton Ivanov @ 2021-02-26 15:40 UTC (permalink / raw)
  To: Timo Rothenpieler, Bruce Fields
  Cc: Salvatore Bonaccorso, Chuck Lever, Linux Kernel Mailing List,
	940821, Linux NFS Mailing List, trond.myklebust, anna.schumaker

On 26/02/2021 15:03, Timo Rothenpieler wrote:
> I think I can reproduce this, or something that at least looks very 
> similar to this, on 5.10. Namely on 5.10.17 (On both Client and Server).

I think this is a different issue - see below.

>
> We are running slurm, and since a while now (coincides with updating 
> from 5.4 to 5.10, but a whole bunch of other stuff was updated at the 
> same time, so it took me a while to correlate this) the logs it writes 
> have been truncated, but only while they're being observed on the 
> client, using tail -f or something like that.
>
> Looks like this then:
>
> On Server:
>> store01 /srv/export/home/users/timo/TestRun # ls -l slurm-41101.out
>> -rw-r--r-- 1 timo timo 1931 Feb 26 15:46 slurm-41101.out
>> store01 /srv/export/home/users/timo/TestRun # wc -l slurm-41101.out
>> 61 slurm-41101.out
>
> On Client:
>> timo@login01 ~/TestRun $ ls -l slurm-41101.out
>> -rw-r--r-- 1 timo timo 1931 Feb 26 15:46 slurm-41101.out
>> timo@login01 ~/TestRun $ wc -l slurm-41101.out
>> 24 slurm-41101.out
>
> See https://gist.github.com/BtbN/b9eb4fc08ccc53bb20087bce0bf9f826 for 
> the respective file-contents.
>
> If I run the same test job, wait until its done, and then look at its 
> slurm.out file, it matches between NFS Client and Server.
> If I tail -f the slurm.out on an NFS client, the file stops getting 
> updated on the client, but keeps getting more logs written to it on 
> the NFS server.
>
> The slurm.out file is being written to by another NFS client, which is 
> running on one of the compute nodes of the system. It's being reads 
> from a login node.

These are two different clients, then what you see is possible on NFS 
with client side caching. If you have multiple clients reading/writing 
to the same files you usually need to tune the caching options and/or 
use locking. I suspect that if you leave it for a while (until the cache 
expires) it will sort itself out.

In my test-case it is just one client, it missed a file deletion and 
nothing short of an unmount and remount fixes that. I have waited for 30 
mins+. It does not seem to refresh or expire. I also see the opposite 
behavior - the bug shows up on 4.x up to at least 5.4. I do not see it 
on 5.10.

Brgds,


>
>
>
>
> Timo
>
>
> On 21.02.2021 16:53, Anton Ivanov wrote:
>> Client side. This seems to be an entirely client side issue.
>>
>> A variety of kernels on the clients starting from 4.9 and up to 5.10 
>> using 4.19 servers. I have observed it on a 4.9 client versus 4.9 
>> server earlier.
>>
>> 4.9 fails, 4.19 fails, 5.2 fails, 5.4 fails, 5.10 works.
>>
>> At present the server is at 4.19.67 in all tests.
>>
>> Linux jain 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 
>> (2019-11-11) x86_64 GNU/Linux
>>
>> I can set-up a couple of alternative servers during the week, but so 
>> far everything is pointing towards a client fs cache issue, not a 
>> server one.
>>
>> Brgds,
>>
>
>

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NFS Caching broken in 4.19.37
  2021-02-26 15:40                 ` Anton Ivanov
@ 2021-02-26 15:48                   ` Timo Rothenpieler
  0 siblings, 0 replies; 11+ messages in thread
From: Timo Rothenpieler @ 2021-02-26 15:48 UTC (permalink / raw)
  To: Anton Ivanov, Bruce Fields
  Cc: Salvatore Bonaccorso, Chuck Lever, Linux Kernel Mailing List,
	Linux NFS Mailing List, trond.myklebust, anna.schumaker

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]

On 26.02.2021 16:40, Anton Ivanov wrote:
> These are two different clients, then what you see is possible on NFS 
> with client side caching. If you have multiple clients reading/writing 
> to the same files you usually need to tune the caching options and/or 
> use locking. I suspect that if you leave it for a while (until the cache 
> expires) it will sort itself out.

Yes, letting the client sit for just a few minutes (without interacting 
with file or directory in question) gets it back in sync with the server.

> In my test-case it is just one client, it missed a file deletion and 
> nothing short of an unmount and remount fixes that. I have waited for 30 
> mins+. It does not seem to refresh or expire. I also see the opposite 
> behavior - the bug shows up on 4.x up to at least 5.4. I do not see it 
> on 5.10.

Yeah, that's indeed different, though still looks somewhat similar.
Makes me wonder if what fixed that issue is what's causing mine.

The primarily broken use case here is users starting their SLURM jobs, 
and then observing them via "tail -f slurm.out", which has worked 
perfectly fine in the past, prior to the update from 5.4 to 5.10.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4494 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-02-26 15:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-08 18:19 NFS Caching broken in 4.19.37 Anton Ivanov
2021-02-20 20:04 ` Salvatore Bonaccorso
2021-02-20 20:13   ` Anton Ivanov
2021-02-20 20:16     ` Chuck Lever
2021-02-21  9:13       ` Salvatore Bonaccorso
2021-02-21 11:38         ` Anton Ivanov
2021-02-21 14:37           ` Bruce Fields
2021-02-21 15:53             ` Anton Ivanov
2021-02-26 15:03               ` Timo Rothenpieler
2021-02-26 15:40                 ` Anton Ivanov
2021-02-26 15:48                   ` Timo Rothenpieler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).