All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [syzbot] WARNING in p9_client_destroy
       [not found] <CAAZOf26g-L2nSV-Siw6mwWQv1nv6on8c0fWqB4bKmX73QAFzow@mail.gmail.com>
@ 2022-03-26 11:46 ` David Kahurani
  2022-03-26 11:48 ` Christian Schoenebeck
  1 sibling, 0 replies; 48+ messages in thread
From: David Kahurani @ 2022-03-26 11:46 UTC (permalink / raw)
  To: davem, ericvh, kuba, linux-kernel, linux_oss, lucho, netdev,
	syzkaller-bugs, v9fs-developer, syzbot+5e28cdb7ebd0f2389ca4

Sorry, got to resend this in plain text. It doesn't look like it is
getting through to the mailing lists.

On Thu, Mar 24, 2022 at 3:13 PM David Kahurani <k.kahurani@gmail.com> wrote:
>
> On Monday, February 28, 2022 at 4:38:57 AM UTC+3 asmadeus@codewreck.org wrote:
>>
>> syzbot wrote on Sun, Feb 27, 2022 at 04:53:29PM -0800:
>> > kmem_cache_destroy 9p-fcall-cache: Slab cache still has objects when
>> > called from p9_client_destroy+0x213/0x370 net/9p/client.c:1100
>>
>> hmm, there is no previous "Packet with tag %d has still references"
>> (sic) message, so this is probably because p9_tag_cleanup only relies on
>> rcu read lock for consistency, so even if the connection has been closed
>> above (clnt->trans_mode->close) there could have been a request sent
>> (= tag added) just before that which isn't visible on the destroying
>> side?
>>
>> I guess adding an rcu_barrier() is what makes most sense here to protect
>> this case?
>> I'll send a patch in the next few days unless it was a stupid idea.
>
>
> Looking at this brought me to the same conclusion.
>
> ---------------------
>
> From cd5a11207a140004bf55005fac7f7e4cec2fd075 Mon Sep 17 00:00:00 2001
> From: David Kahurani <k.kahurani@gmail.com>
> Date: Thu, 24 Mar 2022 15:00:23 +0300
> Subject: [PATCH] net/9p: Flush any delayed rce free
>
> As is best practice
>
> kmem_cache_destroy 9p-fcall-cache: Slab cache still has objects when called from p9_client_destroy+0x213/0x370 net/9p/client.c:1100
> WARNING: CPU: 1 PID: 3701 at mm/slab_common.c:502 kmem_cache_destroy mm/slab_common.c:502 [inline]
> WARNING: CPU: 1 PID: 3701 at mm/slab_common.c:502 kmem_cache_destroy+0x13b/0x140 mm/slab_common.c:490
> Modules linked in:
> CPU: 1 PID: 3701 Comm: syz-executor.3 Not tainted 5.17.0-rc5-syzkaller-00021-g23d04328444a #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
> RIP: 0010:kmem_cache_destroy mm/slab_common.c:502 [inline]
> RIP: 0010:kmem_cache_destroy+0x13b/0x140 mm/slab_common.c:490
> Code: da a8 0e 48 89 ee e8 44 6e 15 00 eb c1 c3 48 8b 55 58 48 c7 c6 60 cd b6 89 48 c7 c7 30 83 3a 8b 48 8b 4c 24 18 e8 9b 30 60 07 <0f> 0b eb a0 90 41 55 49 89 d5 41 54 49 89 f4 55 48 89 fd 53 48 83
> RSP: 0018:ffffc90002767cf0 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: 1ffff920004ecfa5 RCX: 0000000000000000
> RDX: ffff88801e56a280 RSI: ffffffff815f4b38 RDI: fffff520004ecf90
> RBP: ffff888020ba8b00 R08: 0000000000000000 R09: 0000000000000000
> R10: ffffffff815ef1ce R11: 0000000000000000 R12: 0000000000000001
> R13: ffffc90002767d68 R14: dffffc0000000000 R15: 0000000000000000
> FS:  00005555561b0400(0000) GS:ffff88802ca00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000555556ead708 CR3: 0000000068b97000 CR4: 0000000000150ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  p9_client_destroy+0x213/0x370 net/9p/client.c:1100
>  v9fs_session_close+0x45/0x2d0 fs/9p/v9fs.c:504
>  v9fs_kill_super+0x49/0x90 fs/9p/vfs_super.c:226
>  deactivate_locked_super+0x94/0x160 fs/super.c:332
>  deactivate_super+0xad/0xd0 fs/super.c:363
>  cleanup_mnt+0x3a2/0x540 fs/namespace.c:1173
>  task_work_run+0xdd/0x1a0 kernel/task_work.c:164
>  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
>  exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
>  exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207
>  __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
>  syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
>  do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x7f5ff63ed4c7
> Code: ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007fff01862e98 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f5ff63ed4c7
> RDX: 00007fff01862f6c RSI: 000000000000000a RDI: 00007fff01862f60
> RBP: 00007fff01862f60 R08: 00000000ffffffff R09: 00007fff01862d30
> R10: 00005555561b18b3 R11: 0000000000000246 R12: 00007f5ff64451ea
> R13: 00007fff01864020 R14: 00005555561b1810 R15: 00007fff01864060
>  </TASK>
>
> Signed-off-by: David Kahurani <k.kahurani@gmail.com>
> Reported-by: syzbot+5e28cdb7ebd0f2389ca4@syzkaller.appspotmail.com
> ---
>  net/9p/client.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 8bba0d9cf..67c51913a 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -1097,6 +1097,7 @@ void p9_client_destroy(struct p9_client *clnt)
>
>   p9_tag_cleanup(clnt);
>
> + rcu_barrier();
>   kmem_cache_destroy(clnt->fcall_cache);
>   kfree(clnt);
>  }
> --
> 2.25.1
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [syzbot] WARNING in p9_client_destroy
       [not found] <CAAZOf26g-L2nSV-Siw6mwWQv1nv6on8c0fWqB4bKmX73QAFzow@mail.gmail.com>
  2022-03-26 11:46 ` [syzbot] WARNING in p9_client_destroy David Kahurani
@ 2022-03-26 11:48 ` Christian Schoenebeck
  2022-03-26 12:24   ` asmadeus
  1 sibling, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-03-26 11:48 UTC (permalink / raw)
  To: David Kahurani
  Cc: davem, ericvh, kuba, linux-kernel, lucho, netdev, syzkaller-bugs,
	v9fs-developer, syzbot+5e28cdb7ebd0f2389ca4, asmadeus

On Donnerstag, 24. März 2022 13:13:25 CET David Kahurani wrote:
> On Monday, February 28, 2022 at 4:38:57 AM UTC+3 asmadeus@codewreck.org
> 
> wrote:
> > syzbot wrote on Sun, Feb 27, 2022 at 04:53:29PM -0800:
> > > kmem_cache_destroy 9p-fcall-cache: Slab cache still has objects when
> > > called from p9_client_destroy+0x213/0x370 net/9p/client.c:1100
> > 
> > hmm, there is no previous "Packet with tag %d has still references"
> > (sic) message, so this is probably because p9_tag_cleanup only relies on
> > rcu read lock for consistency, so even if the connection has been closed
> > above (clnt->trans_mode->close) there could have been a request sent
> > (= tag added) just before that which isn't visible on the destroying
> > side?
> > 
> > I guess adding an rcu_barrier() is what makes most sense here to protect
> > this case?
> > I'll send a patch in the next few days unless it was a stupid idea.
> 
> Looking at this brought me to the same conclusion.
> 
> ---------------------
> 
> From cd5a11207a140004bf55005fac7f7e4cec2fd075 Mon Sep 17 00:00:00 2001
> From: David Kahurani <k.kahurani@gmail.com>
> Date: Thu, 24 Mar 2022 15:00:23 +0300
> Subject: [PATCH] net/9p: Flush any delayed rce free
> 
> As is best practice
> 
> kmem_cache_destroy 9p-fcall-cache: Slab cache still has objects when called
> from p9_client_destroy+0x213/0x370 net/9p/client.c:1100
> WARNING: CPU: 1 PID: 3701 at mm/slab_common.c:502 kmem_cache_destroy
> mm/slab_common.c:502 [inline]
> WARNING: CPU: 1 PID: 3701 at mm/slab_common.c:502
> kmem_cache_destroy+0x13b/0x140 mm/slab_common.c:490
> Modules linked in:
> CPU: 1 PID: 3701 Comm: syz-executor.3 Not tainted
> 5.17.0-rc5-syzkaller-00021-g23d04328444a #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
> RIP: 0010:kmem_cache_destroy mm/slab_common.c:502 [inline]
> RIP: 0010:kmem_cache_destroy+0x13b/0x140 mm/slab_common.c:490
> Code: da a8 0e 48 89 ee e8 44 6e 15 00 eb c1 c3 48 8b 55 58 48 c7 c6 60 cd
> b6 89 48 c7 c7 30 83 3a 8b 48 8b 4c 24 18 e8 9b 30 60 07 <0f> 0b eb a0 90
> 41 55 49 89 d5 41 54 49 89 f4 55 48 89 fd 53 48 83
> RSP: 0018:ffffc90002767cf0 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: 1ffff920004ecfa5 RCX: 0000000000000000
> RDX: ffff88801e56a280 RSI: ffffffff815f4b38 RDI: fffff520004ecf90
> RBP: ffff888020ba8b00 R08: 0000000000000000 R09: 0000000000000000
> R10: ffffffff815ef1ce R11: 0000000000000000 R12: 0000000000000001
> R13: ffffc90002767d68 R14: dffffc0000000000 R15: 0000000000000000
> FS:  00005555561b0400(0000) GS:ffff88802ca00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000555556ead708 CR3: 0000000068b97000 CR4: 0000000000150ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  p9_client_destroy+0x213/0x370 net/9p/client.c:1100
>  v9fs_session_close+0x45/0x2d0 fs/9p/v9fs.c:504
>  v9fs_kill_super+0x49/0x90 fs/9p/vfs_super.c:226
>  deactivate_locked_super+0x94/0x160 fs/super.c:332
>  deactivate_super+0xad/0xd0 fs/super.c:363
>  cleanup_mnt+0x3a2/0x540 fs/namespace.c:1173
>  task_work_run+0xdd/0x1a0 kernel/task_work.c:164
>  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
>  exit_to_user_mode_loop kernel/entry/common.c:175 [inline]
>  exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207
>  __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
>  syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
>  do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x7f5ff63ed4c7
> Code: ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09
> 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff
> ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007fff01862e98 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f5ff63ed4c7
> RDX: 00007fff01862f6c RSI: 000000000000000a RDI: 00007fff01862f60
> RBP: 00007fff01862f60 R08: 00000000ffffffff R09: 00007fff01862d30
> R10: 00005555561b18b3 R11: 0000000000000246 R12: 00007f5ff64451ea
> R13: 00007fff01864020 R14: 00005555561b1810 R15: 00007fff01864060
>  </TASK>
> 
> Signed-off-by: David Kahurani <k.kahurani@gmail.com>
> Reported-by: syzbot+5e28cdb7ebd0f2389ca4@syzkaller.appspotmail.com

I'm not absolutely sure that this will really fix this issue, but it seems to 
be a good idea to add a rcu_barrier() call here nevertheless.

Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>

> ---
>  net/9p/client.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 8bba0d9cf..67c51913a 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -1097,6 +1097,7 @@ void p9_client_destroy(struct p9_client *clnt)
> 
>   p9_tag_cleanup(clnt);
> 
> + rcu_barrier();
>   kmem_cache_destroy(clnt->fcall_cache);
>   kfree(clnt);
>  }





^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [syzbot] WARNING in p9_client_destroy
  2022-03-26 11:48 ` Christian Schoenebeck
@ 2022-03-26 12:24   ` asmadeus
  2022-03-26 12:36     ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: asmadeus @ 2022-03-26 12:24 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	syzkaller-bugs, v9fs-developer, syzbot+5e28cdb7ebd0f2389ca4

Christian Schoenebeck wrote on Sat, Mar 26, 2022 at 12:48:26PM +0100:
> [...]
>
> > Signed-off-by: David Kahurani <k.kahurani@gmail.com>
> > Reported-by: syzbot+5e28cdb7ebd0f2389ca4@syzkaller.appspotmail.com

Looks good to me - it's pretty much what I'd have done if I hadn't
forgotten!
It doesn't strike me as anything critical and I don't have anything else
for this cycle so I'll just queue it in -next for now, and submit it
at the start of the 5.19 cycle in ~2months.

> I'm not absolutely sure that this will really fix this issue, but it seems to 
> be a good idea to add a rcu_barrier() call here nevertheless.

Yeah, I'm not really sure either but this is the only idea I have given
the debug code doesn't list anything left in the cache, and David came
to the same conclusion :/

Can't hurt though, so let's try and see if syzbot complains
again. Thanks for the review!

-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [syzbot] WARNING in p9_client_destroy
  2022-03-26 12:24   ` asmadeus
@ 2022-03-26 12:36     ` Christian Schoenebeck
  2022-03-26 13:35       ` 9p fscache Duplicate cookie detected (Was: [syzbot] WARNING in p9_client_destroy) asmadeus
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-03-26 12:36 UTC (permalink / raw)
  To: asmadeus
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	syzkaller-bugs, v9fs-developer, syzbot+5e28cdb7ebd0f2389ca4

On Samstag, 26. März 2022 13:24:10 CET asmadeus@codewreck.org wrote:
> Christian Schoenebeck wrote on Sat, Mar 26, 2022 at 12:48:26PM +0100:
> > [...]
> > 
> > > Signed-off-by: David Kahurani <k.kahurani@gmail.com>
> > > Reported-by: syzbot+5e28cdb7ebd0f2389ca4@syzkaller.appspotmail.com
> 
> Looks good to me - it's pretty much what I'd have done if I hadn't
> forgotten!
> It doesn't strike me as anything critical and I don't have anything else
> for this cycle so I'll just queue it in -next for now, and submit it
> at the start of the 5.19 cycle in ~2months.

BTW, another issue that I am seeing for a long time affects the fs-cache: When
I use cache=mmap then things seem to be harmless, I periodically see messages
like these, but that's about it:

[90763.435562] FS-Cache: Duplicate cookie detected
[90763.436514] FS-Cache: O-cookie c=00dcb42f [p=00000003 fl=216 nc=0 na=0]
[90763.437795] FS-Cache: O-cookie d=0000000000000000{?} n=0000000000000000
[90763.440096] FS-Cache: O-key=[8] 'a7ab2c0000000000'
[90763.441656] FS-Cache: N-cookie c=00dcb4a7 [p=00000003 fl=2 nc=0 na=1]
[90763.446753] FS-Cache: N-cookie d=000000005b583d5a{9p.inode} n=00000000212184fb
[90763.448196] FS-Cache: N-key=[8] 'a7ab2c0000000000'

The real trouble starts when I use cache=loose though, in this case I get all
sorts of misbehaviours from time to time, especially complaining about invalid
file descriptors.

Any clues?

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fscache Duplicate cookie detected (Was: [syzbot] WARNING in p9_client_destroy)
  2022-03-26 12:36     ` Christian Schoenebeck
@ 2022-03-26 13:35       ` asmadeus
  2022-03-30 12:21         ` 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected) Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: asmadeus @ 2022-03-26 13:35 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells

(+David Howells in Cc as he's knows how that works better than me;
 -syzbot lists as it doesn't really concern this bug)

Christian Schoenebeck wrote on Sat, Mar 26, 2022 at 01:36:31PM +0100:
> BTW, another issue that I am seeing for a long time affects the fs-cache: When
> I use cache=mmap then things seem to be harmless, I periodically see messages
> like these, but that's about it:
> 
> [90763.435562] FS-Cache: Duplicate cookie detected
> [90763.436514] FS-Cache: O-cookie c=00dcb42f [p=00000003 fl=216 nc=0 na=0]
> [90763.437795] FS-Cache: O-cookie d=0000000000000000{?} n=0000000000000000
> [90763.440096] FS-Cache: O-key=[8] 'a7ab2c0000000000'
> [90763.441656] FS-Cache: N-cookie c=00dcb4a7 [p=00000003 fl=2 nc=0 na=1]
> [90763.446753] FS-Cache: N-cookie d=000000005b583d5a{9p.inode} n=00000000212184fb
> [90763.448196] FS-Cache: N-key=[8] 'a7ab2c0000000000'

hm, fscache code shouldn't be used for cache=mmap, I'm surprised you can
hit this...

> The real trouble starts when I use cache=loose though, in this case I get all
> sorts of misbehaviours from time to time, especially complaining about invalid
> file descriptors.

... but I did encouter these on cache=loose/fscache, although I hadn't
noticed any bad behaviour such as invalid file descriptors.

> Any clues?

Since I hadn't noticed real harm I didn't look too hard into it, I have
a couple of ideas:
- the cookie is just a truncated part of the inode number, it's possible
we get real collisions because there are no guarantees there won't be
identical inodes there.
In particular, it's trivial to reproduce by exporting submounts:

## on host in export directory
# mount -t tmpfs tmpfs m1
# mount -t tmpfs tmpfs m2
# echo foo > m1/a
# echo bar > m2/a
# ls -li m1 m2
m1:
total 4
2 -rw-r--r-- 1 asmadeus users 4 Mar 26 22:23 a

m2:
total 4
2 -rw-r--r-- 1 asmadeus users 4 Mar 26 22:23 a

## on client
# /mnt/t/m*/a
foo
bar
FS-Cache: Duplicate cookie detected
FS-Cache: O-cookie c=0000099a [fl=4000 na=0 nA=0 s=-]
FS-Cache: O-cookie V=00000006 [9p,tmp,]
FS-Cache: O-key=[8] '0200000000000000'
FS-Cache: N-cookie c=0000099b [fl=0 na=0 nA=0 s=-]
FS-Cache: N-cookie V=00000006 [9p,tmp,]
FS-Cache: N-key=[8] '0200000000000000'


But as you can see despite the warning the content is properly
different, and writing also works, so this probably isn't it... Although
the fscache code we're using is totally different -- your dmesg output
is from the "pre-netfs" code, so that might have gotten fixed as a side
effect?


- lifecycle différence between inode and fscache entry.
David pushed a patch a few years back to address this but it looks like
it never got merged:
https://lore.kernel.org/lkml/155231584487.2992.17466330160329385162.stgit@warthog.procyon.org.uk/

the rationale is that we could evict the inode then reallocate it, and
it'd generate a new fscache entry with the same key before the previous
fscache entry had been freed.
I'm not sure if that got fixed otherwise and it might not be possible
anymore, I didn't follow that, but given 


 - some other bug...

If you have some kind of reproducer of invalid filedescriptor or similar
errors I'd be happy to dig a bit more, I don't particularly like all
aspect of our cache model but it's not good if it corrupts things.

-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-03-26 13:35       ` 9p fscache Duplicate cookie detected (Was: [syzbot] WARNING in p9_client_destroy) asmadeus
@ 2022-03-30 12:21         ` Christian Schoenebeck
  2022-03-30 21:47           ` asmadeus
                             ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Christian Schoenebeck @ 2022-03-30 12:21 UTC (permalink / raw)
  To: asmadeus
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

I made some tests & benchmarks regarding the fs-cache issue of 9p, running
different kernel versions and kernel configurations in comparison.

Setup: all tests were running QEMU v7.0.0-rc0+ [3]. Linux guest system was
running 9p as root filesystem as described in the QEMU 9p root fs HOWTO [4]
and then I installed build tools into guest OS required for this test. In each
test run I compiled the same source files as parallel build (make -jN).
Between each run I deleted the build directory and rebooted the guest system
with different kernel config and restarted a build on guest. Results:

Case  Linux kernel version           .config  msize    cache  duration  host cpu  errors/warnings

A)    5.17.0+[2] + msize patches[1]  debug    4186112  mmap   20m 40s   ~80%      none
B)    5.17.0+[2] + msize patches[1]  debug    4186112  loose  31m 28s   ~35%      several errors (compilation completed)
C)    5.17.0+[2] + msize patches[1]  debug    507904   mmap   20m 25s   ~84%      none
D)    5.17.0+[2] + msize patches[1]  debug    507904   loose  31m 2s    ~33%      several errors (compilation completed)
E)    5.17.0+[2]                     debug    512000   mmap   23m 45s   ~75%      none
F)    5.17.0+[2]                     debug    512000   loose  32m 6s    ~31%      several errors (compilation completed)
G)    5.17.0+[2]                     release  512000   mmap   23m 18s   ~76%      none
H)    5.17.0+[2]                     release  512000   loose  32m 33s   ~31%      several errors (compilation completed)
I)    5.17.0+[2] + msize patches[1]  release  4186112  mmap   20m 30s   ~83%      none
J)    5.17.0+[2] + msize patches[1]  release  4186112  loose  31m 21s   ~31%      several errors (compilation completed)
K)    5.10.84                        release  512000   mmap   39m 20s   ~80%      none
L)    5.10.84                        release  512000   loose  13m 40s   ~55%      none

[1] 9p msize patches v4 (2021-12-30): https://lore.kernel.org/netdev/cover.1640870037.git.linux_oss@crudebyte.com/
[2] Linux kernel "5.17.0+": SHA-1 710f5d627a98 ("Merge tag 'usb-5.18-rc1'", 2022-03-26 13:08:25)
[3] QEMU "v7.0.0-rc0+": SHA-1 1d60bb4b146 ("Merge tag 'pull-request-2022-03-15v2'", 2022-03-16 10:43:58)
[4] 9p as root filesystem: https://wiki.qemu.org/Documentation/9p_root_fs

As for fs-cache issues:
======================

Disclaimer: I have not looked into the fs-cache sources yet, so I am not sure,
but my first impression is that probably something got broken with recent
fs-cache changes (see column errors, especially in comparison to case L) which
did not generate any errors)? And also note the huge build duration 
differences, especially in comparison to case L), so fs-cache (cache=loose)
also got significantly slower while cache=mmap OTOH became significantly
faster?

About the errors: I actually already see errors with cache=loose and recent
kernel version just when booting the guest OS. For these tests I chose some
sources which allowed me to complete the build to capture some benchmark as
well, I got some "soft" errors with those, but the build completed at least.
I had other sources OTOH which did not complete though and aborted with
certain invalid file descriptor errors, which I obviously could not use for
those benchmarks here.

debug/release .config: In the first runs with recent kernel 5.17.0+ I still
had debugging turned on, whereas the older kernel was optimized. So I repeated
the tests of kernel 5.17.0+ with -O2 and debugging options turned off, but the
numbers only slightly improved. So debug vs. release does not seem to have a
significant impact on the results.

host cpu column: these were just very approximate numbers that I additionally
wrote down to compare host CPU saturation during these tests.

As for latest msize patches (v4):
================================

large msize: In these tests there are a very large amount of rather small
chunk I/O in parallel, where a huge msize (e.g. 4MB) does not really bring
advantages. So this is different to my previous benchmarks which focused on
large chunk sequential I/O before, where large msize values could shine. You
can see that case A) is probably even a bit slower with msize=4MB, where I am
assuming that Treaddir requests still being msize large might hurt here with
msize=4MB in these tests. I still need to verify that though.

small msize: The results also suggest though that the msize patches bring
advantages with a smaller msize value in comparison to unpatched kernels. I
assume that's because of the last bunch of patches which reduce the size of
most 9p requests to what they really need, instead of simply allocating
always 'msize' for each 9p request as it is still right now on master.

  ...

And finally in response to your previous email, see below ...

On Samstag, 26. März 2022 14:35:14 CEST asmadeus@codewreck.org wrote:
> (+David Howells in Cc as he's knows how that works better than me;
>  -syzbot lists as it doesn't really concern this bug)

+Greg Kurz, for 9p server part

> Christian Schoenebeck wrote on Sat, Mar 26, 2022 at 01:36:31PM +0100:
> > BTW, another issue that I am seeing for a long time affects the fs-cache:
> > When I use cache=mmap then things seem to be harmless, I periodically see
> > messages like these, but that's about it:
> > 
> > [90763.435562] FS-Cache: Duplicate cookie detected
> > [90763.436514] FS-Cache: O-cookie c=00dcb42f [p=00000003 fl=216 nc=0 na=0]
> > [90763.437795] FS-Cache: O-cookie d=0000000000000000{?} n=0000000000000000
> > [90763.440096] FS-Cache: O-key=[8] 'a7ab2c0000000000'
> > [90763.441656] FS-Cache: N-cookie c=00dcb4a7 [p=00000003 fl=2 nc=0 na=1]
> > [90763.446753] FS-Cache: N-cookie d=000000005b583d5a{9p.inode}
> > n=00000000212184fb [90763.448196] FS-Cache: N-key=[8] 'a7ab2c0000000000'
> 
> hm, fscache code shouldn't be used for cache=mmap, I'm surprised you can
> hit this...

I assume that you mean that 9p driver does not explicitly ask for fs-cache
being used for mmap. I see that 9p uses the kernel's generalized mmap
implementation:

https://github.com/torvalds/linux/blob/d888c83fcec75194a8a48ccd283953bdba7b2550/fs/9p/vfs_file.c#L481

I haven't dived further into this, but the kernel has to use some kind of
filesystem cache anyway to provide the mmap functionality, so I guess it makes
sense that I got those warning messages from the FS-Cache subsystem?

> > The real trouble starts when I use cache=loose though, in this case I get
> > all sorts of misbehaviours from time to time, especially complaining
> > about invalid file descriptors.
> 
> ... but I did encouter these on cache=loose/fscache, although I hadn't
> noticed any bad behaviour such as invalid file descriptors.
> 
> > Any clues?
> 
> Since I hadn't noticed real harm I didn't look too hard into it, I have
> a couple of ideas:
> - the cookie is just a truncated part of the inode number, it's possible
> we get real collisions because there are no guarantees there won't be
> identical inodes there.

I think with 'truncated' you actually mean what's going on 9p server (QEMU)
side, see below ...

> In particular, it's trivial to reproduce by exporting submounts:
> 
> ## on host in export directory
> # mount -t tmpfs tmpfs m1
> # mount -t tmpfs tmpfs m2
> # echo foo > m1/a
> # echo bar > m2/a
> # ls -li m1 m2
> m1:
> total 4
> 2 -rw-r--r-- 1 asmadeus users 4 Mar 26 22:23 a
> 
> m2:
> total 4
> 2 -rw-r--r-- 1 asmadeus users 4 Mar 26 22:23 a
> 
> ## on client
> # /mnt/t/m*/a
> foo
> bar
> FS-Cache: Duplicate cookie detected
> FS-Cache: O-cookie c=0000099a [fl=4000 na=0 nA=0 s=-]
> FS-Cache: O-cookie V=00000006 [9p,tmp,]
> FS-Cache: O-key=[8] '0200000000000000'
> FS-Cache: N-cookie c=0000099b [fl=0 na=0 nA=0 s=-]
> FS-Cache: N-cookie V=00000006 [9p,tmp,]
> FS-Cache: N-key=[8] '0200000000000000'

With QEMU >= 5.2 you should see the following QEMU warning with your reproducer:

"
qemu-system-x86_64: warning: 9p: Multiple devices detected in same VirtFS
export, which might lead to file ID collisions and severe misbehaviours on
guest! You should either use a separate export for each device shared from
host or use virtfs option 'multidevs=remap'!
"

And after restarting QEMU with 'multidevs=remap' you won't get such errors
anymore. I just tested this right now: without 'multidevs=remap' I would get
those errors with your reproducer above, with 'multidevs=remap' there were
no errors.

Background: the Linux 9p driver is using the 9p "QID path" as file ID, i.e. as
key for looking up entries in the fs-cache:
https://github.com/torvalds/linux/blob/d888c83fcec75194a8a48ccd283953bdba7b2550/fs/9p/cache.c#L65

By default QEMU just uses the host file's inode number as "QID path". So if
you have multiple filesystems inside the exported tree, this can lead to
collisions. Usually we "should" place both the device ID number and inode
number into "QID path" to prevent that, but the problem is "QID path" is
currently only 64-bit large in the 9p protocol, so it is too small to hold
both device id and inode number:
http://ericvh.github.io/9p-rfc/rfc9p2000.html#anchor32

If 'multidevs=remap' is passed to QEMU though then guaranteed unique "QID
path" numbers are generated, even if there are multiple filesystems mounted
inside the exported tree. So you won't get collisions in this case. This is
usually cost free, because we are using the fact that inode numbers are always
sequentially generated by host file systems from 1 upwards. So on the left
hand side of inode numbers we usally have plenty of zeros and can prefix them
with our own numbers there to prevent collissions while being able to squeeze
them into 64-bit.

> But as you can see despite the warning the content is properly
> different, and writing also works, so this probably isn't it... Although
> the fscache code we're using is totally different -- your dmesg output
> is from the "pre-netfs" code, so that might have gotten fixed as a side
> effect?
> 
> - lifecycle différence between inode and fscache entry.
> David pushed a patch a few years back to address this but it looks like
> it never got merged:
> https://lore.kernel.org/lkml/155231584487.2992.17466330160329385162.stgit@wa
> rthog.procyon.org.uk/
> 
> the rationale is that we could evict the inode then reallocate it, and
> it'd generate a new fscache entry with the same key before the previous
> fscache entry had been freed.
> I'm not sure if that got fixed otherwise and it might not be possible
> anymore, I didn't follow that, but given

I don't know the current fs-cache implementation in the Linux kernel yet, so I
can't comment on this part at this point.

>  - some other bug...
> 
> If you have some kind of reproducer of invalid filedescriptor or similar
> errors I'd be happy to dig a bit more, I don't particularly like all
> aspect of our cache model but it's not good if it corrupts things.

Maybe you can reproduce this with the root fs setup [4] described above? As I
said, I immediately get errors when guest OS is booting. So I don't have to
run something fancy to get errors with cache=loose & recent kernel.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-03-30 12:21         ` 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected) Christian Schoenebeck
@ 2022-03-30 21:47           ` asmadeus
  2022-04-01 14:19             ` Christian Schoenebeck
  2022-04-11  8:10             ` David Howells
  2022-04-09 11:16           ` Christian Schoenebeck
  2022-04-11  7:59           ` 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected) David Howells
  2 siblings, 2 replies; 48+ messages in thread
From: asmadeus @ 2022-03-30 21:47 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

Thanks Christian!

Christian Schoenebeck wrote on Wed, Mar 30, 2022 at 02:21:16PM +0200:
> Case  Linux kernel version           .config  msize    cache  duration  host cpu  errors/warnings
> 
> A)    5.17.0+[2] + msize patches[1]  debug    4186112  mmap   20m 40s   ~80%      none
> B)    5.17.0+[2] + msize patches[1]  debug    4186112  loose  31m 28s   ~35%      several errors (compilation completed)
> C)    5.17.0+[2] + msize patches[1]  debug    507904   mmap   20m 25s   ~84%      none
> D)    5.17.0+[2] + msize patches[1]  debug    507904   loose  31m 2s    ~33%      several errors (compilation completed)
> E)    5.17.0+[2]                     debug    512000   mmap   23m 45s   ~75%      none
> F)    5.17.0+[2]                     debug    512000   loose  32m 6s    ~31%      several errors (compilation completed)
> G)    5.17.0+[2]                     release  512000   mmap   23m 18s   ~76%      none
> H)    5.17.0+[2]                     release  512000   loose  32m 33s   ~31%      several errors (compilation completed)
> I)    5.17.0+[2] + msize patches[1]  release  4186112  mmap   20m 30s   ~83%      none
> J)    5.17.0+[2] + msize patches[1]  release  4186112  loose  31m 21s   ~31%      several errors (compilation completed)
> K)    5.10.84                        release  512000   mmap   39m 20s   ~80%      none
> L)    5.10.84                        release  512000   loose  13m 40s   ~55%      none

ow.

> Disclaimer: I have not looked into the fs-cache sources yet, so I am not sure,
> but my first impression is that probably something got broken with recent
> fs-cache changes (see column errors, especially in comparison to case L) which
> did not generate any errors)? And also note the huge build duration 
> differences, especially in comparison to case L), so fs-cache (cache=loose)
> also got significantly slower while cache=mmap OTOH became significantly
> faster?

Yes, that's a big regression; I didn't do any performance benchmark with
the new patches as I didn't think it'd matter but I obviously should
have.

There is one thing I must check: I know new kernels will be writing in
4k chunks and that is going to be very slow until the netfs write
helpers are finished, but I thought the old code did the same.
If the old code had bigger writes that performance will probably come
back.
Otherwise there's some other error like not reusing cached content we
should use.


> About the errors: I actually already see errors with cache=loose and recent
> kernel version just when booting the guest OS. For these tests I chose some
> sources which allowed me to complete the build to capture some benchmark as
> well, I got some "soft" errors with those, but the build completed at least.
> I had other sources OTOH which did not complete though and aborted with
> certain invalid file descriptor errors, which I obviously could not use for
> those benchmarks here.

That's less surprising, the change was really huge. I'm annoyed because
I did test part of a parallel linux kernel compilation with
cache=fscache without noticing a problem :/

I'll try to reproduce this weekend-ish.
> > Christian Schoenebeck wrote on Sat, Mar 26, 2022 at 01:36:31PM +0100:
> > hm, fscache code shouldn't be used for cache=mmap, I'm surprised you can
> > hit this...
> 
> I assume that you mean that 9p driver does not explicitly ask for fs-cache
> being used for mmap. I see that 9p uses the kernel's generalized mmap
> implementation:
> 
> https://github.com/torvalds/linux/blob/d888c83fcec75194a8a48ccd283953bdba7b2550/fs/9p/vfs_file.c#L481
> 
> I haven't dived further into this, but the kernel has to use some kind of
> filesystem cache anyway to provide the mmap functionality, so I guess it makes
> sense that I got those warning messages from the FS-Cache subsystem?

It uses the generic mmap which has vfs caching, but definitely not
fs-cache.
fs-cache adds more hooks for cachefilesd (writing file contents to disk
for bigger cache) and things like that cache=loose/mmap shouldn't be
caring about. cache=loose actually just disables some key parts so I'm
not surprised it shares bugs with the new code, but cache=mmap is really
independant and I need to trace where these come from...

> With QEMU >= 5.2 you should see the following QEMU warning with your reproducer:
> 
> "
> qemu-system-x86_64: warning: 9p: Multiple devices detected in same VirtFS
> export, which might lead to file ID collisions and severe misbehaviours on
> guest! You should either use a separate export for each device shared from
> host or use virtfs option 'multidevs=remap'!
> "

oh, I wasn't aware of the new option. Good job there!

It's the easiest way to reproduce but there are also harder to fix
collisions, file systems only guarantee unicity for (fsid,inode
number,version) which is usually bigger than 128 bits (although version
is often 0), but version isn't exposed to userspace easily...
What we'd want for unicity is handle from e.g. name_to_handle_at but
that'd add overhead, wouldn't fit in qid path and not all fs are capable
of providing one... The 9p protocol just doesn't want bigger handles
than qid path.



And, err, looking at the qemu code

  qidp->version = stbuf->st_mtime ^ (stbuf->st_size << 8);

so the qid is treated as "data version",
but on kernel side we've treated it as inode version (i_version, see
include/linux/iversion.h)

(v9fs_test_inode_dotl checks the version is the same when comparing two
inodes) so it will incorrectly identify two identical inodes as
different.
That will cause problems...
Since you'll be faster than me could you try keeping it at 0 there?

I see fscache also uses the qid version as 'auxilliary data', but I'm
not sure what this is used for -- if it's a data version like thing then
it will also at least invalidate the cache content all the time.


Note there also is a data_version thing in the protocol in the response
to getattr, which the protocol side of 9p in linux digilently fills in
st_data_version, but we never use it that I can see.
This is probably what 9p meant to fill, and fscache should rely on to
detect file changes if that helps.


I'm sorry I didn't see this sooner....

> > If you have some kind of reproducer of invalid filedescriptor or similar
> > errors I'd be happy to dig a bit more, I don't particularly like all
> > aspect of our cache model but it's not good if it corrupts things.
> 
> Maybe you can reproduce this with the root fs setup [4] described above? As I
> said, I immediately get errors when guest OS is booting. So I don't have to
> run something fancy to get errors with cache=loose & recent kernel.

Yes, this is much worse than I had first assumed when you first brought
it up, I'll definitely set some time aside to investigate.

-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-03-30 21:47           ` asmadeus
@ 2022-04-01 14:19             ` Christian Schoenebeck
  2022-04-01 23:11               ` asmadeus
  2022-04-11  8:10             ` David Howells
  1 sibling, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-04-01 14:19 UTC (permalink / raw)
  To: asmadeus
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

On Mittwoch, 30. März 2022 23:47:41 CEST asmadeus@codewreck.org wrote:
> Thanks Christian!
> 
> Christian Schoenebeck wrote on Wed, Mar 30, 2022 at 02:21:16PM +0200:
[...]
> > > Christian Schoenebeck wrote on Sat, Mar 26, 2022 at 01:36:31PM +0100:
> > > hm, fscache code shouldn't be used for cache=mmap, I'm surprised you can
> > > hit this...
> > 
> > I assume that you mean that 9p driver does not explicitly ask for fs-cache
> > being used for mmap. I see that 9p uses the kernel's generalized mmap
> > implementation:
> > 
> > https://github.com/torvalds/linux/blob/d888c83fcec75194a8a48ccd283953bdba7
> > b2550/fs/9p/vfs_file.c#L481
> > 
> > I haven't dived further into this, but the kernel has to use some kind of
> > filesystem cache anyway to provide the mmap functionality, so I guess it
> > makes sense that I got those warning messages from the FS-Cache
> > subsystem?
> It uses the generic mmap which has vfs caching, but definitely not
> fs-cache.
> fs-cache adds more hooks for cachefilesd (writing file contents to disk
> for bigger cache) and things like that cache=loose/mmap shouldn't be
> caring about. cache=loose actually just disables some key parts so I'm
> not surprised it shares bugs with the new code, but cache=mmap is really
> independant and I need to trace where these come from...

From looking at the sources, the call stack for emitting "FS-Cache: Duplicate
cookie detected" error messages with 9p "cache=mmap" option seems to be:

1. v9fs_vfs_lookup [fs/9p/vfs_inode.c, 788]:

	inode = v9fs_get_new_inode_from_fid(v9ses, fid, dir->i_sb);

2. v9fs_get_new_inode_from_fid [fs/9p/v9fs.h, 228]:

	return v9fs_inode_from_fid_dotl(v9ses, fid, sb, 1);

3. v9fs_inode_from_fid_dotl [fs/9p/vfs_inode_dotl.c, 157]:

	inode = v9fs_qid_iget_dotl(sb, &st->qid, fid, st, new);

4. v9fs_qid_iget_dotl [fs/9p/vfs_inode_dotl.c, 133]:

	v9fs_cache_inode_get_cookie(inode);
	^--- Called independent of function argument "new"'s value here
   https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9fe5282696/fs/9p/vfs_inode_dotl.c#L133

5. v9fs_cache_inode_get_cookie [fs/9p/cache.c, 68]:

	v9inode->fscache =
		fscache_acquire_cookie(v9fs_session_cache(v9ses),
				       0,
				       &path, sizeof(path),
				       &version, sizeof(version),
				       i_size_read(&v9inode->vfs_inode));

6. fscache_acquire_cookie [include/linux/fscache.h, 251]:

	return __fscache_acquire_cookie(volume, advice,
					index_key, index_key_len,
					aux_data, aux_data_len,
					object_size);

7. __fscache_acquire_cookie [fs/fscache/cookie.c, 472]:

	if (!fscache_hash_cookie(cookie)) {
		fscache_see_cookie(cookie, fscache_cookie_discard);
		fscache_free_cookie(cookie);
		return NULL;
	}

8. fscache_hash_cookie [fs/fscache/cookie.c, 430]:

	pr_err("Duplicate cookie detected\n");

> > With QEMU >= 5.2 you should see the following QEMU warning with your
> > reproducer:
> > 
> > "
> > qemu-system-x86_64: warning: 9p: Multiple devices detected in same VirtFS
> > export, which might lead to file ID collisions and severe misbehaviours on
> > guest! You should either use a separate export for each device shared from
> > host or use virtfs option 'multidevs=remap'!
> > "
> 
> oh, I wasn't aware of the new option. Good job there!
> 
> It's the easiest way to reproduce but there are also harder to fix
> collisions, file systems only guarantee unicity for (fsid,inode
> number,version) which is usually bigger than 128 bits (although version
> is often 0), but version isn't exposed to userspace easily...
> What we'd want for unicity is handle from e.g. name_to_handle_at but
> that'd add overhead, wouldn't fit in qid path and not all fs are capable
> of providing one... The 9p protocol just doesn't want bigger handles
> than qid path.

No bigger qid.path on 9p protocol level in future? Why?

> And, err, looking at the qemu code
> 
>   qidp->version = stbuf->st_mtime ^ (stbuf->st_size << 8);
> 
> so the qid is treated as "data version",
> but on kernel side we've treated it as inode version (i_version, see
> include/linux/iversion.h)
> 
> (v9fs_test_inode_dotl checks the version is the same when comparing two
> inodes) so it will incorrectly identify two identical inodes as
> different.
> That will cause problems...
> Since you'll be faster than me could you try keeping it at 0 there?

I tried with your suggested change on QEMU side:

diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index a6d6b3f835..5d9be87758 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -981,7 +981,7 @@ static int stat_to_qid(V9fsPDU *pdu, const struct stat *stbuf, V9fsQID *qidp)
         memcpy(&qidp->path, &stbuf->st_ino, size);
     }
 
-    qidp->version = stbuf->st_mtime ^ (stbuf->st_size << 8);
+    qidp->version = 0;
     qidp->type = 0;
     if (S_ISDIR(stbuf->st_mode)) {
         qidp->type |= P9_QID_TYPE_DIR;

Unfortunately it did not make any difference for these 2 Linux kernel fs-cache
issues at least; still same errors, and same suboptimal performance.

Best regards,
Christian Schoenebeck



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-04-01 14:19             ` Christian Schoenebeck
@ 2022-04-01 23:11               ` asmadeus
  2022-04-02 12:43                 ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: asmadeus @ 2022-04-01 23:11 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

Christian Schoenebeck wrote on Fri, Apr 01, 2022 at 04:19:20PM +0200:
> 4. v9fs_qid_iget_dotl [fs/9p/vfs_inode_dotl.c, 133]:
> 
> 	v9fs_cache_inode_get_cookie(inode);
> 	^--- Called independent of function argument "new"'s value here
>    https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9fe5282696/fs/9p/vfs_inode_dotl.c#L133


uh, I'd have assumed either this call or the function to check
v9ses->cache, but it doesn't look like either do...
There's a nice compile-time static inline empty definition if FSCACHE is
not compiled in, but that should -also- be a check at runtime based on
the session struct.

For your remark vs. the 'new' argument, it does depend on it:
 - new determines which check is used for iget5_locked.
In the 'new' case, v9fs_test_new_inode_dotl always returns 0 so we
always get a new inode.
 - if iget5_locked found an existing inode (i_state & I_NEW false) we
return it.
 - at this point we're allocating a new inode, so we should initialize
its cookie if it's on a fscache-enabled mount

So that part looks OK to me.

What isn't correct with qemu setting qid version is the non-new case's
v9fs_test_inode_dotl, it'll consider the inode to be new if version
changed so it would recreate new, different inodes with same inode
number/cookie and I was sure that was the problem, but it looks like
there's more to it from your reply below :(


>> Well, at least that one is an easy fix: we just don't need this.
>> It's the easiest way to reproduce but there are also harder to fix
>> collisions, file systems only guarantee unicity for (fsid,inode
>> number,version) which is usually bigger than 128 bits (although version
>> is often 0), but version isn't exposed to userspace easily...
>> What we'd want for unicity is handle from e.g. name_to_handle_at but
>> that'd add overhead, wouldn't fit in qid path and not all fs are capable
>> of providing one... The 9p protocol just doesn't want bigger handles
>> than qid path.
> 
> No bigger qid.path on 9p protocol level in future? Why?

I have no idea about the "9p protocol" as a standard, nor who decides
how that evolves -- I guess if we wanted to we could probably make a
9p2022.L without concerting much around, but I have no plan to do
that... But if we do, I can probably add quite a few things to the
list of things that might need to change :)


That being said, this particular change of qid format is rather
annoying. 9p2000.L basically just added new message types, so dissectors
such as wireshark could barge in the middle of the tcp flow and more or
less understand; modifying a basic type like this would require to
either catch the TVERSION message or make new message types for
everything that deals wth qids (auth/attach, walk, lopen, lcreate,
mknod, getattr, readdir, mkdir... that's quite a few)

So I think we're better off with the status quo here.

>> And, err, looking at the qemu code
>> 
>>   qidp->version = stbuf->st_mtime ^ (stbuf->st_size << 8);
>> 
>> so the qid is treated as "data version",
>> but on kernel side we've treated it as inode version (i_version, see
>> include/linux/iversion.h)
>> 
>> (v9fs_test_inode_dotl checks the version is the same when comparing two
>> inodes) so it will incorrectly identify two identical inodes as
>> different.
>> That will cause problems...
>> Since you'll be faster than me could you try keeping it at 0 there?
> 
> I tried with your suggested change on QEMU side:
> 
> diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
> index a6d6b3f835..5d9be87758 100644
> --- a/hw/9pfs/9p.c
> +++ b/hw/9pfs/9p.c
> @@ -981,7 +981,7 @@ static int stat_to_qid(V9fsPDU *pdu, const struct stat *stbuf, V9fsQID *qidp)
>          memcpy(&qidp->path, &stbuf->st_ino, size);
>      }
>  
> -    qidp->version = stbuf->st_mtime ^ (stbuf->st_size << 8);
> +    qidp->version = 0;
>      qidp->type = 0;
>      if (S_ISDIR(stbuf->st_mode)) {
>          qidp->type |= P9_QID_TYPE_DIR;
> 
> Unfortunately it did not make any difference for these 2 Linux kernel fs-cache
> issues at least; still same errors, and same suboptimal performance.

Thanks, I'll give it a try today or tomorrow, adding some trace when we
create inodes might give a clue...

-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-04-01 23:11               ` asmadeus
@ 2022-04-02 12:43                 ` Christian Schoenebeck
  0 siblings, 0 replies; 48+ messages in thread
From: Christian Schoenebeck @ 2022-04-02 12:43 UTC (permalink / raw)
  To: asmadeus
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

On Samstag, 2. April 2022 01:11:04 CEST asmadeus@codewreck.org wrote:
> Christian Schoenebeck wrote on Fri, Apr 01, 2022 at 04:19:20PM +0200:
> > 4. v9fs_qid_iget_dotl [fs/9p/vfs_inode_dotl.c, 133]:
> > 	v9fs_cache_inode_get_cookie(inode);
> > 	^--- Called independent of function argument "new"'s value here
> > 	
> >    https://github.com/torvalds/linux/blob/e8b767f5e04097aaedcd6e06e2270f9f
> >    e5282696/fs/9p/vfs_inode_dotl.c#L133
> uh, I'd have assumed either this call or the function to check
> v9ses->cache, but it doesn't look like either do...
> There's a nice compile-time static inline empty definition if FSCACHE is
> not compiled in, but that should -also- be a check at runtime based on
> the session struct.
> 
> For your remark vs. the 'new' argument, it does depend on it:
>  - new determines which check is used for iget5_locked.
> In the 'new' case, v9fs_test_new_inode_dotl always returns 0 so we
> always get a new inode.
>  - if iget5_locked found an existing inode (i_state & I_NEW false) we
> return it.

Yes, I saw that. This part so far is correct.

The only minor issue I see here from performance PoV: the test function passed
(always returning zero) unnecessarily causes find_inode() to iterate over all
inodes of the associated hash bucket:

https://github.com/torvalds/linux/blob/88e6c0207623874922712e162e25d9dafd39661e/fs/inode.c#L912

IMO it would make sense introducing an official of what's called "identically
zero function" in shared code space and let 9p use that official function
instead. Then inode.c could simply compare the test function pointer and not
bother to iterate over the entire list in such a case.

>  - at this point we're allocating a new inode, so we should initialize
> its cookie if it's on a fscache-enabled mount
> 
> So that part looks OK to me.

Mmm, but you agree that it also does that for cache=mmap right now, right?

> What isn't correct with qemu setting qid version is the non-new case's
> v9fs_test_inode_dotl, it'll consider the inode to be new if version
> changed so it would recreate new, different inodes with same inode
> number/cookie and I was sure that was the problem, but it looks like
> there's more to it from your reply below :(

Yes, it does not seem to be related, and I mean this part of the code has not
changed for 11 years. So if that was the cause, then old kernels would suffer
from the same issues, which does not seem to be the case.

I would not say though that QEMU is necessarily wrong in filling in mtime for
qid.version. The 9p spec just says:

"The version is a version number for a file; typically, it is incremented
every time the file is modified."

So yes, it does recommend sequential numbering, but OTOH does not require it.
And implementing that would be expensive, because 9p server would need to
maintain its own version for every single file. whereas using host
filesystem's mtime is cheap.

> >> Well, at least that one is an easy fix: we just don't need this.
> >> It's the easiest way to reproduce but there are also harder to fix
> >> collisions, file systems only guarantee unicity for (fsid,inode
> >> number,version) which is usually bigger than 128 bits (although version
> >> is often 0), but version isn't exposed to userspace easily...
> >> What we'd want for unicity is handle from e.g. name_to_handle_at but
> >> that'd add overhead, wouldn't fit in qid path and not all fs are capable
> >> of providing one... The 9p protocol just doesn't want bigger handles
> >> than qid path.
> > 
> > No bigger qid.path on 9p protocol level in future? Why?
> 
> I have no idea about the "9p protocol" as a standard, nor who decides
> how that evolves -- I guess if we wanted to we could probably make a
> 9p2022.L without concerting much around, but I have no plan to do
> that... But if we do, I can probably add quite a few things to the
> list of things that might need to change :)

Yes, I agree, 9p protocol changes are a long-term thing which I don't want to
hurry either. But I do think it makes sense to at least collect an informal
list of ideas/features/issues that should be addressed in future, e.g. a wiki
page. For now I am using the QEMU wiki for this:

https://wiki.qemu.org/Documentation/9p#Protocol_Plans

You can use that wiki page as well of course, or if somebody thinks there
would be a better place, no problem for me either.

> That being said, this particular change of qid format is rather
> annoying. 9p2000.L basically just added new message types, so dissectors
> such as wireshark could barge in the middle of the tcp flow and more or
> less understand; modifying a basic type like this would require to
> either catch the TVERSION message or make new message types for
> everything that deals wth qids (auth/attach, walk, lopen, lcreate,
> mknod, getattr, readdir, mkdir... that's quite a few)
> 
> So I think we're better off with the status quo here.

Well, in a future protocol revision I would try to merge the individual 9p
dialects as much as possible anyway and I would also release one document that
covers all supported messages instead of requiring people to read 4+
individual specs, which I don't find helpful. Most changes were just about
data types, which could also be covered in a spec by just naming a message
once, and listing the difference for individual 9p versions for that
particular message.

But again: no priority for me either ATM. There is still enough to do for
fixing the implementations on server and client side for current 9p protocol
versions.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-03-30 12:21         ` 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected) Christian Schoenebeck
  2022-03-30 21:47           ` asmadeus
@ 2022-04-09 11:16           ` Christian Schoenebeck
  2022-04-10 16:18             ` Christian Schoenebeck
  2022-04-11  7:59           ` 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected) David Howells
  2 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-04-09 11:16 UTC (permalink / raw)
  To: asmadeus
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

On Mittwoch, 30. März 2022 14:21:16 CEST Christian Schoenebeck wrote:
> I made some tests & benchmarks regarding the fs-cache issue of 9p, running
> different kernel versions and kernel configurations in comparison.
[...]
> Case  Linux kernel version           .config  msize    cache  duration  host cpu  errors/warnings
>
> A)    5.17.0+[2] + msize patches[1]  debug    4186112  mmap   20m 40s   ~80%      none
> B)    5.17.0+[2] + msize patches[1]  debug    4186112  loose  31m 28s   ~35%      several errors (compilation completed)
> C)    5.17.0+[2] + msize patches[1]  debug    507904   mmap   20m 25s   ~84%      none
> D)    5.17.0+[2] + msize patches[1]  debug    507904   loose  31m 2s    ~33%      several errors (compilation completed)
> E)    5.17.0+[2]                     debug    512000   mmap   23m 45s   ~75%      none
> F)    5.17.0+[2]                     debug    512000   loose  32m 6s    ~31%      several errors (compilation completed)
> G)    5.17.0+[2]                     release  512000   mmap   23m 18s   ~76%      none
> H)    5.17.0+[2]                     release  512000   loose  32m 33s   ~31%      several errors (compilation completed)
> I)    5.17.0+[2] + msize patches[1]  release  4186112  mmap   20m 30s   ~83%      none
> J)    5.17.0+[2] + msize patches[1]  release  4186112  loose  31m 21s   ~31%      several errors (compilation completed)
> K)    5.10.84                        release  512000   mmap   39m 20s   ~80%      none
> L)    5.10.84                        release  512000   loose  13m 40s   ~55%      none
[...]
> About the errors: I actually already see errors with cache=loose and recent
> kernel version just when booting the guest OS. For these tests I chose some
> sources which allowed me to complete the build to capture some benchmark as
> well, I got some "soft" errors with those, but the build completed at least.
> I had other sources OTOH which did not complete though and aborted with
> certain invalid file descriptor errors, which I obviously could not use for
> those benchmarks here.

I used git-bisect to identify the commit that broke 9p behaviour, and it is
indeed this one:

commit eb497943fa215897f2f60fd28aa6fe52da27ca6c (HEAD, refs/bisect/bad)
Author: David Howells <dhowells@redhat.com>
Date:   Tue Nov 2 08:29:55 2021 +0000

    9p: Convert to using the netfs helper lib to do reads and caching
    
    Convert the 9p filesystem to use the netfs helper lib to handle readpage,
    readahead and write_begin, converting those into a common issue_op for the
    filesystem itself to handle.  The netfs helper lib also handles reading
    from fscache if a cache is available, and interleaving reads from both
    sources.
    
    This change also switches from the old fscache I/O API to the new one,
    meaning that fscache no longer keeps track of netfs pages and instead does
    async DIO between the backing files and the 9p file pagecache.  As a part
    of this change, the handling of PG_fscache changes.  It now just means that
    the cache has a write I/O operation in progress on a page (PG_locked
    is used for a read I/O op).
    
    Note that this is a cut-down version of the fscache rewrite and does not
    change any of the cookie and cache coherency handling.
    
    Changes
    =======
    ver #4:
      - Rebase on top of folios.
      - Don't use wait_on_page_bit_killable().
    
    ver #3:
      - v9fs_req_issue_op() needs to terminate the subrequest.
      - v9fs_write_end() needs to call SetPageUptodate() a bit more often.
      - It's not CONFIG_{AFS,V9FS}_FSCACHE[1]
      - v9fs_init_rreq() should take a ref on the p9_fid and the cleanup should
        drop it [from Dominique Martinet].
    
    Signed-off-by: David Howells <dhowells@redhat.com>
    Reviewed-and-tested-by: Dominique Martinet <asmadeus@codewreck.org>
    cc: v9fs-developer@lists.sourceforge.net
    cc: linux-cachefs@redhat.com
    Link: https://lore.kernel.org/r/YUm+xucHxED+1MJp@codewreck.org/ [1]
    Link: https://lore.kernel.org/r/163162772646.438332.16323773205855053535.stgit@warthog.procyon.org.uk/ # rfc
    Link: https://lore.kernel.org/r/163189109885.2509237.7153668924503399173.stgit@warthog.procyon.org.uk/ # rfc v2
    Link: https://lore.kernel.org/r/163363943896.1980952.1226527304649419689.stgit@warthog.procyon.org.uk/ # v3
    Link: https://lore.kernel.org/r/163551662876.1877519.14706391695553204156.stgit@warthog.procyon.org.uk/ # v4
    Link: https://lore.kernel.org/r/163584179557.4023316.11089762304657644342.stgit@warthog.procyon.org.uk # rebase on folio
    Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>

So Linux kernel v5.15 is fine, v5.16 is broken.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-04-09 11:16           ` Christian Schoenebeck
@ 2022-04-10 16:18             ` Christian Schoenebeck
  2022-04-10 22:54               ` asmadeus
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-04-10 16:18 UTC (permalink / raw)
  To: asmadeus
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

On Samstag, 9. April 2022 13:16:11 CEST Christian Schoenebeck wrote:
> On Mittwoch, 30. März 2022 14:21:16 CEST Christian Schoenebeck wrote:
> > I made some tests & benchmarks regarding the fs-cache issue of 9p, running
> > different kernel versions and kernel configurations in comparison.
> [...]
> > Case  Linux kernel version           .config  msize    cache  duration  host cpu  errors/warnings
> >
> > A)    5.17.0+[2] + msize patches[1]  debug    4186112  mmap   20m 40s   ~80%      none
> > B)    5.17.0+[2] + msize patches[1]  debug    4186112  loose  31m 28s   ~35%      several errors (compilation completed)
> > C)    5.17.0+[2] + msize patches[1]  debug    507904   mmap   20m 25s   ~84%      none
> > D)    5.17.0+[2] + msize patches[1]  debug    507904   loose  31m 2s    ~33%      several errors (compilation completed)
> > E)    5.17.0+[2]                     debug    512000   mmap   23m 45s   ~75%      none
> > F)    5.17.0+[2]                     debug    512000   loose  32m 6s    ~31%      several errors (compilation completed)
> > G)    5.17.0+[2]                     release  512000   mmap   23m 18s   ~76%      none
> > H)    5.17.0+[2]                     release  512000   loose  32m 33s   ~31%      several errors (compilation completed)
> > I)    5.17.0+[2] + msize patches[1]  release  4186112  mmap   20m 30s   ~83%      none
> > J)    5.17.0+[2] + msize patches[1]  release  4186112  loose  31m 21s   ~31%      several errors (compilation completed)
> > K)    5.10.84                        release  512000   mmap   39m 20s   ~80%      none
> > L)    5.10.84                        release  512000   loose  13m 40s   ~55%      none
> [...]
> > About the errors: I actually already see errors with cache=loose and recent
> > kernel version just when booting the guest OS. For these tests I chose some
> > sources which allowed me to complete the build to capture some benchmark as
> > well, I got some "soft" errors with those, but the build completed at least.
> > I had other sources OTOH which did not complete though and aborted with
> > certain invalid file descriptor errors, which I obviously could not use for
> > those benchmarks here.
> 
> I used git-bisect to identify the commit that broke 9p behaviour, and it is
> indeed this one:
> 
> commit eb497943fa215897f2f60fd28aa6fe52da27ca6c (HEAD, refs/bisect/bad)
> Author: David Howells <dhowells@redhat.com>
> Date:   Tue Nov 2 08:29:55 2021 +0000
> 
>     9p: Convert to using the netfs helper lib to do reads and caching
>     
>     Convert the 9p filesystem to use the netfs helper lib to handle readpage,
>     readahead and write_begin, converting those into a common issue_op for the
>     filesystem itself to handle.  The netfs helper lib also handles reading
>     from fscache if a cache is available, and interleaving reads from both
>     sources.

I looked into the errors I get, and as far as I can see it, all misbehaviours
that I see, boil down to "Bad file descriptor" (EBADF) errors being the
originating cause.

The easiest misbehaviours on the guest system I can look into, are errors
with the git client. For instance 'git fetch origin' fails this way:

...
write(3, "d16782889ee07005d1f57eb884f4a06b"..., 40) = 40
write(3, "\n", 1)                       = 1
close(3)                                = 0
access(".git/hooks/reference-transaction", X_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, ".git/logs/refs/remotes/origin/master", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
fstat(7, {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
read(7, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\t\0\0\0\t\0\0\0\0"..., 8192) = 2326
lseek(7, -1467, SEEK_CUR)               = 859
read(7, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\t\0\0\0\t\0\0\0\0"..., 8192) = 1467
close(7)                                = 0
write(3, "d8a68c5027ef629d93b9d9519ff4da95"..., 168) = -1 EBADF (Bad file descriptor)
...
error: cannot update the ref 'refs/remotes/origin/master': unable to append to '.git/logs/refs/remotes/origin/master': Bad file descriptor

I tried to manually replicate those file access operations on that
.git/logs/refs/remotes/origin/master file in question, and it worked. But when
I look at the strace output above, I see there is a close(3) call just before
the subsequent openat(".git/logs/refs/remotes/origin/master") call returning 3,
which makes me wonder, is this maybe a concurrency issue on file descriptor
management?

Ideas anyone?

Best regards,
Christian Schoenebeck





^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-04-10 16:18             ` Christian Schoenebeck
@ 2022-04-10 22:54               ` asmadeus
  2022-04-11 13:41                 ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: asmadeus @ 2022-04-10 22:54 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

Thanks for keeping it up!

Christian Schoenebeck wrote on Sun, Apr 10, 2022 at 06:18:38PM +0200:
> > I used git-bisect to identify the commit that broke 9p behaviour, and it is
> > indeed this one:
> > 
> > commit eb497943fa215897f2f60fd28aa6fe52da27ca6c (HEAD, refs/bisect/bad)
> > Author: David Howells <dhowells@redhat.com>
> > Date:   Tue Nov 2 08:29:55 2021 +0000
> > 
> >     9p: Convert to using the netfs helper lib to do reads and caching

Yes, quite a few things changed with that.

> I looked into the errors I get, and as far as I can see it, all misbehaviours
> that I see, boil down to "Bad file descriptor" (EBADF) errors being the
> originating cause.
> 
> The easiest misbehaviours on the guest system I can look into, are errors
> with the git client. For instance 'git fetch origin' fails this way:

FWIW I didn't report but did try to reproduce, on my machines (tried a
couple) booting on a small alpine rootfs over 9p works, and I tried some
git clone/git fetch of variying sizes of local repo (tmpfs in VM -> 9p
mount of tmpfs on host) to no avail.
Perhaps backing filesystem dependant? qemu version? virtfs access options?

It's all extremely slow though... like the final checkout counting files
at less than 10/s

> ...
> write(3, "d16782889ee07005d1f57eb884f4a06b"..., 40) = 40
> write(3, "\n", 1)                       = 1
> close(3)                                = 0
> access(".git/hooks/reference-transaction", X_OK) = -1 ENOENT (No such file or directory)
> openat(AT_FDCWD, ".git/logs/refs/remotes/origin/master", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
> openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 7
> fstat(7, {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
> fstat(7, {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
> read(7, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\t\0\0\0\t\0\0\0\0"..., 8192) = 2326
> lseek(7, -1467, SEEK_CUR)               = 859
> read(7, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\t\0\0\0\t\0\0\0\0"..., 8192) = 1467
> close(7)                                = 0
> write(3, "d8a68c5027ef629d93b9d9519ff4da95"..., 168) = -1 EBADF (Bad file descriptor)
> ...
> error: cannot update the ref 'refs/remotes/origin/master': unable to append to '.git/logs/refs/remotes/origin/master': Bad file descriptor
> 
> I tried to manually replicate those file access operations on that
> .git/logs/refs/remotes/origin/master file in question, and it worked. But when
> I look at the strace output above, I see there is a close(3) call just before
> the subsequent openat(".git/logs/refs/remotes/origin/master") call returning 3,
> which makes me wonder, is this maybe a concurrency issue on file descriptor
> management?

hmm, in cache=loose case write should just be updating the page cache
for buffers to be flushed later, so this is definitely weird.

If you can reproduce well enough for this, could you first confirm that
the EBADF comes from the client and not qemu? either mounting with debug
or getting traces from qemu at a protocol level would get that.

If it's local there are only so many places EBADF can come from and it
should be possible to trace it back with e.g. perf probe or bpftrace,
but even if we confirm that e.g. the process' fd table is messed up it
won't tell us why it was, so it's going to be annoying... I'd really
like to be able to reproduce this somehow :/

-- 
Dominique


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-03-30 12:21         ` 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected) Christian Schoenebeck
  2022-03-30 21:47           ` asmadeus
  2022-04-09 11:16           ` Christian Schoenebeck
@ 2022-04-11  7:59           ` David Howells
  2 siblings, 0 replies; 48+ messages in thread
From: David Howells @ 2022-04-11  7:59 UTC (permalink / raw)
  To: asmadeus
  Cc: dhowells, Christian Schoenebeck, David Kahurani, davem, ericvh,
	kuba, linux-kernel, lucho, netdev, v9fs-developer, Greg Kurz

asmadeus@codewreck.org wrote:

> I see fscache also uses the qid version as 'auxilliary data', but I'm
> not sure what this is used for -- if it's a data version like thing then
> it will also at least invalidate the cache content all the time.

I should really have renamed "auxiliary data" to "coherency data".  It's used
by direct comparison when fscache binds a cookie to a backing cache object to
work out if the content of the backing object is still valid.

David


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-03-30 21:47           ` asmadeus
  2022-04-01 14:19             ` Christian Schoenebeck
@ 2022-04-11  8:10             ` David Howells
  1 sibling, 0 replies; 48+ messages in thread
From: David Howells @ 2022-04-11  8:10 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: dhowells, asmadeus, David Kahurani, davem, ericvh, kuba,
	linux-kernel, lucho, netdev, v9fs-developer, Greg Kurz

Christian Schoenebeck <linux_oss@crudebyte.com> wrote:

> From looking at the sources, the call stack for emitting "FS-Cache: Duplicate
> cookie detected" error messages with 9p "cache=mmap" option seems to be:

You might find these tracepoints useful:

echo 1 >/sys/kernel/debug/tracing/events/fscache/fscache_cookie/enable
echo 1 >/sys/kernel/debug/tracing/events/fscache/fscache_acquire/enable
echo 1 >/sys/kernel/debug/tracing/events/fscache/fscache_relinquish/enable

David


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-04-10 22:54               ` asmadeus
@ 2022-04-11 13:41                 ` Christian Schoenebeck
  2022-04-12 22:38                   ` asmadeus
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-04-11 13:41 UTC (permalink / raw)
  To: asmadeus
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

On Montag, 11. April 2022 00:54:47 CEST asmadeus@codewreck.org wrote:
> Thanks for keeping it up!
> 
> Christian Schoenebeck wrote on Sun, Apr 10, 2022 at 06:18:38PM +0200:
> > > I used git-bisect to identify the commit that broke 9p behaviour, and it
> > > is
> > > indeed this one:
> > > 
> > > commit eb497943fa215897f2f60fd28aa6fe52da27ca6c (HEAD, refs/bisect/bad)
> > > Author: David Howells <dhowells@redhat.com>
> > > Date:   Tue Nov 2 08:29:55 2021 +0000
> > > 
> > >     9p: Convert to using the netfs helper lib to do reads and caching
> 
> Yes, quite a few things changed with that.
> 
> > I looked into the errors I get, and as far as I can see it, all
> > misbehaviours that I see, boil down to "Bad file descriptor" (EBADF)
> > errors being the originating cause.
> > 
> > The easiest misbehaviours on the guest system I can look into, are errors
> 
> > with the git client. For instance 'git fetch origin' fails this way:
> FWIW I didn't report but did try to reproduce, on my machines (tried a
> couple) booting on a small alpine rootfs over 9p works, and I tried some
> git clone/git fetch of variying sizes of local repo (tmpfs in VM -> 9p
> mount of tmpfs on host) to no avail.

I get more convinced that it's a bug on Linux kernel side. When guest is
booted I always immediately get a read("/var/log/wtmp") = EBADF error on
guest. And the 9p command sequence sent to QEMU 9p server were:

...
v9fs_clunk tag 0 id 120 fid 568
v9fs_walk tag 0 id 110 fid 1 newfid 568 nwnames 1
v9fs_rerror tag 0 id 110 err 2
v9fs_walk tag 0 id 110 fid 26 newfid 568 nwnames 1
v9fs_rerror tag 0 id 110 err 2
v9fs_readlink tag 0 id 22 fid 474
v9fs_readlink_return tag 0 id 22 name /run
v9fs_readlink tag 0 id 22 fid 474
v9fs_readlink_return tag 0 id 22 name /run
v9fs_readlink tag 0 id 22 fid 474
v9fs_readlink_return tag 0 id 22 name /run
v9fs_readlink tag 0 id 22 fid 474
v9fs_readlink_return tag 0 id 22 name /run
v9fs_walk tag 0 id 110 fid 633 newfid 568 nwnames 1
v9fs_rerror tag 0 id 110 err 2
v9fs_walk tag 0 id 110 fid 875 newfid 568 nwnames 0
v9fs_walk_return tag 0 id 110 nwnames 0 qids (nil)
v9fs_open tag 0 id 12 fid 568 mode 32769
v9fs_open_return tag 0 id 12 qid={type 0 version 0 path 820297} iounit 507904
v9fs_walk tag 0 id 110 fid 875 newfid 900 nwnames 0
v9fs_walk_return tag 0 id 110 nwnames 0 qids (nil)
v9fs_open tag 0 id 12 fid 900 mode 2
v9fs_open_return tag 0 id 12 qid={type 0 version 0 path 820297} iounit 507904
v9fs_lock tag 0 id 52 fid 568 type 1 start 0 length 0
v9fs_lock_return tag 0 id 52 status 0
v9fs_xattrwalk tag 0 id 30 fid 568 newfid 901 name security.capability
v9fs_rerror tag 0 id 30 err 95
v9fs_read tag 0 id 116 fid 568 off 192512 max_count 256

So guest opens /var/log/wtmp with fid=568 mode=32769, which is write-only
mode, and then it tries to read that fid 568, which eventually causes the
read() call on host to error with EBADF. Which makes sense, as the file was
opened in write-only mode, hence read() is not possible with that file
descriptor.

The other things I noticed when looking at the 9p command sequence above:
there is a Twalk on fid 568 before, which is not clunked before reusing fid
568 with Topen later. And before that Twalk on fid 568 there is a Tclunk on
fid 568, but apparently that fid was not used before.

> Perhaps backing filesystem dependant? qemu version? virtfs access options?

I tried with different hardware and different file systems (ext4, btrfs), same
misbehaviours.

QEMU is latest git version. I also tried several different QEMU versions, same
thing.

QEMU command line used:

~/git/qemu/build/qemu-system-x86_64 \
-machine pc,accel=kvm,usb=off,dump-guest-core=off -m 16384 \
-smp 8,sockets=8,cores=1,threads=1 -rtc base=utc -boot strict=on \
-kernel ~/vm/bullseye/boot/vmlinuz \
-initrd ~/vm/bullseye/boot/initrd.img \
-append 'root=fsRoot rw rootfstype=9p rootflags=trans=virtio,version=9p2000.L,msize=4186112,cache=loose console=ttyS0' \
-fsdev local,security_model=mapped,multidevs=remap,id=fsdev-fs0,path=$HOME/vm/bullseye/ \
-device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=fsRoot \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-nographic

Important for reproducing this issue:

  * cache=loose
  * -smp N (with N>1)
  * Guest booted with Linux kernel containing commit eb497943fa21
    (uname >= 5.16)

I'm pretty sure that you can reproduce this issue with the QEMU 9p rootfs
setup HOWTO linked before.

> It's all extremely slow though... like the final checkout counting files
> at less than 10/s

It is VERY slow. And the weird thing is that cache=loose got much slower than
cache=mmap. My worst case expactation would be cache=loose at least not
performing worse than cache=mmap.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-04-11 13:41                 ` Christian Schoenebeck
@ 2022-04-12 22:38                   ` asmadeus
  2022-04-14 12:44                     ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: asmadeus @ 2022-04-12 22:38 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

Christian Schoenebeck wrote on Mon, Apr 11, 2022 at 03:41:45PM +0200:
> I get more convinced that it's a bug on Linux kernel side. When guest is
> booted I always immediately get a read("/var/log/wtmp") = EBADF error on
> guest. And the 9p command sequence sent to QEMU 9p server were:

Yes, I'm not pointing fingers, just trying to understand :)

> 
> ...
> v9fs_clunk tag 0 id 120 fid 568
> v9fs_walk tag 0 id 110 fid 1 newfid 568 nwnames 1
> v9fs_rerror tag 0 id 110 err 2
> v9fs_walk tag 0 id 110 fid 26 newfid 568 nwnames 1
> v9fs_rerror tag 0 id 110 err 2
> v9fs_readlink tag 0 id 22 fid 474
> v9fs_readlink_return tag 0 id 22 name /run
> v9fs_readlink tag 0 id 22 fid 474
> v9fs_readlink_return tag 0 id 22 name /run
> v9fs_readlink tag 0 id 22 fid 474
> v9fs_readlink_return tag 0 id 22 name /run
> v9fs_readlink tag 0 id 22 fid 474
> v9fs_readlink_return tag 0 id 22 name /run
> v9fs_walk tag 0 id 110 fid 633 newfid 568 nwnames 1
> v9fs_rerror tag 0 id 110 err 2
> v9fs_walk tag 0 id 110 fid 875 newfid 568 nwnames 0
> v9fs_walk_return tag 0 id 110 nwnames 0 qids (nil)
> v9fs_open tag 0 id 12 fid 568 mode 32769
> v9fs_open_return tag 0 id 12 qid={type 0 version 0 path 820297} iounit 507904
> v9fs_walk tag 0 id 110 fid 875 newfid 900 nwnames 0
> v9fs_walk_return tag 0 id 110 nwnames 0 qids (nil)
> v9fs_open tag 0 id 12 fid 900 mode 2
> v9fs_open_return tag 0 id 12 qid={type 0 version 0 path 820297} iounit 507904
> v9fs_lock tag 0 id 52 fid 568 type 1 start 0 length 0
> v9fs_lock_return tag 0 id 52 status 0
> v9fs_xattrwalk tag 0 id 30 fid 568 newfid 901 name security.capability
> v9fs_rerror tag 0 id 30 err 95
> v9fs_read tag 0 id 116 fid 568 off 192512 max_count 256
> 
> So guest opens /var/log/wtmp with fid=568 mode=32769, which is write-only
> mode, and then it tries to read that fid 568, which eventually causes the
> read() call on host to error with EBADF. Which makes sense, as the file was
> opened in write-only mode, hence read() is not possible with that file
> descriptor.

Oh! That's something we can work on. the vfs code has different caches
for read only and read-write fids, perhaps the new netfs code just used
the wrong one somewhere. I'll have a look.

> The other things I noticed when looking at the 9p command sequence above:
> there is a Twalk on fid 568 before, which is not clunked before reusing fid
> 568 with Topen later. And before that Twalk on fid 568 there is a Tclunk on
> fid 568, but apparently that fid was not used before.

This one though is just weird, I don't see where linux would make up a fid to
clunk like this... Could messages be ordered a bit weird through
multithreading?
e.g. thread 1 opens, thread 2 clunks almost immediately afterwards, and
would be printed the other way around?
Should still be serialized through the virtio ring buffer so I don't
believe what I'm saying myself... It might be worth digging further as
well.

> > Perhaps backing filesystem dependant? qemu version? virtfs access options?
> 
> I tried with different hardware and different file systems (ext4, btrfs), same
> misbehaviours.
> 
> QEMU is latest git version. I also tried several different QEMU versions, same
> thing.
> 
> QEMU command line used:
> 
> ~/git/qemu/build/qemu-system-x86_64 \
> -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 16384 \
> -smp 8,sockets=8,cores=1,threads=1 -rtc base=utc -boot strict=on \
> -kernel ~/vm/bullseye/boot/vmlinuz \
> -initrd ~/vm/bullseye/boot/initrd.img \
> -append 'root=fsRoot rw rootfstype=9p rootflags=trans=virtio,version=9p2000.L,msize=4186112,cache=loose console=ttyS0' \
> -fsdev local,security_model=mapped,multidevs=remap,id=fsdev-fs0,path=$HOME/vm/bullseye/ \
> -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=fsRoot \
> -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
> -nographic
> 
> Important for reproducing this issue:
> 
>   * cache=loose
>   * -smp N (with N>1)
>   * Guest booted with Linux kernel containing commit eb497943fa21
>     (uname >= 5.16)
> 
> I'm pretty sure that you can reproduce this issue with the QEMU 9p rootfs
> setup HOWTO linked before.

Yes, I'm not sure why I can't reproduce... All my computers are pretty
slow but the conditions should be met.
I'll try again with a command line closer to what you just gave here.


> > It's all extremely slow though... like the final checkout counting files
> > at less than 10/s
> 
> It is VERY slow. And the weird thing is that cache=loose got much slower than
> cache=mmap. My worst case expactation would be cache=loose at least not
> performing worse than cache=mmap.

Yes, some profiling is also in order, it didn't use to be that slow so
it must not be reusing previously open fids as it should have or
something..

-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-04-12 22:38                   ` asmadeus
@ 2022-04-14 12:44                     ` Christian Schoenebeck
  2022-04-17 12:56                       ` asmadeus
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-04-14 12:44 UTC (permalink / raw)
  To: asmadeus
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

On Mittwoch, 13. April 2022 00:38:21 CEST asmadeus@codewreck.org wrote:
> Christian Schoenebeck wrote on Mon, Apr 11, 2022 at 03:41:45PM +0200:
> > I get more convinced that it's a bug on Linux kernel side. When guest is
> > booted I always immediately get a read("/var/log/wtmp") = EBADF error on
> > guest. And the 9p command sequence sent to QEMU 9p server were:
> 
> Yes, I'm not pointing fingers, just trying to understand :)

Don't worry, that was not my impression, nor was it my intention either. I'm 
jut trying to interpret what I'm seeing here.

> > ...
> > v9fs_clunk tag 0 id 120 fid 568
> > v9fs_walk tag 0 id 110 fid 1 newfid 568 nwnames 1
> > v9fs_rerror tag 0 id 110 err 2
> > v9fs_walk tag 0 id 110 fid 26 newfid 568 nwnames 1
> > v9fs_rerror tag 0 id 110 err 2
> > v9fs_readlink tag 0 id 22 fid 474
> > v9fs_readlink_return tag 0 id 22 name /run
> > v9fs_readlink tag 0 id 22 fid 474
> > v9fs_readlink_return tag 0 id 22 name /run
> > v9fs_readlink tag 0 id 22 fid 474
> > v9fs_readlink_return tag 0 id 22 name /run
> > v9fs_readlink tag 0 id 22 fid 474
> > v9fs_readlink_return tag 0 id 22 name /run
> > v9fs_walk tag 0 id 110 fid 633 newfid 568 nwnames 1
> > v9fs_rerror tag 0 id 110 err 2
> > v9fs_walk tag 0 id 110 fid 875 newfid 568 nwnames 0
> > v9fs_walk_return tag 0 id 110 nwnames 0 qids (nil)
> > v9fs_open tag 0 id 12 fid 568 mode 32769
> > v9fs_open_return tag 0 id 12 qid={type 0 version 0 path 820297} iounit
> > 507904 v9fs_walk tag 0 id 110 fid 875 newfid 900 nwnames 0
> > v9fs_walk_return tag 0 id 110 nwnames 0 qids (nil)
> > v9fs_open tag 0 id 12 fid 900 mode 2
> > v9fs_open_return tag 0 id 12 qid={type 0 version 0 path 820297} iounit
> > 507904 v9fs_lock tag 0 id 52 fid 568 type 1 start 0 length 0
> > v9fs_lock_return tag 0 id 52 status 0
> > v9fs_xattrwalk tag 0 id 30 fid 568 newfid 901 name security.capability
> > v9fs_rerror tag 0 id 30 err 95
> > v9fs_read tag 0 id 116 fid 568 off 192512 max_count 256
> > 
> > So guest opens /var/log/wtmp with fid=568 mode=32769, which is write-only
> > mode, and then it tries to read that fid 568, which eventually causes the
> > read() call on host to error with EBADF. Which makes sense, as the file
> > was
> > opened in write-only mode, hence read() is not possible with that file
> > descriptor.
> 
> Oh! That's something we can work on. the vfs code has different caches
> for read only and read-write fids, perhaps the new netfs code just used
> the wrong one somewhere. I'll have a look.
> 
> > The other things I noticed when looking at the 9p command sequence above:
> > there is a Twalk on fid 568 before, which is not clunked before reusing
> > fid
> > 568 with Topen later. And before that Twalk on fid 568 there is a Tclunk
> > on
> > fid 568, but apparently that fid was not used before.
> 
> This one though is just weird, I don't see where linux would make up a fid
> to clunk like this... Could messages be ordered a bit weird through
> multithreading?
> e.g. thread 1 opens, thread 2 clunks almost immediately afterwards, and
> would be printed the other way around?

Yeah, something like that was also my guess.

> Should still be serialized through the virtio ring buffer so I don't
> believe what I'm saying myself... It might be worth digging further as
> well.
> 
> > > Perhaps backing filesystem dependant? qemu version? virtfs access
> > > options?
> > 
> > I tried with different hardware and different file systems (ext4, btrfs),
> > same misbehaviours.
> > 
> > QEMU is latest git version. I also tried several different QEMU versions,
> > same thing.
> > 
> > QEMU command line used:
> > 
> > ~/git/qemu/build/qemu-system-x86_64 \
> > -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 16384 \
> > -smp 8,sockets=8,cores=1,threads=1 -rtc base=utc -boot strict=on \
> > -kernel ~/vm/bullseye/boot/vmlinuz \
> > -initrd ~/vm/bullseye/boot/initrd.img \
> > -append 'root=fsRoot rw rootfstype=9p
> > rootflags=trans=virtio,version=9p2000.L,msize=4186112,cache=loose
> > console=ttyS0' \ -fsdev
> > local,security_model=mapped,multidevs=remap,id=fsdev-fs0,path=$HOME/vm/bu
> > llseye/ \ -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=fsRoot \
> > -sandbox
> > on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
> > -nographic
> > 
> > Important for reproducing this issue:
> >   * cache=loose
> >   * -smp N (with N>1)
> >   * Guest booted with Linux kernel containing commit eb497943fa21
> >   
> >     (uname >= 5.16)
> > 
> > I'm pretty sure that you can reproduce this issue with the QEMU 9p rootfs
> > setup HOWTO linked before.
> 
> Yes, I'm not sure why I can't reproduce... All my computers are pretty
> slow but the conditions should be met.
> I'll try again with a command line closer to what you just gave here.

I'm not surprised that you could not reproduce the EBADF errors yet. To make 
this more clear, as for the git client errors: I have like 200+ git 
repositories checked out on that test VM, and only about 5 of them trigger 
EBADF errors on 'git pull'. But those few repositories reproduce the EBADF 
errors reliably here.

In other words: these EBADF errors only seem to trigger under certain 
circumstances, so it requires quite a bunch of test material to get a 
reproducer.

Like I said though, with the Bullseye installation I immediately get EBADF 
errors already when booting, whereas with a Buster VM it boots without errors.

> > > It's all extremely slow though... like the final checkout counting files
> > > at less than 10/s
> > 
> > It is VERY slow. And the weird thing is that cache=loose got much slower
> > than cache=mmap. My worst case expactation would be cache=loose at least
> > not performing worse than cache=mmap.
> 
> Yes, some profiling is also in order, it didn't use to be that slow so
> it must not be reusing previously open fids as it should have or
> something..

If somebody has some more ideas what I can try/test, let me know. However ATM 
I won't be able to review the netfs and vfs code to actually find the cause of 
these issues.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-04-14 12:44                     ` Christian Schoenebeck
@ 2022-04-17 12:56                       ` asmadeus
  2022-04-17 13:52                         ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: asmadeus @ 2022-04-17 12:56 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

Christian Schoenebeck wrote on Thu, Apr 14, 2022 at 02:44:53PM +0200:
> > Yes, I'm not sure why I can't reproduce... All my computers are pretty
> > slow but the conditions should be met.
> > I'll try again with a command line closer to what you just gave here.
> 
> I'm not surprised that you could not reproduce the EBADF errors yet. To make 
> this more clear, as for the git client errors: I have like 200+ git 
> repositories checked out on that test VM, and only about 5 of them trigger 
> EBADF errors on 'git pull'. But those few repositories reproduce the EBADF 
> errors reliably here.
> 
> In other words: these EBADF errors only seem to trigger under certain 
> circumstances, so it requires quite a bunch of test material to get a 
> reproducer.
> 
> Like I said though, with the Bullseye installation I immediately get EBADF 
> errors already when booting, whereas with a Buster VM it boots without errors.

Okay, I had missed that!

I've managed to reproduce with git:
https://gaia.codewreck.org/local/tmp/c.tar.zst

This archive (~300KB) when decompressed is a ~150MB repo where git reset
produces EBADF reliably for me.

From the looks of it, write fails in v9fs_write_begin, which itself
fails because it tries to read first on a file that was open with
O_WRONLY|O_CREAT|O_APPEND.
Since this is an append the read is necessary to populate the local page
cache when writing, and we're careful that the writeback fid is open in
write, but not about read...

Will have to think how we might want to handle this; perhaps just giving
the writeback fid read rights all the time as well...
Ran out of time for tonight, but hopefully we can sort it out soonish now!

> If somebody has some more ideas what I can try/test, let me know. However ATM 
> I won't be able to review the netfs and vfs code to actually find the cause of 
> these issues.

You've been of great help already, thanks!

-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-04-17 12:56                       ` asmadeus
@ 2022-04-17 13:52                         ` Christian Schoenebeck
  2022-04-17 21:22                           ` asmadeus
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-04-17 13:52 UTC (permalink / raw)
  To: asmadeus
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

On Sonntag, 17. April 2022 14:56:22 CEST asmadeus@codewreck.org wrote:
> Christian Schoenebeck wrote on Thu, Apr 14, 2022 at 02:44:53PM +0200:
> > > Yes, I'm not sure why I can't reproduce... All my computers are pretty
> > > slow but the conditions should be met.
> > > I'll try again with a command line closer to what you just gave here.
> > 
> > I'm not surprised that you could not reproduce the EBADF errors yet. To
> > make this more clear, as for the git client errors: I have like 200+ git
> > repositories checked out on that test VM, and only about 5 of them
> > trigger EBADF errors on 'git pull'. But those few repositories reproduce
> > the EBADF errors reliably here.
> > 
> > In other words: these EBADF errors only seem to trigger under certain
> > circumstances, so it requires quite a bunch of test material to get a
> > reproducer.
> > 
> > Like I said though, with the Bullseye installation I immediately get EBADF
> > errors already when booting, whereas with a Buster VM it boots without
> > errors.
> Okay, I had missed that!
> 
> I've managed to reproduce with git:
> https://gaia.codewreck.org/local/tmp/c.tar.zst
> 
> This archive (~300KB) when decompressed is a ~150MB repo where git reset
> produces EBADF reliably for me.

I'm glad you were able to reproduce these EBADF errors!

> From the looks of it, write fails in v9fs_write_begin, which itself
> fails because it tries to read first on a file that was open with
> O_WRONLY|O_CREAT|O_APPEND.
> Since this is an append the read is necessary to populate the local page
> cache when writing, and we're careful that the writeback fid is open in
> write, but not about read...
> 
> Will have to think how we might want to handle this; perhaps just giving
> the writeback fid read rights all the time as well...
> Ran out of time for tonight, but hopefully we can sort it out soonish now!

I fear that would just trade symptoms: There are use cases for write-only 
permissions, which would then fail after such kind of simple change.

Independent of this EBADF issue, it would be good to know why 9p performance 
got so slow with cache=loose by the netfs changes. Maybe David has an idea?

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)
  2022-04-17 13:52                         ` Christian Schoenebeck
@ 2022-04-17 21:22                           ` asmadeus
  2022-04-17 22:17                             ` 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)) asmadeus
  2022-04-21 10:36                             ` David Howells
  0 siblings, 2 replies; 48+ messages in thread
From: asmadeus @ 2022-04-17 21:22 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, David Howells, Greg Kurz

Christian Schoenebeck wrote on Sun, Apr 17, 2022 at 03:52:43PM +0200:
> > From the looks of it, write fails in v9fs_write_begin, which itself
> > fails because it tries to read first on a file that was open with
> > O_WRONLY|O_CREAT|O_APPEND.
> > Since this is an append the read is necessary to populate the local page
> > cache when writing, and we're careful that the writeback fid is open in
> > write, but not about read...
> > 
> > Will have to think how we might want to handle this; perhaps just giving
> > the writeback fid read rights all the time as well...
> > Ran out of time for tonight, but hopefully we can sort it out soonish now!
> 
> I fear that would just trade symptoms: There are use cases for write-only 
> permissions, which would then fail after such kind of simple change.

The writeback fid is only used by async ops to flush (and apparently
since 5.10ish populate) the cache; I actually wonder how that "populate
the cache" worked before!
Anyway, since it's not used by direct operations I believe we can mess
with its open mode, but that assumes permission checks are properly done
at vfs level (this is pretty much the line of thinking that allowed
dirty cow... But in this case if a file is opened read-only the
writeback fid isn't allocated afaik, so it's probably ok ?...)

Alternatively we could have the cache issue yet another open for read
when needed, but I think a single RW fid is probably good enough if we
might read from it (no TRUNC)...
It'd break opening the writeback fid on files with -w- permission if the
open is not done as root, but I don't see how we could make appending to
a write only file at something that is not a page boundary either way.

David, netfs doesn't allow cache at byte granularity, correct?

If it does we could fix the problem by only triggering a read when
really needed.



> Independent of this EBADF issue, it would be good to know why 9p performance 
> got so slow with cache=loose by the netfs changes. Maybe David has an idea?

Yes, I've just compared the behaviour of the old cache and new one (with
cache=loose) and the main difference in behaviour I can see if the time
until flush is longer on older version, and reads are bigger with the
new version recently, but the rest is all identical as far as I can see
(4k IOs for write, walk/open/clunk sequences to read a cached file (we
could delay these until reading into cache or a metadata op is
required?), TFSYNC after a series of write or on directories after a
while...), so I don't see a difference.

In particular I don't observe any cache invalidation when the mtime (and
so qid 'version', e.g. cache anciliary data) changes, but for cache=loose
that's how I'd expect it to work as well.


Perhaps the performance difference can be explained just by how
aggressively it's flushed out of memory, since it's written to disk
faster it'd also be easier to forget about and re-issue slow reads?
hmm... I need to spend more time on that as well...

-- 
Dominique



^ permalink raw reply	[flat|nested] 48+ messages in thread

* 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
  2022-04-17 21:22                           ` asmadeus
@ 2022-04-17 22:17                             ` asmadeus
  2022-04-21 10:36                             ` David Howells
  1 sibling, 0 replies; 48+ messages in thread
From: asmadeus @ 2022-04-17 22:17 UTC (permalink / raw)
  To: David Howells, Christian Schoenebeck
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, Greg Kurz

(fixed the subject again and promoted David Howells to To, please read
the previous couple of mails when you have time)

asmadeus@codewreck.org wrote on Mon, Apr 18, 2022 at 06:22:26AM +0900:
> Christian Schoenebeck wrote on Sun, Apr 17, 2022 at 03:52:43PM +0200:
> > > From the looks of it, write fails in v9fs_write_begin, which itself
> > > fails because it tries to read first on a file that was open with
> > > O_WRONLY|O_CREAT|O_APPEND.
> > > Since this is an append the read is necessary to populate the local page
> > > cache when writing, and we're careful that the writeback fid is open in
> > > write, but not about read...

BTW now this is understood here's a much simpler reproducer:

---append.c----
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
	if (argc < 2)
		return 1;
	int fd = open(argv[1], O_WRONLY|O_APPEND);
	if (fd < 0)
		return 1;
	if (write(fd, "test\n", 5) < 0)
		return 1;
	return 0;
}
---

---
echo foo > foo
echo 3 > /proc/sys/vm/drop_caches
strace ./append foo
...
openat(AT_FDCWD, "foo", O_WRONLY|O_APPEND) = 3
write(3, "test\n", 5)                   = -1 EBADF (Bad file descriptor)
---

at 9p client level:
----
9pnet: (00000460) >>> TWALK fids 1,2 nwname 1d wname[0] t
9pnet: (00000460) >>> size=20 type: 110 tag: 0
9pnet: (00000460) <<< size=22 type: 111 tag: 0
9pnet: (00000460) <<< RWALK nwqid 1:
9pnet: (00000460) <<<     [0] 0.6e672b.6289a895
9pnet: (00000460) >>> TGETATTR fid 2, request_mask 6143
9pnet: (00000460) >>> size=19 type: 24 tag: 0
9pnet: (00000460) <<< size=160 type: 25 tag: 0
9pnet: (00000460) <<< RGETATTR st_result_mask=6143
<<< qid=0.6e672b.6289a895
<<< st_mode=000081ed st_nlink=1
<<< st_uid=1000 st_gid=100
<<< st_rdev=0 st_size=d538 st_blksize=126976 st_blocks=112
<<< st_atime_sec=1650233493 st_atime_nsec=697920121
<<< st_mtime_sec=1650233493 st_mtime_nsec=19911120
<<< st_ctime_sec=1650233493 st_ctime_nsec=19911120
<<< st_btime_sec=0 st_btime_nsec=0
<<< st_gen=0 st_data_version=0
9pnet: (00000460) >>> TWALK fids 2,3 nwname 0d wname[0] (null)
9pnet: (00000460) >>> size=17 type: 110 tag: 0
9pnet: (00000460) <<< size=9 type: 111 tag: 0
9pnet: (00000460) <<< RWALK nwqid 0:
9pnet: (00000460) >>> TLOPEN fid 3 mode 32768
9pnet: (00000460) >>> size=15 type: 12 tag: 0
9pnet: (00000460) <<< size=24 type: 13 tag: 0
9pnet: (00000460) <<< RLOPEN qid 0.6e672b.6289a895 iounit 1f000
9pnet: (00000460) >>> TREAD fid 3 offset 0 8192
9pnet: (00000460) >>> size=23 type: 116 tag: 0
9pnet: (00000460) <<< size=8203 type: 117 tag: 0
9pnet: (00000460) <<< RREAD count 8192
9pnet: (00000460) >>> TREAD fid 3 offset 8192 16384
9pnet: (00000460) >>> size=23 type: 116 tag: 0
9pnet: (00000460) <<< size=16395 type: 117 tag: 0
9pnet: (00000460) <<< RREAD count 16384
9pnet: (00000460) >>> TXATTRWALK file_fid 2, attr_fid 4 name security.capability
9pnet: (00000460) >>> size=36 type: 30 tag: 0
9pnet: (00000460) <<< size=11 type: 7 tag: 0
9pnet: (00000460) <<< RLERROR (-95)
9pnet: (00000460) >>> TREAD fid 3 offset 24576 30008
9pnet: (00000460) >>> size=23 type: 116 tag: 0
9pnet: (00000460) <<< size=30019 type: 117 tag: 0
9pnet: (00000460) <<< RREAD count 30008
9pnet: (00000460) >>> TWALK fids 1,4 nwname 1d wname[0] foo
9pnet: (00000460) >>> size=22 type: 110 tag: 0
9pnet: (00000460) <<< size=22 type: 111 tag: 0
9pnet: (00000460) <<< RWALK nwqid 1:
9pnet: (00000460) <<<     [0] 0.6e66f9.625c86a5
9pnet: (00000460) >>> TGETATTR fid 4, request_mask 6143
9pnet: (00000460) >>> size=19 type: 24 tag: 0
9pnet: (00000460) <<< size=160 type: 25 tag: 0
9pnet: (00000460) <<< RGETATTR st_result_mask=6143
<<< qid=0.6e66f9.625c86a5
<<< st_mode=000081a4 st_nlink=1
<<< st_uid=0 st_gid=0
<<< st_rdev=0 st_size=9 st_blksize=126976 st_blocks=8
<<< st_atime_sec=1650233249 st_atime_nsec=226674419
<<< st_mtime_sec=1650233253 st_mtime_nsec=226727529
<<< st_ctime_sec=1650233253 st_ctime_nsec=226727529
<<< st_btime_sec=0 st_btime_nsec=0
<<< st_gen=0 st_data_version=0
9pnet: (00000460) >>> TWALK fids 4,5 nwname 0d wname[0] (null)
9pnet: (00000460) >>> size=17 type: 110 tag: 0
9pnet: (00000460) <<< size=9 type: 111 tag: 0
9pnet: (00000460) <<< RWALK nwqid 0:
9pnet: (00000460) >>> TLOPEN fid 5 mode 33793
9pnet: (00000460) >>> size=15 type: 12 tag: 0
9pnet: (00000460) <<< size=24 type: 13 tag: 0
9pnet: (00000460) <<< RLOPEN qid 0.6e66f9.625c86a5 iounit 1f000
9pnet: (00000460) >>> TWALK fids 4,6 nwname 0d wname[0] (null)
9pnet: (00000460) >>> size=17 type: 110 tag: 0
9pnet: (00000460) <<< size=9 type: 111 tag: 0
9pnet: (00000460) <<< RWALK nwqid 0:
9pnet: (00000460) >>> TLOPEN fid 6 mode 2
9pnet: (00000460) >>> size=15 type: 12 tag: 0
9pnet: (00000460) <<< size=24 type: 13 tag: 0
9pnet: (00000460) <<< RLOPEN qid 0.6e66f9.625c86a5 iounit 1f000
9pnet: (00000460) >>> TXATTRWALK file_fid 4, attr_fid 7 name security.capability
9pnet: (00000460) >>> size=36 type: 30 tag: 0
9pnet: (00000460) <<< size=11 type: 7 tag: 0
9pnet: (00000460) <<< RLERROR (-95)
9pnet: (00000460) >>> TREAD fid 5 offset 0 9
9pnet: (00000460) >>> size=23 type: 116 tag: 0
9pnet: (00000460) <<< size=11 type: 7 tag: 0
9pnet: (00000460) <<< RLERROR (-9)
9pnet: (00000460) >>> TCLUNK fid 5 (try 0)
9pnet: (00000460) >>> size=11 type: 120 tag: 0
9pnet: (00000460) <<< size=7 type: 121 tag: 0
9pnet: (00000460) <<< RCLUNK fid 5
9pnet: (00000460) >>> TCLUNK fid 3 (try 0)
9pnet: (00000460) >>> size=11 type: 120 tag: 0
9pnet: (00000460) <<< size=7 type: 121 tag: 0
9pnet: (00000460) <<< RCLUNK fid 3
-------

-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
  2022-04-17 21:22                           ` asmadeus
  2022-04-17 22:17                             ` 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)) asmadeus
@ 2022-04-21 10:36                             ` David Howells
  2022-04-21 11:36                               ` Christian Schoenebeck
  1 sibling, 1 reply; 48+ messages in thread
From: David Howells @ 2022-04-21 10:36 UTC (permalink / raw)
  To: asmadeus
  Cc: dhowells, Christian Schoenebeck, David Kahurani, davem, ericvh,
	kuba, linux-kernel, lucho, netdev, v9fs-developer, Greg Kurz

asmadeus@codewreck.org wrote:

> 	int fd = open(argv[1], O_WRONLY|O_APPEND);
> 	if (fd < 0)
> 		return 1;
> 	if (write(fd, "test\n", 5) < 0)

I think I need to implement the ability to store writes in non-uptodate pages
without needing to read from the server as NFS does.  This may fix the
performance drop also.

David


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
  2022-04-21 10:36                             ` David Howells
@ 2022-04-21 11:36                               ` Christian Schoenebeck
  2022-04-22 13:13                                 ` asmadeus
  2022-04-25 14:10                                 ` David Howells
  0 siblings, 2 replies; 48+ messages in thread
From: Christian Schoenebeck @ 2022-04-21 11:36 UTC (permalink / raw)
  To: asmadeus, David Howells
  Cc: dhowells, David Kahurani, davem, ericvh, kuba, linux-kernel,
	lucho, netdev, v9fs-developer, Greg Kurz

On Donnerstag, 21. April 2022 12:36:12 CEST David Howells wrote:
> asmadeus@codewreck.org wrote:
> > 	int fd = open(argv[1], O_WRONLY|O_APPEND);
> > 	if (fd < 0)
> > 	
> > 		return 1;
> > 	
> > 	if (write(fd, "test\n", 5) < 0)
> 
> I think I need to implement the ability to store writes in non-uptodate
> pages without needing to read from the server as NFS does.  This may fix
> the performance drop also.
> 
> David

I hope this does not sound harsh, wouldn't it make sense to revert 
eb497943fa215897f2f60fd28aa6fe52da27ca6c for now until those issues are sorted 
out? My concern is that it might take a long time to address them, and these 
are not minor issues.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
  2022-04-21 11:36                               ` Christian Schoenebeck
@ 2022-04-22 13:13                                 ` asmadeus
  2022-04-25 14:10                                 ` David Howells
  1 sibling, 0 replies; 48+ messages in thread
From: asmadeus @ 2022-04-22 13:13 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: David Howells, David Kahurani, davem, ericvh, kuba, linux-kernel,
	lucho, netdev, v9fs-developer, Greg Kurz

Christian Schoenebeck wrote on Thu, Apr 21, 2022 at 01:36:14PM +0200:
> I hope this does not sound harsh, wouldn't it make sense to revert 
> eb497943fa215897f2f60fd28aa6fe52da27ca6c for now until those issues are sorted 
> out? My concern is that it might take a long time to address them, and these 
> are not minor issues.

I'm not sure that's possible at all, the related old fscache code has
been ripped out since and just reverting won't work.

I'm also curious why that behavior changed though, I don't think the
old code had any special handling of partially written pages either...
Understanding that might give a key to a small quick fix.


It is quite a bad bug though and really wish I could give it the
attention it deserves, early next month has a few holidays here
hopefully I'll be able to look at it closer then :/

-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
  2022-04-21 11:36                               ` Christian Schoenebeck
  2022-04-22 13:13                                 ` asmadeus
@ 2022-04-25 14:10                                 ` David Howells
  2022-04-26 15:38                                   ` Christian Schoenebeck
  1 sibling, 1 reply; 48+ messages in thread
From: David Howells @ 2022-04-25 14:10 UTC (permalink / raw)
  To: asmadeus
  Cc: dhowells, Christian Schoenebeck, David Kahurani, davem, ericvh,
	kuba, linux-kernel, lucho, netdev, v9fs-developer, Greg Kurz

There may be a quick and dirty workaround.  I think the problem is that unless
the O_APPEND read starts at the beginning of a page, netfs is going to enforce
a read.  Does the attached patch fix the problem?  (note that it's untested)

Also, can you get the contents of /proc/fs/fscache/stats from after
reproducing the problem?

David
---
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 501128188343..5f61fdb950b0 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -291,16 +291,25 @@ static int v9fs_write_end(struct file *filp, struct address_space *mapping,
 	struct folio *folio = page_folio(subpage);
 	struct inode *inode = mapping->host;
 	struct v9fs_inode *v9inode = V9FS_I(inode);
+	size_t fsize = folio_size(folio);
+	size_t offset = pos & (fsize - 1);
+	/* With multipage folio support, we may be given len > fsize */
+	size_t copy_size = min_t(size_t, len, fsize - offset);
 
 	p9_debug(P9_DEBUG_VFS, "filp %p, mapping %p\n", filp, mapping);
 
 	if (!folio_test_uptodate(folio)) {
-		if (unlikely(copied < len)) {
+		if (unlikely(copied < copy_size)) {
 			copied = 0;
 			goto out;
 		}
-
-		folio_mark_uptodate(folio);
+		if (offset == 0) {
+			if (copied == fsize)
+				folio_mark_uptodate(folio);
+			/* Could clear to end of page if last_pos == new EOF
+			 * and then mark uptodate
+			 */
+		}
 	}
 
 	/*
diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index 281a88a5b8dc..78439f628c23 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -364,6 +364,12 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
 	if (folio_test_uptodate(folio))
 		goto have_folio;
 
+	if (!netfs_is_cache_enabled(ctx) &&
+	    (file->f_flags & (O_APPEND | O_ACCMODE)) == (O_APPEND | O_WRONLY)) {
+		netfs_stat(&netfs_n_rh_write_append);
+		goto have_folio_no_wait;
+	}
+
 	/* If the page is beyond the EOF, we want to clear it - unless it's
 	 * within the cache granule containing the EOF, in which case we need
 	 * to preload the granule.
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index b7b0e3d18d9e..a1cd649197dc 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -67,6 +67,7 @@ extern atomic_t netfs_n_rh_read_failed;
 extern atomic_t netfs_n_rh_zero;
 extern atomic_t netfs_n_rh_short_read;
 extern atomic_t netfs_n_rh_write;
+extern atomic_t netfs_n_rh_write_append;
 extern atomic_t netfs_n_rh_write_begin;
 extern atomic_t netfs_n_rh_write_done;
 extern atomic_t netfs_n_rh_write_failed;
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index 5510a7a14a40..fce87f86f950 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -23,6 +23,7 @@ atomic_t netfs_n_rh_read_failed;
 atomic_t netfs_n_rh_zero;
 atomic_t netfs_n_rh_short_read;
 atomic_t netfs_n_rh_write;
+atomic_t netfs_n_rh_write_append;
 atomic_t netfs_n_rh_write_begin;
 atomic_t netfs_n_rh_write_done;
 atomic_t netfs_n_rh_write_failed;
@@ -37,10 +38,11 @@ void netfs_stats_show(struct seq_file *m)
 		   atomic_read(&netfs_n_rh_write_zskip),
 		   atomic_read(&netfs_n_rh_rreq),
 		   atomic_read(&netfs_n_rh_sreq));
-	seq_printf(m, "RdHelp : ZR=%u sh=%u sk=%u\n",
+	seq_printf(m, "RdHelp : ZR=%u sh=%u sk=%u wa=%u\n",
 		   atomic_read(&netfs_n_rh_zero),
 		   atomic_read(&netfs_n_rh_short_read),
-		   atomic_read(&netfs_n_rh_write_zskip));
+		   atomic_read(&netfs_n_rh_write_zskip),
+		   atomic_read(&netfs_n_rh_write_append));
 	seq_printf(m, "RdHelp : DL=%u ds=%u df=%u di=%u\n",
 		   atomic_read(&netfs_n_rh_download),
 		   atomic_read(&netfs_n_rh_download_done),


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
  2022-04-25 14:10                                 ` David Howells
@ 2022-04-26 15:38                                   ` Christian Schoenebeck
  2022-05-03 10:21                                     ` asmadeus
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-04-26 15:38 UTC (permalink / raw)
  To: David Howells
  Cc: asmadeus, David Kahurani, davem, ericvh, kuba, linux-kernel,
	lucho, netdev, v9fs-developer, Greg Kurz

On Montag, 25. April 2022 16:10:16 CEST David Howells wrote:
> There may be a quick and dirty workaround.  I think the problem is that
> unless the O_APPEND read starts at the beginning of a page, netfs is going
> to enforce a read.  Does the attached patch fix the problem?  (note that
> it's untested)

Patch doesn't apply for me on master:

checking file fs/9p/vfs_addr.c
Hunk #1 FAILED at 291.
1 out of 1 hunk FAILED
checking file fs/netfs/buffered_read.c
Hunk #1 FAILED at 364.
1 out of 1 hunk FAILED
checking file fs/netfs/internal.h
checking file fs/netfs/stats.c
Hunk #2 FAILED at 38.
1 out of 2 hunks FAILED

commit d615b5416f8a1afeb82d13b238f8152c572d59c0 (HEAD -> master, origin/master, origin/HEAD)
Merge: 0fc74d820a01 4d8ec9120819
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Mon Apr 25 10:53:56 2022 -0700

What was is based on?

> Also, can you get the contents of /proc/fs/fscache/stats from after
> reproducing the problem?

FS-Cache statistics
Cookies: n=684 v=1 vcol=0 voom=0
Acquire: n=689 ok=689 oom=0
LRU    : n=0 exp=0 rmv=0 drp=0 at=0
Invals : n=0
Updates: n=2095 rsz=0 rsn=0
Relinqs: n=5 rtr=0 drop=5
NoSpace: nwr=0 ncr=0 cull=0
IO     : rd=0 wr=0
RdHelp : RA=974 RP=0 WB=13323 WBZ=2072 rr=0 sr=0
RdHelp : ZR=13854 sh=0 sk=2072
RdHelp : DL=14297 ds=14297 df=13322 di=0
RdHelp : RD=0 rs=0 rf=0
RdHelp : WR=0 ws=0 wf=0

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
  2022-04-26 15:38                                   ` Christian Schoenebeck
@ 2022-05-03 10:21                                     ` asmadeus
  2022-05-04 18:33                                       ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: asmadeus @ 2022-05-03 10:21 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: David Howells, David Kahurani, davem, ericvh, kuba, linux-kernel,
	lucho, netdev, v9fs-developer, Greg Kurz

Sorry for the delay.

Christian Schoenebeck wrote on Tue, Apr 26, 2022 at 05:38:30PM +0200:
> On Montag, 25. April 2022 16:10:16 CEST David Howells wrote:
> > There may be a quick and dirty workaround.  I think the problem is that
> > unless the O_APPEND read starts at the beginning of a page, netfs is going
> > to enforce a read.  Does the attached patch fix the problem?  (note that
> > it's untested)

It might work for this particular case (O_APPEND), but what about an
arbitrary pwrite or seek+write in the middle of a file?
e.g.

$ dd if=/dev/zero of=test bs=1M count=1
$ chmod 400 test
# drop cache or remound
$ dd if=/dev/urandom of=test bs=102 seek=2 count=1 conv=notrunc
dd: error writing 'test': Bad file descriptor


Silly question, how does that work on ceph or AFS? the read back
callback always works regardless of permission?

Basically I think we really only have two choices there:
 - make the readback call work regardless of open mode, e.g. make it use
the writeback fid if it wasn't, and make that writeback_fid all-able

Now I'm looking, v9fs_writeback_fid() calls
v9fs_fid_lookup_with_uid(GLOBAL_ROOT_UID) and opens with O_RDWR, so it
shoud be a root fid we can read regardles of file perm !

The more I think about it and the more I think that's the way to go and
probably how it used to work, I'll look into why this isn't working
(main fid used or writeback fid not root)


 - add some complex code to track the exact byte range that got updated
in some conditions e.g. WRONLY or read fails?
That'd still be useful depending on how the backend tracks file mode,
qemu as user with security_model=mapped-file keeps files 600 but with
passthrough or none qemu wouldn't be able to read the file regardless of
what we do on client...
Christian, if you still have an old kernel around did that use to work?


> Patch doesn't apply for me on master:

It applies on fscache-next
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-next

But on that branch with the patch (works fine without) I get another
problem just writing normally:
[   94.327094] ------------[ cut here ]------------
[   94.327809] WARNING: CPU: 0 PID: 93 at mm/page-writeback.c:2498 __folio_mark_dirty+0x397/0x510
[   94.329191] Modules linked in:
[   94.329491] CPU: 0 PID: 93 Comm: cat Not tainted 5.18.0-rc1+ #56
[   94.330195] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[   94.331709] RIP: 0010:__folio_mark_dirty+0x397/0x510
[   94.332312] Code: 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 14 01 00 00 44 8b 7b 5c 44 89 3c 24 4c 89 fd 49 63 d7 e9 4d fe ff ff <0f> 0b e9 c0 fc ff f0
[   94.335341] RSP: 0018:ffffc90000257ad0 EFLAGS: 00010046
[   94.336031] RAX: 4000000000000009 RBX: ffffea0001ffb080 RCX: ffffffff815144cc
[   94.336937] RDX: 1ffffd40003ff610 RSI: 0000000000000008 RDI: ffffea0001ffb080
[   94.337749] RBP: ffff8880056c4488 R08: 0000000000000000 R09: ffffea0001ffb087
[   94.338612] R10: fffff940003ff610 R11: 0000000000000001 R12: 0000000000000246
[   94.339551] R13: ffff8880056c4490 R14: 0000000000000001 R15: 0000000000000068
[   94.340487] FS:  00007f18dbc1eb80(0000) GS:ffff88806ca00000(0000) knlGS:0000000000000000
[   94.341558] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   94.342369] CR2: 00007f18dbbfd000 CR3: 000000000b5b4000 CR4: 00000000000006b0
[   94.343613] Call Trace:
[   94.343856]  <TASK>
[   94.344052]  filemap_dirty_folio+0x73/0xc0
[   94.344646]  v9fs_write_end+0x18f/0x300
[   94.345195]  generic_perform_write+0x2bd/0x4a0
[   94.345834]  ? __bpf_trace_file_check_and_advance_wb_err+0x10/0x10
[   94.346807]  ? discard_new_inode+0x100/0x100
[   94.347398]  ? generic_write_checks+0x1e8/0x360
[   94.347926]  __generic_file_write_iter+0x247/0x3d0
[   94.348420]  generic_file_write_iter+0xbe/0x1d0
[   94.348885]  new_sync_write+0x2f0/0x540
[   94.349250]  ? new_sync_read+0x530/0x530
[   94.349634]  vfs_write+0x517/0x7b0
[   94.349939]  ksys_write+0xed/0x1c0
[   94.350318]  ? __ia32_sys_read+0xb0/0xb0
[   94.350817]  do_syscall_64+0x43/0x90
[   94.351257]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   94.351955] RIP: 0033:0x7f18dbe0eea3
[   94.352438] Code: 54 ff ff 48 83 c4 58 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 75
[   94.355597] RSP: 002b:00007fffdf4661d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   94.356520] RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007f18dbe0eea3
[   94.357392] RDX: 0000000000000068 RSI: 00007f18dbbfd000 RDI: 0000000000000001
[   94.358287] RBP: 00007f18dbbfd000 R08: 00007f18dbbfc010 R09: 0000000000000000
[   94.359318] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000001
[   94.360349] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000020000
[   94.361295]  </TASK>
[   94.361462] ---[ end trace 0000000000000000 ]---

got it with cat but dd with bs >=2 also reproduces, the second write
fails with EBADF:

110   openat(AT_FDCWD, "bar", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
110   dup2(3, 1)                        = 1
110   close(3)                          = 0
110   execve("/run/current-system/sw/bin/cat", ["cat"], 0x12e1010 /* 10 vars */) = 0
110   read(0, "[   94.327094] ------------[ cut"..., 131072) = 52
110   write(1, "[   94.327094] ------------[ cut"..., 52) = 52
110   read(0, "[   94.327809] WARNING: CPU: 0 P"..., 131072) = 98
110   write(1, "[   94.327809] WARNING: CPU: 0 P"..., 98) = -1 EBADF (Bad file descriptor)

I'm sure that could be fixed, but as said above I don't think it's the
right approach.

> > Also, can you get the contents of /proc/fs/fscache/stats from after
> > reproducing the problem?
> 
> FS-Cache statistics

(He probably wanted to confirm the new trace he added got hit with the
workaround pattern, I didn't get that far as I couldn't compile my
reproducer on that fs...)

-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
  2022-05-03 10:21                                     ` asmadeus
@ 2022-05-04 18:33                                       ` Christian Schoenebeck
  2022-05-04 21:48                                         ` asmadeus
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-05-04 18:33 UTC (permalink / raw)
  To: asmadeus
  Cc: David Howells, David Kahurani, davem, ericvh, kuba, linux-kernel,
	lucho, netdev, v9fs-developer, Greg Kurz

On Dienstag, 3. Mai 2022 12:21:23 CEST asmadeus@codewreck.org wrote:
[...]
>  - add some complex code to track the exact byte range that got updated
> in some conditions e.g. WRONLY or read fails?
> That'd still be useful depending on how the backend tracks file mode,
> qemu as user with security_model=mapped-file keeps files 600 but with
> passthrough or none qemu wouldn't be able to read the file regardless of
> what we do on client...
> Christian, if you still have an old kernel around did that use to work?

Sorry, what was the question, i.e. what should I test / look for precisely? :)

[...]
> > > Also, can you get the contents of /proc/fs/fscache/stats from after
> > > reproducing the problem?
> > 
> > FS-Cache statistics
> 
> (He probably wanted to confirm the new trace he added got hit with the
> workaround pattern, I didn't get that far as I couldn't compile my
> reproducer on that fs...)

Yeah, I got that. But since his patch did not apply, I just dumped what I got 
so far in case the existing stats might be useful anyway.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
  2022-05-04 18:33                                       ` Christian Schoenebeck
@ 2022-05-04 21:48                                         ` asmadeus
  2022-05-06 19:14                                           ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: asmadeus @ 2022-05-04 21:48 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: David Howells, David Kahurani, davem, ericvh, kuba, linux-kernel,
	lucho, netdev, v9fs-developer, Greg Kurz

Christian Schoenebeck wrote on Wed, May 04, 2022 at 08:33:36PM +0200:
> On Dienstag, 3. Mai 2022 12:21:23 CEST asmadeus@codewreck.org wrote:
> >  - add some complex code to track the exact byte range that got updated
> > in some conditions e.g. WRONLY or read fails?
> > That'd still be useful depending on how the backend tracks file mode,
> > qemu as user with security_model=mapped-file keeps files 600 but with
> > passthrough or none qemu wouldn't be able to read the file regardless of
> > what we do on client...
> > Christian, if you still have an old kernel around did that use to work?
> 
> Sorry, what was the question, i.e. what should I test / look for precisely? :)

I was curious if older kernel does not issue read at all, or issues read
on writeback fid correctly opened as root/RDRW

You can try either the append.c I pasted a few mails back or the dd
commands, as regular user.

$ dd if=/dev/zero of=test bs=1M count=1
$ chmod 400 test
# drop cache or remount
$ dd if=/dev/urandom of=test bs=102 seek=2 count=1 conv=notrunc
dd: error writing 'test': Bad file descriptor

... But honestly I should just find the time to do it myself, this has
been dragging on for too long...
-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
  2022-05-04 21:48                                         ` asmadeus
@ 2022-05-06 19:14                                           ` Christian Schoenebeck
  2022-06-03 16:46                                             ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-05-06 19:14 UTC (permalink / raw)
  To: asmadeus
  Cc: David Howells, David Kahurani, davem, ericvh, kuba, linux-kernel,
	lucho, netdev, v9fs-developer, Greg Kurz

On Mittwoch, 4. Mai 2022 23:48:47 CEST asmadeus@codewreck.org wrote:
> Christian Schoenebeck wrote on Wed, May 04, 2022 at 08:33:36PM +0200:
> > On Dienstag, 3. Mai 2022 12:21:23 CEST asmadeus@codewreck.org wrote:
> > >  - add some complex code to track the exact byte range that got updated
> > > 
> > > in some conditions e.g. WRONLY or read fails?
> > > That'd still be useful depending on how the backend tracks file mode,
> > > qemu as user with security_model=mapped-file keeps files 600 but with
> > > passthrough or none qemu wouldn't be able to read the file regardless of
> > > what we do on client...
> > > Christian, if you still have an old kernel around did that use to work?
> > 
> > Sorry, what was the question, i.e. what should I test / look for
> > precisely? :)
> I was curious if older kernel does not issue read at all, or issues read
> on writeback fid correctly opened as root/RDRW
> 
> You can try either the append.c I pasted a few mails back or the dd
> commands, as regular user.
> 
> $ dd if=/dev/zero of=test bs=1M count=1
> $ chmod 400 test
> # drop cache or remount
> $ dd if=/dev/urandom of=test bs=102 seek=2 count=1 conv=notrunc
> dd: error writing 'test': Bad file descriptor

Seems you were right, the old kernel opens the file with O_RDWR.

The following was taken with cache=loose, pre-netfs kernel version, using your
append code and file to be appended already containing 34 bytes, relevant file is fid 7:

  v9fs_open tag 0 id 12 fid 7 mode 2
  v9fs_open_return tag 0 id 12 qid={type 0 version 1651854932 path 3108899} iounit 4096
  v9fs_xattrwalk tag 0 id 30 fid 5 newfid 8 name security.capability
  v9fs_rerror tag 0 id 30 err 95
  v9fs_read tag 0 id 116 fid 7 off 0 max_count 4096
  v9fs_read_return tag 0 id 116 count 34 err 45
  v9fs_read tag 0 id 116 fid 7 off 34 max_count 4062
  v9fs_read_return tag 0 id 116 count 0 err 11
  v9fs_clunk tag 0 id 120 fid 6
  v9fs_clunk tag 0 id 120 fid 4
  [delay]
  v9fs_write tag 0 id 118 fid 7 off 0 count 39 cnt 1
  v9fs_write_return tag 0 id 118 total 39 err 11
  v9fs_fsync tag 0 id 50 fid 7 datasync 0

BTW to see this protocol debug output with QEMU:

  cd qemu/build
  ../configure --enable-trace-backends=log ...
  make -jN
  ./qemu-system-x86_64 -trace 'v9fs*' ...

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
  2022-05-06 19:14                                           ` Christian Schoenebeck
@ 2022-06-03 16:46                                             ` Christian Schoenebeck
  2022-06-12 10:02                                               ` asmadeus
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-06-03 16:46 UTC (permalink / raw)
  To: asmadeus
  Cc: David Howells, David Kahurani, davem, ericvh, kuba, linux-kernel,
	lucho, netdev, v9fs-developer, Greg Kurz

On Freitag, 6. Mai 2022 21:14:52 CEST Christian Schoenebeck wrote:
> On Mittwoch, 4. Mai 2022 23:48:47 CEST asmadeus@codewreck.org wrote:
> > Christian Schoenebeck wrote on Wed, May 04, 2022 at 08:33:36PM +0200:
> > > On Dienstag, 3. Mai 2022 12:21:23 CEST asmadeus@codewreck.org wrote:
> > > >  - add some complex code to track the exact byte range that got
> > > >  updated
> > > > 
> > > > in some conditions e.g. WRONLY or read fails?
> > > > That'd still be useful depending on how the backend tracks file mode,
> > > > qemu as user with security_model=mapped-file keeps files 600 but with
> > > > passthrough or none qemu wouldn't be able to read the file regardless
> > > > of
> > > > what we do on client...
> > > > Christian, if you still have an old kernel around did that use to
> > > > work?
> > > 
> > > Sorry, what was the question, i.e. what should I test / look for
> > > precisely? :)
> > 
> > I was curious if older kernel does not issue read at all, or issues read
> > on writeback fid correctly opened as root/RDRW
> > 
> > You can try either the append.c I pasted a few mails back or the dd
> > commands, as regular user.
> > 
> > $ dd if=/dev/zero of=test bs=1M count=1
> > $ chmod 400 test
> > # drop cache or remount
> > $ dd if=/dev/urandom of=test bs=102 seek=2 count=1 conv=notrunc
> > dd: error writing 'test': Bad file descriptor
> 
> Seems you were right, the old kernel opens the file with O_RDWR.
> 
> The following was taken with cache=loose, pre-netfs kernel version, using
> your append code and file to be appended already containing 34 bytes,
> relevant file is fid 7:
> 
>   v9fs_open tag 0 id 12 fid 7 mode 2
>   v9fs_open_return tag 0 id 12 qid={type 0 version 1651854932 path 3108899}
> iounit 4096 v9fs_xattrwalk tag 0 id 30 fid 5 newfid 8 name
> security.capability v9fs_rerror tag 0 id 30 err 95
>   v9fs_read tag 0 id 116 fid 7 off 0 max_count 4096
>   v9fs_read_return tag 0 id 116 count 34 err 45
>   v9fs_read tag 0 id 116 fid 7 off 34 max_count 4062
>   v9fs_read_return tag 0 id 116 count 0 err 11
>   v9fs_clunk tag 0 id 120 fid 6
>   v9fs_clunk tag 0 id 120 fid 4
>   [delay]
>   v9fs_write tag 0 id 118 fid 7 off 0 count 39 cnt 1
>   v9fs_write_return tag 0 id 118 total 39 err 11
>   v9fs_fsync tag 0 id 50 fid 7 datasync 0
> 
> BTW to see this protocol debug output with QEMU:
> 
>   cd qemu/build
>   ../configure --enable-trace-backends=log ...
>   make -jN
>   ./qemu-system-x86_64 -trace 'v9fs*' ...

I had another time slice on this issue today. As Dominique pointed out before,
the writeback_fid was and still is opened with O_RDWR [fs/9p/fid.c]:

struct p9_fid *v9fs_writeback_fid(struct dentry *dentry)
{
	int err;
	struct p9_fid *fid, *ofid;

	ofid = v9fs_fid_lookup_with_uid(dentry, GLOBAL_ROOT_UID, 0);
	fid = clone_fid(ofid);
	if (IS_ERR(fid))
		goto error_out;
	p9_client_clunk(ofid);
	/*
	 * writeback fid will only be used to write back the
	 * dirty pages. We always request for the open fid in read-write
	 * mode so that a partial page write which result in page
	 * read can work.
	 */
	err = p9_client_open(fid, O_RDWR);
	if (err < 0) {
		p9_client_clunk(fid);
		fid = ERR_PTR(err);
		goto error_out;
	}
error_out:
	return fid;
}

The problem rather seems to be that the new netfs code does not use the
writeback_fid when doing an implied read before the actual partial writeback.

As I showed in my previous email, the old pre-netfs kernel versions also did a
read before partial writebacks, but apparently used the special writeback_fid
for that.

I added some trap code to recent netfs kernel version:

diff --git a/net/9p/client.c b/net/9p/client.c
index 8bba0d9cf975..11ff1ee2130e 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1549,12 +1549,21 @@ int p9_client_unlinkat(struct p9_fid *dfid, const char *name, int flags)
 }
 EXPORT_SYMBOL(p9_client_unlinkat);
 
+void p9_bug(void) {
+    BUG_ON(true);
+}
+EXPORT_SYMBOL(p9_bug);
+
 int
 p9_client_read(struct p9_fid *fid, u64 offset, struct iov_iter *to, int *err)
 {
        int total = 0;
        *err = 0;
 
+    if ((fid->mode & O_ACCMODE) == O_WRONLY) {
+        p9_bug();
+    }
+
        while (iov_iter_count(to)) {
                int count;
 
@@ -1648,6 +1657,10 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
        p9_debug(P9_DEBUG_9P, ">>> TWRITE fid %d offset %llu count %zd\n",
                 fid->fid, offset, iov_iter_count(from));
 
+    if ((fid->mode & O_ACCMODE) == O_RDONLY) {
+        p9_bug();
+    }
+
        while (iov_iter_count(from)) {
                int count = iov_iter_count(from);
                int rsize = fid->iounit;

Which triggers the trap in p9_client_read() with cache=loose. Here is the
backtrace [based on d615b5416f8a1afeb82d13b238f8152c572d59c0]:

[  139.365314] p9_client_read (net/9p/client.c:1553 net/9p/client.c:1564) 9pnet
[  139.148806] v9fs_issue_read (fs/9p/vfs_addr.c:45) 9p
[  139.149268] netfs_begin_read (fs/netfs/io.c:91 fs/netfs/io.c:579 fs/netfs/io.c:625) netfs
[  139.149725] ? xas_load (lib/xarray.c:211 lib/xarray.c:242) 
[  139.150057] ? xa_load (lib/xarray.c:1469) 
[  139.150398] netfs_write_begin (fs/netfs/buffered_read.c:407) netfs
[  139.150883] v9fs_write_begin (fs/9p/vfs_addr.c:279 (discriminator 2)) 9p
[  139.151293] generic_perform_write (mm/filemap.c:3789) 
[  139.151721] ? generic_update_time (fs/inode.c:1858) 
[  139.152112] ? file_update_time (fs/inode.c:2089) 
[  139.152504] __generic_file_write_iter (mm/filemap.c:3916) 
[  139.152943] generic_file_write_iter (./include/linux/fs.h:753 mm/filemap.c:3948) 
[  139.153348] new_sync_write (fs/read_write.c:505 (discriminator 1)) 
[  139.153754] vfs_write (fs/read_write.c:591) 
[  139.154090] ksys_write (fs/read_write.c:644) 
[  139.154417] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[  139.154776] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:115)

I still had not time to read the netfs code part yet, but I assume netfs falls
back to a generic 9p read on the O_WRONLY opened fid here, instead of using
the special O_RDWR opened 'writeback_fid'.

Is there already some info available in the netfs API that the read is
actually part of a writeback task, so that we could force on 9p driver level
to use the special writeback_fid for the read in this case instead?

Best regards,
Christian Schoenebeck





^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
  2022-06-03 16:46                                             ` Christian Schoenebeck
@ 2022-06-12 10:02                                               ` asmadeus
  2022-06-14  3:38                                                 ` [PATCH] 9p: fix EBADF errors in cached mode Dominique Martinet
  0 siblings, 1 reply; 48+ messages in thread
From: asmadeus @ 2022-06-12 10:02 UTC (permalink / raw)
  To: David Howells, Christian Schoenebeck
  Cc: David Kahurani, davem, ericvh, kuba, linux-kernel, lucho, netdev,
	v9fs-developer, Greg Kurz


Sorry, I had planned on working on this today but the other patchset
ended up taking all my time... I think I'm bad at priorities, this
is definitely important...

David, I think with the latest comments we made it should be relatively
straightforward to make netfs use the writeback fid? Could you find some
time to have a look? It should be trivial to reproduce, I gave these
commands a few mails ago (needs to run as a regular user, on a fscache mount)
---
$ dd if=/dev/zero of=test bs=1M count=1
$ chmod 200 test
# drop cache or remount
$ dd if=/dev/urandom of=test bs=102 seek=2 count=1 conv=notrunc
dd: error writing 'test': Bad file descriptor
---

Otherwise I'll try to make some more 9p time again, but it's getting
more and more difficult for me...

Christian Schoenebeck wrote on Fri, Jun 03, 2022 at 06:46:04PM +0200:
> I had another time slice on this issue today. As Dominique pointed out before,
> the writeback_fid was and still is opened with O_RDWR [fs/9p/fid.c]:
> 
> struct p9_fid *v9fs_writeback_fid(struct dentry *dentry)
> {
> 	int err;
> 	struct p9_fid *fid, *ofid;
> 
> 	ofid = v9fs_fid_lookup_with_uid(dentry, GLOBAL_ROOT_UID, 0);
> 	fid = clone_fid(ofid);
> 	if (IS_ERR(fid))
> 		goto error_out;
> 	p9_client_clunk(ofid);
> 	/*
> 	 * writeback fid will only be used to write back the
> 	 * dirty pages. We always request for the open fid in read-write
> 	 * mode so that a partial page write which result in page
> 	 * read can work.
> 	 */
> 	err = p9_client_open(fid, O_RDWR);
> 	if (err < 0) {
> 		p9_client_clunk(fid);
> 		fid = ERR_PTR(err);
> 		goto error_out;
> 	}
> error_out:
> 	return fid;
> }
> 
> The problem rather seems to be that the new netfs code does not use the
> writeback_fid when doing an implied read before the actual partial writeback.
> 
> As I showed in my previous email, the old pre-netfs kernel versions also did a
> read before partial writebacks, but apparently used the special writeback_fid
> for that.

This looks good! Thanks for keeping it up.

> 
> I added some trap code to recent netfs kernel version:
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 8bba0d9cf975..11ff1ee2130e 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -1549,12 +1549,21 @@ int p9_client_unlinkat(struct p9_fid *dfid, const char *name, int flags)
>  }
>  EXPORT_SYMBOL(p9_client_unlinkat);
>  
> +void p9_bug(void) {
> +    BUG_ON(true);
> +}
> +EXPORT_SYMBOL(p9_bug);
> +
>  int
>  p9_client_read(struct p9_fid *fid, u64 offset, struct iov_iter *to, int *err)
>  {
>         int total = 0;
>         *err = 0;
>  
> +    if ((fid->mode & O_ACCMODE) == O_WRONLY) {
> +        p9_bug();
> +    }
> +
>         while (iov_iter_count(to)) {
>                 int count;
>  
> @@ -1648,6 +1657,10 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
>         p9_debug(P9_DEBUG_9P, ">>> TWRITE fid %d offset %llu count %zd\n",
>                  fid->fid, offset, iov_iter_count(from));
>  
> +    if ((fid->mode & O_ACCMODE) == O_RDONLY) {
> +        p9_bug();
> +    }
> +
>         while (iov_iter_count(from)) {
>                 int count = iov_iter_count(from);
>                 int rsize = fid->iounit;
> 
> Which triggers the trap in p9_client_read() with cache=loose. Here is the
> backtrace [based on d615b5416f8a1afeb82d13b238f8152c572d59c0]:
> 
> [  139.365314] p9_client_read (net/9p/client.c:1553 net/9p/client.c:1564) 9pnet
> [  139.148806] v9fs_issue_read (fs/9p/vfs_addr.c:45) 9p
> [  139.149268] netfs_begin_read (fs/netfs/io.c:91 fs/netfs/io.c:579 fs/netfs/io.c:625) netfs
> [  139.149725] ? xas_load (lib/xarray.c:211 lib/xarray.c:242) 
> [  139.150057] ? xa_load (lib/xarray.c:1469) 
> [  139.150398] netfs_write_begin (fs/netfs/buffered_read.c:407) netfs
> [  139.150883] v9fs_write_begin (fs/9p/vfs_addr.c:279 (discriminator 2)) 9p
> [  139.151293] generic_perform_write (mm/filemap.c:3789) 
> [  139.151721] ? generic_update_time (fs/inode.c:1858) 
> [  139.152112] ? file_update_time (fs/inode.c:2089) 
> [  139.152504] __generic_file_write_iter (mm/filemap.c:3916) 
> [  139.152943] generic_file_write_iter (./include/linux/fs.h:753 mm/filemap.c:3948) 
> [  139.153348] new_sync_write (fs/read_write.c:505 (discriminator 1)) 
> [  139.153754] vfs_write (fs/read_write.c:591) 
> [  139.154090] ksys_write (fs/read_write.c:644) 
> [  139.154417] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> [  139.154776] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:115)
> 
> I still had not time to read the netfs code part yet, but I assume netfs falls
> back to a generic 9p read on the O_WRONLY opened fid here, instead of using
> the special O_RDWR opened 'writeback_fid'.
> 
> Is there already some info available in the netfs API that the read is
> actually part of a writeback task, so that we could force on 9p driver level
> to use the special writeback_fid for the read in this case instead?

-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH] 9p: fix EBADF errors in cached mode
  2022-06-12 10:02                                               ` asmadeus
@ 2022-06-14  3:38                                                 ` Dominique Martinet
  2022-06-14  3:41                                                   ` Dominique Martinet
  0 siblings, 1 reply; 48+ messages in thread
From: Dominique Martinet @ 2022-06-14  3:38 UTC (permalink / raw)
  To: Eric Van Hensbergen, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck, David Howells
  Cc: linux-fsdevel, stable, v9fs-developer, linux-kernel

cached operations sometimes need to do invalid operations (e.g. read
on a write only file)
Historic fscache had added a "writeback fid" for this, but the conversion
to new fscache somehow lost usage of it: use the writeback fid instead
of normal one.

Note that the way this works (writeback fid being linked to inode) means
we might use overprivileged fid for some operations, e.g. write as root
when we shouldn't.
Ideally we should keep both fids handy, and only use the writeback fid
when really required e.g. reads to a write-only file to fill in the page
cache (read-modify-write); but this is the situation we've always had
and this commit only fixes an issue we've had for too long.

Fixes: eb497943fa21 ("9p: Convert to using the netfs helper lib to do reads and caching")
Cc: stable@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Reported-By: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
---
Ok so finally had time to look at this, and it's not a lot so this is
the most straight forward way to do: just reverting to how the old
fscache worked.

This appears to work from quick testing, Chiristian could you test it?

I think the warnings you added in p9_client_read/write that check
fid->mode might a lot of sense, if you care to resend it as
WARN_ON((fid->mode & ACCMODE) == O_xyz);
instead I'll queue that for 5.20


@Stable people, I've checked it applies to 5.17 and 5.18 so should be
good to grab once I submit it for inclusion (that commit was included in
5.16, which is no longer stable)


 fs/9p/vfs_addr.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 7382c5227e94..262968d02f55 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -58,7 +58,11 @@ static void v9fs_issue_read(struct netfs_io_subrequest *subreq)
  */
 static int v9fs_init_request(struct netfs_io_request *rreq, struct file *file)
 {
-	struct p9_fid *fid = file->private_data;
+	struct inode *inode = file_inode(file);
+	struct v9fs_inode *v9inode = V9FS_I(inode);
+	struct p9_fid *fid = v9inode->writeback_fid;
+
+	BUG_ON(!fid);
 
 	p9_fid_get(fid);
 	rreq->netfs_priv = fid;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH] 9p: fix EBADF errors in cached mode
  2022-06-14  3:38                                                 ` [PATCH] 9p: fix EBADF errors in cached mode Dominique Martinet
@ 2022-06-14  3:41                                                   ` Dominique Martinet
  2022-06-14 12:10                                                     ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: Dominique Martinet @ 2022-06-14  3:41 UTC (permalink / raw)
  To: Eric Van Hensbergen, Latchesar Ionkov, Christian Schoenebeck,
	David Howells
  Cc: linux-fsdevel, stable, v9fs-developer, linux-kernel

Dominique Martinet wrote on Tue, Jun 14, 2022 at 12:38:02PM +0900:
> cached operations sometimes need to do invalid operations (e.g. read
> on a write only file)
> Historic fscache had added a "writeback fid" for this, but the conversion
> to new fscache somehow lost usage of it: use the writeback fid instead
> of normal one.
> 
> Note that the way this works (writeback fid being linked to inode) means
> we might use overprivileged fid for some operations, e.g. write as root
> when we shouldn't.
> Ideally we should keep both fids handy, and only use the writeback fid
> when really required e.g. reads to a write-only file to fill in the page
> cache (read-modify-write); but this is the situation we've always had
> and this commit only fixes an issue we've had for too long.
> 
> Fixes: eb497943fa21 ("9p: Convert to using the netfs helper lib to do reads and caching")
> Cc: stable@vger.kernel.org
> Cc: David Howells <dhowells@redhat.com>
> Reported-By: Christian Schoenebeck <linux_oss@crudebyte.com>
> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
> ---
> Ok so finally had time to look at this, and it's not a lot so this is
> the most straight forward way to do: just reverting to how the old
> fscache worked.
> 
> This appears to work from quick testing, Chiristian could you test it?
> 
> I think the warnings you added in p9_client_read/write that check
> fid->mode might a lot of sense, if you care to resend it as
> WARN_ON((fid->mode & ACCMODE) == O_xyz);
> instead I'll queue that for 5.20
> 
> 
> @Stable people, I've checked it applies to 5.17 and 5.18 so should be
> good to grab once I submit it for inclusion (that commit was included in
> 5.16, which is no longer stable)
> 
> 
>  fs/9p/vfs_addr.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
> index 7382c5227e94..262968d02f55 100644
> --- a/fs/9p/vfs_addr.c
> +++ b/fs/9p/vfs_addr.c
> @@ -58,7 +58,11 @@ static void v9fs_issue_read(struct netfs_io_subrequest *subreq)
>   */
>  static int v9fs_init_request(struct netfs_io_request *rreq, struct file *file)
>  {
> -	struct p9_fid *fid = file->private_data;
> +	struct inode *inode = file_inode(file);
> +	struct v9fs_inode *v9inode = V9FS_I(inode);
> +	struct p9_fid *fid = v9inode->writeback_fid;
> +

Sorry for mails back-to-back (grmbl I hate git commit --amend not
warning I only have unstaged changes), this is missing the following
here:

+    /* If there is no writeback fid this file only ever has had
+     * read-only opens, so we can use file's fid which should
+     * always be set instead */
+    if (!fid)
+        fid = file->private_data;

Christian, you can find it here to test:
https://github.com/martinetd/linux/commit/a6e033c41cc9f0ec105f5d208b0a820118e2bda8

> +	BUG_ON(!fid);
>  
>  	p9_fid_get(fid);
>  	rreq->netfs_priv = fid;

Thanks,
-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] 9p: fix EBADF errors in cached mode
  2022-06-14  3:41                                                   ` Dominique Martinet
@ 2022-06-14 12:10                                                     ` Christian Schoenebeck
  2022-06-14 12:45                                                       ` Dominique Martinet
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-06-14 12:10 UTC (permalink / raw)
  To: Eric Van Hensbergen, Latchesar Ionkov, David Howells, Dominique Martinet
  Cc: linux-fsdevel, stable, v9fs-developer, linux-kernel

On Dienstag, 14. Juni 2022 05:41:40 CEST Dominique Martinet wrote:
> Dominique Martinet wrote on Tue, Jun 14, 2022 at 12:38:02PM +0900:
> > cached operations sometimes need to do invalid operations (e.g. read
> > on a write only file)
> > Historic fscache had added a "writeback fid" for this, but the conversion
> > to new fscache somehow lost usage of it: use the writeback fid instead
> > of normal one.
> > 
> > Note that the way this works (writeback fid being linked to inode) means
> > we might use overprivileged fid for some operations, e.g. write as root
> > when we shouldn't.
> > Ideally we should keep both fids handy, and only use the writeback fid
> > when really required e.g. reads to a write-only file to fill in the page
> > cache (read-modify-write); but this is the situation we've always had
> > and this commit only fixes an issue we've had for too long.
> > 
> > Fixes: eb497943fa21 ("9p: Convert to using the netfs helper lib to do
> > reads and caching") Cc: stable@vger.kernel.org
> > Cc: David Howells <dhowells@redhat.com>
> > Reported-By: Christian Schoenebeck <linux_oss@crudebyte.com>
> > Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
> > ---
> > Ok so finally had time to look at this, and it's not a lot so this is
> > the most straight forward way to do: just reverting to how the old
> > fscache worked.
> > 
> > This appears to work from quick testing, Chiristian could you test it?
> > 
> > I think the warnings you added in p9_client_read/write that check
> > fid->mode might a lot of sense, if you care to resend it as
> > WARN_ON((fid->mode & ACCMODE) == O_xyz);
> > instead I'll queue that for 5.20
> > 
> > 
> > @Stable people, I've checked it applies to 5.17 and 5.18 so should be
> > good to grab once I submit it for inclusion (that commit was included in
> > 5.16, which is no longer stable)
> > 
> >  fs/9p/vfs_addr.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
> > index 7382c5227e94..262968d02f55 100644
> > --- a/fs/9p/vfs_addr.c
> > +++ b/fs/9p/vfs_addr.c
> > @@ -58,7 +58,11 @@ static void v9fs_issue_read(struct netfs_io_subrequest
> > *subreq)> 
> >   */
> >  
> >  static int v9fs_init_request(struct netfs_io_request *rreq, struct file
> >  *file) {
> > 
> > -	struct p9_fid *fid = file->private_data;
> > +	struct inode *inode = file_inode(file);
> > +	struct v9fs_inode *v9inode = V9FS_I(inode);
> > +	struct p9_fid *fid = v9inode->writeback_fid;
> > +
> 
> Sorry for mails back-to-back (grmbl I hate git commit --amend not
> warning I only have unstaged changes), this is missing the following
> here:

I think git does actually. It shows you staged and unstaged changes as comment 
below the commit log text inside the editor. Not as a big fat warning, but the 
info is there.

> +    /* If there is no writeback fid this file only ever has had
> +     * read-only opens, so we can use file's fid which should
> +     * always be set instead */
> +    if (!fid)
> +        fid = file->private_data;
> 
> Christian, you can find it here to test:
> https://github.com/martinetd/linux/commit/a6e033c41cc9f0ec105f5d208b0a820118
> e2bda8
> > +	BUG_ON(!fid);
> > 
> >  	p9_fid_get(fid);
> >  	rreq->netfs_priv = fid;

It definitely goes into the right direction, but I think it's going a bit too 
far by using writeback_fid also in cases where it is not necessary and wasn't 
used before in the past.

What about something like this in v9fs_init_request() (yet untested):

    /* writeback_fid is always opened O_RDWR (instead of just O_WRONLY) 
     * explicitly for this case: partial write backs that require a read
     * prior to actual write and therefore requires a fid with read
     * capability.
     */
    if (rreq->origin == NETFS_READ_FOR_WRITE)
        fid = v9inode->writeback_fid;

If desired, this could be further constrained later on like:

    if (rreq->origin == NETFS_READ_FOR_WRITE &&
        (fid->mode & O_ACCMODE) == O_WRONLY)
    {
        fid = v9inode->writeback_fid;
    }

I will definitely give these options some test spins here, a short feedback 
ahead would be appreciated though.

Thanks Dominique!

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] 9p: fix EBADF errors in cached mode
  2022-06-14 12:10                                                     ` Christian Schoenebeck
@ 2022-06-14 12:45                                                       ` Dominique Martinet
  2022-06-14 14:11                                                         ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: Dominique Martinet @ 2022-06-14 12:45 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Eric Van Hensbergen, Latchesar Ionkov, David Howells,
	linux-fsdevel, stable, v9fs-developer, linux-kernel

Christian Schoenebeck wrote on Tue, Jun 14, 2022 at 02:10:01PM +0200:
> It definitely goes into the right direction, but I think it's going a bit too 
> far by using writeback_fid also in cases where it is not necessary and wasn't 
> used before in the past.

Would help if I had an idea of what was used where in the past.. :)

From a quick look at the code, checking out v5.10,
v9fs_vfs_writepage_locked() used the writeback fid always for all writes
v9fs_vfs_readpages is a bit more complex but only seems to be using the
"direct" private_data fid for reads...
It took me a bit of time but I think the reads you were seeing on
writeback fid come from v9fs_write_begin that does some readpage on the
writeback fid to populate the page before a non-filling write happens.

> What about something like this in v9fs_init_request() (yet untested):
> 
>     /* writeback_fid is always opened O_RDWR (instead of just O_WRONLY) 
>      * explicitly for this case: partial write backs that require a read
>      * prior to actual write and therefore requires a fid with read
>      * capability.
>      */
>     if (rreq->origin == NETFS_READ_FOR_WRITE)
>         fid = v9inode->writeback_fid;

... Which seems to be exactly what this origin is about, so if that
works I'm all for it.

> If desired, this could be further constrained later on like:
> 
>     if (rreq->origin == NETFS_READ_FOR_WRITE &&
>         (fid->mode & O_ACCMODE) == O_WRONLY)
>     {
>         fid = v9inode->writeback_fid;
>     }

That also makes sense, if the fid mode has read permissions we might as
well use these as the writeback fid would needlessly be doing root IOs.

> I will definitely give these options some test spins here, a short feedback 
> ahead would be appreciated though.

Please let me know how that works out, I'd be happy to use either of
your versions instead of mine.
If I can be greedy though I'd like to post it together with the other
couple of fixes next week, so having something before the end of the
week would be great -- I think even my first overkill version early and
building on it would make sense at this point.

But I think you've got the right end, so hopefully won't be needing to
delay


Cheers,
-- 
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] 9p: fix EBADF errors in cached mode
  2022-06-14 12:45                                                       ` Dominique Martinet
@ 2022-06-14 14:11                                                         ` Christian Schoenebeck
  2022-06-16 13:35                                                           ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-06-14 14:11 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Eric Van Hensbergen, Latchesar Ionkov, David Howells,
	linux-fsdevel, stable, v9fs-developer, linux-kernel

On Dienstag, 14. Juni 2022 14:45:38 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Tue, Jun 14, 2022 at 02:10:01PM +0200:
> > It definitely goes into the right direction, but I think it's going a bit
> > too far by using writeback_fid also in cases where it is not necessary
> > and wasn't used before in the past.
> 
> Would help if I had an idea of what was used where in the past.. :)
> 
> From a quick look at the code, checking out v5.10,
> v9fs_vfs_writepage_locked() used the writeback fid always for all writes
> v9fs_vfs_readpages is a bit more complex but only seems to be using the
> "direct" private_data fid for reads...
> It took me a bit of time but I think the reads you were seeing on
> writeback fid come from v9fs_write_begin that does some readpage on the
> writeback fid to populate the page before a non-filling write happens.

Yes, the overall picture in the past was not clear to me either.

To be more specific, I was reading your patch as if it would e.g. also use the 
writeback_fid if somebody explicitly called read() (i.e. not an implied read 
caused by a partial write back), and was concerned about a potential privilege 
escalation. Maybe it's just a theoretical issue, as this case is probably 
already catched on a higher, general fs handling level, but worth 
consideration.

> > What about something like this in v9fs_init_request() (yet untested):
> >     /* writeback_fid is always opened O_RDWR (instead of just O_WRONLY)
> >     
> >      * explicitly for this case: partial write backs that require a read
> >      * prior to actual write and therefore requires a fid with read
> >      * capability.
> >      */
> >     
> >     if (rreq->origin == NETFS_READ_FOR_WRITE)
> >     
> >         fid = v9inode->writeback_fid;
> 
> ... Which seems to be exactly what this origin is about, so if that
> works I'm all for it.
> 
> > If desired, this could be further constrained later on like:
> >     if (rreq->origin == NETFS_READ_FOR_WRITE &&
> >     
> >         (fid->mode & O_ACCMODE) == O_WRONLY)
> >     
> >     {
> >     
> >         fid = v9inode->writeback_fid;
> >     
> >     }
> 
> That also makes sense, if the fid mode has read permissions we might as
> well use these as the writeback fid would needlessly be doing root IOs.
> 
> > I will definitely give these options some test spins here, a short
> > feedback
> > ahead would be appreciated though.
> 
> Please let me know how that works out, I'd be happy to use either of
> your versions instead of mine.
> If I can be greedy though I'd like to post it together with the other
> couple of fixes next week, so having something before the end of the
> week would be great -- I think even my first overkill version early and
> building on it would make sense at this point.
> 
> But I think you've got the right end, so hopefully won't be needing to
> delay

I need a day or two for testing, then I will report back for sure. So it 
should perfectly fit into your intended schedule.

Thanks!

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] 9p: fix EBADF errors in cached mode
  2022-06-14 14:11                                                         ` Christian Schoenebeck
@ 2022-06-16 13:35                                                           ` Christian Schoenebeck
  2022-06-16 13:51                                                             ` Dominique Martinet
  2022-06-16 13:52                                                             ` [PATCH v2] " Dominique Martinet
  0 siblings, 2 replies; 48+ messages in thread
From: Christian Schoenebeck @ 2022-06-16 13:35 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Eric Van Hensbergen, Latchesar Ionkov, David Howells,
	linux-fsdevel, stable, v9fs-developer, linux-kernel

On Dienstag, 14. Juni 2022 16:11:35 CEST Christian Schoenebeck wrote:
> On Dienstag, 14. Juni 2022 14:45:38 CEST Dominique Martinet wrote:
[...]
> > Please let me know how that works out, I'd be happy to use either of
> > your versions instead of mine.
> > If I can be greedy though I'd like to post it together with the other
> > couple of fixes next week, so having something before the end of the
> > week would be great -- I think even my first overkill version early and
> > building on it would make sense at this point.
> > 
> > But I think you've got the right end, so hopefully won't be needing to
> > delay
> 
> I need a day or two for testing, then I will report back for sure. So it
> should perfectly fit into your intended schedule.

Two things:

1. your EBADF patch is based on you recent get/put refactoring patch, so it won't apply on stable.

2. I fixed the conflict and gave your patch a test spin, and it triggers
the BUG_ON(!fid); that you added with that patch. Backtrace based on
30306f6194ca ("Merge tag 'hardening-v5.19-rc3' ..."):

[    2.211473] kernel BUG at fs/9p/vfs_addr.c:65!
...
[    2.244415] netfs_alloc_request (fs/netfs/objects.c:42) netfs
[    2.245438] netfs_readahead (fs/netfs/buffered_read.c:166) netfs
[    2.246392] read_pages (./include/linux/pagemap.h:1264 ./include/linux/pagemap.h:1306 mm/readahead.c:164) 
[    2.247120] ? folio_add_lru (./arch/x86/include/asm/preempt.h:103 mm/swap.c:468) 
[    2.247911] page_cache_ra_unbounded (./include/linux/fs.h:808 mm/readahead.c:264) 
[    2.248875] filemap_get_pages (mm/filemap.c:2594) 
[    2.249723] filemap_read (mm/filemap.c:2679) 
[    2.250478] ? ptep_set_access_flags (./arch/x86/include/asm/paravirt.h:441 arch/x86/mm/pgtable.c:493) 
[    2.251417] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:186) 
[    2.252253] ? do_wp_page (mm/memory.c:3293 mm/memory.c:3393) 
[    2.253012] ? aa_file_perm (security/apparmor/file.c:604) 
[    2.253824] new_sync_read (fs/read_write.c:402 (discriminator 1)) 
[    2.254616] vfs_read (fs/read_write.c:482) 
[    2.255313] ksys_read (fs/read_write.c:620) 
[    2.256000] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[    2.256764] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:115)

Did your patch work there for you? I mean I have not applied the other pending
9p patches, but they should not really make difference, right? I won't have
time today, but I will continue to look at it tomorrow. If you already had
some thoughts on this, that would be great of course.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] 9p: fix EBADF errors in cached mode
  2022-06-16 13:35                                                           ` Christian Schoenebeck
@ 2022-06-16 13:51                                                             ` Dominique Martinet
  2022-06-16 14:11                                                               ` Dominique Martinet
  2022-06-16 13:52                                                             ` [PATCH v2] " Dominique Martinet
  1 sibling, 1 reply; 48+ messages in thread
From: Dominique Martinet @ 2022-06-16 13:51 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Eric Van Hensbergen, Latchesar Ionkov, David Howells,
	linux-fsdevel, stable, v9fs-developer, linux-kernel

Christian Schoenebeck wrote on Thu, Jun 16, 2022 at 03:35:59PM +0200:
> 2. I fixed the conflict and gave your patch a test spin, and it triggers
> the BUG_ON(!fid); that you added with that patch. Backtrace based on
> 30306f6194ca ("Merge tag 'hardening-v5.19-rc3' ..."):

hm, that's probably the version I sent without the fallback to
private_data fid if writeback fid was sent (I've only commented without
sending a v2)

> 1. your EBADF patch is based on you recent get/put refactoring patch, so it won't apply on stable.

ugh, you are correct, that was wrong as well in the version I sent by
mail... I've hurried that way too much.

The patch that's currently on the tip of my 9p-next branch should be
alright though, I'll resend it now so you can apply cleanly if you don't
want to fetch https://github.com/martinetd/linux/commits/9p-next

> Did your patch work there for you? I mean I have not applied the other pending
> 9p patches, but they should not really make difference, right? I won't have
> time today, but I will continue to look at it tomorrow. If you already had
> some thoughts on this, that would be great of course.

Yes, my version passes basic tests at least, and I could no longer
reproduce the problem.

Without the if (!fid) fid = file->private_data though it does fail
horribly like you've found out.

--
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2] 9p: fix EBADF errors in cached mode
  2022-06-16 13:35                                                           ` Christian Schoenebeck
  2022-06-16 13:51                                                             ` Dominique Martinet
@ 2022-06-16 13:52                                                             ` Dominique Martinet
  1 sibling, 0 replies; 48+ messages in thread
From: Dominique Martinet @ 2022-06-16 13:52 UTC (permalink / raw)
  To: Eric Van Hensbergen, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck, David Howells
  Cc: stable, v9fs-developer, linux-kernel

cached operations sometimes need to do invalid operations (e.g. read
on a write only file)
Historic fscache had added a "writeback fid" for this, but the conversion
to new fscache somehow lost usage of it: use the writeback fid instead
of normal one.

Note that the way this works (writeback fid being linked to inode) means
we might use overprivileged fid for some operations, e.g. write as root
when we shouldn't.
Ideally we should keep both fids handy, and only use the writeback fid
when really required e.g. reads to a write-only file to fill in the page
cache (read-modify-write); but this is the situation we've always had
and this commit only fixes an issue we've had for too long.

Link: https://lkml.kernel.org/r/20220614033802.1606738-1-asmadeus@codewreck.org
Fixes: eb497943fa21 ("9p: Convert to using the netfs helper lib to do reads and caching")
Cc: stable@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Reported-By: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
---
 fs/9p/vfs_addr.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index a8f512b44a85..7f924e671e3e 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -58,7 +58,17 @@ static void v9fs_issue_read(struct netfs_io_subrequest *subreq)
  */
 static int v9fs_init_request(struct netfs_io_request *rreq, struct file *file)
 {
-	struct p9_fid *fid = file->private_data;
+	struct inode *inode = file_inode(file);
+	struct v9fs_inode *v9inode = V9FS_I(inode);
+	struct p9_fid *fid = v9inode->writeback_fid;
+
+	/* If there is no writeback fid this file only ever has had
+	 * read-only opens, so we can use file's fid which should
+	 * always be set instead */
+	if (!fid)
+		fid = file->private_data;
+
+	BUG_ON(!fid);
 
 	refcount_inc(&fid->count);
 	rreq->netfs_priv = fid;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH] 9p: fix EBADF errors in cached mode
  2022-06-16 13:51                                                             ` Dominique Martinet
@ 2022-06-16 14:11                                                               ` Dominique Martinet
  2022-06-16 20:14                                                                 ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: Dominique Martinet @ 2022-06-16 14:11 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Eric Van Hensbergen, Latchesar Ionkov, David Howells,
	linux-fsdevel, stable, v9fs-developer, linux-kernel

Dominique Martinet wrote on Thu, Jun 16, 2022 at 10:51:31PM +0900:
> > Did your patch work there for you? I mean I have not applied the other pending
> > 9p patches, but they should not really make difference, right? I won't have
> > time today, but I will continue to look at it tomorrow. If you already had
> > some thoughts on this, that would be great of course.
> 
> Yes, my version passes basic tests at least, and I could no longer
> reproduce the problem.

For what it's worth I've also tested a version of your patch:

-----
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index a8f512b44a85..d0833fa69faf 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -58,8 +58,21 @@ static void v9fs_issue_read(struct netfs_io_subrequest *subreq)
  */
 static int v9fs_init_request(struct netfs_io_request *rreq, struct file *file)
 {
+	struct inode *inode = file_inode(file);
+	struct v9fs_inode *v9inode = V9FS_I(inode);
 	struct p9_fid *fid = file->private_data;
 
+	BUG_ON(!fid);
+
+	/* we might need to read from a fid that was opened write-only
+	 * for read-modify-write of page cache, use the writeback fid
+	 * for that */
+	if (rreq->origin == NETFS_READ_FOR_WRITE &&
+			(fid->mode & O_ACCMODE) == O_WRONLY) {
+		fid = v9inode->writeback_fid;
+		BUG_ON(!fid);
+	}
+
 	refcount_inc(&fid->count);
 	rreq->netfs_priv = fid;
 	return 0;
-----

And this also seems to work alright.

I was about to ask why the original code did writes with the writeback
fid, but I'm noticing now the current code still does (through
v9fs_vfs_write_folio_locked()), so that part hasn't changed from the old
code, and init_request will only be getting reads? Which actually makes
sense now I'm thinking about it because I recall David saying he's
working on netfs writes now...

So that minimal version is probably what we want, give or take style
adjustments (only initializing inode/v9inode in the if case or not) -- I
sure hope compilers optimizes it away when not needed.


I'll let you test one or both versions and will fixup the commit message
again/credit you/resend if we go with this version, unless you want to
send it.

--
Dominique

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH] 9p: fix EBADF errors in cached mode
  2022-06-16 14:11                                                               ` Dominique Martinet
@ 2022-06-16 20:14                                                                 ` Christian Schoenebeck
  2022-06-16 20:53                                                                   ` Dominique Martinet
  2022-06-16 21:10                                                                   ` [PATCH v3] " Dominique Martinet
  0 siblings, 2 replies; 48+ messages in thread
From: Christian Schoenebeck @ 2022-06-16 20:14 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Eric Van Hensbergen, Latchesar Ionkov, David Howells,
	linux-fsdevel, stable, v9fs-developer, linux-kernel

On Donnerstag, 16. Juni 2022 15:51:31 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Thu, Jun 16, 2022 at 03:35:59PM +0200:
> > 2. I fixed the conflict and gave your patch a test spin, and it triggers
> > the BUG_ON(!fid); that you added with that patch. Backtrace based on
> 
> > 30306f6194ca ("Merge tag 'hardening-v5.19-rc3' ..."):
> hm, that's probably the version I sent without the fallback to
> private_data fid if writeback fid was sent (I've only commented without
> sending a v2)

Right, I forgot that you queued another version, sorry. With your already 
queued patch (today's v2) that's fine now.

On Donnerstag, 16. Juni 2022 16:11:16 CEST Dominique Martinet wrote:
> Dominique Martinet wrote on Thu, Jun 16, 2022 at 10:51:31PM +0900:
> > > Did your patch work there for you? I mean I have not applied the other
> > > pending 9p patches, but they should not really make difference, right?
> > > I won't have time today, but I will continue to look at it tomorrow. If
> > > you already had some thoughts on this, that would be great of course.
> > 
> > Yes, my version passes basic tests at least, and I could no longer
> > reproduce the problem.
> 
> For what it's worth I've also tested a version of your patch:
> 
> -----
> diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
> index a8f512b44a85..d0833fa69faf 100644
> --- a/fs/9p/vfs_addr.c
> +++ b/fs/9p/vfs_addr.c
> @@ -58,8 +58,21 @@ static void v9fs_issue_read(struct netfs_io_subrequest
> *subreq) */
>  static int v9fs_init_request(struct netfs_io_request *rreq, struct file
> *file) {
> +	struct inode *inode = file_inode(file);
> +	struct v9fs_inode *v9inode = V9FS_I(inode);
>  	struct p9_fid *fid = file->private_data;
> 
> +	BUG_ON(!fid);
> +
> +	/* we might need to read from a fid that was opened write-only
> +	 * for read-modify-write of page cache, use the writeback fid
> +	 * for that */
> +	if (rreq->origin == NETFS_READ_FOR_WRITE &&
> +			(fid->mode & O_ACCMODE) == O_WRONLY) {
> +		fid = v9inode->writeback_fid;
> +		BUG_ON(!fid);
> +	}
> +
>  	refcount_inc(&fid->count);
>  	rreq->netfs_priv = fid;
>  	return 0;
> -----
> 
> And this also seems to work alright.
> 
> I was about to ask why the original code did writes with the writeback
> fid, but I'm noticing now the current code still does (through
> v9fs_vfs_write_folio_locked()), so that part hasn't changed from the old
> code, and init_request will only be getting reads? Which actually makes
> sense now I'm thinking about it because I recall David saying he's
> working on netfs writes now...
> 
> So that minimal version is probably what we want, give or take style
> adjustments (only initializing inode/v9inode in the if case or not) -- I
> sure hope compilers optimizes it away when not needed.
> 
> 
> I'll let you test one or both versions and will fixup the commit message
> again/credit you/resend if we go with this version, unless you want to
> send it.
> 
> --
> Dominique

I tested all 3 variants today, and they were all behaving correctly (no EBADF 
errors anymore, no other side effects observed).

The minimalistic version (i.e. your initial suggestion) performed 20% slower 
in my tests, but that could be due to the fact that it was simply the 1st 
version I tested, so caching on host side might be the reason. If necessary I 
can check the performance aspect more thoroughly.

Personally I would at least use the NETFS_READ_FOR_WRITE version, but that's 
up to you. On doubt, clarify with David's plans.

Feel free to add my RB and TB tags to any of the 3 version(s) you end up 
queuing:

Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Tested-by: Christian Schoenebeck <linux_oss@crudebyte.com>

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] 9p: fix EBADF errors in cached mode
  2022-06-16 20:14                                                                 ` Christian Schoenebeck
@ 2022-06-16 20:53                                                                   ` Dominique Martinet
  2022-06-16 21:10                                                                   ` [PATCH v3] " Dominique Martinet
  1 sibling, 0 replies; 48+ messages in thread
From: Dominique Martinet @ 2022-06-16 20:53 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Eric Van Hensbergen, Latchesar Ionkov, David Howells,
	linux-fsdevel, stable, v9fs-developer, linux-kernel

Christian Schoenebeck wrote on Thu, Jun 16, 2022 at 10:14:16PM +0200:
> I tested all 3 variants today, and they were all behaving correctly (no EBADF 
> errors anymore, no other side effects observed).

Thanks!

> The minimalistic version (i.e. your initial suggestion) performed 20% slower 
> in my tests, but that could be due to the fact that it was simply the 1st 
> version I tested, so caching on host side might be the reason. If necessary I 
> can check the performance aspect more thoroughly.

hmm, yeah we open the writeback fids anyway so I'm not sure what would
be really different performance-wise, but I'd tend to go with the most
restricted change anyway.

> Personally I would at least use the NETFS_READ_FOR_WRITE version, but that's 
> up to you. On doubt, clarify with David's plans.
> 
> Feel free to add my RB and TB tags to any of the 3 version(s) you end up 
> queuing:
> 
> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
> Tested-by: Christian Schoenebeck <linux_oss@crudebyte.com>

Thanks, I'll add these and resend the last version for archival on the
list / commit message wording check.

At last that issue closed...
--
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3] 9p: fix EBADF errors in cached mode
  2022-06-16 20:14                                                                 ` Christian Schoenebeck
  2022-06-16 20:53                                                                   ` Dominique Martinet
@ 2022-06-16 21:10                                                                   ` Dominique Martinet
  2022-06-20 12:47                                                                     ` Christian Schoenebeck
  1 sibling, 1 reply; 48+ messages in thread
From: Dominique Martinet @ 2022-06-16 21:10 UTC (permalink / raw)
  To: Eric Van Hensbergen, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck, David Howells
  Cc: stable, v9fs-developer, linux-kernel

cached operations sometimes need to do invalid operations (e.g. read
on a write only file)
Historic fscache had added a "writeback fid", a special handle opened
RW as root, for this. The conversion to new fscache missed that bit.

This commit reinstates a slightly lesser variant of the original code
that uses the writeback fid for partial pages backfills if the regular
user fid had been open as WRONLY, and thus would lack read permissions.

Link: https://lkml.kernel.org/r/20220614033802.1606738-1-asmadeus@codewreck.org
Fixes: eb497943fa21 ("9p: Convert to using the netfs helper lib to do reads and caching")
Cc: stable@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Reported-By: Christian Schoenebeck <linux_oss@crudebyte.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Tested-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
---
v3: use the least permissive version of the patch that only uses
writeback fid when really required

If no problem shows up by then I'll post this patch around Wed 23 (next
week) with the other stable fixes.

 fs/9p/vfs_addr.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index a8f512b44a85..d0833fa69faf 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -58,8 +58,21 @@ static void v9fs_issue_read(struct netfs_io_subrequest *subreq)
  */
 static int v9fs_init_request(struct netfs_io_request *rreq, struct file *file)
 {
+	struct inode *inode = file_inode(file);
+	struct v9fs_inode *v9inode = V9FS_I(inode);
 	struct p9_fid *fid = file->private_data;
 
+	BUG_ON(!fid);
+
+	/* we might need to read from a fid that was opened write-only
+	 * for read-modify-write of page cache, use the writeback fid
+	 * for that */
+	if (rreq->origin == NETFS_READ_FOR_WRITE &&
+			(fid->mode & O_ACCMODE) == O_WRONLY) {
+		fid = v9inode->writeback_fid;
+		BUG_ON(!fid);
+	}
+
 	refcount_inc(&fid->count);
 	rreq->netfs_priv = fid;
 	return 0;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] 9p: fix EBADF errors in cached mode
  2022-06-16 21:10                                                                   ` [PATCH v3] " Dominique Martinet
@ 2022-06-20 12:47                                                                     ` Christian Schoenebeck
  2022-06-20 20:34                                                                       ` Dominique Martinet
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Schoenebeck @ 2022-06-20 12:47 UTC (permalink / raw)
  To: Eric Van Hensbergen, Latchesar Ionkov, Dominique Martinet,
	David Howells, stable
  Cc: v9fs-developer, linux-kernel

On Donnerstag, 16. Juni 2022 23:10:25 CEST Dominique Martinet wrote:
> cached operations sometimes need to do invalid operations (e.g. read
> on a write only file)
> Historic fscache had added a "writeback fid", a special handle opened
> RW as root, for this. The conversion to new fscache missed that bit.
> 
> This commit reinstates a slightly lesser variant of the original code
> that uses the writeback fid for partial pages backfills if the regular
> user fid had been open as WRONLY, and thus would lack read permissions.
> 
> Link:
> https://lkml.kernel.org/r/20220614033802.1606738-1-asmadeus@codewreck.org
> Fixes: eb497943fa21 ("9p: Convert to using the netfs helper lib to do reads
> and caching") Cc: stable@vger.kernel.org
> Cc: David Howells <dhowells@redhat.com>
> Reported-By: Christian Schoenebeck <linux_oss@crudebyte.com>
> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
> Tested-by: Christian Schoenebeck <linux_oss@crudebyte.com>
> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
> ---
> v3: use the least permissive version of the patch that only uses
> writeback fid when really required
> 
> If no problem shows up by then I'll post this patch around Wed 23 (next
> week) with the other stable fixes.
> 
>  fs/9p/vfs_addr.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
> index a8f512b44a85..d0833fa69faf 100644
> --- a/fs/9p/vfs_addr.c
> +++ b/fs/9p/vfs_addr.c
> @@ -58,8 +58,21 @@ static void v9fs_issue_read(struct netfs_io_subrequest
> *subreq) */
>  static int v9fs_init_request(struct netfs_io_request *rreq, struct file
> *file) {
> +	struct inode *inode = file_inode(file);
> +	struct v9fs_inode *v9inode = V9FS_I(inode);
>  	struct p9_fid *fid = file->private_data;
> 
> +	BUG_ON(!fid);
> +
> +	/* we might need to read from a fid that was opened write-only
> +	 * for read-modify-write of page cache, use the writeback fid
> +	 * for that */
> +	if (rreq->origin == NETFS_READ_FOR_WRITE &&
> +			(fid->mode & O_ACCMODE) == O_WRONLY) {
> +		fid = v9inode->writeback_fid;
> +		BUG_ON(!fid);
> +	}
> +
>  	refcount_inc(&fid->count);
>  	rreq->netfs_priv = fid;
>  	return 0;

Some more tests this weekend; all looks fine. It appears that this also fixed
the performance degradation that I reported early in this thread. Again,
benchmarks compiling a bunch of sources:

Case  Linux kernel version         msize   cache  duration (average)

A)    EBADF fix only [1]           512000  loose  31m 14s
B)    EBADF fix only [1]           512000  mmap   44m 1s
C)    EBADF fix + clunk fixes [2]  512000  loose  29m 32s
D)    EBADF fix + clunk fixes [2]  512000  mmap   44m 0s
E)    5.10.84                      512000  loose  35m 5s
F)    5.10.84                      512000  mmap   65m 5s

[1] 5.19.0-rc2 + EBADF fix v3 patch (alone):
https://lore.kernel.org/lkml/20220616211025.1790171-1-asmadeus@codewreck.org/

[2] 5.19.0-rc2 + EBADF fix v3 patch + clunk fix patches, a.k.a. 9p-next:
https://github.com/martinetd/linux/commit/b0017602fdf6bd3f344dd49eaee8b6ffeed6dbac

Conclusion: all thumbs in my possession pointing upwards. :)

Thanks Dominique!

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] 9p: fix EBADF errors in cached mode
  2022-06-20 12:47                                                                     ` Christian Schoenebeck
@ 2022-06-20 20:34                                                                       ` Dominique Martinet
  2022-06-21 12:13                                                                         ` Christian Schoenebeck
  0 siblings, 1 reply; 48+ messages in thread
From: Dominique Martinet @ 2022-06-20 20:34 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Eric Van Hensbergen, Latchesar Ionkov, David Howells, stable,
	v9fs-developer, linux-kernel

Christian Schoenebeck wrote on Mon, Jun 20, 2022 at 02:47:24PM +0200:
> Some more tests this weekend; all looks fine. It appears that this also fixed
> the performance degradation that I reported early in this thread.

wow, I wouldn't have expected the EBADF fix patch to have any impact on
performance. Maybe the build just behaved differently enough to take
more time with the errors?

> Again, benchmarks compiling a bunch of sources:
> 
> Case  Linux kernel version         msize   cache  duration (average)
> 
> A)    EBADF fix only [1]           512000  loose  31m 14s
> B)    EBADF fix only [1]           512000  mmap   44m 1s
> C)    EBADF fix + clunk fixes [2]  512000  loose  29m 32s
> D)    EBADF fix + clunk fixes [2]  512000  mmap   44m 0s
> E)    5.10.84                      512000  loose  35m 5s
> F)    5.10.84                      512000  mmap   65m 5s
> 
> [1] 5.19.0-rc2 + EBADF fix v3 patch (alone):
> https://lore.kernel.org/lkml/20220616211025.1790171-1-asmadeus@codewreck.org/
> 
> [2] 5.19.0-rc2 + EBADF fix v3 patch + clunk fix patches, a.k.a. 9p-next:
> https://github.com/martinetd/linux/commit/b0017602fdf6bd3f344dd49eaee8b6ffeed6dbac
> 
> Conclusion: all thumbs in my possession pointing upwards. :)
> 
> Thanks Dominique!

Great news, thanks for the tests! :)

--
Dominique

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] 9p: fix EBADF errors in cached mode
  2022-06-20 20:34                                                                       ` Dominique Martinet
@ 2022-06-21 12:13                                                                         ` Christian Schoenebeck
  0 siblings, 0 replies; 48+ messages in thread
From: Christian Schoenebeck @ 2022-06-21 12:13 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Eric Van Hensbergen, Latchesar Ionkov, David Howells, stable,
	v9fs-developer, linux-kernel

On Montag, 20. Juni 2022 22:34:38 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Mon, Jun 20, 2022 at 02:47:24PM +0200:
> > Some more tests this weekend; all looks fine. It appears that this also
> > fixed the performance degradation that I reported early in this thread.
> 
> wow, I wouldn't have expected the EBADF fix patch to have any impact on
> performance. Maybe the build just behaved differently enough to take
> more time with the errors?

Maybe. It could also be less overhead using writeback_fid vs. dedicated fid, 
i.e. no walking and fid cloning required when just using the writeback_fid 
which is already there (reduced latency).

Probably also overall reduced total amount of fids might have some (smaller) 
impact, as on QEMU 9p server side we still have a simple linked list for fids 
which is iterated on each fid lookup. A proc-like interface for statistics 
(e.g. max. amount of fids) would be useful.

But honestly, all these things still don't really explain to me such a 
difference from performance PoV in regards to this patch, as the particular 
case handled by this patch does not appear to happen often.

Anyway, my plan is to identify performance bottlenecks in general more 
analytically this year. Now that we have macOS support for 9p in QEMU, I'll 
probably use Xcode's "Instruments" tool which really has a great way to 
graphically investigate complex performance aspects in a very intuitive and 
customizable way, which goes beyond standard profiling. Then I can hunt down 
performance issues by weight.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2022-06-21 12:14 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAAZOf26g-L2nSV-Siw6mwWQv1nv6on8c0fWqB4bKmX73QAFzow@mail.gmail.com>
2022-03-26 11:46 ` [syzbot] WARNING in p9_client_destroy David Kahurani
2022-03-26 11:48 ` Christian Schoenebeck
2022-03-26 12:24   ` asmadeus
2022-03-26 12:36     ` Christian Schoenebeck
2022-03-26 13:35       ` 9p fscache Duplicate cookie detected (Was: [syzbot] WARNING in p9_client_destroy) asmadeus
2022-03-30 12:21         ` 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected) Christian Schoenebeck
2022-03-30 21:47           ` asmadeus
2022-04-01 14:19             ` Christian Schoenebeck
2022-04-01 23:11               ` asmadeus
2022-04-02 12:43                 ` Christian Schoenebeck
2022-04-11  8:10             ` David Howells
2022-04-09 11:16           ` Christian Schoenebeck
2022-04-10 16:18             ` Christian Schoenebeck
2022-04-10 22:54               ` asmadeus
2022-04-11 13:41                 ` Christian Schoenebeck
2022-04-12 22:38                   ` asmadeus
2022-04-14 12:44                     ` Christian Schoenebeck
2022-04-17 12:56                       ` asmadeus
2022-04-17 13:52                         ` Christian Schoenebeck
2022-04-17 21:22                           ` asmadeus
2022-04-17 22:17                             ` 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)) asmadeus
2022-04-21 10:36                             ` David Howells
2022-04-21 11:36                               ` Christian Schoenebeck
2022-04-22 13:13                                 ` asmadeus
2022-04-25 14:10                                 ` David Howells
2022-04-26 15:38                                   ` Christian Schoenebeck
2022-05-03 10:21                                     ` asmadeus
2022-05-04 18:33                                       ` Christian Schoenebeck
2022-05-04 21:48                                         ` asmadeus
2022-05-06 19:14                                           ` Christian Schoenebeck
2022-06-03 16:46                                             ` Christian Schoenebeck
2022-06-12 10:02                                               ` asmadeus
2022-06-14  3:38                                                 ` [PATCH] 9p: fix EBADF errors in cached mode Dominique Martinet
2022-06-14  3:41                                                   ` Dominique Martinet
2022-06-14 12:10                                                     ` Christian Schoenebeck
2022-06-14 12:45                                                       ` Dominique Martinet
2022-06-14 14:11                                                         ` Christian Schoenebeck
2022-06-16 13:35                                                           ` Christian Schoenebeck
2022-06-16 13:51                                                             ` Dominique Martinet
2022-06-16 14:11                                                               ` Dominique Martinet
2022-06-16 20:14                                                                 ` Christian Schoenebeck
2022-06-16 20:53                                                                   ` Dominique Martinet
2022-06-16 21:10                                                                   ` [PATCH v3] " Dominique Martinet
2022-06-20 12:47                                                                     ` Christian Schoenebeck
2022-06-20 20:34                                                                       ` Dominique Martinet
2022-06-21 12:13                                                                         ` Christian Schoenebeck
2022-06-16 13:52                                                             ` [PATCH v2] " Dominique Martinet
2022-04-11  7:59           ` 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected) David Howells

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.