linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [linus:master] [eventfs]  852e46e239: BUG:unable_to_handle_page_fault_for_address
@ 2024-01-29  2:58 kernel test robot
  2024-01-29  4:36 ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: kernel test robot @ 2024-01-29  2:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: oe-lkp, lkp, linux-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Linus Torvalds, Christian Brauner, Al Viro,
	Ajay Kaher, kernel test robot, linux-trace-kernel



Hello,

kernel test robot noticed "BUG:unable_to_handle_page_fault_for_address" on:

commit: 852e46e239ee6db3cd220614cf8bce96e79227c2 ("eventfs: Do not create dentries nor inodes in iterate_shared")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

[test failed on linus/master 7ed2632ec7d72e926b9e8bcc9ad1bb0cd37274bf]
[test failed on linux-next/master 774551425799cb5bbac94e1768fd69eec4f78dd4]

in testcase: stress-ng
version: stress-ng-x86_64-1744999cb-1_20240112
with following parameters:

	nr_threads: 10%
	disk: 1HDD
	testtime: 60s
	fs: xfs
	class: filesystem
	test: getdent
	cpufreq_governor: performance



compiler: gcc-12
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202401291043.e62e89dc-oliver.sang@intel.com


[   52.318955][ T4398] BUG: unable to handle page fault for address: 0000042c000004db
[   52.326780][ T4398] #PF: supervisor read access in kernel mode
[   52.332830][ T4398] #PF: error_code(0x0000) - not-present page
[   52.338864][ T4398] PGD 186970067 P4D 0
[   52.342993][ T4398] Oops: 0000 [#1] SMP NOPTI
[   52.347552][ T4398] CPU: 32 PID: 4398 Comm: stress-ng-getde Not tainted 6.7.0-rc2-00049-g852e46e239ee #1
[   52.357235][ T4398] Hardware name: Inspur NF5180M6/NF5180M6, BIOS 06.00.04 04/12/2022
[ 52.365278][ T4398] RIP: 0010:set_top_events_ownership (fs/tracefs/event_inode.c:172 (discriminator 1)) 
[ 52.371424][ T4398] Code: 00 00 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 8b 47 f8 48 85 c0 74 11 <f6> 40 78 02 74 0b 8b 50 50 f7 c2 00 00 08 00 75 05 c3 cc cc cc cc
All code
========
   0:	00 00                	add    %al,(%rax)
   2:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
   9:	00 00 00 
   c:	90                   	nop
   d:	90                   	nop
   e:	90                   	nop
   f:	90                   	nop
  10:	90                   	nop
  11:	90                   	nop
  12:	90                   	nop
  13:	90                   	nop
  14:	90                   	nop
  15:	90                   	nop
  16:	90                   	nop
  17:	90                   	nop
  18:	90                   	nop
  19:	90                   	nop
  1a:	90                   	nop
  1b:	90                   	nop
  1c:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  21:	48 8b 47 f8          	mov    -0x8(%rdi),%rax
  25:	48 85 c0             	test   %rax,%rax
  28:	74 11                	je     0x3b
  2a:*	f6 40 78 02          	testb  $0x2,0x78(%rax)		<-- trapping instruction
  2e:	74 0b                	je     0x3b
  30:	8b 50 50             	mov    0x50(%rax),%edx
  33:	f7 c2 00 00 08 00    	test   $0x80000,%edx
  39:	75 05                	jne    0x40
  3b:	c3                   	retq   
  3c:	cc                   	int3   
  3d:	cc                   	int3   
  3e:	cc                   	int3   
  3f:	cc                   	int3   

Code starting with the faulting instruction
===========================================
   0:	f6 40 78 02          	testb  $0x2,0x78(%rax)
   4:	74 0b                	je     0x11
   6:	8b 50 50             	mov    0x50(%rax),%edx
   9:	f7 c2 00 00 08 00    	test   $0x80000,%edx
   f:	75 05                	jne    0x16
  11:	c3                   	retq   
  12:	cc                   	int3   
  13:	cc                   	int3   
  14:	cc                   	int3   
  15:	cc                   	int3   
[   52.391324][ T4398] RSP: 0018:ffa000000f2efc90 EFLAGS: 00010206
[   52.397481][ T4398] RAX: 0000042c00000463 RBX: ff110002452ae2c8 RCX: 0000000000004000
[   52.405547][ T4398] RDX: 0000000000000024 RSI: ff110002452ae2c8 RDI: ff110002452ae2c8
[   52.413616][ T4398] RBP: ffffffff83080320 R08: 0000000000000064 R09: ff1100018a25dd40
[   52.421686][ T4398] R10: ffffffffffff8c98 R11: 0000000000000000 R12: 0000000000000024
[   52.429755][ T4398] R13: ffffffff83080320 R14: ffa000000f2efedc R15: 0000000000018000
[   52.437828][ T4398] FS:  00007f3b60e3d740(0000) GS:ff11001fffe00000(0000) knlGS:0000000000000000
[   52.446854][ T4398] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   52.453547][ T4398] CR2: 0000042c000004db CR3: 0000000113750004 CR4: 0000000000771ef0
[   52.461629][ T4398] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   52.469707][ T4398] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   52.477783][ T4398] PKRU: 55555554
[   52.481444][ T4398] Call Trace:
[   52.484848][ T4398]  <TASK>
[ 52.487894][ T4398] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434) 
[ 52.491898][ T4398] ? page_fault_oops (arch/x86/mm/fault.c:707) 
[ 52.496854][ T4398] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1561) 
[ 52.501725][ T4398] ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570) 
[ 52.506858][ T4398] ? set_top_events_ownership (fs/tracefs/event_inode.c:172 (discriminator 1)) 
[ 52.512425][ T4398] eventfs_permission (fs/tracefs/event_inode.c:203) 
[ 52.517390][ T4398] inode_permission (fs/namei.c:462 fs/namei.c:529 fs/namei.c:504) 
[ 52.522271][ T4398] may_open (fs/namei.c:3250) 
[ 52.526453][ T4398] do_open (fs/namei.c:3621) 
[ 52.530540][ T4398] ? open_last_lookups (fs/namei.c:3569) 
[ 52.535670][ T4398] path_openat (fs/namei.c:3780) 
[ 52.540187][ T4398] do_filp_open (fs/namei.c:3809) 
[ 52.544696][ T4398] do_sys_openat2 (fs/open.c:1440) 
[ 52.549285][ T4398] __x64_sys_openat (fs/open.c:1466) 
[ 52.554044][ T4398] do_syscall_64 (arch/x86/entry/common.c:51 arch/x86/entry/common.c:82) 
[ 52.558541][ T4398] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) 
[   52.564509][ T4398] RIP: 0033:0x7f3b60fcb127
[ 52.568991][ T4398] Code: 25 00 00 41 00 3d 00 00 41 00 74 47 64 8b 04 25 18 00 00 00 85 c0 75 6b 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 95 00 00 00 48 8b 4c 24 28 64 48 2b 0c 25
All code
========
   0:	25 00 00 41 00       	and    $0x410000,%eax
   5:	3d 00 00 41 00       	cmp    $0x410000,%eax
   a:	74 47                	je     0x53
   c:	64 8b 04 25 18 00 00 	mov    %fs:0x18,%eax
  13:	00 
  14:	85 c0                	test   %eax,%eax
  16:	75 6b                	jne    0x83
  18:	44 89 e2             	mov    %r12d,%edx
  1b:	48 89 ee             	mov    %rbp,%rsi
  1e:	bf 9c ff ff ff       	mov    $0xffffff9c,%edi
  23:	b8 01 01 00 00       	mov    $0x101,%eax
  28:	0f 05                	syscall 
  2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
  30:	0f 87 95 00 00 00    	ja     0xcb
  36:	48 8b 4c 24 28       	mov    0x28(%rsp),%rcx
  3b:	64                   	fs
  3c:	48                   	rex.W
  3d:	2b                   	.byte 0x2b
  3e:	0c 25                	or     $0x25,%al

Code starting with the faulting instruction
===========================================
   0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
   6:	0f 87 95 00 00 00    	ja     0xa1
   c:	48 8b 4c 24 28       	mov    0x28(%rsp),%rcx
  11:	64                   	fs
  12:	48                   	rex.W
  13:	2b                   	.byte 0x2b
  14:	0c 25                	or     $0x25,%al


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240129/202401291043.e62e89dc-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29  2:58 [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address kernel test robot
@ 2024-01-29  4:36 ` Linus Torvalds
  2024-01-29 17:01   ` Steven Rostedt
  2024-01-30  2:08   ` Al Viro
  0 siblings, 2 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-29  4:36 UTC (permalink / raw)
  To: kernel test robot
  Cc: Steven Rostedt, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Sun, 28 Jan 2024 at 18:59, kernel test robot <oliver.sang@intel.com> wrote:
>
>   21:   48 8b 47 f8             mov    -0x8(%rdi),%rax
>   25:   48 85 c0                test   %rax,%rax
>   28:   74 11                   je     0x3b
>   2a:*  f6 40 78 02             testb  $0x2,0x78(%rax)          <-- trapping instruction

So this is

        struct tracefs_inode *ti = get_tracefs(inode);
        struct eventfs_inode *ei = ti->private;

        if (!ei || !ei->is_events || ..

in set_top_events_ownership(), and it's that 'ei->is_events' test that oopses.

The 'inode' is the incoming argument (in %rdi), and you don't see code
generation for the "get_tracefs(inode)" because it's just an offset
off the inode.

So the "ti->private" read is that read off "-8(%rdi)", because

  struct tracefs_inode {
        unsigned long           flags;
        void                    *private;
        struct inode            vfs_inode;
  };

so 'private' is 8 bytes below the 'struct inode' pointer.

So 'ei' isn't NULL, but it's a bad pointer, and 'ei->is_events' is
that "testb  $0x2,0x78(%rax)" and it oopses as a result.

I don't think this is directly related to that commit 852e46e239ee
("eventfs: Do not create dentries nor inodes in iterate_shared") that
the kernel test robot talks about.

It looks like some inode creation never filled in the ->private field
at all, and it's just garbage.

I note that we have code like this:

 create_dir_dentry():
        ...
        mutex_unlock(&eventfs_mutex);

        dentry = create_dir(ei, parent);

        mutex_lock(&eventfs_mutex);
        ...
        if (!ei->dentry && !ei->is_freed) {
                ei->dentry = dentry;
                eventfs_post_create_dir(ei);
                dentry->d_fsdata = ei;
        } else {

and that eventfs_post_create_dir() seems to be where it fills in that
->private pointer:

        ti = get_tracefs(ei->dentry->d_inode);
        ti->private = ei;

but notice how create_dir() has done that

        d_instantiate(dentry, inode);

which exposes the inode to lookup - long before it's actually then filled in.

IOW, what I think is going on is a very basic race, where
create_dir_dentry() will allocate the inode and attach it to the
dentry long before the inode has been fully initialized.

So if somebody comes in *immediately* after that, and does a 'stat()'
on that name that now is exposed, it will see an inode that has not
yet made it to that eventfs_post_create_dir() phase, and that in turn
explains why

        struct eventfs_inode *ei = ti->private;

just reads random garbage values.

So if I'm right (big "if" - it looks likely, but who knows) this bug
is entirely unrelated to any dentry caching or any reference counting.

It just needs just the right timing, where one CPU happens to do a
'stat()' just as another one creates the directory.

It's not a big window, so you do need some timing "luck".

End result: the d_instantiate() needs to be done *after* the inode has
been fully filled in.

Alternatively, you could

 (a) not drop the eventfs_mutex around the create_dir() call

 (b) take the eventfs_mutex around all of set_top_events_ownership()

and just fix it by having the lock protect the lack of ordering.

                 Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29  4:36 ` Linus Torvalds
@ 2024-01-29 17:01   ` Steven Rostedt
  2024-01-29 17:40     ` Linus Torvalds
  2024-01-30  2:08   ` Al Viro
  1 sibling, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2024-01-29 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Sun, 28 Jan 2024 20:36:12 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:


> End result: the d_instantiate() needs to be done *after* the inode has
> been fully filled in.
> 
> Alternatively, you could
> 
>  (a) not drop the eventfs_mutex around the create_dir() call
> 
>  (b) take the eventfs_mutex around all of set_top_events_ownership()
> 
> and just fix it by having the lock protect the lack of ordering.

Hi Linus,

Thanks for the analysis. I have a patch that removes the dropping of the
mutex over the create_dir/file() calls, and lockdep hasn't complained about
it.

I was going to add that to my queue for the next merge window along with
other clean ups but this looks like it actually fixes a real bug. I'll move
that over to my urgent queue and start testing it.

-- Steve


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 17:01   ` Steven Rostedt
@ 2024-01-29 17:40     ` Linus Torvalds
  2024-01-29 17:44       ` Steven Rostedt
                         ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-29 17:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 at 09:01, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Thanks for the analysis. I have a patch that removes the dropping of the
> mutex over the create_dir/file() calls, and lockdep hasn't complained about
> it.
>
> I was going to add that to my queue for the next merge window along with
> other clean ups but this looks like it actually fixes a real bug. I'll move
> that over to my urgent queue and start testing it.

Note that it is *not* enough to just keep the mutex.

All the *users* need to get the mutex too.

Otherwise you have this:

CPU1:

   create_dir_dentry():
      mutex locked the whole time..
        dentry = create_dir(ei, parent);
           does d_instantiate(dentry, inode);
        eventfs_post_create_dir(ei);
           dentry->d_fsdata = ei;
        mutex dropped;

but CPU2 can still come in, see the dentry immediately after the
"d_instantiate()", and do an "open()" or "stat()" on the dentry (which
will *not* cause a 'lookup()', since it's in the dentry cache), and
that will then cause either

 ->permission() -> eventfs_permission() -> set_top_events_ownership()

or

 ->get_attr() -> eventfs_get_attr() -> set_top_events_ownership()

and both of those will now do the dentry->inode->ei lookup. And
neither of them takes the mutex.

So then it doesn't even matter that you didn't drop the mutex in the
middle, because the users simply won't be serializing with it anyway.

So you'd have to make set_top_events_ownership() take the mutex around
it all too.

In fact, pretty much *any* use of "ti->private" needs the mutex.

Which is obviously a bit painful.

Honestly, I think the right model is to just make sure that the inode
is fully initialized when you do 'd_instantiate()'

The patch looks obvious, and I think this actually fixes *another*
bug, namely that the old

        ti = get_tracefs(inode);
        ti->flags |= TRACEFS_EVENT_INODE;

was buggy, because 'ti->flags' was uninitialized before.

eventfs_create_events_dir() seems to have the same bug with ti->flags,
btw, but got the ti->private initialization right.

Funnily enough, create_file() got this right. I don't even understand
why create_dir() did what it did.

IOW, I think the right fix is really just this:

  --- a/fs/tracefs/event_inode.c
  +++ b/fs/tracefs/event_inode.c
  @@ -328,7 +328,8 @@
        inode->i_ino = EVENTFS_FILE_INODE_INO;

        ti = get_tracefs(inode);
  -     ti->flags |= TRACEFS_EVENT_INODE;
  +     ti->flags = TRACEFS_EVENT_INODE;
  +     ti->private = ei;
        d_instantiate(dentry, inode);
        fsnotify_create(dentry->d_parent->d_inode, dentry);
        return eventfs_end_creating(dentry);
  @@ -513,7 +514,6 @@
   static void eventfs_post_create_dir(struct eventfs_inode *ei)
   {
        struct eventfs_inode *ei_child;
  -     struct tracefs_inode *ti;

        lockdep_assert_held(&eventfs_mutex);

  @@ -523,9 +523,6 @@
                                 srcu_read_lock_held(&eventfs_srcu)) {
                ei_child->d_parent = ei->dentry;
        }
  -
  -     ti = get_tracefs(ei->dentry->d_inode);
  -     ti->private = ei;
   }

   /**
  @@ -943,7 +940,7 @@
        INIT_LIST_HEAD(&ei->list);

        ti = get_tracefs(inode);
  -     ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
  +     ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
        ti->private = ei;

        inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;

which fixes the initialization errors with the 'ti' fields.

Now, I do agree that your locking is strange, and that should be fixed
*too*, but I think the above is the "right" fix for this particular
issue.

                Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 17:40     ` Linus Torvalds
@ 2024-01-29 17:44       ` Steven Rostedt
  2024-01-29 17:45         ` Steven Rostedt
  2024-01-29 17:56         ` Linus Torvalds
  2024-01-29 17:55       ` Linus Torvalds
  2024-01-29 19:24       ` Linus Torvalds
  2 siblings, 2 replies; 47+ messages in thread
From: Steven Rostedt @ 2024-01-29 17:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 09:40:06 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Now, I do agree that your locking is strange, and that should be fixed
> *too*, but I think the above is the "right" fix for this particular
> issue.

Would you be OK if I did both as a "fix"?

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 17:44       ` Steven Rostedt
@ 2024-01-29 17:45         ` Steven Rostedt
  2024-01-29 17:56         ` Linus Torvalds
  1 sibling, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2024-01-29 17:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 12:44:36 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 29 Jan 2024 09:40:06 -0800
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > Now, I do agree that your locking is strange, and that should be fixed
> > *too*, but I think the above is the "right" fix for this particular
> > issue.  
> 
> Would you be OK if I did both as a "fix"?
> 

In separate patches of course. And I may even give the same tags to both
patches as well.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 17:40     ` Linus Torvalds
  2024-01-29 17:44       ` Steven Rostedt
@ 2024-01-29 17:55       ` Linus Torvalds
  2024-01-29 19:24       ` Linus Torvalds
  2 siblings, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-29 17:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 at 09:40, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> eventfs_create_events_dir() seems to have the same bug with ti->flags,
> btw, but got the ti->private initialization right.
>
> Funnily enough, create_file() got this right. I don't even understand
> why create_dir() did what it did.
>
> IOW, I think the right fix is really just this:

Actually, I think you have another uninitialized field here too:
'dentry->d_fsdata'.

And it looks like both create_file and create_dir got that wrong, but
eventfs_create_events_dir() got it right.

So you *also* need to do that

        dentry->d_fsdata = ei;

before you do the d_instantiate().

Now, from a quick look, all the d_fsdata accesses *do* seem to be
protected by the eventfs_mutex, except

 (a) eventfs_create_events_dir() doesn't seem to take the mutex, but
gets the ordering right, so is ok

 (b) create_dir_dentry() drops the mutex in the middle, so the mutex
doesn't actually serialize anything

Not dropping the mutex in create_dir_dentry() *will* fix this bug, but
honestly, I'd suggest that in addition to not dropping the mutex, you
just fix the ordering too.

IOW, just do that

        dentry->d_fsdata = ei;

before you do d_instantiate(), and now accessing d_fsdata is just
always safe and doesn't even need the mutex.

The whole "initialize everything before exposing it to others" is
simply just a good idea.

                Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 17:44       ` Steven Rostedt
  2024-01-29 17:45         ` Steven Rostedt
@ 2024-01-29 17:56         ` Linus Torvalds
  1 sibling, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-29 17:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 at 09:44, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 29 Jan 2024 09:40:06 -0800
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > Now, I do agree that your locking is strange, and that should be fixed
> > *too*, but I think the above is the "right" fix for this particular
> > issue.
>
> Would you be OK if I did both as a "fix"?

See my crossed email - not dropping the mutex *is* actually a fix for
another piece of data.

          Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 17:40     ` Linus Torvalds
  2024-01-29 17:44       ` Steven Rostedt
  2024-01-29 17:55       ` Linus Torvalds
@ 2024-01-29 19:24       ` Linus Torvalds
  2024-01-29 19:51         ` Linus Torvalds
  2 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2024-01-29 19:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 674 bytes --]

On Mon, 29 Jan 2024 at 09:40, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> IOW, I think the right fix is really just this:

Oh, for some reason I sent out the original patch I had which didn't
fix the create_dir() case.

So that patch was missing the important hunk that added the

        ti->flags = TRACEFS_EVENT_INODE;
        ti->private = ei;

to create_dir() (to match the removal in eventfs_post_create_dir()).

I had incorrectly put it in the create_file() case, that should just
set ->private to NULL. afaik

So the patch was completely broken. Here's the one that should
actually compile (although still not actually *tested*).

               Linus

[-- Attachment #2: 0001-eventfsfs-initialize-the-tracefs-inode-properly.patch --]
[-- Type: text/x-patch, Size: 2544 bytes --]

From 6e5db10ebc96ebe6b9707c9938c450f51e9a3ae0 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 29 Jan 2024 11:06:32 -0800
Subject: [PATCH] eventfsfs: initialize the tracefs inode properly

The tracefs-specific fields in the inode were not initialized before the
inode was exposed to others through the dentry with 'd_instantiate()'.

And the ->flags file was initialized incorrectly with a '|=', when the
old value was stale.  It should have just been a straight assignment.

Move the field initializations up to before the d_instantiate, and fix
the use of uninitialized data.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2d128bedd654..c0d977e6c0f2 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, umode_t mode,
 	inode->i_ino = EVENTFS_FILE_INODE_INO;
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE;
+	ti->flags = TRACEFS_EVENT_INODE;
+	ti->private = NULL;			// Directories have 'ei', files not
+
 	d_instantiate(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
@@ -367,7 +369,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
 	inode->i_ino = eventfs_dir_ino(ei);
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE;
+	ti->flags = TRACEFS_EVENT_INODE;
+	ti->private = ei;
 
 	inc_nlink(inode);
 	d_instantiate(dentry, inode);
@@ -513,7 +516,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 static void eventfs_post_create_dir(struct eventfs_inode *ei)
 {
 	struct eventfs_inode *ei_child;
-	struct tracefs_inode *ti;
 
 	lockdep_assert_held(&eventfs_mutex);
 
@@ -523,9 +525,6 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei)
 				 srcu_read_lock_held(&eventfs_srcu)) {
 		ei_child->d_parent = ei->dentry;
 	}
-
-	ti = get_tracefs(ei->dentry->d_inode);
-	ti->private = ei;
 }
 
 /**
@@ -943,7 +942,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	INIT_LIST_HEAD(&ei->list);
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+	ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
 	ti->private = ei;
 
 	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 19:24       ` Linus Torvalds
@ 2024-01-29 19:51         ` Linus Torvalds
  2024-01-29 20:26           ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2024-01-29 19:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 at 11:24, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So the patch was completely broken. Here's the one that should
> actually compile (although still not actually *tested*).

Note that this fixes the d_instantiate() ordering wrt initializing the inode.

But as I look up the call chain, I see many more fundamental mistakes.

Steven - the reason you think that the VFS doesn't have documentation
is that we *do* have tons of documentation, but it's of the kind "Here
is what you should do".

It is *not* of the kind that says "You messed up and did something
else, and how do you recover from it?".

So the fundamental bug I now find is that eventfs_root_lookup() gets a
target dentry, and for some unfathomable reason it then does

        ret = simple_lookup(dir, dentry, flags);

on it. Which is *completely* broken, because what "simple_lookup()"
does is just say "oh, you didn't have a dentry of this kind before, so
clearly a lookup must be a non-existent file". Remember: this is for
'tmpfs' kinds of filesystems where the dentry cache cotnains *ALL*
files.

For the tracefs kind of filesystem, it's TOTALLY BOGUS. What the
"simple_lookup()" will do is just a plain

        d_add(dentry, NULL);

and nothing else. And guess what *that* does? It basically
instantiates a negative dentry, telling all other lookups that the
path does not exist.

So if you have two concurrent lookups, one will do that
simple_lookup(), and the other will then - depending on timing -
either see the negative dentry and return -ENOENT, or - if it comes in
a bit later - see the new inode that then later gets added by the
first lookup with d_instantiate().

See? That simple_lookup() is not just unnecessary, but it's also
actively completely WRONG. Because it instantiates a NULL pointer,
other processes that race with the lookup may now end up saying "that
file doesn't exist", even though it should.

Basically, you can't use *any* of the "simple" filesystem helpers.
Because they are all designed for that "the dentry tree is all there
is" case.

                Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 19:51         ` Linus Torvalds
@ 2024-01-29 20:26           ` Steven Rostedt
  2024-01-29 20:51             ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2024-01-29 20:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 11:51:52 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 29 Jan 2024 at 11:24, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So the patch was completely broken. Here's the one that should
> > actually compile (although still not actually *tested*).  
> 
> Note that this fixes the d_instantiate() ordering wrt initializing the inode.
> 
> But as I look up the call chain, I see many more fundamental mistakes.
> 
> Steven - the reason you think that the VFS doesn't have documentation
> is that we *do* have tons of documentation, but it's of the kind "Here
> is what you should do".
> 
> It is *not* of the kind that says "You messed up and did something
> else, and how do you recover from it?".

When I first worked on this, I did read all the VFS documentation, and it
was difficult to understand what I needed and what I didn't. It is focused
on a real file system and not a pseudo ones. Another mistake I made was
thinking that debugfs was doing things the "right" way as well. And having
a good idea of how debugfs worked, and thinking it was correct, just made
reading VFS documentation even more difficult as I couldn't relate what I
knew (about debugfs) with what was being explained. I thought it was
perfectly fine to use dentry as a handle for the file system. I did until
you told me it wasn't. That made a profound change in my understanding of
how things are supposed to work.

> 
> So the fundamental bug I now find is that eventfs_root_lookup() gets a
> target dentry, and for some unfathomable reason it then does
> 
>         ret = simple_lookup(dir, dentry, flags);
> 
> on it. Which is *completely* broken, because what "simple_lookup()"
> does is just say "oh, you didn't have a dentry of this kind before, so
> clearly a lookup must be a non-existent file". Remember: this is for
> 'tmpfs' kinds of filesystems where the dentry cache cotnains *ALL*
> files.

Sorry, I don't really understand what you mean by "ALL files"? You mean
that all files in the pseudo file system has a dentry to it (like debugfs,
and the rest of tracefs)?

> 
> For the tracefs kind of filesystem, it's TOTALLY BOGUS. What the
> "simple_lookup()" will do is just a plain
> 
>         d_add(dentry, NULL);
> 
> and nothing else. And guess what *that* does? It basically
> instantiates a negative dentry, telling all other lookups that the
> path does not exist.
> 
> So if you have two concurrent lookups, one will do that
> simple_lookup(), and the other will then - depending on timing -
> either see the negative dentry and return -ENOENT, or - if it comes in
> a bit later - see the new inode that then later gets added by the
> first lookup with d_instantiate().
> 
> See? That simple_lookup() is not just unnecessary, but it's also
> actively completely WRONG. Because it instantiates a NULL pointer,
> other processes that race with the lookup may now end up saying "that
> file doesn't exist", even though it should.
> 
> Basically, you can't use *any* of the "simple" filesystem helpers.
> Because they are all designed for that "the dentry tree is all there
> is" case.

Yeah, the above code did come about with me not fully understanding the
above. It's not that it wasn't documented, but I admit, when I read the VFS
documentation, I had a lot of trouble trying to make sense of things like
negative dentries and how they relate.

I now have a much better understanding of most of this, thanks to our
discussion here, and also knowing that using dentry as the main handle to a
file is *not* how to do it. When thinking it was, it made things much more
difficult to comprehend.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 20:26           ` Steven Rostedt
@ 2024-01-29 20:51             ` Linus Torvalds
  2024-01-29 21:45               ` Steven Rostedt
                                 ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-29 20:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 2648 bytes --]

On Mon, 29 Jan 2024 at 12:25, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > So the fundamental bug I now find is that eventfs_root_lookup() gets a
> > target dentry, and for some unfathomable reason it then does
> >
> >         ret = simple_lookup(dir, dentry, flags);
> >
> > on it. Which is *completely* broken, because what "simple_lookup()"
> > does is just say "oh, you didn't have a dentry of this kind before, so
> > clearly a lookup must be a non-existent file". Remember: this is for
> > 'tmpfs' kinds of filesystems where the dentry cache cotnains *ALL*
> > files.
>
> Sorry, I don't really understand what you mean by "ALL files"? You mean
> that all files in the pseudo file system has a dentry to it (like debugfs,
> and the rest of tracefs)?

Yes.

So the whole - and *ONLY* - point of 'simple_lookup()' is for
filesystems like tmpfs, or like debugfs or other filesystems like
that, which never actually *need* to look anything up, because
everything is already cached in the dentry tree.

That's what the "simple" part of the simple functions mean. They are
simple from a dcache standpoint, because the dcache is all there is.

End result: what simple_lookup() does is say "oh, you didn't have the
file, so it's by definition a negative dentry", and thus all it does
is to do "d_add(dentry, NULL)".

Anyway, removing this was painful. I initially thought "I'll just
remove the calls". But it all ended up cascading into "that's also
wrong".

So now I have a patch that tries to fix this all up, and it looks like thisL:

 1 file changed, 50 insertions(+), 219 deletions(-)

because it basically removed all the old code, and replaced it with
much simpler code.

I'm including the patch here as an attachment, but I want to note very
clearly that this *builds* for me, and it looks a *lot* more obvious
and correct than the old code did, but I haven't tested it. AT ALL.

Also note that it depends on my previous patches, so I guess I'll
include them here again just to make it unambiguous.

Finally - this does *not* fix up the refcounting. I still think the
SRCU stuff is completely broken. But that's another headache. But at
least now the *lookup* parts look like they DTRT wrt eventfs_mutex.

The SRCU logic from the directory iteration parts still needs crapectomy.

AGAIN: these patches (ie particularly that last one - 0004) were all
done entirely "blindly" - I've looked at the code, and fixed the bugs
and problems I've seen by pure code inspection.

That's great, but it really means that it's all untested. It *looks*
better than the old code, but there may be some silly gotcha that I
have missed.

            Linus

[-- Attachment #2: 0002-tracefs-avoid-using-the-ei-dentry-pointer-unnecessar.patch --]
[-- Type: text/x-patch, Size: 3561 bytes --]

From b1f487acf6f4e9093d8b0fa00f864a6d07a3c4c2 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 27 Jan 2024 13:27:01 -0800
Subject: [PATCH 2/4] tracefs: avoid using the ei->dentry pointer unnecessarily

The eventfs_find_events() code tries to walk up the tree to find the
event directory that a dentry belongs to, in order to then find the
eventfs inode that is associated with that event directory.

However, it uses an odd combination of walking the dentry parent,
looking up the eventfs inode associated with that, and then looking up
the dentry from there.  Repeat.

But the code shouldn't have back-pointers to dentries in the first
place, and it should just walk the dentry parenthood chain directly.

Similarly, 'set_top_events_ownership()' looks up the dentry from the
eventfs inode, but the only reason it wants a dentry is to look up the
superblock in order to look up the root dentry.

But it already has the real filesystem inode, which has that same
superblock pointer.  So just pass in the superblock pointer using the
information that's already there, instead of looking up extraneous data
that is irrelevant.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1c3dd0ad4660..2d128bedd654 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -156,33 +156,30 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry,
 	return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct dentry *dentry)
+static void update_top_events_attr(struct eventfs_inode *ei, struct super_block *sb)
 {
-	struct inode *inode;
+	struct inode *root;
 
 	/* Only update if the "events" was on the top level */
 	if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
 		return;
 
 	/* Get the tracefs root inode. */
-	inode = d_inode(dentry->d_sb->s_root);
-	ei->attr.uid = inode->i_uid;
-	ei->attr.gid = inode->i_gid;
+	root = d_inode(sb->s_root);
+	ei->attr.uid = root->i_uid;
+	ei->attr.gid = root->i_gid;
 }
 
 static void set_top_events_ownership(struct inode *inode)
 {
 	struct tracefs_inode *ti = get_tracefs(inode);
 	struct eventfs_inode *ei = ti->private;
-	struct dentry *dentry;
 
 	/* The top events directory doesn't get automatically updated */
 	if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
 		return;
 
-	dentry = ei->dentry;
-
-	update_top_events_attr(ei, dentry);
+	update_top_events_attr(ei, inode->i_sb);
 
 	if (!(ei->attr.mode & EVENTFS_SAVE_UID))
 		inode->i_uid = ei->attr.uid;
@@ -235,8 +232,10 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 
 	mutex_lock(&eventfs_mutex);
 	do {
-		/* The parent always has an ei, except for events itself */
-		ei = dentry->d_parent->d_fsdata;
+		// The parent is stable because we do not do renames
+		dentry = dentry->d_parent;
+		// ... and directories always have d_fsdata
+		ei = dentry->d_fsdata;
 
 		/*
 		 * If the ei is being freed, the ownership of the children
@@ -246,12 +245,11 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 			ei = NULL;
 			break;
 		}
-
-		dentry = ei->dentry;
+		// Walk upwards until you find the events inode
 	} while (!ei->is_events);
 	mutex_unlock(&eventfs_mutex);
 
-	update_top_events_attr(ei, dentry);
+	update_top_events_attr(ei, dentry->d_sb);
 
 	return ei;
 }
-- 
2.43.0.5.g38fb137bdb


[-- Attachment #3: 0003-eventfsfs-initialize-the-tracefs-inode-properly.patch --]
[-- Type: text/x-patch, Size: 2548 bytes --]

From 6e5db10ebc96ebe6b9707c9938c450f51e9a3ae0 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 29 Jan 2024 11:06:32 -0800
Subject: [PATCH 3/4] eventfsfs: initialize the tracefs inode properly

The tracefs-specific fields in the inode were not initialized before the
inode was exposed to others through the dentry with 'd_instantiate()'.

And the ->flags file was initialized incorrectly with a '|=', when the
old value was stale.  It should have just been a straight assignment.

Move the field initializations up to before the d_instantiate, and fix
the use of uninitialized data.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2d128bedd654..c0d977e6c0f2 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, umode_t mode,
 	inode->i_ino = EVENTFS_FILE_INODE_INO;
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE;
+	ti->flags = TRACEFS_EVENT_INODE;
+	ti->private = NULL;			// Directories have 'ei', files not
+
 	d_instantiate(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
@@ -367,7 +369,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
 	inode->i_ino = eventfs_dir_ino(ei);
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE;
+	ti->flags = TRACEFS_EVENT_INODE;
+	ti->private = ei;
 
 	inc_nlink(inode);
 	d_instantiate(dentry, inode);
@@ -513,7 +516,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 static void eventfs_post_create_dir(struct eventfs_inode *ei)
 {
 	struct eventfs_inode *ei_child;
-	struct tracefs_inode *ti;
 
 	lockdep_assert_held(&eventfs_mutex);
 
@@ -523,9 +525,6 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei)
 				 srcu_read_lock_held(&eventfs_srcu)) {
 		ei_child->d_parent = ei->dentry;
 	}
-
-	ti = get_tracefs(ei->dentry->d_inode);
-	ti->private = ei;
 }
 
 /**
@@ -943,7 +942,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	INIT_LIST_HEAD(&ei->list);
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+	ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
 	ti->private = ei;
 
 	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
-- 
2.43.0.5.g38fb137bdb


[-- Attachment #4: 0001-tracefs-remove-stale-update_gid-code.patch --]
[-- Type: text/x-patch, Size: 2612 bytes --]

From 41699c9c5830ca662badc3fa0d8cb59c491ce775 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 27 Jan 2024 13:21:14 -0800
Subject: [PATCH 1/4] tracefs: remove stale 'update_gid' code

The 'eventfs_update_gid()' function is no longer called, so remove it
(and the helper function it uses).

Fixes: 8186fff7ab64 ("tracefs/eventfs: Use root and instance inodes as default ownership")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 38 --------------------------------------
 fs/tracefs/internal.h    |  1 -
 2 files changed, 39 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 6b211522a13e..1c3dd0ad4660 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -281,44 +281,6 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode,
 		inode->i_gid = attr->gid;
 }
 
-static void update_gid(struct eventfs_inode *ei, kgid_t gid, int level)
-{
-	struct eventfs_inode *ei_child;
-
-	/* at most we have events/system/event */
-	if (WARN_ON_ONCE(level > 3))
-		return;
-
-	ei->attr.gid = gid;
-
-	if (ei->entry_attrs) {
-		for (int i = 0; i < ei->nr_entries; i++) {
-			ei->entry_attrs[i].gid = gid;
-		}
-	}
-
-	/*
-	 * Only eventfs_inode with dentries are updated, make sure
-	 * all eventfs_inodes are updated. If one of the children
-	 * do not have a dentry, this function must traverse it.
-	 */
-	list_for_each_entry_srcu(ei_child, &ei->children, list,
-				 srcu_read_lock_held(&eventfs_srcu)) {
-		if (!ei_child->dentry)
-			update_gid(ei_child, gid, level + 1);
-	}
-}
-
-void eventfs_update_gid(struct dentry *dentry, kgid_t gid)
-{
-	struct eventfs_inode *ei = dentry->d_fsdata;
-	int idx;
-
-	idx = srcu_read_lock(&eventfs_srcu);
-	update_gid(ei, gid, 0);
-	srcu_read_unlock(&eventfs_srcu, idx);
-}
-
 /**
  * create_file - create a file in the tracefs filesystem
  * @name: the name of the file to create.
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 45397df9bb65..91c2bf0b91d9 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -82,7 +82,6 @@ struct inode *tracefs_get_inode(struct super_block *sb);
 struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
 struct dentry *eventfs_failed_creating(struct dentry *dentry);
 struct dentry *eventfs_end_creating(struct dentry *dentry);
-void eventfs_update_gid(struct dentry *dentry, kgid_t gid);
 void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry);
 
 #endif /* _TRACEFS_INTERNAL_H */
-- 
2.43.0.5.g38fb137bdb


[-- Attachment #5: 0004-tracefs-dentry-lookup-crapectomy.patch --]
[-- Type: text/x-patch, Size: 11761 bytes --]

From 9fe4f80dba499a1367f6563b205837efa54c9a93 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 29 Jan 2024 12:25:53 -0800
Subject: [PATCH 4/4] tracefs: dentry lookup crapectomy

The dentry lookup for eventfs files was very broken, and had lots of
signs of the old situation where the filesystem names were all created
statically in the dentry tree, rather than being looked up dynamically
based on the eventfs data structures.

You could see it in the naming - how it claimed to "create" dentries
rather than just look up the dentries that were given it.

You could see it in various nonsensical and very incorrect operations,
like using "simple_lookup()" on the dentries that were passed in, which
only results in those dentries becoming negative dentries.  Which meant
that any other lookup would possibly return ENOENT if it saw that
negative dentry before the data rwas then later filled in.

You could see it in the immesnse amount of nonsensical code that didn't
actually just do lookups.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 269 ++++++++-------------------------------
 1 file changed, 50 insertions(+), 219 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index c0d977e6c0f2..cd6de3244442 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -280,11 +280,10 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode,
 }
 
 /**
- * create_file - create a file in the tracefs filesystem
- * @name: the name of the file to create.
+ * lookup_file - look up a file in the tracefs filesystem
+ * @dentry: the dentry to look up
  * @mode: the permission that the file should have.
  * @attr: saved attributes changed by user
- * @parent: parent dentry for this file.
  * @data: something that the caller will want to get to later on.
  * @fop: struct file_operations that should be used for this file.
  *
@@ -292,13 +291,13 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode,
  * directory. The inode.i_private pointer will point to @data in the open()
  * call.
  */
-static struct dentry *create_file(const char *name, umode_t mode,
+static struct dentry *lookup_file(struct dentry *dentry,
+				  umode_t mode,
 				  struct eventfs_attr *attr,
-				  struct dentry *parent, void *data,
+				  void *data,
 				  const struct file_operations *fop)
 {
 	struct tracefs_inode *ti;
-	struct dentry *dentry;
 	struct inode *inode;
 
 	if (!(mode & S_IFMT))
@@ -307,12 +306,6 @@ static struct dentry *create_file(const char *name, umode_t mode,
 	if (WARN_ON_ONCE(!S_ISREG(mode)))
 		return NULL;
 
-	WARN_ON_ONCE(!parent);
-	dentry = eventfs_start_creating(name, parent);
-
-	if (IS_ERR(dentry))
-		return dentry;
-
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
 		return eventfs_failed_creating(dentry);
@@ -337,23 +330,19 @@ static struct dentry *create_file(const char *name, umode_t mode,
 };
 
 /**
- * create_dir - create a dir in the tracefs filesystem
+ * lookup_dir_entry - look up a dir in the tracefs filesystem
+ * @dentry: the directory to look up
  * @ei: the eventfs_inode that represents the directory to create
- * @parent: parent dentry for this file.
  *
- * This function will create a dentry for a directory represented by
+ * This function will look up a dentry for a directory represented by
  * a eventfs_inode.
  */
-static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent)
+static struct dentry *lookup_dir_entry(struct dentry *dentry,
+	struct eventfs_inode *pei, struct eventfs_inode *ei)
 {
 	struct tracefs_inode *ti;
-	struct dentry *dentry;
 	struct inode *inode;
 
-	dentry = eventfs_start_creating(ei->name, parent);
-	if (IS_ERR(dentry))
-		return dentry;
-
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
 		return eventfs_failed_creating(dentry);
@@ -372,6 +361,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = ei;
 
+	dentry->d_fsdata = ei;
+        ei->dentry = dentry;	// Remove me!
+
 	inc_nlink(inode);
 	d_instantiate(dentry, inode);
 	inc_nlink(dentry->d_parent->d_inode);
@@ -426,7 +418,7 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
 }
 
 /**
- * create_file_dentry - create a dentry for a file of an eventfs_inode
+ * lookup_file_dentry - create a dentry for a file of an eventfs_inode
  * @ei: the eventfs_inode that the file will be created under
  * @idx: the index into the d_children[] of the @ei
  * @parent: The parent dentry of the created file.
@@ -439,157 +431,21 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
  * address located at @e_dentry.
  */
 static struct dentry *
-create_file_dentry(struct eventfs_inode *ei, int idx,
-		   struct dentry *parent, const char *name, umode_t mode, void *data,
+lookup_file_dentry(struct dentry *dentry,
+		   struct eventfs_inode *ei, int idx,
+		   umode_t mode, void *data,
 		   const struct file_operations *fops)
 {
 	struct eventfs_attr *attr = NULL;
 	struct dentry **e_dentry = &ei->d_children[idx];
-	struct dentry *dentry;
 
-	WARN_ON_ONCE(!inode_is_locked(parent->d_inode));
-
-	mutex_lock(&eventfs_mutex);
-	if (ei->is_freed) {
-		mutex_unlock(&eventfs_mutex);
-		return NULL;
-	}
-	/* If the e_dentry already has a dentry, use it */
-	if (*e_dentry) {
-		dget(*e_dentry);
-		mutex_unlock(&eventfs_mutex);
-		return *e_dentry;
-	}
-
-	/* ei->entry_attrs are protected by SRCU */
 	if (ei->entry_attrs)
 		attr = &ei->entry_attrs[idx];
 
-	mutex_unlock(&eventfs_mutex);
+	dentry->d_fsdata = ei;		// NOTE: ei of _parent_
+	lookup_file(dentry, mode, attr, data, fops);
 
-	dentry = create_file(name, mode, attr, parent, data, fops);
-
-	mutex_lock(&eventfs_mutex);
-
-	if (IS_ERR_OR_NULL(dentry)) {
-		/*
-		 * When the mutex was released, something else could have
-		 * created the dentry for this e_dentry. In which case
-		 * use that one.
-		 *
-		 * If ei->is_freed is set, the e_dentry is currently on its
-		 * way to being freed, don't return it. If e_dentry is NULL
-		 * it means it was already freed.
-		 */
-		if (ei->is_freed) {
-			dentry = NULL;
-		} else {
-			dentry = *e_dentry;
-			dget(dentry);
-		}
-		mutex_unlock(&eventfs_mutex);
-		return dentry;
-	}
-
-	if (!*e_dentry && !ei->is_freed) {
-		*e_dentry = dentry;
-		dentry->d_fsdata = ei;
-	} else {
-		/*
-		 * Should never happen unless we get here due to being freed.
-		 * Otherwise it means two dentries exist with the same name.
-		 */
-		WARN_ON_ONCE(!ei->is_freed);
-		dentry = NULL;
-	}
-	mutex_unlock(&eventfs_mutex);
-
-	return dentry;
-}
-
-/**
- * eventfs_post_create_dir - post create dir routine
- * @ei: eventfs_inode of recently created dir
- *
- * Map the meta-data of files within an eventfs dir to their parent dentry
- */
-static void eventfs_post_create_dir(struct eventfs_inode *ei)
-{
-	struct eventfs_inode *ei_child;
-
-	lockdep_assert_held(&eventfs_mutex);
-
-	/* srcu lock already held */
-	/* fill parent-child relation */
-	list_for_each_entry_srcu(ei_child, &ei->children, list,
-				 srcu_read_lock_held(&eventfs_srcu)) {
-		ei_child->d_parent = ei->dentry;
-	}
-}
-
-/**
- * create_dir_dentry - Create a directory dentry for the eventfs_inode
- * @pei: The eventfs_inode parent of ei.
- * @ei: The eventfs_inode to create the directory for
- * @parent: The dentry of the parent of this directory
- *
- * This creates and attaches a directory dentry to the eventfs_inode @ei.
- */
-static struct dentry *
-create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
-		  struct dentry *parent)
-{
-	struct dentry *dentry = NULL;
-
-	WARN_ON_ONCE(!inode_is_locked(parent->d_inode));
-
-	mutex_lock(&eventfs_mutex);
-	if (pei->is_freed || ei->is_freed) {
-		mutex_unlock(&eventfs_mutex);
-		return NULL;
-	}
-	if (ei->dentry) {
-		/* If the eventfs_inode already has a dentry, use it */
-		dentry = ei->dentry;
-		dget(dentry);
-		mutex_unlock(&eventfs_mutex);
-		return dentry;
-	}
-	mutex_unlock(&eventfs_mutex);
-
-	dentry = create_dir(ei, parent);
-
-	mutex_lock(&eventfs_mutex);
-
-	if (IS_ERR_OR_NULL(dentry) && !ei->is_freed) {
-		/*
-		 * When the mutex was released, something else could have
-		 * created the dentry for this e_dentry. In which case
-		 * use that one.
-		 *
-		 * If ei->is_freed is set, the e_dentry is currently on its
-		 * way to being freed.
-		 */
-		dentry = ei->dentry;
-		if (dentry)
-			dget(dentry);
-		mutex_unlock(&eventfs_mutex);
-		return dentry;
-	}
-
-	if (!ei->dentry && !ei->is_freed) {
-		ei->dentry = dentry;
-		eventfs_post_create_dir(ei);
-		dentry->d_fsdata = ei;
-	} else {
-		/*
-		 * Should never happen unless we get here due to being freed.
-		 * Otherwise it means two dentries exist with the same name.
-		 */
-		WARN_ON_ONCE(!ei->is_freed);
-		dentry = NULL;
-	}
-	mutex_unlock(&eventfs_mutex);
+	*e_dentry = dentry;	// Remove me
 
 	return dentry;
 }
@@ -608,79 +464,54 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 					  struct dentry *dentry,
 					  unsigned int flags)
 {
-	const struct file_operations *fops;
-	const struct eventfs_entry *entry;
 	struct eventfs_inode *ei_child;
 	struct tracefs_inode *ti;
 	struct eventfs_inode *ei;
-	struct dentry *ei_dentry = NULL;
-	struct dentry *ret = NULL;
-	struct dentry *d;
 	const char *name = dentry->d_name.name;
-	umode_t mode;
-	void *data;
-	int idx;
-	int i;
-	int r;
 
 	ti = get_tracefs(dir);
 	if (!(ti->flags & TRACEFS_EVENT_INODE))
-		return NULL;
+		return ERR_PTR(-EIO);
 
-	/* Grab srcu to prevent the ei from going away */
-	idx = srcu_read_lock(&eventfs_srcu);
-
-	/*
-	 * Grab the eventfs_mutex to consistent value from ti->private.
-	 * This s
-	 */
 	mutex_lock(&eventfs_mutex);
-	ei = READ_ONCE(ti->private);
-	if (ei && !ei->is_freed)
-		ei_dentry = READ_ONCE(ei->dentry);
-	mutex_unlock(&eventfs_mutex);
 
-	if (!ei || !ei_dentry)
-		goto out;
+	ei = ti->private;
+	if (!ei || ei->is_freed)
+		goto enoent;
 
-	data = ei->data;
-
-	list_for_each_entry_srcu(ei_child, &ei->children, list,
-				 srcu_read_lock_held(&eventfs_srcu)) {
+	list_for_each_entry(ei_child, &ei->children, list) {
 		if (strcmp(ei_child->name, name) != 0)
 			continue;
-		ret = simple_lookup(dir, dentry, flags);
-		if (IS_ERR(ret))
-			goto out;
-		d = create_dir_dentry(ei, ei_child, ei_dentry);
-		dput(d);
+		if (ei_child->is_freed)
+			goto enoent;
+		lookup_dir_entry(dentry, ei, ei_child);
 		goto out;
 	}
 
-	for (i = 0; i < ei->nr_entries; i++) {
-		entry = &ei->entries[i];
-		if (strcmp(name, entry->name) == 0) {
-			void *cdata = data;
-			mutex_lock(&eventfs_mutex);
-			/* If ei->is_freed, then the event itself may be too */
-			if (!ei->is_freed)
-				r = entry->callback(name, &mode, &cdata, &fops);
-			else
-				r = -1;
-			mutex_unlock(&eventfs_mutex);
-			if (r <= 0)
-				continue;
-			ret = simple_lookup(dir, dentry, flags);
-			if (IS_ERR(ret))
-				goto out;
-			d = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops);
-			dput(d);
-			break;
-		}
+	for (int i = 0; i < ei->nr_entries; i++) {
+		void *data;
+		umode_t mode;
+		const struct file_operations *fops;
+		const struct eventfs_entry *entry = &ei->entries[i];
+
+		if (strcmp(name, entry->name) != 0)
+			continue;
+
+		data = ei->data;
+		if (entry->callback(name, &mode, &data, &fops) <= 0)
+			goto enoent;
+
+		lookup_file_dentry(dentry, ei, i, mode, data, fops);
+		goto out;
 	}
+
+ enoent:
+	/* Nothing found? */
+	d_add(dentry, NULL);
+
  out:
-	srcu_read_unlock(&eventfs_srcu, idx);
-	return ret;
+	mutex_unlock(&eventfs_mutex);
+	return NULL;
 }
 
 /*
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 20:51             ` Linus Torvalds
@ 2024-01-29 21:45               ` Steven Rostedt
  2024-01-29 22:19                 ` Linus Torvalds
  2024-01-29 21:55               ` Steven Rostedt
  2024-01-29 22:22               ` Steven Rostedt
  2 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2024-01-29 21:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 12:51:59 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> End result: what simple_lookup() does is say "oh, you didn't have the
> file, so it's by definition a negative dentry", and thus all it does
> is to do "d_add(dentry, NULL)".
> 
> Anyway, removing this was painful. I initially thought "I'll just
> remove the calls". But it all ended up cascading into "that's also
> wrong".
> 
> So now I have a patch that tries to fix this all up, and it looks like thisL:
> 
>  1 file changed, 50 insertions(+), 219 deletions(-)

Thanks, much appreciated.

> 
> because it basically removed all the old code, and replaced it with
> much simpler code.
> 
> I'm including the patch here as an attachment, but I want to note very
> clearly that this *builds* for me, and it looks a *lot* more obvious
> and correct than the old code did, but I haven't tested it. AT ALL.

I'm going to stare at them as I test them. Because I want to understand
them. I may come back with questions.

> 
> Also note that it depends on my previous patches, so I guess I'll
> include them here again just to make it unambiguous.
> 
> Finally - this does *not* fix up the refcounting. I still think the
> SRCU stuff is completely broken. But that's another headache. But at
> least now the *lookup* parts look like they DTRT wrt eventfs_mutex.
> 
> The SRCU logic from the directory iteration parts still needs crapectomy.

I think the not dropping the mutex lock lets me get rid of the SRCU. I
added the SRCU when I was hitting the deadlocks with the iput code which
I'm not hitting anymore. So getting rid of the SRCU shouldn't be hard.

> 
> AGAIN: these patches (ie particularly that last one - 0004) were all
> done entirely "blindly" - I've looked at the code, and fixed the bugs
> and problems I've seen by pure code inspection.
> 
> That's great, but it really means that it's all untested. It *looks*
> better than the old code, but there may be some silly gotcha that I
> have missed.

I'll let you know. 

Oh, does b4 handle attachments? Because this breaks the patchwork flow.
I haven't used b4 yet.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 20:51             ` Linus Torvalds
  2024-01-29 21:45               ` Steven Rostedt
@ 2024-01-29 21:55               ` Steven Rostedt
  2024-01-29 22:22               ` Steven Rostedt
  2 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2024-01-29 21:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 12:51:59 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [0001-tracefs-remove-stale-update_gid-code.patch  text/x-patch (2612 bytes)] 

Oh, I already applied this and even sent you a pull request with it.

  https://lore.kernel.org/all/20240128223151.2dad6599@rorschach.local.home/

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 21:45               ` Steven Rostedt
@ 2024-01-29 22:19                 ` Linus Torvalds
  0 siblings, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-29 22:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 at 13:45, Steven Rostedt <rostedt@goodmis.org> wrote:
> >  1 file changed, 50 insertions(+), 219 deletions(-)
>
> Thanks, much appreciated.

Well, I decided I might as well give it a test-run, and there's an
immediate deadlock on eventfs_mutex, because I missed removing it from
eventfs_find_events() when the callers now already hold it.

So at a minimum, it will require this patch on top:

  --- a/fs/tracefs/event_inode.c
  +++ b/fs/tracefs/event_inode.c
  @@ -230,7 +230,6 @@ static struct eventfs_inode
*eventfs_find_events(
   {
        struct eventfs_inode *ei;

  -     mutex_lock(&eventfs_mutex);
        do {
                // The parent is stable because we do not do renames
                dentry = dentry->d_parent;
  @@ -247,7 +246,6 @@
                }
                // Walk upwards until you find the events inode
        } while (!ei->is_events);
  -     mutex_unlock(&eventfs_mutex);

        update_top_events_attr(ei, dentry->d_sb);

to not deadlock immediately on the first lookup.

And honestly, there might be other such obvious "I missed that when
reading the code".

Let me reboot into a fixed system and do some more basic smoke-testing.

                Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 20:51             ` Linus Torvalds
  2024-01-29 21:45               ` Steven Rostedt
  2024-01-29 21:55               ` Steven Rostedt
@ 2024-01-29 22:22               ` Steven Rostedt
  2024-01-29 22:35                 ` Linus Torvalds
  2 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2024-01-29 22:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 12:51:59 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [0004-tracefs-dentry-lookup-crapectomy.patch  text/x-patch (11761 bytes)] 

I had to add:

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index cd6de3244442..89897d934302 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
 	struct eventfs_inode *ei;
 
-	mutex_lock(&eventfs_mutex);
 	do {
 		// The parent is stable because we do not do renames
 		dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 		}
 		// Walk upwards until you find the events inode
 	} while (!ei->is_events);
-	mutex_unlock(&eventfs_mutex);
 
 	update_top_events_attr(ei, dentry->d_sb);
 

As eventfs_find_events() is only called by update_inode_attr() which is
only called by: lookup_file() and lookup_dir_entry() which are called by
eventfs_root_lookup() where eventfs_mutex is already held.

But crashes with just a:

 # ls /sys/kernel/tracing/events

[   66.423983] ------------[ cut here ]------------
[   66.426447] kernel BUG at fs/dcache.c:1876!
[   66.428363] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[   66.430320] CPU: 4 PID: 863 Comm: ls Not tainted 6.8.0-rc1-test-00009-gcaff43732484-dirty #463
[   66.433192] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   66.436122] RIP: 0010:d_instantiate+0x69/0x70
[   66.437537] Code: 00 4c 89 e7 e8 18 f0 49 01 48 89 df 48 89 ee e8 0d fe ff ff 4c 89 e7 5b 5d 41 5c e9 e1 f3 49 01 5b 5d 41 5c c3 cc cc cc cc 90 <0f> 0b 90 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[   66.442759] RSP: 0018:ffffc900015f7830 EFLAGS: 00010282
[   66.444315] RAX: 0000000000000000 RBX: ffff8881006aec40 RCX: ffffffffb77fcbff
[   66.446237] RDX: dffffc0000000000 RSI: ffff888127c46ec0 RDI: ffff8881006aed70
[   66.448170] RBP: ffff888127c46ec0 R08: 0000000000000001 R09: fffffbfff7766c4f
[   66.450094] R10: ffffffffbbb3627f R11: 0000000000000000 R12: 00000000000081a0
[   66.452007] R13: ffff88810f850000 R14: 0000000000000000 R15: 0000000000000000
[   66.455224] FS:  00007fe1e56d9800(0000) GS:ffff88823c800000(0000) knlGS:0000000000000000
[   66.457484] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   66.459139] CR2: 0000556b5293b000 CR3: 000000010f056001 CR4: 0000000000170ef0
[   66.461192] Call Trace:
[   66.462094]  <TASK>
[   66.462915]  ? die+0x36/0x90
[   66.463928]  ? do_trap+0x133/0x230
[   66.465089]  ? d_instantiate+0x69/0x70
[   66.466325]  ? d_instantiate+0x69/0x70
[   66.467532]  ? do_error_trap+0x90/0x130
[   66.468787]  ? d_instantiate+0x69/0x70
[   66.470020]  ? handle_invalid_op+0x2c/0x40
[   66.471340]  ? d_instantiate+0x69/0x70
[   66.472559]  ? exc_invalid_op+0x2e/0x50
[   66.473807]  ? asm_exc_invalid_op+0x1a/0x20
[   66.475154]  ? d_instantiate+0x1f/0x70
[   66.476396]  ? d_instantiate+0x69/0x70
[   66.477629]  eventfs_root_lookup+0x366/0x660
[   66.479021]  ? __pfx_eventfs_root_lookup+0x10/0x10
[   66.480567]  ? print_circular_bug_entry+0x170/0x170
[   66.482107]  ? lockdep_init_map_type+0xd3/0x3a0
[   66.485243]  __lookup_slow+0x194/0x2a0
[   66.486410]  ? __pfx___lookup_slow+0x10/0x10
[   66.487694]  ? rwsem_read_trylock+0x118/0x1b0
[   66.489057]  ? i915_ttm_backup+0x2a0/0x5e0
[   66.490358]  ? down_read+0xbb/0x240
[   66.491506]  ? down_read+0xbb/0x240
[   66.492674]  ? trace_preempt_on+0xc8/0xe0
[   66.493962]  ? i915_ttm_backup+0x2a0/0x5e0
[   66.495291]  walk_component+0x166/0x220
[   66.496564]  path_lookupat+0xa9/0x2e0
[   66.497766]  ? __pfx_mark_lock+0x10/0x10
[   66.499026]  filename_lookup+0x19c/0x350
[   66.500335]  ? __pfx_filename_lookup+0x10/0x10
[   66.501688]  ? __pfx___lock_acquire+0x10/0x10
[   66.502996]  ? __pfx___lock_acquire+0x10/0x10
[   66.504305]  ? _raw_read_unlock_irqrestore+0x40/0x80
[   66.505765]  ? stack_depot_save_flags+0x1f0/0x790
[   66.507137]  vfs_statx+0xe1/0x270
[   66.508196]  ? __pfx_vfs_statx+0x10/0x10
[   66.509385]  ? __virt_addr_valid+0x155/0x330
[   66.510667]  ? __pfx_lock_release+0x10/0x10
[   66.511924]  do_statx+0xac/0x110
[   66.515394]  ? __pfx_do_statx+0x10/0x10
[   66.516631]  ? getname_flags.part.0+0xd6/0x260
[   66.517956]  __x64_sys_statx+0xa0/0xc0
[   66.519079]  do_syscall_64+0xca/0x1e0
[   66.520206]  entry_SYSCALL_64_after_hwframe+0x6f/0x77
[   66.521674] RIP: 0033:0x7fe1e586e2ea
[   66.522780] Code: 48 8b 05 31 bb 0d 00 ba ff ff ff ff 64 c7 00 16 00 00 00 e9 a5 fd ff ff e8 03 10 02 00 0f 1f 00 41 89 ca b8 4c 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2e 89 c1 85 c0 74 0f 48 8b 05 f9 ba 0d 00 64
[   66.527728] RSP: 002b:00007fff06f5ef98 EFLAGS: 00000246 ORIG_RAX: 000000000000014c
[   66.529922] RAX: ffffffffffffffda RBX: 0000556b52935d18 RCX: 00007fe1e586e2ea
[   66.531863] RDX: 0000000000000900 RSI: 00007fff06f5f0d0 RDI: 00000000ffffff9c
[   66.533832] RBP: 0000000000000002 R08: 00007fff06f5efa0 R09: 0000000000000000
[   66.535768] R10: 0000000000000002 R11: 0000000000000246 R12: 00007fff06f5f0d0
[   66.537733] R13: 0000000000000005 R14: 0000556b52935d00 R15: 0000000000000003
[   66.539682]  </TASK>
[   66.540454] Modules linked in: vmw_vsock_virtio_transport vmw_vsock_virtio_transport_common vsock ip_tables
[   66.543272] ---[ end trace 0000000000000000 ]---


void d_instantiate(struct dentry *entry, struct inode * inode)
{
	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));  <<<--- HERE
	if (inode) {
		security_d_instantiate(entry, inode);
		spin_lock(&inode->i_lock);
		__d_instantiate(entry, inode);
		spin_unlock(&inode->i_lock);
	}
}

-- Steve

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 22:22               ` Steven Rostedt
@ 2024-01-29 22:35                 ` Linus Torvalds
  2024-01-29 22:42                   ` Linus Torvalds
  2024-01-29 22:47                   ` Steven Rostedt
  0 siblings, 2 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-29 22:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 at 14:21, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> But crashes with just a:
>
>  # ls /sys/kernel/tracing/events
>
> [   66.423983] ------------[ cut here ]------------
> [   66.426447] kernel BUG at fs/dcache.c:1876!

Duh.

That's a bit too much copy-and-paste by me.

So what is going on is that a ->lookup() function should *not* call
d_instantiate() at all, and the only reason it actually used to work
here was due to the incorrect "simple_lookup()", which basically did
all the preliminaries.

A ->lookup() should do 'd_add()' on the dentry.

So just replace all the d_instantiate() calls there with "d_add()"
instead. I think that will fix it.

Basically the "simple_lookup()" had done the "d_add(dentry, NULL)",
and at that point the "d_instantiate()" just exposed the inode and
turned the negative dentry into a positive one.

So "d_add()" is "I'm adding the inode to a new dentry under lookup".
And "d_instantiate()" is "I'm adding this inode to an existing dentry
that used to be negative"

And so the old "d_add(NULL)+d_instantiate(inode)" _kind_ of worked,
except it made that negative dentry visible for a short while.

And when I did the cleanup, I didn't think of this thing, so I left
the d_instantiate() calls as such, even though they now really need to
be d_add().

Hope that explains it.

And I hope there aren't any other stupid things I missed like that.

         Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 22:35                 ` Linus Torvalds
@ 2024-01-29 22:42                   ` Linus Torvalds
  2024-01-29 22:49                     ` Steven Rostedt
  2024-01-29 22:47                   ` Steven Rostedt
  1 sibling, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2024-01-29 22:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 535 bytes --]

On Mon, 29 Jan 2024 at 14:35, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So just replace all the d_instantiate() calls there with "d_add()"
> instead. I think that will fix it.

I can confirm that with the mutex deadlock removed and the d_add()
fix, at least things *look* superficially ok.

I didn't actually do anything with it. So it might be leaking dentry
refs like mad or something like that, but at least the obvious cases
look fine.

just for completeness, here's the fixup diff I used.

                  Linus

[-- Attachment #2: fixup.diff --]
[-- Type: text/x-patch, Size: 1795 bytes --]

 fs/tracefs/event_inode.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index cd6de3244442..5b307bb64f8f 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
 	struct eventfs_inode *ei;
 
-	mutex_lock(&eventfs_mutex);
 	do {
 		// The parent is stable because we do not do renames
 		dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 		}
 		// Walk upwards until you find the events inode
 	} while (!ei->is_events);
-	mutex_unlock(&eventfs_mutex);
 
 	update_top_events_attr(ei, dentry->d_sb);
 
@@ -324,7 +322,7 @@ static struct dentry *lookup_file(struct dentry *dentry,
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = NULL;			// Directories have 'ei', files not
 
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
 };
@@ -365,7 +363,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
         ei->dentry = dentry;	// Remove me!
 
 	inc_nlink(inode);
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	inc_nlink(dentry->d_parent->d_inode);
 	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
@@ -786,7 +784,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 
 	/* directory inodes start off with i_nlink == 2 (for "." entry) */
 	inc_nlink(inode);
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	inc_nlink(dentry->d_parent->d_inode);
 	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
 	tracefs_end_creating(dentry);

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 22:35                 ` Linus Torvalds
  2024-01-29 22:42                   ` Linus Torvalds
@ 2024-01-29 22:47                   ` Steven Rostedt
  2024-01-29 23:15                     ` Steven Rostedt
  1 sibling, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2024-01-29 22:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 14:35:37 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And I hope there aren't any other stupid things I missed like that.

Well the preliminary tests pass with this added to your patch:

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index cd6de3244442..ad11063bdd53 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
 	struct eventfs_inode *ei;
 
-	mutex_lock(&eventfs_mutex);
 	do {
 		// The parent is stable because we do not do renames
 		dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 		}
 		// Walk upwards until you find the events inode
 	} while (!ei->is_events);
-	mutex_unlock(&eventfs_mutex);
 
 	update_top_events_attr(ei, dentry->d_sb);
 
@@ -324,7 +322,7 @@ static struct dentry *lookup_file(struct dentry *dentry,
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = NULL;			// Directories have 'ei', files not
 
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
 };
@@ -365,7 +363,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
         ei->dentry = dentry;	// Remove me!
 
 	inc_nlink(inode);
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	inc_nlink(dentry->d_parent->d_inode);
 	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);

-- Steve

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 22:42                   ` Linus Torvalds
@ 2024-01-29 22:49                     ` Steven Rostedt
  2024-01-30  0:01                       ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2024-01-29 22:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 14:42:47 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> @@ -324,7 +322,7 @@ static struct dentry *lookup_file(struct dentry *dentry,
>  	ti->flags = TRACEFS_EVENT_INODE;
>  	ti->private = NULL;			// Directories have 'ei', files not
>  
> -	d_instantiate(dentry, inode);
> +	d_add(dentry, inode);
>  	fsnotify_create(dentry->d_parent->d_inode, dentry);
>  	return eventfs_end_creating(dentry);
>  };
> @@ -365,7 +363,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
>          ei->dentry = dentry;	// Remove me!
>  
>  	inc_nlink(inode);
> -	d_instantiate(dentry, inode);
> +	d_add(dentry, inode);
>  	inc_nlink(dentry->d_parent->d_inode);
>  	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
>  	return eventfs_end_creating(dentry);
> @@ -786,7 +784,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
>  
>  	/* directory inodes start off with i_nlink == 2 (for "." entry) */
>  	inc_nlink(inode);
> -	d_instantiate(dentry, inode);
> +	d_add(dentry, inode);

Now I didn't change this last d_instantiate, because this is not called
through the lookup code. This is the root events directory and acts more
like debugfs. It's not "dynamically" added.

-- Steve


>  	inc_nlink(dentry->d_parent->d_inode);
>  	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
>  	tracefs_end_creating(dentry);


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 22:47                   ` Steven Rostedt
@ 2024-01-29 23:15                     ` Steven Rostedt
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2024-01-29 23:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 17:47:32 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> > And I hope there aren't any other stupid things I missed like that.  
> 
> Well the preliminary tests pass with this added to your patch:

Spoke too soon. The later tests started failing.

It fails on creating a kprobe, deleting it, and then recreating it. Even
though the directory is there, it can't be accessed.

 # cd /sys/kernel/tracing

// Create a kprobe

 # echo 'p:sched schedule' >> kprobe_events 
 # ls events/kprobes/
enable  filter  sched

// Now delete the kprobe

 # echo '-:sched schedule' >> kprobe_events

// Make sure it's gone

 # ls events/kprobes/
ls: cannot access 'events/kprobes/': No such file or directory

// Recreate it

 # echo 'p:sched schedule' >> kprobe_events 
 # ls events/kprobes/
ls: cannot access 'events/kprobes/': No such file or directory
 # ls events | grep kprobes
kprobes

No longer able to access it.

 # ls -l events | grep kprobes
ls: cannot access 'events/kprobes': No such file or directory
d????????? ? ?    ?    ?            ? kprobes


-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29 22:49                     ` Steven Rostedt
@ 2024-01-30  0:01                       ` Linus Torvalds
  2024-01-30  0:35                         ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2024-01-30  0:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 at 14:49, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Now I didn't change this last d_instantiate, because this is not called
> through the lookup code. This is the root events directory and acts more
> like debugfs. It's not "dynamically" added.

Ahh, yes, I see, the dentry was created (as a negative one) with
tracefs_start_creating() -> lookup_one_len().

So  yes, there d_instantiate() is correct, as it's exactly that "turn
negative dentry into a positive one" case.

I'll go see what's up with the "create it again" case - I don't
immediately see what's wrong.

              Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30  0:01                       ` Linus Torvalds
@ 2024-01-30  0:35                         ` Steven Rostedt
  2024-01-30  1:50                           ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2024-01-30  0:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 16:01:25 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I'll go see what's up with the "create it again" case - I don't
> immediately see what's wrong.

Interesting. I added a printk in the lookup, and just did this:

 # cd /sys/kernel/tracing
 # ls events/kprobes

And it showed that it tried to see if "kprobes" existed in the lookup.
Which it did not because I haven't created any kprobes yet.

Then I did:

 # echo 'p:sched schedule' >> /sys/kernel/tracing/kprobe_events
 # ls -l events/kprobes/
ls: cannot access 'events/kprobes/': No such file or directory

Where it should now exist but doesn't. But the lookup code never triggered.

If the lookup fails, does it cache the result?

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30  0:35                         ` Steven Rostedt
@ 2024-01-30  1:50                           ` Linus Torvalds
  2024-01-30  3:56                             ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2024-01-30  1:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 at 16:35, Steven Rostedt <rostedt@goodmis.org> wrote:
>
>  # echo 'p:sched schedule' >> /sys/kernel/tracing/kprobe_events
>  # ls -l events/kprobes/
> ls: cannot access 'events/kprobes/': No such file or directory
>
> Where it should now exist but doesn't. But the lookup code never triggered.
>
> If the lookup fails, does it cache the result?

I think you end up still having negative dentries around.

The old code then tried to compensate for that by trying to remember
the old dentry with 'ei->dentry' and 'ei->d_children[]', and would at
lookup time try to use the *old* dentry instead of the new one.

And because dentries are just caches and can go away, it then had that
odd dance with '.d_iput', so that when a dentry was removed, it would
be removed from the 'ei->dentry' and 'ei->d_children[]' array too.

Except that d_iput() of an old dentry isn't actually serialized with
->d_lookup() in any way, so you end up with the whole race that I
already talked about earlier, where you could still have an
'ei->dentry' that pointed to something that had already been unhashed,
but d_iput() hadn't been called *yet*, so d_lookup() is called with a
new dentry, but the tracefs code then desperately tries to use the old
dentry pointer that just isn't _valid_ any more, but it doesn't know
that because d_iput() hasn't been called yet...

And as I *also* pointed out when I described that originally, you'll
practically never hit this race, because you just need to be *very*
unlucky with the whole "dentry is freed due to memory pressure".

But basically, this is why I absolutely *HATE* that "ei->dentry"
backpointer. It's truly fundamentally broken.

You can't reference-count it, since the whole point of your current
tracefs scheme is to *not* keep dentries and inodes around forever,
and doing a "dget()" on that 'ei->dentry' would thus fundamentally
screw that up.

But you also cannot keep it in sync with dentries being released due
to memory pressure, because of the above thing.

See why I've tried to tell you that the back-pointer is basically a
100% sign of a bug.

The *only* time you can have a valid dentry pointer is when you have
also taken a ref to it with dget(), and you can't do that.

So then you have all that completely broken code that _tries_ to
maintain consistency with ->d_children[] etc, and it works 99.9% in
practice, because the race is just so hard to hit because dentries
only normally get evicted either synchronously (which you do under the
eventfs_mutex) or under memory pressure (which is basically never
going to be something you can test).

And yes, my lookup patch removed all the band-aids for "if I have an
ei->dentry, I'll reuse it". So I think it ends up exposing all the
previous bugs that the old "let's reuse the old dentry" code tried to
hide.

But, as mentioned, that ei->dentry pointer really REALLY is broken.

NBow, having looked at this a lot, I think I have a way forward.
Because there is actually *one* case where you actually *do* do the
whole "dget()" to get a stable dentry pointer. And that's exactly the
"events" directory creation (ie eventfs_create_events_dir()).

So what I propose is that

 - ei->dentry and ei->d_children[] need to die. Really. They are
buggy. There is no way to save them. There never was.

 - but we *can* introduce a new 'ei->events_dir' pointer that is
*only* set by eventfs_create_events_dir(), and which is stable exactly
because that function also does a dget() on it, so now the dentry will
actually continue to exist reliably

I think that works. The only thing that actually *needs* the existing
'ei->dentry' is literally the eventfs_remove_events_dir() that gets
rid of the stable events directory. It's undoing
eventfs_create_events_dir(), and it will do the final dput() too.

I will try to make a patch for this. I do think it means that every
time we do that

    dentry->d_fsdata = ei;

we need to also do proper reference counting of said 'ei'. Because we
can't release 'ei' early when we have dentries that point to it.

Let me see how painful this will be.

             Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-29  4:36 ` Linus Torvalds
  2024-01-29 17:01   ` Steven Rostedt
@ 2024-01-30  2:08   ` Al Viro
  1 sibling, 0 replies; 47+ messages in thread
From: Al Viro @ 2024-01-30  2:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, Steven Rostedt, oe-lkp, lkp, linux-kernel,
	Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	Christian Brauner, Ajay Kaher, linux-trace-kernel

On Sun, Jan 28, 2024 at 08:36:12PM -0800, Linus Torvalds wrote:

[snip]

apologies for being MIA on that, will post tomorrow morning once I get some
sleep...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30  1:50                           ` Linus Torvalds
@ 2024-01-30  3:56                             ` Linus Torvalds
  2024-01-30  8:43                               ` Linus Torvalds
                                                 ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-30  3:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 1513 bytes --]

On Mon, 29 Jan 2024 at 17:50, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So what I propose is that
>
>  - ei->dentry and ei->d_children[] need to die. Really. They are
> buggy. There is no way to save them. There never was.
>
>  - but we *can* introduce a new 'ei->events_dir' pointer that is
> *only* set by eventfs_create_events_dir(), and which is stable exactly
> because that function also does a dget() on it, so now the dentry will
> actually continue to exist reliably
>
> I think that works.

Well, it doesn't. I don't see where the bug is, but since Al is now
aware of the thread, maybe when he wakes up he will tell me where I've
gone wrong.

In the meantime I did do the pending tracefs pull, so the series has
changed a bit, and this is the rebased series on top of my current
public git tree.

It is still broken wrt 'events' directories. You don't even need the
"create, delete, create" sequence that Steven pointed out, just a
plain sequence of

 # cd /sys/kernel/tracing
 # ls events/kprobes/
 # echo 'p:sched schedule' >> kprobe_events

messes up - ie it's enough to just have 'lookup' create a negative
dentry by trying to look up 'events/kprobes/' before actually trying
to create that kprobe_events.

But I've been staring at this code for too long, so I'm posting this
just as a "it's broken, but _something_ like this", because I'm taking
a break from looking at this.

I'll get back to it tomorrow, but I hope that Al will show me the
error of my ways.

              Linus

[-- Attachment #2: 0005-eventfs-get-rid-of-dentry-pointers-without-refcounts.patch --]
[-- Type: text/x-patch, Size: 15652 bytes --]

From 6763ac4af7ccc0c97fb5f7c98d0c8ae1289ec0fe Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 29 Jan 2024 18:49:42 -0800
Subject: [PATCH 5/5] eventfs: get rid of dentry pointers without refcounts

The eventfs inode had pointers to dentries (and child dentries) without
actually holding a refcount on said pointer.  That is fundamentally
broken, and while eventfs tried to then maintain coherence with dentries
going away by hooking into the '.d_iput' callback, that doesn't actually
work since it's not ordered wrt lookups.

There were two reasonms why eventfs tried to keep a pointer to a dentry:

 - the creation of a 'events' directory would actually have a stable
   dentry pointer that it created with tracefs_start_creating().

   And it needed that dentry when tearing it all down again in
   eventfs_remove_events_dir().

   This use is actually ok, because the special top-level events
   directory dentries are actually stable, not just a temporary cache of
   the eventfs data structures.

 - the 'eventfs_inode' (aka ei) needs to stay around as long as there
   are dentries that refer to it.

   It then used these dentry pointers as a replacement for doing
   reference counting: it would try to make sure that there was only
   ever one dentry associated with an event_inode, and keep a child
   dentry array around to see which dentries might still refer to the
   parent ei.

This gets rid of the invalid dentry pointer use, and renames the one
valid case to a different name to make it clear that it's not just any
random dentry.

The magic child dentry array that is kind of a "reverse reference list"
is simply replaced by having child dentries take a ref to the ei.  As
does the directory dentries.  That makes the broken use case go away.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 245 ++++++++++++---------------------------
 fs/tracefs/internal.h    |   9 +-
 2 files changed, 80 insertions(+), 174 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1d0102bfd7da..a37db0dac302 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -62,6 +62,34 @@ enum {
 
 #define EVENTFS_MODE_MASK	(EVENTFS_SAVE_MODE - 1)
 
+/*
+ * eventfs_inode reference count management.
+ *
+ * NOTE! We count only references from dentries, in the
+ * form 'dentry->d_fsdata'. There are also references from
+ * directory inodes ('ti->private'), but the dentry reference
+ * count is always a superset of the inode reference count.
+ */
+static void release_ei(struct kref *ref)
+{
+	struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref);
+	kfree(ei->entry_attrs);
+	kfree(ei);
+}
+
+static inline void put_ei(struct eventfs_inode *ei)
+{
+	if (ei)
+		kref_put(&ei->kref, release_ei);
+}
+
+static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
+{
+	if (ei)
+		kref_get(&ei->kref);
+	return ei;
+}
+
 static struct dentry *eventfs_root_lookup(struct inode *dir,
 					  struct dentry *dentry,
 					  unsigned int flags);
@@ -289,7 +317,8 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode,
  * directory. The inode.i_private pointer will point to @data in the open()
  * call.
  */
-static struct dentry *lookup_file(struct dentry *dentry,
+static struct dentry *lookup_file(struct eventfs_inode *parent_ei,
+				  struct dentry *dentry,
 				  umode_t mode,
 				  struct eventfs_attr *attr,
 				  void *data,
@@ -302,11 +331,11 @@ static struct dentry *lookup_file(struct dentry *dentry,
 		mode |= S_IFREG;
 
 	if (WARN_ON_ONCE(!S_ISREG(mode)))
-		return NULL;
+		return ERR_PTR(-EIO);
 
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
-		return eventfs_failed_creating(dentry);
+		return ERR_PTR(-ENOMEM);
 
 	/* If the user updated the directory's attributes, use them */
 	update_inode_attr(dentry, inode, attr, mode);
@@ -322,9 +351,12 @@ static struct dentry *lookup_file(struct dentry *dentry,
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = NULL;			// Directories have 'ei', files not
 
+	// Files have their parent's ei as their fsdata
+	dentry->d_fsdata = get_ei(parent_ei);
+
 	d_add(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
-	return eventfs_end_creating(dentry);
+	return NULL;
 };
 
 /**
@@ -343,7 +375,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
 
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
-		return eventfs_failed_creating(dentry);
+		return ERR_PTR(-ENOMEM);
 
 	/* If the user updated the directory's attributes, use them */
 	update_inode_attr(dentry, inode, &ei->attr,
@@ -359,24 +391,28 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = ei;
 
-	dentry->d_fsdata = ei;
-        ei->dentry = dentry;	// Remove me!
+	dentry->d_fsdata = get_ei(ei);
 
 	inc_nlink(inode);
 	d_add(dentry, inode);
 	inc_nlink(dentry->d_parent->d_inode);
 	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
-	return eventfs_end_creating(dentry);
+	return NULL;
 }
 
-static void free_ei(struct eventfs_inode *ei)
+static inline struct eventfs_inode *alloc_ei(const char *name)
 {
-	kfree_const(ei->name);
-	kfree(ei->d_children);
-	kfree(ei->entry_attrs);
-	kfree(ei);
+	int namesize = strlen(name) + 1;
+	struct eventfs_inode *ei = kzalloc(sizeof(*ei) + namesize, GFP_KERNEL);
+
+	if (ei) {
+		memcpy((char *)ei->name, name, namesize);
+		kref_init(&ei->kref);
+	}
+	return ei;
 }
 
+
 /**
  * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode
  * @ti: the tracefs_inode of the dentry
@@ -387,38 +423,20 @@ static void free_ei(struct eventfs_inode *ei)
 void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
 {
 	struct eventfs_inode *ei;
-	int i;
 
 	mutex_lock(&eventfs_mutex);
-
 	ei = dentry->d_fsdata;
-	if (!ei)
-		goto out;
-
-	/* This could belong to one of the files of the ei */
-	if (ei->dentry != dentry) {
-		for (i = 0; i < ei->nr_entries; i++) {
-			if (ei->d_children[i] == dentry)
-				break;
-		}
-		if (WARN_ON_ONCE(i == ei->nr_entries))
-			goto out;
-		ei->d_children[i] = NULL;
-	} else if (ei->is_freed) {
-		free_ei(ei);
-	} else {
-		ei->dentry = NULL;
+	if (ei) {
+		dentry->d_fsdata = NULL;
+		put_ei(ei);
 	}
-
-	dentry->d_fsdata = NULL;
- out:
 	mutex_unlock(&eventfs_mutex);
 }
 
 /**
  * lookup_file_dentry - create a dentry for a file of an eventfs_inode
  * @ei: the eventfs_inode that the file will be created under
- * @idx: the index into the d_children[] of the @ei
+ * @idx: the index into the entry_attrs[] of the @ei
  * @parent: The parent dentry of the created file.
  * @name: The name of the file to create
  * @mode: The mode of the file.
@@ -435,17 +453,11 @@ lookup_file_dentry(struct dentry *dentry,
 		   const struct file_operations *fops)
 {
 	struct eventfs_attr *attr = NULL;
-	struct dentry **e_dentry = &ei->d_children[idx];
 
 	if (ei->entry_attrs)
 		attr = &ei->entry_attrs[idx];
 
-	dentry->d_fsdata = ei;		// NOTE: ei of _parent_
-	lookup_file(dentry, mode, attr, data, fops);
-
-	*e_dentry = dentry;	// Remove me
-
-	return dentry;
+	return lookup_file(ei, dentry, mode, attr, data, fops);
 }
 
 /**
@@ -466,6 +478,7 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 	struct tracefs_inode *ti;
 	struct eventfs_inode *ei;
 	const char *name = dentry->d_name.name;
+	struct dentry *result;
 
 	ti = get_tracefs(dir);
 	if (!(ti->flags & TRACEFS_EVENT_INODE))
@@ -482,7 +495,7 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 			continue;
 		if (ei_child->is_freed)
 			goto enoent;
-		lookup_dir_entry(dentry, ei, ei_child);
+		result = lookup_dir_entry(dentry, ei, ei_child);
 		goto out;
 	}
 
@@ -499,17 +512,18 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 		if (entry->callback(name, &mode, &data, &fops) <= 0)
 			goto enoent;
 
-		lookup_file_dentry(dentry, ei, i, mode, data, fops);
+		result = lookup_file_dentry(dentry, ei, i, mode, data, fops);
 		goto out;
 	}
 
  enoent:
 	/* Nothing found? */
 	d_add(dentry, NULL);
+	result = NULL;
 
  out:
 	mutex_unlock(&eventfs_mutex);
-	return NULL;
+	return result;
 }
 
 /*
@@ -659,25 +673,10 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode
 	if (!parent)
 		return ERR_PTR(-EINVAL);
 
-	ei = kzalloc(sizeof(*ei), GFP_KERNEL);
+	ei = alloc_ei(name);
 	if (!ei)
 		return ERR_PTR(-ENOMEM);
 
-	ei->name = kstrdup_const(name, GFP_KERNEL);
-	if (!ei->name) {
-		kfree(ei);
-		return ERR_PTR(-ENOMEM);
-	}
-
-	if (size) {
-		ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL);
-		if (!ei->d_children) {
-			kfree_const(ei->name);
-			kfree(ei);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
-
 	ei->entries = entries;
 	ei->nr_entries = size;
 	ei->data = data;
@@ -691,7 +690,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode
 
 	/* Was the parent freed? */
 	if (list_empty(&ei->list)) {
-		free_ei(ei);
+		put_ei(ei);
 		ei = NULL;
 	}
 	return ei;
@@ -726,28 +725,20 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	if (IS_ERR(dentry))
 		return ERR_CAST(dentry);
 
-	ei = kzalloc(sizeof(*ei), GFP_KERNEL);
+	ei = alloc_ei(name);
 	if (!ei)
-		goto fail_ei;
+		goto fail;
 
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
 		goto fail;
 
-	if (size) {
-		ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL);
-		if (!ei->d_children)
-			goto fail;
-	}
-
-	ei->dentry = dentry;
+	// Note: we have a ref to the dentry from tracefs_start_creating()
+	ei->events_dir = dentry;
 	ei->entries = entries;
 	ei->nr_entries = size;
 	ei->is_events = 1;
 	ei->data = data;
-	ei->name = kstrdup_const(name, GFP_KERNEL);
-	if (!ei->name)
-		goto fail;
 
 	/* Save the ownership of this directory */
 	uid = d_inode(dentry->d_parent)->i_uid;
@@ -778,7 +769,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	inode->i_op = &eventfs_root_dir_inode_operations;
 	inode->i_fop = &eventfs_file_operations;
 
-	dentry->d_fsdata = ei;
+	dentry->d_fsdata = get_ei(ei);
 
 	/* directory inodes start off with i_nlink == 2 (for "." entry) */
 	inc_nlink(inode);
@@ -790,72 +781,11 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	return ei;
 
  fail:
-	kfree(ei->d_children);
-	kfree(ei);
- fail_ei:
+	put_ei(ei);
 	tracefs_failed_creating(dentry);
 	return ERR_PTR(-ENOMEM);
 }
 
-static LLIST_HEAD(free_list);
-
-static void eventfs_workfn(struct work_struct *work)
-{
-        struct eventfs_inode *ei, *tmp;
-        struct llist_node *llnode;
-
-	llnode = llist_del_all(&free_list);
-        llist_for_each_entry_safe(ei, tmp, llnode, llist) {
-		/* This dput() matches the dget() from unhook_dentry() */
-		for (int i = 0; i < ei->nr_entries; i++) {
-			if (ei->d_children[i])
-				dput(ei->d_children[i]);
-		}
-		/* This should only get here if it had a dentry */
-		if (!WARN_ON_ONCE(!ei->dentry))
-			dput(ei->dentry);
-        }
-}
-
-static DECLARE_WORK(eventfs_work, eventfs_workfn);
-
-static void free_rcu_ei(struct rcu_head *head)
-{
-	struct eventfs_inode *ei = container_of(head, struct eventfs_inode, rcu);
-
-	if (ei->dentry) {
-		/* Do not free the ei until all references of dentry are gone */
-		if (llist_add(&ei->llist, &free_list))
-			queue_work(system_unbound_wq, &eventfs_work);
-		return;
-	}
-
-	/* If the ei doesn't have a dentry, neither should its children */
-	for (int i = 0; i < ei->nr_entries; i++) {
-		WARN_ON_ONCE(ei->d_children[i]);
-	}
-
-	free_ei(ei);
-}
-
-static void unhook_dentry(struct dentry *dentry)
-{
-	if (!dentry)
-		return;
-	/*
-	 * Need to add a reference to the dentry that is expected by
-	 * simple_recursive_removal(), which will include a dput().
-	 */
-	dget(dentry);
-
-	/*
-	 * Also add a reference for the dput() in eventfs_workfn().
-	 * That is required as that dput() will free the ei after
-	 * the SRCU grace period is over.
-	 */
-	dget(dentry);
-}
-
 /**
  * eventfs_remove_rec - remove eventfs dir or file from list
  * @ei: eventfs_inode to be removed.
@@ -868,8 +798,6 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, int level)
 {
 	struct eventfs_inode *ei_child;
 
-	if (!ei)
-		return;
 	/*
 	 * Check recursion depth. It should never be greater than 3:
 	 * 0 - events/
@@ -881,28 +809,12 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, int level)
 		return;
 
 	/* search for nested folders or files */
-	list_for_each_entry_srcu(ei_child, &ei->children, list,
-				 lockdep_is_held(&eventfs_mutex)) {
-		/* Children only have dentry if parent does */
-		WARN_ON_ONCE(ei_child->dentry && !ei->dentry);
+	list_for_each_entry(ei_child, &ei->children, list)
 		eventfs_remove_rec(ei_child, level + 1);
-	}
-
 
 	ei->is_freed = 1;
-
-	for (int i = 0; i < ei->nr_entries; i++) {
-		if (ei->d_children[i]) {
-			/* Children only have dentry if parent does */
-			WARN_ON_ONCE(!ei->dentry);
-			unhook_dentry(ei->d_children[i]);
-		}
-	}
-
-	unhook_dentry(ei->dentry);
-
-	list_del_rcu(&ei->list);
-	call_srcu(&eventfs_srcu, &ei->rcu, free_rcu_ei);
+	list_del(&ei->list);
+	put_ei(ei);
 }
 
 /**
@@ -913,22 +825,12 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, int level)
  */
 void eventfs_remove_dir(struct eventfs_inode *ei)
 {
-	struct dentry *dentry;
-
 	if (!ei)
 		return;
 
 	mutex_lock(&eventfs_mutex);
-	dentry = ei->dentry;
 	eventfs_remove_rec(ei, 0);
 	mutex_unlock(&eventfs_mutex);
-
-	/*
-	 * If any of the ei children has a dentry, then the ei itself
-	 * must have a dentry.
-	 */
-	if (dentry)
-		simple_recursive_removal(dentry, NULL);
 }
 
 /**
@@ -941,7 +843,11 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 {
 	struct dentry *dentry;
 
-	dentry = ei->dentry;
+	dentry = ei->events_dir;
+	if (!dentry)
+		return;
+
+	ei->events_dir = NULL;
 	eventfs_remove_dir(ei);
 
 	/*
@@ -951,5 +857,6 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 	 * sticks around while the other ei->dentry are created
 	 * and destroyed dynamically.
 	 */
+	simple_recursive_removal(dentry, NULL);
 	dput(dentry);
 }
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 8f38740bfb5b..72db3bdc4dfb 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -34,8 +34,7 @@ struct eventfs_attr {
  * @entries:	the array of entries representing the files in the directory
  * @name:	the name of the directory to create
  * @children:	link list into the child eventfs_inode
- * @dentry:     the dentry of the directory
- * @d_children: The array of dentries to represent the files when created
+ * @events_dir: the dentry of the events directory
  * @entry_attrs: Saved mode and ownership of the @d_children
  * @attr:	Saved mode and ownership of eventfs_inode itself
  * @data:	The private data to pass to the callbacks
@@ -44,12 +43,11 @@ struct eventfs_attr {
  * @nr_entries: The number of items in @entries
  */
 struct eventfs_inode {
+	struct kref			kref;
 	struct list_head		list;
 	const struct eventfs_entry	*entries;
-	const char			*name;
 	struct list_head		children;
-	struct dentry			*dentry; /* Check is_freed to access */
-	struct dentry			**d_children;
+	struct dentry			*events_dir;
 	struct eventfs_attr		*entry_attrs;
 	struct eventfs_attr		attr;
 	void				*data;
@@ -66,6 +64,7 @@ struct eventfs_inode {
 		struct llist_node	llist;
 		struct rcu_head		rcu;
 	};
+	const char name[];
 };
 
 static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
-- 
2.43.0.5.g38fb137bdb


[-- Attachment #3: 0003-tracefs-dentry-lookup-crapectomy.patch --]
[-- Type: text/x-patch, Size: 12628 bytes --]

From b2ee7a777a8201aed790ef8cf8fd1f10f4f08aa3 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 29 Jan 2024 12:25:53 -0800
Subject: [PATCH 3/5] tracefs: dentry lookup crapectomy

The dentry lookup for eventfs files was very broken, and had lots of
signs of the old situation where the filesystem names were all created
statically in the dentry tree, rather than being looked up dynamically
based on the eventfs data structures.

You could see it in the naming - how it claimed to "create" dentries
rather than just look up the dentries that were given it.

You could see it in various nonsensical and very incorrect operations,
like using "simple_lookup()" on the dentries that were passed in, which
only results in those dentries becoming negative dentries.  Which meant
that any other lookup would possibly return ENOENT if it saw that
negative dentry before the data rwas then later filled in.

You could see it in the immesnse amount of nonsensical code that didn't
actually just do lookups.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 275 ++++++++-------------------------------
 1 file changed, 52 insertions(+), 223 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index c0d977e6c0f2..ad11063bdd53 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
 	struct eventfs_inode *ei;
 
-	mutex_lock(&eventfs_mutex);
 	do {
 		// The parent is stable because we do not do renames
 		dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 		}
 		// Walk upwards until you find the events inode
 	} while (!ei->is_events);
-	mutex_unlock(&eventfs_mutex);
 
 	update_top_events_attr(ei, dentry->d_sb);
 
@@ -280,11 +278,10 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode,
 }
 
 /**
- * create_file - create a file in the tracefs filesystem
- * @name: the name of the file to create.
+ * lookup_file - look up a file in the tracefs filesystem
+ * @dentry: the dentry to look up
  * @mode: the permission that the file should have.
  * @attr: saved attributes changed by user
- * @parent: parent dentry for this file.
  * @data: something that the caller will want to get to later on.
  * @fop: struct file_operations that should be used for this file.
  *
@@ -292,13 +289,13 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode,
  * directory. The inode.i_private pointer will point to @data in the open()
  * call.
  */
-static struct dentry *create_file(const char *name, umode_t mode,
+static struct dentry *lookup_file(struct dentry *dentry,
+				  umode_t mode,
 				  struct eventfs_attr *attr,
-				  struct dentry *parent, void *data,
+				  void *data,
 				  const struct file_operations *fop)
 {
 	struct tracefs_inode *ti;
-	struct dentry *dentry;
 	struct inode *inode;
 
 	if (!(mode & S_IFMT))
@@ -307,12 +304,6 @@ static struct dentry *create_file(const char *name, umode_t mode,
 	if (WARN_ON_ONCE(!S_ISREG(mode)))
 		return NULL;
 
-	WARN_ON_ONCE(!parent);
-	dentry = eventfs_start_creating(name, parent);
-
-	if (IS_ERR(dentry))
-		return dentry;
-
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
 		return eventfs_failed_creating(dentry);
@@ -331,29 +322,25 @@ static struct dentry *create_file(const char *name, umode_t mode,
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = NULL;			// Directories have 'ei', files not
 
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
 };
 
 /**
- * create_dir - create a dir in the tracefs filesystem
+ * lookup_dir_entry - look up a dir in the tracefs filesystem
+ * @dentry: the directory to look up
  * @ei: the eventfs_inode that represents the directory to create
- * @parent: parent dentry for this file.
  *
- * This function will create a dentry for a directory represented by
+ * This function will look up a dentry for a directory represented by
  * a eventfs_inode.
  */
-static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent)
+static struct dentry *lookup_dir_entry(struct dentry *dentry,
+	struct eventfs_inode *pei, struct eventfs_inode *ei)
 {
 	struct tracefs_inode *ti;
-	struct dentry *dentry;
 	struct inode *inode;
 
-	dentry = eventfs_start_creating(ei->name, parent);
-	if (IS_ERR(dentry))
-		return dentry;
-
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
 		return eventfs_failed_creating(dentry);
@@ -372,8 +359,11 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = ei;
 
+	dentry->d_fsdata = ei;
+        ei->dentry = dentry;	// Remove me!
+
 	inc_nlink(inode);
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	inc_nlink(dentry->d_parent->d_inode);
 	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
@@ -426,7 +416,7 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
 }
 
 /**
- * create_file_dentry - create a dentry for a file of an eventfs_inode
+ * lookup_file_dentry - create a dentry for a file of an eventfs_inode
  * @ei: the eventfs_inode that the file will be created under
  * @idx: the index into the d_children[] of the @ei
  * @parent: The parent dentry of the created file.
@@ -439,157 +429,21 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
  * address located at @e_dentry.
  */
 static struct dentry *
-create_file_dentry(struct eventfs_inode *ei, int idx,
-		   struct dentry *parent, const char *name, umode_t mode, void *data,
+lookup_file_dentry(struct dentry *dentry,
+		   struct eventfs_inode *ei, int idx,
+		   umode_t mode, void *data,
 		   const struct file_operations *fops)
 {
 	struct eventfs_attr *attr = NULL;
 	struct dentry **e_dentry = &ei->d_children[idx];
-	struct dentry *dentry;
 
-	WARN_ON_ONCE(!inode_is_locked(parent->d_inode));
-
-	mutex_lock(&eventfs_mutex);
-	if (ei->is_freed) {
-		mutex_unlock(&eventfs_mutex);
-		return NULL;
-	}
-	/* If the e_dentry already has a dentry, use it */
-	if (*e_dentry) {
-		dget(*e_dentry);
-		mutex_unlock(&eventfs_mutex);
-		return *e_dentry;
-	}
-
-	/* ei->entry_attrs are protected by SRCU */
 	if (ei->entry_attrs)
 		attr = &ei->entry_attrs[idx];
 
-	mutex_unlock(&eventfs_mutex);
+	dentry->d_fsdata = ei;		// NOTE: ei of _parent_
+	lookup_file(dentry, mode, attr, data, fops);
 
-	dentry = create_file(name, mode, attr, parent, data, fops);
-
-	mutex_lock(&eventfs_mutex);
-
-	if (IS_ERR_OR_NULL(dentry)) {
-		/*
-		 * When the mutex was released, something else could have
-		 * created the dentry for this e_dentry. In which case
-		 * use that one.
-		 *
-		 * If ei->is_freed is set, the e_dentry is currently on its
-		 * way to being freed, don't return it. If e_dentry is NULL
-		 * it means it was already freed.
-		 */
-		if (ei->is_freed) {
-			dentry = NULL;
-		} else {
-			dentry = *e_dentry;
-			dget(dentry);
-		}
-		mutex_unlock(&eventfs_mutex);
-		return dentry;
-	}
-
-	if (!*e_dentry && !ei->is_freed) {
-		*e_dentry = dentry;
-		dentry->d_fsdata = ei;
-	} else {
-		/*
-		 * Should never happen unless we get here due to being freed.
-		 * Otherwise it means two dentries exist with the same name.
-		 */
-		WARN_ON_ONCE(!ei->is_freed);
-		dentry = NULL;
-	}
-	mutex_unlock(&eventfs_mutex);
-
-	return dentry;
-}
-
-/**
- * eventfs_post_create_dir - post create dir routine
- * @ei: eventfs_inode of recently created dir
- *
- * Map the meta-data of files within an eventfs dir to their parent dentry
- */
-static void eventfs_post_create_dir(struct eventfs_inode *ei)
-{
-	struct eventfs_inode *ei_child;
-
-	lockdep_assert_held(&eventfs_mutex);
-
-	/* srcu lock already held */
-	/* fill parent-child relation */
-	list_for_each_entry_srcu(ei_child, &ei->children, list,
-				 srcu_read_lock_held(&eventfs_srcu)) {
-		ei_child->d_parent = ei->dentry;
-	}
-}
-
-/**
- * create_dir_dentry - Create a directory dentry for the eventfs_inode
- * @pei: The eventfs_inode parent of ei.
- * @ei: The eventfs_inode to create the directory for
- * @parent: The dentry of the parent of this directory
- *
- * This creates and attaches a directory dentry to the eventfs_inode @ei.
- */
-static struct dentry *
-create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
-		  struct dentry *parent)
-{
-	struct dentry *dentry = NULL;
-
-	WARN_ON_ONCE(!inode_is_locked(parent->d_inode));
-
-	mutex_lock(&eventfs_mutex);
-	if (pei->is_freed || ei->is_freed) {
-		mutex_unlock(&eventfs_mutex);
-		return NULL;
-	}
-	if (ei->dentry) {
-		/* If the eventfs_inode already has a dentry, use it */
-		dentry = ei->dentry;
-		dget(dentry);
-		mutex_unlock(&eventfs_mutex);
-		return dentry;
-	}
-	mutex_unlock(&eventfs_mutex);
-
-	dentry = create_dir(ei, parent);
-
-	mutex_lock(&eventfs_mutex);
-
-	if (IS_ERR_OR_NULL(dentry) && !ei->is_freed) {
-		/*
-		 * When the mutex was released, something else could have
-		 * created the dentry for this e_dentry. In which case
-		 * use that one.
-		 *
-		 * If ei->is_freed is set, the e_dentry is currently on its
-		 * way to being freed.
-		 */
-		dentry = ei->dentry;
-		if (dentry)
-			dget(dentry);
-		mutex_unlock(&eventfs_mutex);
-		return dentry;
-	}
-
-	if (!ei->dentry && !ei->is_freed) {
-		ei->dentry = dentry;
-		eventfs_post_create_dir(ei);
-		dentry->d_fsdata = ei;
-	} else {
-		/*
-		 * Should never happen unless we get here due to being freed.
-		 * Otherwise it means two dentries exist with the same name.
-		 */
-		WARN_ON_ONCE(!ei->is_freed);
-		dentry = NULL;
-	}
-	mutex_unlock(&eventfs_mutex);
+	*e_dentry = dentry;	// Remove me
 
 	return dentry;
 }
@@ -608,79 +462,54 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 					  struct dentry *dentry,
 					  unsigned int flags)
 {
-	const struct file_operations *fops;
-	const struct eventfs_entry *entry;
 	struct eventfs_inode *ei_child;
 	struct tracefs_inode *ti;
 	struct eventfs_inode *ei;
-	struct dentry *ei_dentry = NULL;
-	struct dentry *ret = NULL;
-	struct dentry *d;
 	const char *name = dentry->d_name.name;
-	umode_t mode;
-	void *data;
-	int idx;
-	int i;
-	int r;
 
 	ti = get_tracefs(dir);
 	if (!(ti->flags & TRACEFS_EVENT_INODE))
-		return NULL;
+		return ERR_PTR(-EIO);
 
-	/* Grab srcu to prevent the ei from going away */
-	idx = srcu_read_lock(&eventfs_srcu);
-
-	/*
-	 * Grab the eventfs_mutex to consistent value from ti->private.
-	 * This s
-	 */
 	mutex_lock(&eventfs_mutex);
-	ei = READ_ONCE(ti->private);
-	if (ei && !ei->is_freed)
-		ei_dentry = READ_ONCE(ei->dentry);
-	mutex_unlock(&eventfs_mutex);
 
-	if (!ei || !ei_dentry)
-		goto out;
+	ei = ti->private;
+	if (!ei || ei->is_freed)
+		goto enoent;
 
-	data = ei->data;
-
-	list_for_each_entry_srcu(ei_child, &ei->children, list,
-				 srcu_read_lock_held(&eventfs_srcu)) {
+	list_for_each_entry(ei_child, &ei->children, list) {
 		if (strcmp(ei_child->name, name) != 0)
 			continue;
-		ret = simple_lookup(dir, dentry, flags);
-		if (IS_ERR(ret))
-			goto out;
-		d = create_dir_dentry(ei, ei_child, ei_dentry);
-		dput(d);
+		if (ei_child->is_freed)
+			goto enoent;
+		lookup_dir_entry(dentry, ei, ei_child);
 		goto out;
 	}
 
-	for (i = 0; i < ei->nr_entries; i++) {
-		entry = &ei->entries[i];
-		if (strcmp(name, entry->name) == 0) {
-			void *cdata = data;
-			mutex_lock(&eventfs_mutex);
-			/* If ei->is_freed, then the event itself may be too */
-			if (!ei->is_freed)
-				r = entry->callback(name, &mode, &cdata, &fops);
-			else
-				r = -1;
-			mutex_unlock(&eventfs_mutex);
-			if (r <= 0)
-				continue;
-			ret = simple_lookup(dir, dentry, flags);
-			if (IS_ERR(ret))
-				goto out;
-			d = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops);
-			dput(d);
-			break;
-		}
+	for (int i = 0; i < ei->nr_entries; i++) {
+		void *data;
+		umode_t mode;
+		const struct file_operations *fops;
+		const struct eventfs_entry *entry = &ei->entries[i];
+
+		if (strcmp(name, entry->name) != 0)
+			continue;
+
+		data = ei->data;
+		if (entry->callback(name, &mode, &data, &fops) <= 0)
+			goto enoent;
+
+		lookup_file_dentry(dentry, ei, i, mode, data, fops);
+		goto out;
 	}
+
+ enoent:
+	/* Nothing found? */
+	d_add(dentry, NULL);
+
  out:
-	srcu_read_unlock(&eventfs_srcu, idx);
-	return ret;
+	mutex_unlock(&eventfs_mutex);
+	return NULL;
 }
 
 /*
-- 
2.43.0.5.g38fb137bdb


[-- Attachment #4: 0001-tracefs-avoid-using-the-ei-dentry-pointer-unnecessar.patch --]
[-- Type: text/x-patch, Size: 3561 bytes --]

From 617b62d5af0913e10701b301706a09fc527e4df9 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 27 Jan 2024 13:27:01 -0800
Subject: [PATCH 1/5] tracefs: avoid using the ei->dentry pointer unnecessarily

The eventfs_find_events() code tries to walk up the tree to find the
event directory that a dentry belongs to, in order to then find the
eventfs inode that is associated with that event directory.

However, it uses an odd combination of walking the dentry parent,
looking up the eventfs inode associated with that, and then looking up
the dentry from there.  Repeat.

But the code shouldn't have back-pointers to dentries in the first
place, and it should just walk the dentry parenthood chain directly.

Similarly, 'set_top_events_ownership()' looks up the dentry from the
eventfs inode, but the only reason it wants a dentry is to look up the
superblock in order to look up the root dentry.

But it already has the real filesystem inode, which has that same
superblock pointer.  So just pass in the superblock pointer using the
information that's already there, instead of looking up extraneous data
that is irrelevant.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1c3dd0ad4660..2d128bedd654 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -156,33 +156,30 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry,
 	return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct dentry *dentry)
+static void update_top_events_attr(struct eventfs_inode *ei, struct super_block *sb)
 {
-	struct inode *inode;
+	struct inode *root;
 
 	/* Only update if the "events" was on the top level */
 	if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
 		return;
 
 	/* Get the tracefs root inode. */
-	inode = d_inode(dentry->d_sb->s_root);
-	ei->attr.uid = inode->i_uid;
-	ei->attr.gid = inode->i_gid;
+	root = d_inode(sb->s_root);
+	ei->attr.uid = root->i_uid;
+	ei->attr.gid = root->i_gid;
 }
 
 static void set_top_events_ownership(struct inode *inode)
 {
 	struct tracefs_inode *ti = get_tracefs(inode);
 	struct eventfs_inode *ei = ti->private;
-	struct dentry *dentry;
 
 	/* The top events directory doesn't get automatically updated */
 	if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
 		return;
 
-	dentry = ei->dentry;
-
-	update_top_events_attr(ei, dentry);
+	update_top_events_attr(ei, inode->i_sb);
 
 	if (!(ei->attr.mode & EVENTFS_SAVE_UID))
 		inode->i_uid = ei->attr.uid;
@@ -235,8 +232,10 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 
 	mutex_lock(&eventfs_mutex);
 	do {
-		/* The parent always has an ei, except for events itself */
-		ei = dentry->d_parent->d_fsdata;
+		// The parent is stable because we do not do renames
+		dentry = dentry->d_parent;
+		// ... and directories always have d_fsdata
+		ei = dentry->d_fsdata;
 
 		/*
 		 * If the ei is being freed, the ownership of the children
@@ -246,12 +245,11 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 			ei = NULL;
 			break;
 		}
-
-		dentry = ei->dentry;
+		// Walk upwards until you find the events inode
 	} while (!ei->is_events);
 	mutex_unlock(&eventfs_mutex);
 
-	update_top_events_attr(ei, dentry);
+	update_top_events_attr(ei, dentry->d_sb);
 
 	return ei;
 }
-- 
2.43.0.5.g38fb137bdb


[-- Attachment #5: 0002-eventfsfs-initialize-the-tracefs-inode-properly.patch --]
[-- Type: text/x-patch, Size: 2548 bytes --]

From fd17eaceb3583c1c84845afce3e1560ba6a15032 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 29 Jan 2024 11:06:32 -0800
Subject: [PATCH 2/5] eventfsfs: initialize the tracefs inode properly

The tracefs-specific fields in the inode were not initialized before the
inode was exposed to others through the dentry with 'd_instantiate()'.

And the ->flags file was initialized incorrectly with a '|=', when the
old value was stale.  It should have just been a straight assignment.

Move the field initializations up to before the d_instantiate, and fix
the use of uninitialized data.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2d128bedd654..c0d977e6c0f2 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, umode_t mode,
 	inode->i_ino = EVENTFS_FILE_INODE_INO;
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE;
+	ti->flags = TRACEFS_EVENT_INODE;
+	ti->private = NULL;			// Directories have 'ei', files not
+
 	d_instantiate(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
@@ -367,7 +369,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
 	inode->i_ino = eventfs_dir_ino(ei);
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE;
+	ti->flags = TRACEFS_EVENT_INODE;
+	ti->private = ei;
 
 	inc_nlink(inode);
 	d_instantiate(dentry, inode);
@@ -513,7 +516,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 static void eventfs_post_create_dir(struct eventfs_inode *ei)
 {
 	struct eventfs_inode *ei_child;
-	struct tracefs_inode *ti;
 
 	lockdep_assert_held(&eventfs_mutex);
 
@@ -523,9 +525,6 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei)
 				 srcu_read_lock_held(&eventfs_srcu)) {
 		ei_child->d_parent = ei->dentry;
 	}
-
-	ti = get_tracefs(ei->dentry->d_inode);
-	ti->private = ei;
 }
 
 /**
@@ -943,7 +942,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	INIT_LIST_HEAD(&ei->list);
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+	ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
 	ti->private = ei;
 
 	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
-- 
2.43.0.5.g38fb137bdb


[-- Attachment #6: 0004-eventfs-remove-unused-d_parent-pointer-field.patch --]
[-- Type: text/x-patch, Size: 1854 bytes --]

From 10edbe91e1ef0c8d3bf0be053662d2bb4fa1a6b0 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 29 Jan 2024 16:07:33 -0800
Subject: [PATCH 4/5] eventfs: remove unused 'd_parent' pointer field

It's never used

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 4 +---
 fs/tracefs/internal.h    | 2 --
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index ad11063bdd53..1d0102bfd7da 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -685,10 +685,8 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode
 	INIT_LIST_HEAD(&ei->list);
 
 	mutex_lock(&eventfs_mutex);
-	if (!parent->is_freed) {
+	if (!parent->is_freed)
 		list_add_tail(&ei->list, &parent->children);
-		ei->d_parent = parent->dentry;
-	}
 	mutex_unlock(&eventfs_mutex);
 
 	/* Was the parent freed? */
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 91c2bf0b91d9..8f38740bfb5b 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -35,7 +35,6 @@ struct eventfs_attr {
  * @name:	the name of the directory to create
  * @children:	link list into the child eventfs_inode
  * @dentry:     the dentry of the directory
- * @d_parent:   pointer to the parent's dentry
  * @d_children: The array of dentries to represent the files when created
  * @entry_attrs: Saved mode and ownership of the @d_children
  * @attr:	Saved mode and ownership of eventfs_inode itself
@@ -50,7 +49,6 @@ struct eventfs_inode {
 	const char			*name;
 	struct list_head		children;
 	struct dentry			*dentry; /* Check is_freed to access */
-	struct dentry			*d_parent;
 	struct dentry			**d_children;
 	struct eventfs_attr		*entry_attrs;
 	struct eventfs_attr		attr;
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30  3:56                             ` Linus Torvalds
@ 2024-01-30  8:43                               ` Linus Torvalds
  2024-01-30  9:12                                 ` Linus Torvalds
  2024-01-30 18:23                               ` Steven Rostedt
  2024-01-30 18:36                               ` Steven Rostedt
  2 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2024-01-30  8:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 at 19:56, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But I've been staring at this code for too long, so I'm posting this
> just as a "it's broken, but _something_ like this", because I'm taking
> a break from looking at this.

Bah. I literally woke up and realized what the problem is.

We're caching negative dentries, but eventfs_create_dir() has no way
to invalidate the old negative dentry when it adds a new entry.

The old "try to reuse dentry" ended up hiding that issue, because it
would look up the negative dentry from the 'ei' and turn it into a
positive one.

And I was stupidly staring at the wrong code - all these functions
with almost the same names.

eventfs_create_events_dir() is fine, because it gets the parent as a
dentry, so looking up the new dentry is trivial.

But eventfs_create_dir() doesn't get a dentry at all. It gets the parent as a

  struct eventfs_inode *parent

which is no good.

So that explains why creating an event after deleting an old one - or
after just looking it up before it was created - ends up with the
nonsensical "ls" output - it gets listed by readdir() because the
entry is there in the eventfs data structures, but then doing a
"stat()" on it will just hit the negative dentry. So it won't actually
look up the dentry.

The simple solution is probably just to not cache negative dentries
for eventfs.  So instead of doing the "d_add(dentry, NULL);" we should
just return "ERR_PTR(ENOENT)".

Which is actually what /proc/<pid> lookups do, for the same reason.

I'll go back to bed, but I think the fix is something trivial like this:

  --- a/fs/tracefs/event_inode.c
  +++ b/fs/tracefs/event_inode.c
  @@ -517,9 +517,8 @@ static struct dentry *eventfs_root_lookup(
        }

    enoent:
  -     /* Nothing found? */
  -     d_add(dentry, NULL);
  -     result = NULL;
  +     /* Don't cache negative lookups, just return an error */
  +     result = ERR_PTR(ENOENT);

    out:
        mutex_unlock(&eventfs_mutex);

I just wanted to write it down when the likely solution struck me.

               Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30  8:43                               ` Linus Torvalds
@ 2024-01-30  9:12                                 ` Linus Torvalds
  2024-01-30 12:45                                   ` Rasmus Villemoes
  2024-01-30 14:39                                   ` Steven Rostedt
  0 siblings, 2 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-30  9:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Tue, 30 Jan 2024 at 00:43, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'll go back to bed, but I think the fix is something trivial like this:

Almost.

>   +     result = ERR_PTR(ENOENT);

That needs a '-' in front of the ENOENT, otherwise you have a positive
error number and things go wrong very quickly.

And that does indeed fix the lookup problem, but you end up with the
same problem later when you do the eventfs_remove_dir(). Again the
eventfs data structure changes, but we don't have a reliable dentry
that we can invalidate.

The dentry cache is just very good at caching those old dentries, and
the interface for eventfs_create_dir() and eventfs_remove_dir() is
just not great.

If those did an actual path lookup (like eventfs_create_events_dir()
does), we'd have the dentry, and it's trivial to get from dentry to
eventfs_inode.

But going the other way is the broken thing because of how the
dentries are just temporary caches.

I suspect the solution is to make eventfs_create_dir() do the same as
the events directory case does, and actually pin the directory dentry
and save it off.

Oh well. At least I understand what the problem is. Now I'm going to
try to go back to sleep.

              Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30  9:12                                 ` Linus Torvalds
@ 2024-01-30 12:45                                   ` Rasmus Villemoes
  2024-01-30 14:39                                   ` Steven Rostedt
  1 sibling, 0 replies; 47+ messages in thread
From: Rasmus Villemoes @ 2024-01-30 12:45 UTC (permalink / raw)
  To: Linus Torvalds, Steven Rostedt; +Cc: linux-kernel, Hector Martin

On 30/01/2024 10.12, Linus Torvalds wrote:
> On Tue, 30 Jan 2024 at 00:43, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> I'll go back to bed, but I think the fix is something trivial like this:
> 
> Almost.
> 
>>   +     result = ERR_PTR(ENOENT);
> 
> That needs a '-' in front of the ENOENT, otherwise you have a positive
> error number and things go wrong very quickly.

OT, but 'git grep' suggests that drivers/soc/apple/mailbox.c might need
exactly that fix... (there's also ext4, but their custom E thing is
already negative).

Rasmus


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30  9:12                                 ` Linus Torvalds
  2024-01-30 12:45                                   ` Rasmus Villemoes
@ 2024-01-30 14:39                                   ` Steven Rostedt
  2024-01-30 16:49                                     ` Steven Rostedt
  2024-01-30 16:51                                     ` Linus Torvalds
  1 sibling, 2 replies; 47+ messages in thread
From: Steven Rostedt @ 2024-01-30 14:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Tue, 30 Jan 2024 01:12:05 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 30 Jan 2024 at 00:43, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I'll go back to bed, but I think the fix is something trivial like this:  
> 
> Almost.
> 
> >   +     result = ERR_PTR(ENOENT);  
> 
> That needs a '-' in front of the ENOENT, otherwise you have a positive
> error number and things go wrong very quickly.
> 
> And that does indeed fix the lookup problem, but you end up with the
> same problem later when you do the eventfs_remove_dir(). Again the
> eventfs data structure changes, but we don't have a reliable dentry
> that we can invalidate.
> 
> The dentry cache is just very good at caching those old dentries, and
> the interface for eventfs_create_dir() and eventfs_remove_dir() is
> just not great.
> 
> If those did an actual path lookup (like eventfs_create_events_dir()
> does), we'd have the dentry, and it's trivial to get from dentry to
> eventfs_inode.
> 
> But going the other way is the broken thing because of how the
> dentries are just temporary caches.
> 
> I suspect the solution is to make eventfs_create_dir() do the same as
> the events directory case does, and actually pin the directory dentry
> and save it off.

I rather not have the create do that because that happens for every event
directory. On my machine that's:

  # find events -type d | wc -l
  2102

And that's regardless if tracefs is mounted or not. And that's how many are
also created with every instance creation. And doesn't pinning the dentry
also require it to be positive? That is, have a full inode allocated with
it?

I may try something that will still let me get rid of the ei->dentry.

> 
> Oh well. At least I understand what the problem is.


 Yep!


> Now I'm going to try to go back to sleep.

Hope you sleep better than I did. We just bought a new mattress, which felt
great in the store, but now after 4 or 5 hours sleeping in it, I wake up
with a sore back and have to sleep on the couch. And we bought the most
expensive "best" mattress in the store :-p

Oh, and NY State law has it that you can't return in once it is delivered.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30 14:39                                   ` Steven Rostedt
@ 2024-01-30 16:49                                     ` Steven Rostedt
  2024-01-30 16:55                                       ` Linus Torvalds
  2024-01-30 16:51                                     ` Linus Torvalds
  1 sibling, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2024-01-30 16:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Tue, 30 Jan 2024 09:39:42 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> I may try something that will still let me get rid of the ei->dentry.

This appears to work, but like always, I may have missed something. I need
to add debugging (printks) to make sure the that I don't leave any dangling
dentries around (dget without a dput).

Here's what I did:

- Removed the ei->dentry that you loved so much

- Removed the ei->d_children that you shared as much love for.

- Removed SRCU. All accesses are under the eventfs_mutex
  I need to add comments that the callbacks need to be aware of this.
  Currently, the callbacks do not take any other locks. I may comment
  that they should never take a lock.

- Added the kref that you recommended

- Created a eventfs_root_inode that has the structure of:

  struct eventfs_root_inode {
       struct eventfs_inode    ei;
       struct dentry           *root_dentry;
  };

  The "events" directory is the only directory that allocates this.
  It is required that its ei->is_events is set, and no other ei has that
  set. This will hold the only non-dynamic dentry.

- I added "parent" to the eventfs_inode that points to the parent
  eventfs_inode.

- On removal, I got rid of the SRCU callback and the work queue.
  Instead, I find the dentry of the current eventfs_inode that is being
  deleted by walking the ei->parent until I find the events inode that has
  a dentry. I then use that to do a lookup walking back down to the
  eventfs_inode I want to delete. This gives me the dentry that I can call
  d_invalidate() on.

This all works with light testing. I'm sure I did something wrong, but
hopefully this is more inline to what you are looking for.

This patch is on top of your last patch series.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index ad11063bdd53..49d4630d5d70 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -24,17 +24,24 @@
 #include <linux/delay.h>
 #include "internal.h"
 
-/*
- * eventfs_mutex protects the eventfs_inode (ei) dentry. Any access
- * to the ei->dentry must be done under this mutex and after checking
- * if ei->is_freed is not set. When ei->is_freed is set, the dentry
- * is on its way to being freed after the last dput() is made on it.
- */
+/* eventfs_mutex protects ei->is_freed and the ei->ref counting. */
 static DEFINE_MUTEX(eventfs_mutex);
 
 /* Choose something "unique" ;-) */
 #define EVENTFS_FILE_INODE_INO		0x12c4e37
 
+/* Used only by the "events" directory */
+struct eventfs_root_inode {
+	struct eventfs_inode	ei;
+	struct dentry		*root_dentry;
+};
+
+static struct eventfs_root_inode *get_root_inode(struct eventfs_inode *ei)
+{
+	WARN_ON_ONCE(!ei->is_events);
+	return container_of(ei, struct eventfs_inode, ei);
+}
+
 /* Just try to make something consistent and unique */
 static int eventfs_dir_ino(struct eventfs_inode *ei)
 {
@@ -44,14 +51,6 @@ static int eventfs_dir_ino(struct eventfs_inode *ei)
 	return ei->ino;
 }
 
-/*
- * The eventfs_inode (ei) itself is protected by SRCU. It is released from
- * its parent's list and will have is_freed set (under eventfs_mutex).
- * After the SRCU grace period is over and the last dput() is called
- * the ei is freed.
- */
-DEFINE_STATIC_SRCU(eventfs_srcu);
-
 /* Mode is unsigned short, use the upper bits for flags */
 enum {
 	EVENTFS_SAVE_MODE	= BIT(16),
@@ -360,7 +359,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
 	ti->private = ei;
 
 	dentry->d_fsdata = ei;
-        ei->dentry = dentry;	// Remove me!
+	kref_get(&ei->kref);
 
 	inc_nlink(inode);
 	d_add(dentry, inode);
@@ -371,10 +370,30 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
 
 static void free_ei(struct eventfs_inode *ei)
 {
+	struct eventfs_root_inode *rei;
+
 	kfree_const(ei->name);
-	kfree(ei->d_children);
 	kfree(ei->entry_attrs);
-	kfree(ei);
+
+	if (ei->is_events) {
+		rei = get_root_inode(ei);
+		kfree(rei);
+	} else {
+		kfree(ei);
+	}
+}
+
+static void kref_ei_release(struct kref *kref)
+{
+	struct eventfs_inode *ei = container_of(kref, struct eventfs_inode, kref);
+	WARN_ON_ONCE(!ei->is_freed);
+	free_ei(ei);
+}
+
+
+static void eventfs_inode_put(struct eventfs_inode *ei)
+{
+	kref_put(&ei->kref, kref_ei_release);
 }
 
 /**
@@ -387,7 +406,6 @@ static void free_ei(struct eventfs_inode *ei)
 void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
 {
 	struct eventfs_inode *ei;
-	int i;
 
 	mutex_lock(&eventfs_mutex);
 
@@ -395,20 +413,7 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
 	if (!ei)
 		goto out;
 
-	/* This could belong to one of the files of the ei */
-	if (ei->dentry != dentry) {
-		for (i = 0; i < ei->nr_entries; i++) {
-			if (ei->d_children[i] == dentry)
-				break;
-		}
-		if (WARN_ON_ONCE(i == ei->nr_entries))
-			goto out;
-		ei->d_children[i] = NULL;
-	} else if (ei->is_freed) {
-		free_ei(ei);
-	} else {
-		ei->dentry = NULL;
-	}
+	eventfs_inode_put(ei);
 
 	dentry->d_fsdata = NULL;
  out:
@@ -435,16 +440,14 @@ lookup_file_dentry(struct dentry *dentry,
 		   const struct file_operations *fops)
 {
 	struct eventfs_attr *attr = NULL;
-	struct dentry **e_dentry = &ei->d_children[idx];
 
 	if (ei->entry_attrs)
 		attr = &ei->entry_attrs[idx];
 
 	dentry->d_fsdata = ei;		// NOTE: ei of _parent_
+	kref_get(&ei->kref);
 	lookup_file(dentry, mode, attr, data, fops);
 
-	*e_dentry = dentry;	// Remove me
-
 	return dentry;
 }
 
@@ -465,6 +468,7 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 	struct eventfs_inode *ei_child;
 	struct tracefs_inode *ti;
 	struct eventfs_inode *ei;
+	struct dentry *result = NULL;
 	const char *name = dentry->d_name.name;
 
 	ti = get_tracefs(dir);
@@ -505,11 +509,10 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 
  enoent:
 	/* Nothing found? */
-	d_add(dentry, NULL);
-
+	result = ERR_PTR(-ENOENT);
  out:
 	mutex_unlock(&eventfs_mutex);
-	return NULL;
+	return result;
 }
 
 /*
@@ -525,7 +528,6 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
 	struct eventfs_inode *ei;
 	const char *name;
 	umode_t mode;
-	int idx;
 	int ret = -EINVAL;
 	int ino;
 	int i, r, c;
@@ -539,15 +541,9 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
 
 	c = ctx->pos - 2;
 
-	idx = srcu_read_lock(&eventfs_srcu);
-
 	mutex_lock(&eventfs_mutex);
 	ei = READ_ONCE(ti->private);
 	if (ei && ei->is_freed)
-		ei = NULL;
-	mutex_unlock(&eventfs_mutex);
-
-	if (!ei)
 		goto out;
 
 	/*
@@ -563,14 +559,12 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
 		entry = &ei->entries[i];
 		name = entry->name;
 
-		mutex_lock(&eventfs_mutex);
 		/* If ei->is_freed then just bail here, nothing more to do */
 		if (ei->is_freed) {
 			mutex_unlock(&eventfs_mutex);
 			goto out;
 		}
 		r = entry->callback(name, &mode, &cdata, &fops);
-		mutex_unlock(&eventfs_mutex);
 		if (r <= 0)
 			continue;
 
@@ -583,8 +577,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
 	/* Subtract the skipped entries above */
 	c -= min((unsigned int)c, (unsigned int)ei->nr_entries);
 
-	list_for_each_entry_srcu(ei_child, &ei->children, list,
-				 srcu_read_lock_held(&eventfs_srcu)) {
+	list_for_each_entry(ei_child, &ei->children, list) {
 
 		if (c > 0) {
 			c--;
@@ -605,7 +598,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
 	}
 	ret = 1;
  out:
-	srcu_read_unlock(&eventfs_srcu, idx);
+	mutex_unlock(&eventfs_mutex);
 
 	return ret;
 
@@ -669,25 +662,17 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode
 		return ERR_PTR(-ENOMEM);
 	}
 
-	if (size) {
-		ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL);
-		if (!ei->d_children) {
-			kfree_const(ei->name);
-			kfree(ei);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
-
 	ei->entries = entries;
 	ei->nr_entries = size;
 	ei->data = data;
 	INIT_LIST_HEAD(&ei->children);
 	INIT_LIST_HEAD(&ei->list);
+	kref_init(&ei->kref);
 
 	mutex_lock(&eventfs_mutex);
 	if (!parent->is_freed) {
 		list_add_tail(&ei->list, &parent->children);
-		ei->d_parent = parent->dentry;
+		ei->parent = parent;
 	}
 	mutex_unlock(&eventfs_mutex);
 
@@ -716,6 +701,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 						int size, void *data)
 {
 	struct dentry *dentry = tracefs_start_creating(name, parent);
+	struct eventfs_root_inode *rei;
 	struct eventfs_inode *ei;
 	struct tracefs_inode *ti;
 	struct inode *inode;
@@ -728,24 +714,21 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	if (IS_ERR(dentry))
 		return ERR_CAST(dentry);
 
-	ei = kzalloc(sizeof(*ei), GFP_KERNEL);
-	if (!ei)
+	rei = kzalloc(sizeof(*rei), GFP_KERNEL);
+	if (!rei)
 		goto fail_ei;
 
+	ei = &rei->ei;
+	ei->is_events = 1;
+
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
 		goto fail;
 
-	if (size) {
-		ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL);
-		if (!ei->d_children)
-			goto fail;
-	}
+	rei->root_dentry = dentry;
 
-	ei->dentry = dentry;
 	ei->entries = entries;
 	ei->nr_entries = size;
-	ei->is_events = 1;
 	ei->data = data;
 	ei->name = kstrdup_const(name, GFP_KERNEL);
 	if (!ei->name)
@@ -781,6 +764,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	inode->i_fop = &eventfs_file_operations;
 
 	dentry->d_fsdata = ei;
+	kref_init(&ei->kref);
 
 	/* directory inodes start off with i_nlink == 2 (for "." entry) */
 	inc_nlink(inode);
@@ -792,83 +776,24 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	return ei;
 
  fail:
-	kfree(ei->d_children);
-	kfree(ei);
+	kfree(rei);
  fail_ei:
 	tracefs_failed_creating(dentry);
 	return ERR_PTR(-ENOMEM);
 }
 
-static LLIST_HEAD(free_list);
-
-static void eventfs_workfn(struct work_struct *work)
-{
-        struct eventfs_inode *ei, *tmp;
-        struct llist_node *llnode;
-
-	llnode = llist_del_all(&free_list);
-        llist_for_each_entry_safe(ei, tmp, llnode, llist) {
-		/* This dput() matches the dget() from unhook_dentry() */
-		for (int i = 0; i < ei->nr_entries; i++) {
-			if (ei->d_children[i])
-				dput(ei->d_children[i]);
-		}
-		/* This should only get here if it had a dentry */
-		if (!WARN_ON_ONCE(!ei->dentry))
-			dput(ei->dentry);
-        }
-}
-
-static DECLARE_WORK(eventfs_work, eventfs_workfn);
-
-static void free_rcu_ei(struct rcu_head *head)
-{
-	struct eventfs_inode *ei = container_of(head, struct eventfs_inode, rcu);
-
-	if (ei->dentry) {
-		/* Do not free the ei until all references of dentry are gone */
-		if (llist_add(&ei->llist, &free_list))
-			queue_work(system_unbound_wq, &eventfs_work);
-		return;
-	}
-
-	/* If the ei doesn't have a dentry, neither should its children */
-	for (int i = 0; i < ei->nr_entries; i++) {
-		WARN_ON_ONCE(ei->d_children[i]);
-	}
-
-	free_ei(ei);
-}
-
-static void unhook_dentry(struct dentry *dentry)
-{
-	if (!dentry)
-		return;
-	/*
-	 * Need to add a reference to the dentry that is expected by
-	 * simple_recursive_removal(), which will include a dput().
-	 */
-	dget(dentry);
-
-	/*
-	 * Also add a reference for the dput() in eventfs_workfn().
-	 * That is required as that dput() will free the ei after
-	 * the SRCU grace period is over.
-	 */
-	dget(dentry);
-}
-
 /**
  * eventfs_remove_rec - remove eventfs dir or file from list
  * @ei: eventfs_inode to be removed.
+ * @head: List head to add the ei's to remove
  * @level: prevent recursion from going more than 3 levels deep.
  *
  * This function recursively removes eventfs_inodes which
  * contains info of files and/or directories.
  */
-static void eventfs_remove_rec(struct eventfs_inode *ei, int level)
+static void eventfs_remove_rec(struct eventfs_inode *ei, struct list_head *head, int level)
 {
-	struct eventfs_inode *ei_child;
+	struct eventfs_inode *ei_child, *tmp;
 
 	if (!ei)
 		return;
@@ -883,28 +808,67 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, int level)
 		return;
 
 	/* search for nested folders or files */
-	list_for_each_entry_srcu(ei_child, &ei->children, list,
-				 lockdep_is_held(&eventfs_mutex)) {
-		/* Children only have dentry if parent does */
-		WARN_ON_ONCE(ei_child->dentry && !ei->dentry);
-		eventfs_remove_rec(ei_child, level + 1);
+	list_for_each_entry_safe(ei_child, tmp, &ei->children, list) {
+		eventfs_remove_rec(ei_child, head, level + 1);
 	}
 
-
 	ei->is_freed = 1;
 
-	for (int i = 0; i < ei->nr_entries; i++) {
-		if (ei->d_children[i]) {
-			/* Children only have dentry if parent does */
-			WARN_ON_ONCE(!ei->dentry);
-			unhook_dentry(ei->d_children[i]);
-		}
+	list_del(&ei->list);
+	list_add(&ei->list, head);
+}
+
+static struct dentry *find_ei_dentry(struct eventfs_inode *ei, int level)
+{
+	struct dentry *d_parent;
+	struct dentry *dentry;
+	const char *name = ei->name;
+
+	/*
+	 * Check recursion depth. It should never be greater than 2;
+	 * 0 - events/
+	 * 1 - events/group/
+	 * 2 - events/group/event/
+	 */
+	if (WARN_ON_ONCE(level > 2))
+		return NULL;
+
+	/* Only the events directory has a dentry we can use */
+	if (ei->parent->is_events) {
+		struct eventfs_root_inode *rei;
+
+		rei = get_root_inode(ei);
+		d_parent = rei->root_dentry;
+		dget(d_parent);
+	} else {
+		d_parent = find_ei_dentry(ei->parent, level + 1);
+		if (!d_parent)
+			return NULL;
 	}
 
-	unhook_dentry(ei->dentry);
+	inode_lock(d_inode(d_parent));
+	dentry = lookup_one_len(name, d_parent, strlen(name));
+	inode_unlock(d_inode(d_parent));
+	dput(d_parent);
+
+	if (IS_ERR(dentry))
+		dentry = NULL;
+	return dentry;
+}
+
+static void remove_dir(struct eventfs_inode *ei)
+{
+	struct eventfs_inode *tmp;
+	LIST_HEAD(head);
 
-	list_del_rcu(&ei->list);
-	call_srcu(&eventfs_srcu, &ei->rcu, free_rcu_ei);
+	mutex_lock(&eventfs_mutex);
+	eventfs_remove_rec(ei, &head, 0);
+	mutex_unlock(&eventfs_mutex);
+
+	list_for_each_entry_safe(ei, tmp, &head, list) {
+		list_del(&ei->list);
+		eventfs_inode_put(ei);
+	}
 }
 
 /**
@@ -920,17 +884,14 @@ void eventfs_remove_dir(struct eventfs_inode *ei)
 	if (!ei)
 		return;
 
-	mutex_lock(&eventfs_mutex);
-	dentry = ei->dentry;
-	eventfs_remove_rec(ei, 0);
-	mutex_unlock(&eventfs_mutex);
+	dentry = find_ei_dentry(ei, 0);
 
-	/*
-	 * If any of the ei children has a dentry, then the ei itself
-	 * must have a dentry.
-	 */
-	if (dentry)
-		simple_recursive_removal(dentry, NULL);
+	remove_dir(ei);
+
+	if (dentry) {
+		d_invalidate(dentry);
+		dput(dentry);
+	}
 }
 
 /**
@@ -941,10 +902,13 @@ void eventfs_remove_dir(struct eventfs_inode *ei)
  */
 void eventfs_remove_events_dir(struct eventfs_inode *ei)
 {
+	struct eventfs_root_inode *rei;
 	struct dentry *dentry;
 
-	dentry = ei->dentry;
-	eventfs_remove_dir(ei);
+	rei = get_root_inode(ei);
+	dentry = rei->root_dentry;
+
+	remove_dir(ei);
 
 	/*
 	 * Matches the dget() done by tracefs_start_creating()
@@ -953,5 +917,8 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 	 * sticks around while the other ei->dentry are created
 	 * and destroyed dynamically.
 	 */
+	simple_recursive_removal(dentry, NULL);
+
+	eventfs_inode_put(ei);
 	dput(dentry);
 }
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 91c2bf0b91d9..2af78fd95c93 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -34,9 +34,7 @@ struct eventfs_attr {
  * @entries:	the array of entries representing the files in the directory
  * @name:	the name of the directory to create
  * @children:	link list into the child eventfs_inode
- * @dentry:     the dentry of the directory
  * @d_parent:   pointer to the parent's dentry
- * @d_children: The array of dentries to represent the files when created
  * @entry_attrs: Saved mode and ownership of the @d_children
  * @attr:	Saved mode and ownership of eventfs_inode itself
  * @data:	The private data to pass to the callbacks
@@ -49,12 +47,11 @@ struct eventfs_inode {
 	const struct eventfs_entry	*entries;
 	const char			*name;
 	struct list_head		children;
-	struct dentry			*dentry; /* Check is_freed to access */
-	struct dentry			*d_parent;
-	struct dentry			**d_children;
+	struct eventfs_inode		*parent;
 	struct eventfs_attr		*entry_attrs;
 	struct eventfs_attr		attr;
 	void				*data;
+	struct kref			kref;
 	unsigned int			is_freed:1;
 	unsigned int			is_events:1;
 	unsigned int			nr_entries:30;


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30 14:39                                   ` Steven Rostedt
  2024-01-30 16:49                                     ` Steven Rostedt
@ 2024-01-30 16:51                                     ` Linus Torvalds
  1 sibling, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-30 16:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Tue, 30 Jan 2024 at 06:39, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 30 Jan 2024 01:12:05 -0800
> >
> > I suspect the solution is to make eventfs_create_dir() do the same as
> > the events directory case does, and actually pin the directory dentry
> > and save it off.
>
> I rather not have the create do that because that happens for every event
> directory.

I wasn't thinking straight yesterday - the real fix is to just do a
"d_revalidate()". It's literally why that thing exists: check whether
a dentry is still valid.

In fact, that trivially gets rid of the 'events' subdirectory issues
too, so we can stop doing that horrendous simple_recursive_removal()
too.

Let me reboot into the trivial fix. I just needed to think about this
the right way.

None of this is anything that the VFS layer has any problems with.

               Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30 16:49                                     ` Steven Rostedt
@ 2024-01-30 16:55                                       ` Linus Torvalds
  2024-01-30 17:06                                         ` Steven Rostedt
  2024-01-30 17:09                                         ` Linus Torvalds
  0 siblings, 2 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-30 16:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Tue, 30 Jan 2024 at 08:49, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> - On removal, I got rid of the SRCU callback and the work queue.
>   Instead, I find the dentry of the current eventfs_inode that is being
>   deleted by walking the ei->parent until I find the events inode that has
>   a dentry. I then use that to do a lookup walking back down to the
>   eventfs_inode I want to delete. This gives me the dentry that I can call
>   d_invalidate() on.

Yes, that works.

However, I have a patch that is *much* smaller and simpler, and
doesn't need that walk.

The VFS layer already has a good interface for "should I still use
this dentry", which is needed for various network filesystems etc that
want to time out caches (or check explicitly whether the file still
exists etc): it's the dentry d_revalidate() check.

Let me just reboot into it to test that I got all the cases.

It makes the code even more obvious, and avoids all the complexity.

           Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30 16:55                                       ` Linus Torvalds
@ 2024-01-30 17:06                                         ` Steven Rostedt
  2024-01-30 17:09                                         ` Linus Torvalds
  1 sibling, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2024-01-30 17:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Tue, 30 Jan 2024 08:55:51 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 30 Jan 2024 at 08:49, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > - On removal, I got rid of the SRCU callback and the work queue.
> >   Instead, I find the dentry of the current eventfs_inode that is being
> >   deleted by walking the ei->parent until I find the events inode that has
> >   a dentry. I then use that to do a lookup walking back down to the
> >   eventfs_inode I want to delete. This gives me the dentry that I can call
> >   d_invalidate() on.  
> 
> Yes, that works.
> 
> However, I have a patch that is *much* smaller and simpler, and
> doesn't need that walk.
> 
> The VFS layer already has a good interface for "should I still use
> this dentry", which is needed for various network filesystems etc that
> want to time out caches (or check explicitly whether the file still
> exists etc): it's the dentry d_revalidate() check.
> 
> Let me just reboot into it to test that I got all the cases.
> 
> It makes the code even more obvious, and avoids all the complexity.


I actually had this before, but it wasn't working (likely to something else
that wasn't working or I did it wrong) so I reverted it.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 49d4630d5d70..9867b39ae24c 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -451,6 +451,13 @@ lookup_file_dentry(struct dentry *dentry,
 	return dentry;
 }
 
+int eventfs_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct eventfs_inode *ei = dentry->d_fsdata;
+
+	return ei && !ei->is_freed;
+}
+
 /**
  * eventfs_root_lookup - lookup routine to create file/dir
  * @dir: in which a lookup is being done
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..0395459d919e 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -392,8 +392,24 @@ static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode)
 	iput(inode);
 }
 
+static int tracefs_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct inode *inode = dentry->d_inode;
+	struct tracefs_inode *ti;
+
+	if (!dentry || !inode)
+		return 0;
+
+	ti = get_tracefs(inode);
+	if (!ti || !(ti->flags & TRACEFS_EVENT_INODE))
+		return 1;
+
+	return eventfs_revalidate(dentry, flags);
+}
+
 static const struct dentry_operations tracefs_dentry_operations = {
-	.d_iput = tracefs_dentry_iput,
+	.d_iput		= tracefs_dentry_iput,
+	.d_revalidate	= tracefs_revalidate,
 };
 
 static int trace_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 2af78fd95c93..a1024202c4e5 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -80,5 +80,6 @@ struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
 struct dentry *eventfs_failed_creating(struct dentry *dentry);
 struct dentry *eventfs_end_creating(struct dentry *dentry);
 void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry);
+int eventfs_revalidate(struct dentry *dentry, unsigned int flags);
 
 #endif /* _TRACEFS_INTERNAL_H */


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30 16:55                                       ` Linus Torvalds
  2024-01-30 17:06                                         ` Steven Rostedt
@ 2024-01-30 17:09                                         ` Linus Torvalds
  1 sibling, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-30 17:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 427 bytes --]

On Tue, 30 Jan 2024 at 08:55, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Let me just reboot into it to test that I got all the cases.
>
> It makes the code even more obvious, and avoids all the complexity.

Yup, this works at least for the case you pointed out, and it really
feels like the RightThing(tm) from a VFS standpoint.

NOTE! This patch is on top of my previous 5-patch series.

                 Linus

[-- Attachment #2: 0001-eventfs-clean-up-dentry-ops-and-add-revalidate-funct.patch --]
[-- Type: text/x-patch, Size: 4937 bytes --]

From 67b31bb53ef65132a65bfbe915628c481a39e73a Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 30 Jan 2024 09:00:18 -0800
Subject: [PATCH] eventfs: clean up dentry ops and add revalidate function

In order for the dentries to stay up-to-date with the eventfs changes,
just add a 'd_revalidate' function that checks the 'is_freed' bit.

Also, clean up the dentry release to actually use d_release() rather
than the slightly odd d_iput() function.  We don't care about the inode,
all we want to do is to get rid of the refcount to the eventfs data
added by dentry->d_fsdata.

It would probably be cleaner to make eventfs its own filesystem, or at
least set its own dentry ops when looking up eventfs files.  But as it
is, only eventfs dentries use d_fsdata, so we don't really need to split
these things up by use.

Another thing that might be worth doing is to make all eventfs lookups
mark their dentries as not worth caching.  We could do that with
d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.

As it is, the dentries are all freeable, but they only tend to get freed
at memory pressure rather than more proactively.  But that's a separate
issue.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 21 +++++----------------
 fs/tracefs/inode.c       | 27 ++++++++++++++++++---------
 fs/tracefs/internal.h    |  3 ++-
 3 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a37db0dac302..acdc797bd9c0 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -414,23 +414,14 @@ static inline struct eventfs_inode *alloc_ei(const char *name)
 
 
 /**
- * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode
- * @ti: the tracefs_inode of the dentry
+ * eventfs_d_release - dentry is going away
  * @dentry: dentry which has the reference to remove.
  *
  * Remove the association between a dentry from an eventfs_inode.
  */
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
+void eventfs_d_release(struct dentry *dentry)
 {
-	struct eventfs_inode *ei;
-
-	mutex_lock(&eventfs_mutex);
-	ei = dentry->d_fsdata;
-	if (ei) {
-		dentry->d_fsdata = NULL;
-		put_ei(ei);
-	}
-	mutex_unlock(&eventfs_mutex);
+	put_ei(dentry->d_fsdata);
 }
 
 /**
@@ -517,9 +508,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 	}
 
  enoent:
-	/* Nothing found? */
-	d_add(dentry, NULL);
-	result = NULL;
+	/* Don't cache negative lookups, just return an error */
+	result = ERR_PTR(-ENOENT);
 
  out:
 	mutex_unlock(&eventfs_mutex);
@@ -857,6 +847,5 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 	 * sticks around while the other ei->dentry are created
 	 * and destroyed dynamically.
 	 */
-	simple_recursive_removal(dentry, NULL);
 	dput(dentry);
 }
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..64122787e5d0 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -379,21 +379,30 @@ static const struct super_operations tracefs_super_operations = {
 	.show_options	= tracefs_show_options,
 };
 
-static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode)
+/*
+ * It would be cleaner if eventfs had its own dentry ops.
+ *
+ * Note that d_revalidate is called potentially under RCU,
+ * so it can't take the eventfs mutex etc. It's fine - if
+ * we open a file just as it's marked dead, things will
+ * still work just fine, and just see the old stale case.
+ */
+static void tracefs_d_release(struct dentry *dentry)
 {
-	struct tracefs_inode *ti;
+	if (dentry->d_fsdata)
+		eventfs_d_release(dentry);
+}
 
-	if (!dentry || !inode)
-		return;
+static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct eventfs_inode *ei = dentry->d_fsdata;
 
-	ti = get_tracefs(inode);
-	if (ti && ti->flags & TRACEFS_EVENT_INODE)
-		eventfs_set_ei_status_free(ti, dentry);
-	iput(inode);
+	return !(ei && ei->is_freed);
 }
 
 static const struct dentry_operations tracefs_dentry_operations = {
-	.d_iput = tracefs_dentry_iput,
+	.d_revalidate = tracefs_d_revalidate,
+	.d_release = tracefs_d_release,
 };
 
 static int trace_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 72db3bdc4dfb..d4194466b643 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -79,6 +79,7 @@ struct inode *tracefs_get_inode(struct super_block *sb);
 struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
 struct dentry *eventfs_failed_creating(struct dentry *dentry);
 struct dentry *eventfs_end_creating(struct dentry *dentry);
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry);
+
+void eventfs_d_release(struct dentry *dentry);
 
 #endif /* _TRACEFS_INTERNAL_H */
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30  3:56                             ` Linus Torvalds
  2024-01-30  8:43                               ` Linus Torvalds
@ 2024-01-30 18:23                               ` Steven Rostedt
  2024-01-30 19:19                                 ` Linus Torvalds
  2024-01-30 18:36                               ` Steven Rostedt
  2 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2024-01-30 18:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 19:56:52 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [0005-eventfs-get-rid-of-dentry-pointers-without-refcounts.patch  text/x-patch (15652 bytes)] 

I missed this email when I wrote basically the same thing :-p

It was sent just as I went to bed and this morning I mistaken it as an
email I already read. When you mentioned that your last patch was on top of
your other 5, I said to myself, "Wait? there's five?" and went back though
this thread looking at every email with an attachment. Well, even though I
duplicated your work, I'll take that as a learning experience.

Anyway, I'm assuming that the other 4 patches are exactly the same as the
previous version, as applying patches from attachments is a manual process.
I don't even develop on the machine with my email client, so it either is
me downloading each and scp'ing them over to my development box or finding
the lore link and going through them individually one by one.

I know you don't send patches inlined anymore, which is a shame, because
patchwork takes care of all the administering when patches are inlined, and
I don't miss patches like I use to.

So, I'll down load patch 5 here and just assume that I already have your
other 4 patches from the previous email.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30  3:56                             ` Linus Torvalds
  2024-01-30  8:43                               ` Linus Torvalds
  2024-01-30 18:23                               ` Steven Rostedt
@ 2024-01-30 18:36                               ` Steven Rostedt
  2 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2024-01-30 18:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Mon, 29 Jan 2024 19:56:52 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [0001-tracefs-avoid-using-the-ei-dentry-pointer-unnecessar.patch  text/x-patch (3561 bytes)] 
> 
> [0002-eventfsfs-initialize-the-tracefs-inode-properly.patch  text/x-patch (2548 bytes)] 

Ah these two are new.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30 18:23                               ` Steven Rostedt
@ 2024-01-30 19:19                                 ` Linus Torvalds
  2024-01-30 19:37                                   ` Steven Rostedt
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2024-01-30 19:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Tue, 30 Jan 2024 at 10:23, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I know you don't send patches inlined anymore, which is a shame, because
> patchwork takes care of all the administering when patches are inlined, and
> I don't miss patches like I use to.

I just sent the whole series as individual patches., and doing a

  b4 am 20240130190355.11486-1-torvalds@linux-foundation.org

should get them all (or if you use patchwork, do that)

I don't normally do these inlined patches any more, because honestly,
99% of all patches I do end up being "I can't test this very well, I
think you should do something like this".

In fact, for simple ones where I might not have even compile-tested
them, much less booted into a kernel with them, I will actively
whitespace-corrupt the patch, just to make sure they aren't applied as
any kind of real workflow - they are almost always meant as a "I think
you should do this, and take credit for it all".

And so when I'm working on a series like this, I'll send attachments
just because it's easier, and because I don't want to patch-bomb
people with some kind of crazy work-in-progress thing.

But I'm reasonably comfortable with this series now, so I sent it as a
"real" patch series. I like it partly because it just removes a lot of
lines:

 3 files changed, 160 insertions(+), 433 deletions(-)

but mostly because the lines it removes are what I consider actively
broken code. So it's not just getting rid of LOC, it's getting rid of
complexity (and bugs) IMHO.

That said, I also don't think that patch series is any kind of
"final". I didn't fix up the readdir iterator locking, for example. I
don't think the SRCU parts are needed at all any more thanks to the
refcounting - the 'ei' is no longer protected by SRCU, it's protected
by virtue of us having the file open (and thus holding the dentry).

So please think of that series not as any kind of final word. More as
a "I strongly believe this is the direction eventfs should go".

I am perfectly ok with you munging those patches and taking full
credit for them, for example.

My "testing" has not involved any real tracing usage, and while I
*have* booted that thing, and have done some very basic smoke-testing
in /sys/kernel/tracing, 99% of the work was literally me just going
through the lookup code, removing everything I found objectionable,
and replacing it with what the VFS layer generally would want.

                      Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30 19:19                                 ` Linus Torvalds
@ 2024-01-30 19:37                                   ` Steven Rostedt
  2024-01-30 19:54                                     ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2024-01-30 19:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Tue, 30 Jan 2024 11:19:01 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 30 Jan 2024 at 10:23, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I know you don't send patches inlined anymore, which is a shame, because
> > patchwork takes care of all the administering when patches are inlined, and
> > I don't miss patches like I use to.  
> 
> I just sent the whole series as individual patches., and doing a
> 
>   b4 am 20240130190355.11486-1-torvalds@linux-foundation.org
> 
> should get them all (or if you use patchwork, do that)

Thanks. I can get the mbox with:

 wget https://patchwork.kernel.org/series/821406/mbox/

For those interested, the series is at:

  https://patchwork.kernel.org/project/linux-trace-kernel/list/?series=821406

And I got the above mbox link from the first patch and selecting the 'mbox'
link.

I download that and then run my scripts which adds the Link tags to them as
well as Cc's. I'll add the link to this thread as well. Then I simply do a
'git am -s'

Note, I mentioned the pain of pulling in your multi-attachment patches by
hand on IRC, and Konstantin replied that he'll fix b4 to do those too ;-)

But I still prefer patchwork, as that keeps history as well.

> 
> I don't normally do these inlined patches any more, because honestly,
> 99% of all patches I do end up being "I can't test this very well, I
> think you should do something like this".
> 
> In fact, for simple ones where I might not have even compile-tested
> them, much less booted into a kernel with them, I will actively
> whitespace-corrupt the patch, just to make sure they aren't applied as
> any kind of real workflow - they are almost always meant as a "I think
> you should do this, and take credit for it all".
> 
> And so when I'm working on a series like this, I'll send attachments
> just because it's easier, and because I don't want to patch-bomb
> people with some kind of crazy work-in-progress thing.
> 
> But I'm reasonably comfortable with this series now, so I sent it as a
> "real" patch series. I like it partly because it just removes a lot of
> lines:
> 
>  3 files changed, 160 insertions(+), 433 deletions(-)
> 
> but mostly because the lines it removes are what I consider actively
> broken code. So it's not just getting rid of LOC, it's getting rid of
> complexity (and bugs) IMHO.

Do you want me to put this in my urgent branch and mark them for stable,
and then send them for 6.8?

> 
> That said, I also don't think that patch series is any kind of
> "final". I didn't fix up the readdir iterator locking, for example. I
> don't think the SRCU parts are needed at all any more thanks to the
> refcounting - the 'ei' is no longer protected by SRCU, it's protected
> by virtue of us having the file open (and thus holding the dentry).
> 
> So please think of that series not as any kind of final word. More as
> a "I strongly believe this is the direction eventfs should go".
> 
> I am perfectly ok with you munging those patches and taking full
> credit for them, for example.

Before seeing your 5 patch series, I wrote that patch I showed you which
did basically the same thing. It passed all the preliminary tests so I'm
sure your version should work too. I'll still have to run it through my full
ktest suite, but that takes several hours to complete. I do that before I
send out my 'for-linus/next' or any git pulls.


> 
> My "testing" has not involved any real tracing usage, and while I
> *have* booted that thing, and have done some very basic smoke-testing
> in /sys/kernel/tracing, 99% of the work was literally me just going
> through the lookup code, removing everything I found objectionable,
> and replacing it with what the VFS layer generally would want.

I'll take your patches and start working on them.

Thanks for doing this!

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30 19:37                                   ` Steven Rostedt
@ 2024-01-30 19:54                                     ` Linus Torvalds
  2024-01-30 20:04                                       ` Steven Rostedt
  2024-01-31 15:58                                       ` Steven Rostedt
  0 siblings, 2 replies; 47+ messages in thread
From: Linus Torvalds @ 2024-01-30 19:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Tue, 30 Jan 2024 at 11:37, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Do you want me to put this in my urgent branch and mark them for stable,
> and then send them for 6.8?

Hmm. I think the only one that fixes a _reported_ bug is that [PTCH
2/6]. And it turns out that while 'ti->private' really is entirely
uninitialized (and I still think it's the cause of the kernel test
robot report that started this thread), the ti->flags field _is_
initialized to zero in tracefs_alloc_inode().

So even in that [PATCH 2/6], these parts:

  -     ti->flags |= TRACEFS_EVENT_INODE;
  +     ti->flags = TRACEFS_EVENT_INODE;

aren't strictly needed (but aren't wrong either).

The 'make sure to initialize ti->private before exposing the dentry'
part *definitely* needs to be part of 6.8, though. That has an
outstanding actually triggered bug report on it.

I do think that tracefs_alloc_inode() should also initialize
ti->private to NULL, but while that would fix the oops that the test
robot reported, it wouldn't fix the data-race on any ti->private
accesses.

So that "ti->private = ei" needs to be done before the d_instantiate()
(that later became a d_add()) regardless. But not having random fields
left uninitialized for future subtle bugs would be a good idea too.

Anyway.

If you do run the full tracefs tests on the whole series, and there
are no other major problems, I'll happily take it all for 6.8. And
yes, even mark it for stable. I think the other bugs are much harder
to hit, but I do think they exist. And code deletion is always good.

So give it the full test attention, and *if* it all still looks good
and there are no new subtle issues that crop up, let's just put this
saga behind us asap.

               Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30 19:54                                     ` Linus Torvalds
@ 2024-01-30 20:04                                       ` Steven Rostedt
  2024-01-31 15:58                                       ` Steven Rostedt
  1 sibling, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2024-01-30 20:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Tue, 30 Jan 2024 11:54:56 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 30 Jan 2024 at 11:37, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Do you want me to put this in my urgent branch and mark them for stable,
> > and then send them for 6.8?  
> 
> Hmm. I think the only one that fixes a _reported_ bug is that [PTCH
> 2/6]. And it turns out that while 'ti->private' really is entirely
> uninitialized (and I still think it's the cause of the kernel test
> robot report that started this thread), the ti->flags field _is_
> initialized to zero in tracefs_alloc_inode().
> 
> So even in that [PATCH 2/6], these parts:
> 
>   -     ti->flags |= TRACEFS_EVENT_INODE;
>   +     ti->flags = TRACEFS_EVENT_INODE;
> 
> aren't strictly needed (but aren't wrong either).
> 
> The 'make sure to initialize ti->private before exposing the dentry'
> part *definitely* needs to be part of 6.8, though. That has an
> outstanding actually triggered bug report on it.
> 
> I do think that tracefs_alloc_inode() should also initialize
> ti->private to NULL, but while that would fix the oops that the test
> robot reported, it wouldn't fix the data-race on any ti->private
> accesses.
> 
> So that "ti->private = ei" needs to be done before the d_instantiate()
> (that later became a d_add()) regardless. But not having random fields
> left uninitialized for future subtle bugs would be a good idea too.


I'll add a patch to add __GFP_ZERO to the tracefs inode allocation, and
modify your patch 2 to just move the ti->private = ei;


> 
> Anyway.
> 
> If you do run the full tracefs tests on the whole series, and there
> are no other major problems, I'll happily take it all for 6.8. And
> yes, even mark it for stable. I think the other bugs are much harder
> to hit, but I do think they exist. And code deletion is always good.

Sounds good.

> 
> So give it the full test attention, and *if* it all still looks good
> and there are no new subtle issues that crop up, let's just put this
> saga behind us asap.

Yes, I've been wanting to get away from eventfs for a month now.

Again, I really do appreciate the time you put in to fixing this for me.
I'm going to be backporting this to older Chromebooks as we really need to
cut down on the memory overhead of instances.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-30 19:54                                     ` Linus Torvalds
  2024-01-30 20:04                                       ` Steven Rostedt
@ 2024-01-31 15:58                                       ` Steven Rostedt
  2024-01-31 16:13                                         ` Steven Rostedt
                                                           ` (2 more replies)
  1 sibling, 3 replies; 47+ messages in thread
From: Steven Rostedt @ 2024-01-31 15:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Tue, 30 Jan 2024 11:54:56 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> If you do run the full tracefs tests on the whole series, and there
> are no other major problems, I'll happily take it all for 6.8. And
> yes, even mark it for stable. I think the other bugs are much harder
> to hit, but I do think they exist. And code deletion is always good.
> 
> So give it the full test attention, and *if* it all still looks good
> and there are no new subtle issues that crop up, let's just put this
> saga behind us asap.

BTW, I ran my full test suite on your patches with the below updates and it
all passed. Note, I did not run the "bisectable" portion of my test. That
is, the part that runs tests on each patch in the series. Because I know
that fails. I just ran the tests on the last applied patch.

I can break up and clean up the patches so that they are bisectable, and if
that passes the bisectable portion of my tests, I can still send them to
you for 6.8. I think this does fix a lot of hidden bugs, and would like all
this to go back to 6.6 when the eventfs was first introduced.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index acdc797bd9c0..31cbe38739fa 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -74,7 +74,8 @@ static void release_ei(struct kref *ref)
 {
 	struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref);
 	kfree(ei->entry_attrs);
-	kfree(ei);
+	kfree_const(ei->name);
+	kfree_rcu(ei, rcu);
 }
 
 static inline void put_ei(struct eventfs_inode *ei)
@@ -348,8 +349,7 @@ static struct dentry *lookup_file(struct eventfs_inode *parent_ei,
 	inode->i_ino = EVENTFS_FILE_INODE_INO;
 
 	ti = get_tracefs(inode);
-	ti->flags = TRACEFS_EVENT_INODE;
-	ti->private = NULL;			// Directories have 'ei', files not
+	ti->flags |= TRACEFS_EVENT_INODE;
 
 	// Files have their parent's ei as their fsdata
 	dentry->d_fsdata = get_ei(parent_ei);
@@ -388,7 +388,8 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
 	inode->i_ino = eventfs_dir_ino(ei);
 
 	ti = get_tracefs(inode);
-	ti->flags = TRACEFS_EVENT_INODE;
+	ti->flags |= TRACEFS_EVENT_INODE;
+	/* Only directories have ti->private set to an ei, not files */
 	ti->private = ei;
 
 	dentry->d_fsdata = get_ei(ei);
@@ -402,17 +403,20 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
 
 static inline struct eventfs_inode *alloc_ei(const char *name)
 {
-	int namesize = strlen(name) + 1;
-	struct eventfs_inode *ei = kzalloc(sizeof(*ei) + namesize, GFP_KERNEL);
+	struct eventfs_inode *ei = kzalloc(sizeof(*ei), GFP_KERNEL);
 
-	if (ei) {
-		memcpy((char *)ei->name, name, namesize);
-		kref_init(&ei->kref);
+	if (!ei)
+		return NULL;
+
+	ei->name = kstrdup_const(name, GFP_KERNEL);
+	if (!ei->name) {
+		kfree(ei);
+		return NULL;
 	}
+	kref_init(&ei->kref);
 	return ei;
 }
 
-
 /**
  * eventfs_d_release - dentry is going away
  * @dentry: dentry which has the reference to remove.
@@ -750,7 +754,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	INIT_LIST_HEAD(&ei->list);
 
 	ti = get_tracefs(inode);
-	ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+	ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
 	ti->private = ei;
 
 	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
@@ -847,5 +851,6 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 	 * sticks around while the other ei->dentry are created
 	 * and destroyed dynamically.
 	 */
+	d_invalidate(dentry);
 	dput(dentry);
 }
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 64122787e5d0..cf90ea86baf8 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -38,8 +38,6 @@ static struct inode *tracefs_alloc_inode(struct super_block *sb)
 	if (!ti)
 		return NULL;
 
-	ti->flags = 0;
-
 	return &ti->vfs_inode;
 }
 
@@ -506,75 +504,6 @@ struct dentry *tracefs_end_creating(struct dentry *dentry)
 	return dentry;
 }
 
-/**
- * eventfs_start_creating - start the process of creating a dentry
- * @name: Name of the file created for the dentry
- * @parent: The parent dentry where this dentry will be created
- *
- * This is a simple helper function for the dynamically created eventfs
- * files. When the directory of the eventfs files are accessed, their
- * dentries are created on the fly. This function is used to start that
- * process.
- */
-struct dentry *eventfs_start_creating(const char *name, struct dentry *parent)
-{
-	struct dentry *dentry;
-	int error;
-
-	/* Must always have a parent. */
-	if (WARN_ON_ONCE(!parent))
-		return ERR_PTR(-EINVAL);
-
-	error = simple_pin_fs(&trace_fs_type, &tracefs_mount,
-			      &tracefs_mount_count);
-	if (error)
-		return ERR_PTR(error);
-
-	if (unlikely(IS_DEADDIR(parent->d_inode)))
-		dentry = ERR_PTR(-ENOENT);
-	else
-		dentry = lookup_one_len(name, parent, strlen(name));
-
-	if (!IS_ERR(dentry) && dentry->d_inode) {
-		dput(dentry);
-		dentry = ERR_PTR(-EEXIST);
-	}
-
-	if (IS_ERR(dentry))
-		simple_release_fs(&tracefs_mount, &tracefs_mount_count);
-
-	return dentry;
-}
-
-/**
- * eventfs_failed_creating - clean up a failed eventfs dentry creation
- * @dentry: The dentry to clean up
- *
- * If after calling eventfs_start_creating(), a failure is detected, the
- * resources created by eventfs_start_creating() needs to be cleaned up. In
- * that case, this function should be called to perform that clean up.
- */
-struct dentry *eventfs_failed_creating(struct dentry *dentry)
-{
-	dput(dentry);
-	simple_release_fs(&tracefs_mount, &tracefs_mount_count);
-	return NULL;
-}
-
-/**
- * eventfs_end_creating - Finish the process of creating a eventfs dentry
- * @dentry: The dentry that has successfully been created.
- *
- * This function is currently just a place holder to match
- * eventfs_start_creating(). In case any synchronization needs to be added,
- * this function will be used to implement that without having to modify
- * the callers of eventfs_start_creating().
- */
-struct dentry *eventfs_end_creating(struct dentry *dentry)
-{
-	return dentry;
-}
-
 /* Find the inode that this will use for default */
 static struct inode *instance_inode(struct dentry *parent, struct inode *inode)
 {
@@ -788,6 +717,7 @@ static void init_once(void *foo)
 {
 	struct tracefs_inode *ti = (struct tracefs_inode *) foo;
 
+	memset(ti, 0, sizeof(*ti));
 	inode_init_once(&ti->vfs_inode);
 }
 
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index d4194466b643..ca1ccc86986f 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -46,6 +46,7 @@ struct eventfs_inode {
 	struct kref			kref;
 	struct list_head		list;
 	const struct eventfs_entry	*entries;
+	const char			*name;
 	struct list_head		children;
 	struct dentry			*events_dir;
 	struct eventfs_attr		*entry_attrs;
@@ -64,7 +65,6 @@ struct eventfs_inode {
 		struct llist_node	llist;
 		struct rcu_head		rcu;
 	};
-	const char name[];
 };
 
 static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
@@ -76,9 +76,6 @@ struct dentry *tracefs_start_creating(const char *name, struct dentry *parent);
 struct dentry *tracefs_end_creating(struct dentry *dentry);
 struct dentry *tracefs_failed_creating(struct dentry *dentry);
 struct inode *tracefs_get_inode(struct super_block *sb);
-struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
-struct dentry *eventfs_failed_creating(struct dentry *dentry);
-struct dentry *eventfs_end_creating(struct dentry *dentry);
 
 void eventfs_d_release(struct dentry *dentry);
 

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-31 15:58                                       ` Steven Rostedt
@ 2024-01-31 16:13                                         ` Steven Rostedt
  2024-01-31 17:28                                           ` Steven Rostedt
  2024-01-31 17:26                                         ` Steven Rostedt
  2024-01-31 19:35                                         ` Linus Torvalds
  2 siblings, 1 reply; 47+ messages in thread
From: Steven Rostedt @ 2024-01-31 16:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Wed, 31 Jan 2024 10:58:47 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> @@ -788,6 +717,7 @@ static void init_once(void *foo)
>  {
>  	struct tracefs_inode *ti = (struct tracefs_inode *) foo;
>  
> +	memset(ti, 0, sizeof(*ti));
>  	inode_init_once(&ti->vfs_inode);
>  }
>  

Note, that inode_init_once() also does a memset on the entire inode, so the
initial memset is redundant on the inode portion. But I didn't think it was
really worth the time to complicate the code by optimizing it. I guess if I
changed the structure to:

 struct tracefs_inode {
+	struct inode            vfs_inode;
	unsigned long           flags;
	void                    *private;
-	struct inode            vfs_inode;
 };

I could have it do:

	memset_after(ti, 0, vfs_inode);

But this can be done as a separate clean up and doesn't need to be done now.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-31 15:58                                       ` Steven Rostedt
  2024-01-31 16:13                                         ` Steven Rostedt
@ 2024-01-31 17:26                                         ` Steven Rostedt
  2024-01-31 19:35                                         ` Linus Torvalds
  2 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2024-01-31 17:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Wed, 31 Jan 2024 10:58:47 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> BTW, I ran my full test suite on your patches with the below updates and it
> all passed. Note, I did not run the "bisectable" portion of my test. That
> is, the part that runs tests on each patch in the series. Because I know
> that fails. I just ran the tests on the last applied patch.
> 
> I can break up and clean up the patches so that they are bisectable, and if
> that passes the bisectable portion of my tests, I can still send them to
> you for 6.8. I think this does fix a lot of hidden bugs, and would like all
> this to go back to 6.6 when the eventfs was first introduced.

So I swapped the order of patch 5 and patch 6, and it appears that patch 6
works without 5 (with some massaging).

Here's the new patch 6 without patch 5, and now patch 5 just finishes the
rest of the changes.

I'll post this series, pull it in through my normal work flow, and rerun
the full test suite with the bisecting check as well.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 0289ec787367..0213a3375d53 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -378,13 +378,12 @@ static void free_ei(struct eventfs_inode *ei)
 }
 
 /**
- * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode
- * @ti: the tracefs_inode of the dentry
+ * eventfs_d_release - dentry is going away
  * @dentry: dentry which has the reference to remove.
  *
  * Remove the association between a dentry from an eventfs_inode.
  */
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
+void eventfs_d_release(struct dentry *dentry)
 {
 	struct eventfs_inode *ei;
 	int i;
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index f403d32bd7cd..cf90ea86baf8 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -377,21 +377,30 @@ static const struct super_operations tracefs_super_operations = {
 	.show_options	= tracefs_show_options,
 };
 
-static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode)
+/*
+ * It would be cleaner if eventfs had its own dentry ops.
+ *
+ * Note that d_revalidate is called potentially under RCU,
+ * so it can't take the eventfs mutex etc. It's fine - if
+ * we open a file just as it's marked dead, things will
+ * still work just fine, and just see the old stale case.
+ */
+static void tracefs_d_release(struct dentry *dentry)
 {
-	struct tracefs_inode *ti;
+	if (dentry->d_fsdata)
+		eventfs_d_release(dentry);
+}
 
-	if (!dentry || !inode)
-		return;
+static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct eventfs_inode *ei = dentry->d_fsdata;
 
-	ti = get_tracefs(inode);
-	if (ti && ti->flags & TRACEFS_EVENT_INODE)
-		eventfs_set_ei_status_free(ti, dentry);
-	iput(inode);
+	return !(ei && ei->is_freed);
 }
 
 static const struct dentry_operations tracefs_dentry_operations = {
-	.d_iput = tracefs_dentry_iput,
+	.d_revalidate = tracefs_d_revalidate,
+	.d_release = tracefs_d_release,
 };
 
 static int trace_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 9e64ca4829c7..687faba9807b 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -77,6 +77,7 @@ struct dentry *tracefs_start_creating(const char *name, struct dentry *parent);
 struct dentry *tracefs_end_creating(struct dentry *dentry);
 struct dentry *tracefs_failed_creating(struct dentry *dentry);
 struct inode *tracefs_get_inode(struct super_block *sb);
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry);
+
+void eventfs_d_release(struct dentry *dentry);
 
 #endif /* _TRACEFS_INTERNAL_H */

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-31 16:13                                         ` Steven Rostedt
@ 2024-01-31 17:28                                           ` Steven Rostedt
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2024-01-31 17:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Wed, 31 Jan 2024 11:13:58 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 31 Jan 2024 10:58:47 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > @@ -788,6 +717,7 @@ static void init_once(void *foo)
> >  {
> >  	struct tracefs_inode *ti = (struct tracefs_inode *) foo;
> >  
> > +	memset(ti, 0, sizeof(*ti));
> >  	inode_init_once(&ti->vfs_inode);
> >  }
> >    
> 
> Note, that inode_init_once() also does a memset on the entire inode, so the
> initial memset is redundant on the inode portion. But I didn't think it was
> really worth the time to complicate the code by optimizing it. I guess if I
> changed the structure to:
> 
>  struct tracefs_inode {
> +	struct inode            vfs_inode;
> 	unsigned long           flags;
> 	void                    *private;
> -	struct inode            vfs_inode;
>  };
> 
> I could have it do:
> 
> 	memset_after(ti, 0, vfs_inode);
> 
> But this can be done as a separate clean up and doesn't need to be done now.
> 

Hmm, since I need to run all the tests again, I think I'll change this
patch to do the above.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-31 15:58                                       ` Steven Rostedt
  2024-01-31 16:13                                         ` Steven Rostedt
  2024-01-31 17:26                                         ` Steven Rostedt
@ 2024-01-31 19:35                                         ` Linus Torvalds
  2024-01-31 19:42                                           ` Steven Rostedt
  2 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2024-01-31 19:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Wed, 31 Jan 2024 at 07:58, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> BTW, I ran my full test suite on your patches with the below updates and it
> all passed.

Those patch updates all look sane to me.

> I can break up and clean up the patches so that they are bisectable, and if
> that passes the bisectable portion of my tests, I can still send them to
> you for 6.8.

Ack. That series you posted looks fine. I didn't do any actual testing
or applying the patches, just looking at them.

The one thing I noticed is that the 'llist' removal still needs to be
done. The logical point is that "[PATCH v2 7/7]" where the
eventfs_workfn stuff is ripped out.

And the 'rcu' head should now be a union with something that is no
longer used after the last kref. The only thing that *is* used after
the last kref is the "is_freed" bit, so there's lots of choice. Using
the 'struct list_head listl' that is used for the child list would
seem to be the obvious choice, but it could be anything (including all
of the beginning of that eventfs_inode, but then you would need to
group that as another nested unnamed struct, so picking a "big enough"
entry like 'list' makes it syntactically simpler.

               Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
  2024-01-31 19:35                                         ` Linus Torvalds
@ 2024-01-31 19:42                                           ` Steven Rostedt
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2024-01-31 19:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Christian Brauner, Al Viro,
	Ajay Kaher, linux-trace-kernel

On Wed, 31 Jan 2024 11:35:18 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, 31 Jan 2024 at 07:58, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > BTW, I ran my full test suite on your patches with the below updates and it
> > all passed.  
> 
> Those patch updates all look sane to me.
> 
> > I can break up and clean up the patches so that they are bisectable, and if
> > that passes the bisectable portion of my tests, I can still send them to
> > you for 6.8.  
> 
> Ack. That series you posted looks fine. I didn't do any actual testing
> or applying the patches, just looking at them.
> 
> The one thing I noticed is that the 'llist' removal still needs to be
> done. The logical point is that "[PATCH v2 7/7]" where the
> eventfs_workfn stuff is ripped out.
> 
> And the 'rcu' head should now be a union with something that is no
> longer used after the last kref. The only thing that *is* used after
> the last kref is the "is_freed" bit, so there's lots of choice. Using
> the 'struct list_head listl' that is used for the child list would
> seem to be the obvious choice, but it could be anything (including all
> of the beginning of that eventfs_inode, but then you would need to
> group that as another nested unnamed struct, so picking a "big enough"
> entry like 'list' makes it syntactically simpler.

Yeah, that was what I was talking about in my cover letter with:

  Note, there's more clean ups that can happen. One being cleaning up the
  eventfs_inode structure. But that's not critical now and can be added
  later.

I just want to get the majority of the broken parts done. The clean up of
the eventfs_inode is something that I'd add a separate patch. Not sure that
falls in your "fixes" category for 6.8.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2024-01-31 19:42 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-29  2:58 [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address kernel test robot
2024-01-29  4:36 ` Linus Torvalds
2024-01-29 17:01   ` Steven Rostedt
2024-01-29 17:40     ` Linus Torvalds
2024-01-29 17:44       ` Steven Rostedt
2024-01-29 17:45         ` Steven Rostedt
2024-01-29 17:56         ` Linus Torvalds
2024-01-29 17:55       ` Linus Torvalds
2024-01-29 19:24       ` Linus Torvalds
2024-01-29 19:51         ` Linus Torvalds
2024-01-29 20:26           ` Steven Rostedt
2024-01-29 20:51             ` Linus Torvalds
2024-01-29 21:45               ` Steven Rostedt
2024-01-29 22:19                 ` Linus Torvalds
2024-01-29 21:55               ` Steven Rostedt
2024-01-29 22:22               ` Steven Rostedt
2024-01-29 22:35                 ` Linus Torvalds
2024-01-29 22:42                   ` Linus Torvalds
2024-01-29 22:49                     ` Steven Rostedt
2024-01-30  0:01                       ` Linus Torvalds
2024-01-30  0:35                         ` Steven Rostedt
2024-01-30  1:50                           ` Linus Torvalds
2024-01-30  3:56                             ` Linus Torvalds
2024-01-30  8:43                               ` Linus Torvalds
2024-01-30  9:12                                 ` Linus Torvalds
2024-01-30 12:45                                   ` Rasmus Villemoes
2024-01-30 14:39                                   ` Steven Rostedt
2024-01-30 16:49                                     ` Steven Rostedt
2024-01-30 16:55                                       ` Linus Torvalds
2024-01-30 17:06                                         ` Steven Rostedt
2024-01-30 17:09                                         ` Linus Torvalds
2024-01-30 16:51                                     ` Linus Torvalds
2024-01-30 18:23                               ` Steven Rostedt
2024-01-30 19:19                                 ` Linus Torvalds
2024-01-30 19:37                                   ` Steven Rostedt
2024-01-30 19:54                                     ` Linus Torvalds
2024-01-30 20:04                                       ` Steven Rostedt
2024-01-31 15:58                                       ` Steven Rostedt
2024-01-31 16:13                                         ` Steven Rostedt
2024-01-31 17:28                                           ` Steven Rostedt
2024-01-31 17:26                                         ` Steven Rostedt
2024-01-31 19:35                                         ` Linus Torvalds
2024-01-31 19:42                                           ` Steven Rostedt
2024-01-30 18:36                               ` Steven Rostedt
2024-01-29 22:47                   ` Steven Rostedt
2024-01-29 23:15                     ` Steven Rostedt
2024-01-30  2:08   ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).