All of lore.kernel.org
 help / color / mirror / Atom feed
* [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove
@ 2024-04-08  8:26 syzbot
  2024-04-15 14:31 ` Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 90+ messages in thread
From: syzbot @ 2024-04-08  8:26 UTC (permalink / raw)
  To: axboe, brauner, io-uring, jack, linux-fsdevel, linux-kernel,
	syzkaller-bugs, viro

Hello,

syzbot found the following issue on:

HEAD commit:    480e035fc4c7 Merge tag 'drm-next-2024-03-13' of https://gi..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=15f55413180000
kernel config:  https://syzkaller.appspot.com/x/.config?x=1e5b814e91787669
dashboard link: https://syzkaller.appspot.com/bug?extid=045b454ab35fd82a35fb
compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=101db623180000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10801175180000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/5f73b6ef963d/disk-480e035f.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/46c949396aad/vmlinux-480e035f.xz
kernel image: https://storage.googleapis.com/syzbot-assets/e3b4d0f5a5f8/bzImage-480e035f.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com

general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 5075 Comm: syz-executor119 Not tainted 6.8.0-syzkaller-08073-g480e035fc4c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
RIP: 0010:__ep_remove+0x13a/0x780 fs/eventpoll.c:825
Code: 89 e7 49 c1 ef 03 41 80 3c 2f 00 74 08 4c 89 e7 e8 8b 81 d9 ff 4d 8b 34 24 4c 89 f0 48 c1 e8 03 48 b9 00 00 00 00 00 fc ff df <80> 3c 08 00 74 08 4c 89 f7 e8 68 81 d9 ff 49 8b 1e 48 8b 04 24 48
RSP: 0018:ffffc90003b6fa78 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: dffffc0000000000
RDX: ffff88807a310000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: dffffc0000000000 R08: ffffffff821ec2a4 R09: fffff5200076df40
R10: dffffc0000000000 R11: fffff5200076df40 R12: ffff88802916cbb8
R13: ffff88802916ca00 R14: 0000000000000000 R15: 1ffff1100522d977
FS:  00007f034c6ff6c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000004 CR3: 00000000775ec000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 eventpoll_release_file+0xdb/0x1c0 fs/eventpoll.c:1071
 eventpoll_release include/linux/eventpoll.h:53 [inline]
 __fput+0x6d7/0x8a0 fs/file_table.c:413
 task_work_run+0x24f/0x310 kernel/task_work.c:180
 get_signal+0x1673/0x1850 kernel/signal.c:2683
 arch_do_signal_or_restart+0x96/0x860 arch/x86/kernel/signal.c:310
 exit_to_user_mode_loop kernel/entry/common.c:105 [inline]
 exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
 __syscall_exit_to_user_mode_work kernel/entry/common.c:201 [inline]
 syscall_exit_to_user_mode+0xc9/0x360 kernel/entry/common.c:212
 do_syscall_64+0x10a/0x240 arch/x86/entry/common.c:89
 entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7f034c84c059
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 1f 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f034c6ff178 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 00007f034c8f0038 RCX: 00007f034c84c059
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f034c8f0038
RBP: 00007f034c8f0030 R08: 00007f034c6ff6c0 R09: 00007f034c6ff6c0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f034c8f003c
R13: 0000000000000002 R14: 00007f034c9ffc40 R15: 00007f034c9ffd28
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:__ep_remove+0x13a/0x780 fs/eventpoll.c:825
Code: 89 e7 49 c1 ef 03 41 80 3c 2f 00 74 08 4c 89 e7 e8 8b 81 d9 ff 4d 8b 34 24 4c 89 f0 48 c1 e8 03 48 b9 00 00 00 00 00 fc ff df <80> 3c 08 00 74 08 4c 89 f7 e8 68 81 d9 ff 49 8b 1e 48 8b 04 24 48
RSP: 0018:ffffc90003b6fa78 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: dffffc0000000000
RDX: ffff88807a310000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: dffffc0000000000 R08: ffffffff821ec2a4 R09: fffff5200076df40
R10: dffffc0000000000 R11: fffff5200076df40 R12: ffff88802916cbb8
R13: ffff88802916ca00 R14: 0000000000000000 R15: 1ffff1100522d977
FS:  00007f034c6ff6c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000004 CR3: 00000000775ec000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess):
   0:	89 e7                	mov    %esp,%edi
   2:	49 c1 ef 03          	shr    $0x3,%r15
   6:	41 80 3c 2f 00       	cmpb   $0x0,(%r15,%rbp,1)
   b:	74 08                	je     0x15
   d:	4c 89 e7             	mov    %r12,%rdi
  10:	e8 8b 81 d9 ff       	call   0xffd981a0
  15:	4d 8b 34 24          	mov    (%r12),%r14
  19:	4c 89 f0             	mov    %r14,%rax
  1c:	48 c1 e8 03          	shr    $0x3,%rax
  20:	48 b9 00 00 00 00 00 	movabs $0xdffffc0000000000,%rcx
  27:	fc ff df
* 2a:	80 3c 08 00          	cmpb   $0x0,(%rax,%rcx,1) <-- trapping instruction
  2e:	74 08                	je     0x38
  30:	4c 89 f7             	mov    %r14,%rdi
  33:	e8 68 81 d9 ff       	call   0xffd981a0
  38:	49 8b 1e             	mov    (%r14),%rbx
  3b:	48 8b 04 24          	mov    (%rsp),%rax
  3f:	48                   	rex.W


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove
  2024-04-08  8:26 [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove syzbot
@ 2024-04-15 14:31 ` Jens Axboe
  2024-04-15 14:57   ` Pavel Begunkov
  2024-05-03 11:54 ` Bui Quang Minh
  2024-05-04  3:23 ` [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove Hillf Danton
  2 siblings, 1 reply; 90+ messages in thread
From: Jens Axboe @ 2024-04-15 14:31 UTC (permalink / raw)
  To: syzbot
  Cc: brauner, io-uring, jack, linux-fsdevel, linux-kernel,
	syzkaller-bugs, viro

This isn't related to io_uring at all, not sure why syzbot has this idea
that anything that involves task_work is iouring.

#syz set subsystems: fs

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove
  2024-04-15 14:31 ` Jens Axboe
@ 2024-04-15 14:57   ` Pavel Begunkov
  0 siblings, 0 replies; 90+ messages in thread
From: Pavel Begunkov @ 2024-04-15 14:57 UTC (permalink / raw)
  To: Jens Axboe, syzbot
  Cc: brauner, io-uring, jack, linux-fsdevel, linux-kernel,
	syzkaller-bugs, viro

On 4/15/24 15:31, Jens Axboe wrote:
> This isn't related to io_uring at all, not sure why syzbot has this idea
> that anything that involves task_work is iouring.

The repro does send an IORING_OP_READ, but I haven't looked deeper
to say if it's anything meaningful.


> #syz set subsystems: fs
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove
  2024-04-08  8:26 [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove syzbot
  2024-04-15 14:31 ` Jens Axboe
@ 2024-05-03 11:54 ` Bui Quang Minh
  2024-05-03 18:26   ` get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove) Kees Cook
  2024-05-04  3:23 ` [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove Hillf Danton
  2 siblings, 1 reply; 90+ messages in thread
From: Bui Quang Minh @ 2024-05-03 11:54 UTC (permalink / raw)
  To: syzbot, axboe, brauner, io-uring, jack, linux-fsdevel,
	linux-kernel, syzkaller-bugs, viro
  Cc: Sumit Semwal, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, Laura Abbott, Kees Cook

Hi everyone,

I've tried to debug this syzkaller's bug report

Here is my minimized proof-of-concept

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/epoll.h>
#include <linux/udmabuf.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <pthread.h>
#include <sys/ioctl.h>

#define err_msg(msg) do {perror(msg); exit(1);} while(0)

void *close_thread(void *arg)
{
     int fd = (int) (long) arg;
     close(fd);
}

int main()
{
     int fd, dmabuf_fd, memfd, epoll_fd, ret;
     struct udmabuf_create dmabuf_arg = {};
     struct epoll_event event = {
         .events = EPOLLIN | EPOLLOUT,
     };
     pthread_t thread;

     memfd = memfd_create("test", MFD_ALLOW_SEALING);
     if (memfd < 0)
         err_msg("memfd-create");

     if (ftruncate(memfd, 0x1000) < 0)
         err_msg("ftruncate");

     ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK);
     if (ret < 0)
         err_msg("add-seal");

     fd = open("/dev/udmabuf", O_RDWR);
     if (fd < 0)
         err_msg("open");

     dmabuf_arg.memfd = memfd;
     dmabuf_arg.size = 0x1000;
     dmabuf_fd = ioctl(fd, UDMABUF_CREATE, &dmabuf_arg);
     if (dmabuf_fd < 0)
         err_msg("ioctl-udmabuf");

     epoll_fd = epoll_create(10);
     if (epoll_fd < 0)
         err_msg("epoll-create");

     ret = epoll_ctl(epoll_fd, EPOLL_CTL_ADD, dmabuf_fd, &event);
     if (ret < 0)
         err_msg("epoll-ctl-add");

     pthread_create(&thread, NULL, close_thread, (void *) (long) dmabuf_fd);
     epoll_wait(epoll_fd, &event, 1, -1);
     return 0;
}

When running the above proof-of-concept on Linux 6.9.0-rc6 with KASAN 
and the
following patch for easier reproducible, I got the KASAN bug report

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 8fe5aa67b167..de3463e7d47b 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -27,6 +27,7 @@
  #include <linux/mm.h>
  #include <linux/mount.h>
  #include <linux/pseudo_fs.h>
+#include <linux/delay.h>

  #include <uapi/linux/dma-buf.h>
  #include <uapi/linux/magic.h>
@@ -240,6 +241,7 @@ static __poll_t dma_buf_poll(struct file *file, 
poll_table *poll)
         struct dma_resv *resv;
         __poll_t events;

+       mdelay(1000);
         dmabuf = file->private_data;
         if (!dmabuf || !dmabuf->resv)
                 return EPOLLERR;

 > while true; do ./mypoc_v2; done
==================================================================
BUG: KASAN: slab-use-after-free in __fput+0x164/0x523
Read of size 8 at addr ffff88800051e830 by task mypoc_v2/402

CPU: 0 PID: 402 Comm: mypoc_v2 Not tainted 6.9.0-rc5+ #11
Call Trace:
  <TASK>
  dump_stack_lvl+0x49/0x65
  ? __fput+0x164/0x523
  print_report+0x170/0x4c2
  ? __virt_addr_valid+0x21b/0x22c
  ? kmem_cache_debug_flags+0xc/0x1d
  ? __fput+0x164/0x523
  kasan_report+0xae/0xd5
  ? __fput+0x164/0x523
  __fput+0x164/0x523
  ? __pfx___schedule+0x10/0x10
  task_work_run+0x16a/0x1bb
  ? __pfx_task_work_run+0x10/0x10
  ? __x64_sys_epoll_wait+0x107/0x143
  resume_user_mode_work+0x21/0x44
  syscall_exit_to_user_mode+0x5d/0x76
  do_syscall_64+0xb5/0x107
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x44d99e
Code: 10 89 7c 24 0c 89 4c 24 1c e8 2e 8c 02 00 44 8b 54 24 1c 8b 54 24 
18 41 89 c0 48 8b 74 24 10 8b 7c 24 0c b8 e8 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 32 44 89 c7 89 44 24 0c e8 6e 8c 02 00 8b 44
RSP: 002b:00007fffaec21770 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
RAX: 0000000000000001 RBX: 00007fffaec219e8 RCX: 000000000044d99e
RDX: 0000000000000001 RSI: 00007fffaec217c4 RDI: 0000000000000006
RBP: 00007fffaec217f0 R08: 0000000000000000 R09: 00007fffaec2167f
R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000001
R13: 00007fffaec219d8 R14: 00000000004dc790 R15: 0000000000000001
  </TASK>

Allocated by task 402:
  kasan_save_stack+0x24/0x44
  kasan_save_track+0x14/0x2d
  __kasan_slab_alloc+0x47/0x55
  kmem_cache_alloc_lru+0x12a/0x172
  __d_alloc+0x2d/0x618
  d_alloc_pseudo+0x14/0x8d
  alloc_path_pseudo+0xa5/0x165
  alloc_file_pseudo+0x7f/0x124
  dma_buf_export+0x37f/0x894
  udmabuf_create+0x53e/0x68c
  udmabuf_ioctl+0x133/0x212
  vfs_ioctl+0x7e/0x95
  __do_sys_ioctl+0x51/0x78
  do_syscall_64+0x9b/0x107
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Freed by task 403:
  kasan_save_stack+0x24/0x44
  kasan_save_track+0x14/0x2d
  kasan_save_free_info+0x3f/0x4d
  poison_slab_object+0xcb/0xd8
  __kasan_slab_free+0x19/0x38
  kmem_cache_free+0xd6/0x136
  __dentry_kill+0x22d/0x321
  dput+0x3b/0x7f
  __fput+0x4f1/0x523
  __do_sys_close+0x59/0x87
  do_syscall_64+0x9b/0x107
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

The buggy address belongs to the object at ffff88800051e800
  which belongs to the cache dentry of size 192
The buggy address is located 48 bytes inside of
  freed 192-byte region [ffff88800051e800, ffff88800051e8c0)

The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x51e
flags: 0x800(slab|zone=0)
page_type: 0xffffffff()
raw: 0000000000000800 ffff888000281780 ffffea0000013ec0 0000000000000002
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff88800051e700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff88800051e780: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
 >ffff88800051e800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                      ^
  ffff88800051e880: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
  ffff88800051e900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================

Root cause:
AFAIK, eventpoll (epoll) does not increase the registered file's reference.
To ensure the safety, when the registered file is deallocated in __fput,
eventpoll_release is called to unregister the file from epoll. When calling
poll on epoll, epoll will loop through registered files and call vfs_poll on
these files. In the file's poll, file is guaranteed to be alive, however, as
epoll does not increase the registered file's reference, the file may be 
dying
and it's not safe the get the file for later use. And dma_buf_poll violates
this. In the dma_buf_poll, it tries to get_file to use in the callback. This
leads to a race where the dmabuf->file can be fput twice.

Here is the race occurs in the above proof-of-concept

close(dmabuf->file)
__fput_sync (f_count == 1, last ref)
f_count-- (f_count == 0 now)
__fput
                                     epoll_wait
                                     vfs_poll(dmabuf->file)
                                     get_file(dmabuf->file)(f_count == 1)
eventpoll_release
dmabuf->file deallocation
                                     fput(dmabuf->file) (f_count == 1)
                                     f_count--
                                     dmabuf->file deallocation

I am not familiar with the dma_buf so I don't know the proper fix for the
issue. About the rule that don't get the file for later use in poll 
callback of
file, I wonder if it is there when only select/poll exist or just after 
epoll
appears.

I hope the analysis helps us to fix the issue.

Thanks,
Quang Minh.

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 11:54 ` Bui Quang Minh
@ 2024-05-03 18:26   ` Kees Cook
  2024-05-03 18:49     ` Jens Axboe
                       ` (2 more replies)
  0 siblings, 3 replies; 90+ messages in thread
From: Kees Cook @ 2024-05-03 18:26 UTC (permalink / raw)
  To: Bui Quang Minh, Al Viro, Christian Brauner
  Cc: syzbot, axboe, brauner, io-uring, jack, linux-fsdevel,
	linux-kernel, syzkaller-bugs, viro, Sumit Semwal,
	Christian König, linux-media, dri-devel, linaro-mm-sig,
	Laura Abbott

On Fri, May 03, 2024 at 06:54:22PM +0700, Bui Quang Minh wrote:
> [...]
> Root cause:
> AFAIK, eventpoll (epoll) does not increase the registered file's reference.
> To ensure the safety, when the registered file is deallocated in __fput,
> eventpoll_release is called to unregister the file from epoll. When calling
> poll on epoll, epoll will loop through registered files and call vfs_poll on
> these files. In the file's poll, file is guaranteed to be alive, however, as
> epoll does not increase the registered file's reference, the file may be
> dying
> and it's not safe the get the file for later use. And dma_buf_poll violates
> this. In the dma_buf_poll, it tries to get_file to use in the callback. This
> leads to a race where the dmabuf->file can be fput twice.
> 
> Here is the race occurs in the above proof-of-concept
> 
> close(dmabuf->file)
> __fput_sync (f_count == 1, last ref)
> f_count-- (f_count == 0 now)
> __fput
>                                     epoll_wait
>                                     vfs_poll(dmabuf->file)
>                                     get_file(dmabuf->file)(f_count == 1)
> eventpoll_release
> dmabuf->file deallocation
>                                     fput(dmabuf->file) (f_count == 1)
>                                     f_count--
>                                     dmabuf->file deallocation
> 
> I am not familiar with the dma_buf so I don't know the proper fix for the
> issue. About the rule that don't get the file for later use in poll callback
> of
> file, I wonder if it is there when only select/poll exist or just after
> epoll
> appears.
> 
> I hope the analysis helps us to fix the issue.

Thanks for doing this analysis! I suspect at least a start of a fix
would be this:

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 8fe5aa67b167..15e8f74ee0f2 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
 
 		if (events & EPOLLOUT) {
 			/* Paired with fput in dma_buf_poll_cb */
-			get_file(dmabuf->file);
-
-			if (!dma_buf_poll_add_cb(resv, true, dcb))
+			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
+			    !dma_buf_poll_add_cb(resv, true, dcb))
 				/* No callback queued, wake up any other waiters */
 				dma_buf_poll_cb(NULL, &dcb->cb);
 			else
@@ -290,9 +289,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
 
 		if (events & EPOLLIN) {
 			/* Paired with fput in dma_buf_poll_cb */
-			get_file(dmabuf->file);
-
-			if (!dma_buf_poll_add_cb(resv, false, dcb))
+			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
+			    !dma_buf_poll_add_cb(resv, false, dcb))
 				/* No callback queued, wake up any other waiters */
 				dma_buf_poll_cb(NULL, &dcb->cb);
 			else


But this ends up leaving "active" non-zero, and at close time it runs
into:

        BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);

But the bottom line is that get_file() is unsafe to use in some places,
one of which appears to be in the poll handler. There are maybe some
other fragile places too, like in drivers/gpu/drm/vmwgfx/ttm_object.c:

static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
{
	return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L;
}

Which I also note involves a dmabuf...

Due to this issue I've proposed fixing get_file() to detect pathological states:
https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/

But that has run into some push-back. I'm hoping that seeing this epoll
example will help illustrate what needs fixing a little better.

I think the best current proposal is to just WARN sooner instead of a
full refcount_t implementation:


diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8dfd53b52744..e09107d0a3d6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1040,7 +1040,8 @@ struct file_handle {
 
 static inline struct file *get_file(struct file *f)
 {
-	atomic_long_inc(&f->f_count);
+	long prior = atomic_long_fetch_inc_relaxed(&f->f_count);
+	WARN_ONCE(!prior, "struct file::f_count incremented from zero; use-after-free condition present!\n");
 	return f;
 }
 


What's the right way to deal with the dmabuf situation? (And I suspect
it applies to get_dma_buf_unless_doomed() as well...)

-Kees

-- 
Kees Cook

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 18:26   ` get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove) Kees Cook
@ 2024-05-03 18:49     ` Jens Axboe
  2024-05-03 19:22       ` Kees Cook
  2024-05-03 21:11     ` [PATCH] epoll: try to be a _bit_ better about file lifetimes Linus Torvalds
  2024-05-04  9:45     ` get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove) Christian Brauner
  2 siblings, 1 reply; 90+ messages in thread
From: Jens Axboe @ 2024-05-03 18:49 UTC (permalink / raw)
  To: Kees Cook, Bui Quang Minh, Al Viro, Christian Brauner
  Cc: syzbot, io-uring, jack, linux-fsdevel, linux-kernel,
	syzkaller-bugs, Sumit Semwal, Christian König, linux-media,
	dri-devel, linaro-mm-sig, Laura Abbott

On 5/3/24 12:26 PM, Kees Cook wrote:
> Thanks for doing this analysis! I suspect at least a start of a fix
> would be this:
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 8fe5aa67b167..15e8f74ee0f2 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
>  
>  		if (events & EPOLLOUT) {
>  			/* Paired with fput in dma_buf_poll_cb */
> -			get_file(dmabuf->file);
> -
> -			if (!dma_buf_poll_add_cb(resv, true, dcb))
> +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> +			    !dma_buf_poll_add_cb(resv, true, dcb))
>  				/* No callback queued, wake up any other waiters */

Don't think this is sane at all. I'm assuming you meant:

	atomic_long_inc_not_zero(&dmabuf->file->f_count);

but won't fly as you're not under RCU in the first place. And what
protects it from being long gone before you attempt this anyway? This is
sane way to attempt to fix it, it's completely opposite of what sane ref
handling should look like.

Not sure what the best fix is here, seems like dma-buf should hold an
actual reference to the file upfront rather than just stash a pointer
and then later _hope_ that it can just grab a reference. That seems
pretty horrible, and the real source of the issue.

> Due to this issue I've proposed fixing get_file() to detect pathological states:
> https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/

I don't think this would catch this case, as the memory could just be
garbage at this point.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 18:49     ` Jens Axboe
@ 2024-05-03 19:22       ` Kees Cook
  2024-05-03 19:35         ` Jens Axboe
  0 siblings, 1 reply; 90+ messages in thread
From: Kees Cook @ 2024-05-03 19:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bui Quang Minh, Al Viro, Christian Brauner, syzbot, io-uring,
	jack, linux-fsdevel, linux-kernel, syzkaller-bugs, Sumit Semwal,
	Christian König, linux-media, dri-devel, linaro-mm-sig,
	Laura Abbott

On Fri, May 03, 2024 at 12:49:11PM -0600, Jens Axboe wrote:
> On 5/3/24 12:26 PM, Kees Cook wrote:
> > Thanks for doing this analysis! I suspect at least a start of a fix
> > would be this:
> > 
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index 8fe5aa67b167..15e8f74ee0f2 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
> >  
> >  		if (events & EPOLLOUT) {
> >  			/* Paired with fput in dma_buf_poll_cb */
> > -			get_file(dmabuf->file);
> > -
> > -			if (!dma_buf_poll_add_cb(resv, true, dcb))
> > +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> > +			    !dma_buf_poll_add_cb(resv, true, dcb))
> >  				/* No callback queued, wake up any other waiters */
> 
> Don't think this is sane at all. I'm assuming you meant:
> 
> 	atomic_long_inc_not_zero(&dmabuf->file->f_count);

Oops, yes, sorry. I was typed from memory instead of copy/paste.

> but won't fly as you're not under RCU in the first place. And what
> protects it from being long gone before you attempt this anyway? This is
> sane way to attempt to fix it, it's completely opposite of what sane ref
> handling should look like.
> 
> Not sure what the best fix is here, seems like dma-buf should hold an
> actual reference to the file upfront rather than just stash a pointer
> and then later _hope_ that it can just grab a reference. That seems
> pretty horrible, and the real source of the issue.

AFAICT, epoll just doesn't hold any references at all. It depends,
I think, on eventpoll_release() (really eventpoll_release_file())
synchronizing with epoll_wait() (but I don't see how this happens, and
the race seems to be against ep_item_poll() ...?)

I'm really confused about how eventpoll manages the lifetime of polled
fds.

> > Due to this issue I've proposed fixing get_file() to detect pathological states:
> > https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/
> 
> I don't think this would catch this case, as the memory could just be
> garbage at this point.

It catches it just fine! :) I tested it against the published PoC.

And for cases where further allocations have progressed far enough to
corrupt the freed struct file and render the check pointless, nothing
different has happened than what happens today. At least now we have a
window to catch the situation across the time frame before it is both
reallocated _and_ the contents at the f_count offset gets changed to
non-zero.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 19:22       ` Kees Cook
@ 2024-05-03 19:35         ` Jens Axboe
  2024-05-03 19:59           ` Kees Cook
  0 siblings, 1 reply; 90+ messages in thread
From: Jens Axboe @ 2024-05-03 19:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: Bui Quang Minh, Al Viro, Christian Brauner, syzbot, io-uring,
	jack, linux-fsdevel, linux-kernel, syzkaller-bugs, Sumit Semwal,
	Christian König, linux-media, dri-devel, linaro-mm-sig,
	Laura Abbott

On 5/3/24 1:22 PM, Kees Cook wrote:
> On Fri, May 03, 2024 at 12:49:11PM -0600, Jens Axboe wrote:
>> On 5/3/24 12:26 PM, Kees Cook wrote:
>>> Thanks for doing this analysis! I suspect at least a start of a fix
>>> would be this:
>>>
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index 8fe5aa67b167..15e8f74ee0f2 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
>>>  
>>>  		if (events & EPOLLOUT) {
>>>  			/* Paired with fput in dma_buf_poll_cb */
>>> -			get_file(dmabuf->file);
>>> -
>>> -			if (!dma_buf_poll_add_cb(resv, true, dcb))
>>> +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
>>> +			    !dma_buf_poll_add_cb(resv, true, dcb))
>>>  				/* No callback queued, wake up any other waiters */
>>
>> Don't think this is sane at all. I'm assuming you meant:
>>
>> 	atomic_long_inc_not_zero(&dmabuf->file->f_count);
> 
> Oops, yes, sorry. I was typed from memory instead of copy/paste.

Figured :-)

>> but won't fly as you're not under RCU in the first place. And what
>> protects it from being long gone before you attempt this anyway? This is
>> sane way to attempt to fix it, it's completely opposite of what sane ref
>> handling should look like.
>>
>> Not sure what the best fix is here, seems like dma-buf should hold an
>> actual reference to the file upfront rather than just stash a pointer
>> and then later _hope_ that it can just grab a reference. That seems
>> pretty horrible, and the real source of the issue.
> 
> AFAICT, epoll just doesn't hold any references at all. It depends,
> I think, on eventpoll_release() (really eventpoll_release_file())
> synchronizing with epoll_wait() (but I don't see how this happens, and
> the race seems to be against ep_item_poll() ...?)
>
> I'm really confused about how eventpoll manages the lifetime of polled
> fds.

epoll doesn't hold any references, and it's got some ugly callback to
deal with that. It's not ideal, nor pretty, but that's how it currently
works. See eventpoll_release() and how it's called. This means that
epoll itself is supposedly safe from the file going away, even though it
doesn't hold a reference to it.

Except that in this case, the file is already gone by the time
eventpoll_release() is called. Which presumably is some interaction with
the somewhat suspicious file reference management that dma-buf is doing.
But I didn't look into that much, outside of noting it looks a bit
suspect.

>>> Due to this issue I've proposed fixing get_file() to detect pathological states:
>>> https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/
>>
>> I don't think this would catch this case, as the memory could just be
>> garbage at this point.
> 
> It catches it just fine! :) I tested it against the published PoC.

Sure it _may_ catch the issue, but the memory may also just be garbage
at that point. Not saying it's a useless addition, outside of the usual
gripes of making the hot path slower, just that it won't catch all
cases. Which I guess is fine and expected.

> And for cases where further allocations have progressed far enough to
> corrupt the freed struct file and render the check pointless, nothing
> different has happened than what happens today. At least now we have a
> window to catch the situation across the time frame before it is both
> reallocated _and_ the contents at the f_count offset gets changed to
> non-zero.

Right.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 19:35         ` Jens Axboe
@ 2024-05-03 19:59           ` Kees Cook
  2024-05-03 20:28             ` Kees Cook
  2024-05-04  9:59             ` Christian Brauner
  0 siblings, 2 replies; 90+ messages in thread
From: Kees Cook @ 2024-05-03 19:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bui Quang Minh, Al Viro, Christian Brauner, syzbot, io-uring,
	jack, linux-fsdevel, linux-kernel, syzkaller-bugs, Sumit Semwal,
	Christian König, linux-media, dri-devel, linaro-mm-sig,
	Laura Abbott

On Fri, May 03, 2024 at 01:35:09PM -0600, Jens Axboe wrote:
> On 5/3/24 1:22 PM, Kees Cook wrote:
> > On Fri, May 03, 2024 at 12:49:11PM -0600, Jens Axboe wrote:
> >> On 5/3/24 12:26 PM, Kees Cook wrote:
> >>> Thanks for doing this analysis! I suspect at least a start of a fix
> >>> would be this:
> >>>
> >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> >>> index 8fe5aa67b167..15e8f74ee0f2 100644
> >>> --- a/drivers/dma-buf/dma-buf.c
> >>> +++ b/drivers/dma-buf/dma-buf.c
> >>> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
> >>>  
> >>>  		if (events & EPOLLOUT) {
> >>>  			/* Paired with fput in dma_buf_poll_cb */
> >>> -			get_file(dmabuf->file);
> >>> -
> >>> -			if (!dma_buf_poll_add_cb(resv, true, dcb))
> >>> +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> >>> +			    !dma_buf_poll_add_cb(resv, true, dcb))
> >>>  				/* No callback queued, wake up any other waiters */
> >>
> >> Don't think this is sane at all. I'm assuming you meant:
> >>
> >> 	atomic_long_inc_not_zero(&dmabuf->file->f_count);
> > 
> > Oops, yes, sorry. I was typed from memory instead of copy/paste.
> 
> Figured :-)
> 
> >> but won't fly as you're not under RCU in the first place. And what
> >> protects it from being long gone before you attempt this anyway? This is
> >> sane way to attempt to fix it, it's completely opposite of what sane ref
> >> handling should look like.
> >>
> >> Not sure what the best fix is here, seems like dma-buf should hold an
> >> actual reference to the file upfront rather than just stash a pointer
> >> and then later _hope_ that it can just grab a reference. That seems
> >> pretty horrible, and the real source of the issue.
> > 
> > AFAICT, epoll just doesn't hold any references at all. It depends,
> > I think, on eventpoll_release() (really eventpoll_release_file())
> > synchronizing with epoll_wait() (but I don't see how this happens, and
> > the race seems to be against ep_item_poll() ...?)
> >
> > I'm really confused about how eventpoll manages the lifetime of polled
> > fds.
> 
> epoll doesn't hold any references, and it's got some ugly callback to
> deal with that. It's not ideal, nor pretty, but that's how it currently
> works. See eventpoll_release() and how it's called. This means that
> epoll itself is supposedly safe from the file going away, even though it
> doesn't hold a reference to it.

Right -- what remains unclear to me is how struct file lifetime is
expected to work in the struct file_operations::poll callbacks. Because
using get_file() there looks clearly unsafe...

> Except that in this case, the file is already gone by the time
> eventpoll_release() is called. Which presumably is some interaction with
> the somewhat suspicious file reference management that dma-buf is doing.
> But I didn't look into that much, outside of noting it looks a bit
> suspect.

Not yet, though. Here's (one) race state from the analysis. I added lines
for the dma_fence_add_callback()/dma_buf_poll_cb() case, since that's
the case that would escape any eventpoll_release/epoll_wait
synchronization (if it exists?):

close(dmabuf->file)
__fput_sync (f_count == 1, last ref)
f_count-- (f_count == 0 now)
__fput
                                     epoll_wait
                                     vfs_poll(dmabuf->file)
                                     get_file(dmabuf->file)(f_count == 1)
                                     dma_fence_add_callback()
eventpoll_release
dmabuf->file deallocation
                                     dma_buf_poll_cb()
                                     fput(dmabuf->file) (f_count == 1)
                                     f_count--
                                     dmabuf->file deallocation

Without fences to create a background callback, we just do a double-free:

close(dmabuf->file)
__fput_sync (f_count == 1, last ref)
f_count-- (f_count == 0 now)
__fput
                                     epoll_wait
                                     vfs_poll(dmabuf->file)
                                     get_file(dmabuf->file)(f_count == 1)
                                     dma_buf_poll_cb()
                                     fput(dmabuf->file) (f_count == 1)
                                     f_count--
                                     eventpoll_release
                                     dmabuf->file deallocation
eventpoll_release
dmabuf->file deallocation


get_file(), via epoll_wait()->vfs_poll()->dma_buf_poll(), has raised
f_count again. Then eventpoll_release() is doing things to remove
dmabuf->file from the eventpoll lists, but I *think* this is synchronized
so that an epoll_wait() will only call .poll handlers with a valid
(though possibly f_count==0) file, but I can't figure out where that
happens. (If it's not happening, we have a much bigger problem, but I
imagine we'd see massive corruption all the time, which we don't.)

So, yeah, I can't figure out how eventpoll_release() and epoll_wait()
are expected to behave safely for .poll handlers.

Regardless, for the simple case: it seems like it's just totally illegal
to use get_file() in a poll handler. Is this known/expected? And if so,
how can dmabuf possibly deal with that?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 19:59           ` Kees Cook
@ 2024-05-03 20:28             ` Kees Cook
  2024-05-03 21:11               ` Al Viro
  2024-05-04  9:59             ` Christian Brauner
  1 sibling, 1 reply; 90+ messages in thread
From: Kees Cook @ 2024-05-03 20:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bui Quang Minh, Al Viro, Christian Brauner, syzbot, io-uring,
	jack, linux-fsdevel, linux-kernel, syzkaller-bugs, Sumit Semwal,
	Christian König, linux-media, dri-devel, linaro-mm-sig,
	Laura Abbott, Linus Torvalds

On Fri, May 03, 2024 at 12:59:52PM -0700, Kees Cook wrote:
> So, yeah, I can't figure out how eventpoll_release() and epoll_wait()
> are expected to behave safely for .poll handlers.
> 
> Regardless, for the simple case: it seems like it's just totally illegal
> to use get_file() in a poll handler. Is this known/expected? And if so,
> how can dmabuf possibly deal with that?

Is this the right approach? It still feels to me like get_file() needs
to happen much earlier...

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 882b89edc52a..c6c29facf228 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -991,9 +991,13 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
 	__poll_t res;
 
 	pt->_key = epi->event.events;
-	if (!is_file_epoll(file))
-		res = vfs_poll(file, pt);
-	else
+	if (!is_file_epoll(file)) {
+		if (atomic_long_inc_not_zero(&file->f_count)) {
+			res = vfs_poll(file, pt);
+			fput(file);
+		} else
+			res = EPOLLERR;
+	} else
 		res = __ep_eventpoll_poll(file, pt, depth);
 	return res & epi->event.events;
 }

-- 
Kees Cook

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 20:28             ` Kees Cook
@ 2024-05-03 21:11               ` Al Viro
  2024-05-03 21:24                 ` Linus Torvalds
  2024-05-03 21:36                 ` Al Viro
  0 siblings, 2 replies; 90+ messages in thread
From: Al Viro @ 2024-05-03 21:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jens Axboe, Bui Quang Minh, Christian Brauner, syzbot, io-uring,
	jack, linux-fsdevel, linux-kernel, syzkaller-bugs, Sumit Semwal,
	Christian König, linux-media, dri-devel, linaro-mm-sig,
	Laura Abbott, Linus Torvalds

On Fri, May 03, 2024 at 01:28:37PM -0700, Kees Cook wrote:
> 
> Is this the right approach? It still feels to me like get_file() needs
> to happen much earlier...

I don't believe it needs to happen at all.  The problem is not that
->release() can be called during ->poll() - it can't and it doesn't.
It's that this instance of ->poll() is trying to extend the lifetime
of that struct file, when it might very well be past the point of no
return.

What we need is
	* promise that ep_item_poll() won't happen after eventpoll_release_file().
AFAICS, we do have that.
	* ->poll() not playing silly buggers.

As it is, dma_buf ->poll() is very suspicious regardless of that
mess - it can grab reference to file for unspecified interval.
Have that happen shortly before reboot and you are asking for failing
umount.

->poll() must be refcount-neutral wrt file passed to it.  I'm seriously
tempted to make ->poll() take const struct file * and see if there's
anything else that would fall out.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 18:26   ` get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove) Kees Cook
  2024-05-03 18:49     ` Jens Axboe
@ 2024-05-03 21:11     ` Linus Torvalds
  2024-05-03 21:24       ` Al Viro
  2024-05-05 17:31       ` Jens Axboe
  2024-05-04  9:45     ` get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove) Christian Brauner
  2 siblings, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-03 21:11 UTC (permalink / raw)
  To: keescook
  Cc: axboe, brauner, christian.koenig, dri-devel, io-uring, jack,
	laura, linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs, viro, Linus Torvalds

epoll is a mess, and does various invalid things in the name of
performance.

Let's try to rein it in a bit. Something like this, perhaps?

Not-yet-signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This is entirely untested, thus the "Not-yet-signed-off-by".  But I
think this may be kind of the right path forward. 

I suspect the ->poll() call is the main case that matters, but there are
other places where eventpoll just looks up the file pointer without then
being very careful about it.  The sock_from_file(epi->ffd.file) uses in
particular should probably also use this to look up the file. 

Comments?

 fs/eventpoll.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 882b89edc52a..bffa8083ff36 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -285,6 +285,30 @@ static inline void free_ephead(struct epitems_head *head)
 		kmem_cache_free(ephead_cache, head);
 }
 
+/*
+ * The ffd.file pointer may be in the process of
+ * being torn down due to being closed, but we
+ * may not have finished eventpoll_release() yet.
+ *
+ * Technically, even with the atomic_long_inc_not_zero,
+ * the file may have been free'd and then gotten
+ * re-allocated to something else (since files are
+ * not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU).
+ *
+ * But for epoll, we don't much care.
+ */
+static struct file *epi_fget(const struct epitem *epi)
+{
+	struct file *file;
+
+	rcu_read_lock();
+	file = epi->ffd.file;
+	if (!atomic_long_inc_not_zero(&file->f_count))
+		file = NULL;
+	rcu_read_unlock();
+	return file;
+}
+
 static void list_file(struct file *file)
 {
 	struct epitems_head *head;
@@ -987,14 +1011,18 @@ static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int dep
 static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
 				 int depth)
 {
-	struct file *file = epi->ffd.file;
+	struct file *file = epi_fget(epi);
 	__poll_t res;
 
+	if (!file)
+		return 0;
+
 	pt->_key = epi->event.events;
 	if (!is_file_epoll(file))
 		res = vfs_poll(file, pt);
 	else
 		res = __ep_eventpoll_poll(file, pt, depth);
+	fput(file);
 	return res & epi->event.events;
 }
 
-- 
2.44.0.330.g4d18c88175


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 21:11     ` [PATCH] epoll: try to be a _bit_ better about file lifetimes Linus Torvalds
@ 2024-05-03 21:24       ` Al Viro
  2024-05-03 21:33         ` Linus Torvalds
  2024-05-04  9:25         ` Hillf Danton
  2024-05-05 17:31       ` Jens Axboe
  1 sibling, 2 replies; 90+ messages in thread
From: Al Viro @ 2024-05-03 21:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: keescook, axboe, brauner, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, May 03, 2024 at 02:11:30PM -0700, Linus Torvalds wrote:
> epoll is a mess, and does various invalid things in the name of
> performance.
> 
> Let's try to rein it in a bit. Something like this, perhaps?

> +/*
> + * The ffd.file pointer may be in the process of
> + * being torn down due to being closed, but we
> + * may not have finished eventpoll_release() yet.
> + *
> + * Technically, even with the atomic_long_inc_not_zero,
> + * the file may have been free'd and then gotten
> + * re-allocated to something else (since files are
> + * not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU).

Can we get to ep_item_poll(epi, ...) after eventpoll_release_file()
got past __ep_remove()?  Because if we can, we have a worse problem -
epi freed under us.

If not, we couldn't possibly have reached ->release() yet, let
alone freeing anything.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 21:11               ` Al Viro
@ 2024-05-03 21:24                 ` Linus Torvalds
  2024-05-03 21:30                   ` Al Viro
  2024-05-06 17:46                   ` Stefan Metzmacher
  2024-05-03 21:36                 ` Al Viro
  1 sibling, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-03 21:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Kees Cook, Jens Axboe, Bui Quang Minh, Christian Brauner, syzbot,
	io-uring, jack, linux-fsdevel, linux-kernel, syzkaller-bugs,
	Sumit Semwal, Christian König, linux-media, dri-devel,
	linaro-mm-sig, Laura Abbott

On Fri, 3 May 2024 at 14:11, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What we need is
>         * promise that ep_item_poll() won't happen after eventpoll_release_file().
> AFAICS, we do have that.
>         * ->poll() not playing silly buggers.

No. That is not enough at all.

Because even with perfectly normal "->poll()", and even with the
ep_item_poll() happening *before* eventpoll_release_file(), you have
this trivial race:

  ep_item_poll()
     ->poll()

and *between* those two operations, another CPU does "close()", and
that causes eventpoll_release_file() to be called, and now f_count
goes down to zero while ->poll() is running.

So you do need to increment the file count around the ->poll() call, I feel.

Or, alternatively, you'd need to serialize with
eventpoll_release_file(), but that would need to be some sleeping lock
held over the ->poll() call.

> As it is, dma_buf ->poll() is very suspicious regardless of that
> mess - it can grab reference to file for unspecified interval.

I think that's actually much preferable to what epoll does, which is
to keep using files without having reference counts to them (and then
relying on magically not racing with eventpoll_release_file().

                Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 21:24                 ` Linus Torvalds
@ 2024-05-03 21:30                   ` Al Viro
  2024-05-06 17:46                   ` Stefan Metzmacher
  1 sibling, 0 replies; 90+ messages in thread
From: Al Viro @ 2024-05-03 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Jens Axboe, Bui Quang Minh, Christian Brauner, syzbot,
	io-uring, jack, linux-fsdevel, linux-kernel, syzkaller-bugs,
	Sumit Semwal, Christian König, linux-media, dri-devel,
	linaro-mm-sig, Laura Abbott

On Fri, May 03, 2024 at 02:24:45PM -0700, Linus Torvalds wrote:
> Because even with perfectly normal "->poll()", and even with the
> ep_item_poll() happening *before* eventpoll_release_file(), you have
> this trivial race:
> 
>   ep_item_poll()
>      ->poll()
> 
> and *between* those two operations, another CPU does "close()", and
> that causes eventpoll_release_file() to be called, and now f_count
> goes down to zero while ->poll() is running.
> 
> So you do need to increment the file count around the ->poll() call, I feel.
> 
> Or, alternatively, you'd need to serialize with
> eventpoll_release_file(), but that would need to be some sleeping lock
> held over the ->poll() call.
> 
> > As it is, dma_buf ->poll() is very suspicious regardless of that
> > mess - it can grab reference to file for unspecified interval.
> 
> I think that's actually much preferable to what epoll does, which is
> to keep using files without having reference counts to them (and then
> relying on magically not racing with eventpoll_release_file().

eventpoll_release_file() calling __ep_remove() while ep_item_poll()
is something we need to avoid anyway - having epi freed under
ep_item_poll() would be a problem regardless of struct file
lifetime issues.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 21:24       ` Al Viro
@ 2024-05-03 21:33         ` Linus Torvalds
  2024-05-03 21:45           ` Al Viro
  2024-05-04  9:37           ` Christian Brauner
  2024-05-04  9:25         ` Hillf Danton
  1 sibling, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-03 21:33 UTC (permalink / raw)
  To: Al Viro
  Cc: keescook, axboe, brauner, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, 3 May 2024 at 14:24, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Can we get to ep_item_poll(epi, ...) after eventpoll_release_file()
> got past __ep_remove()?  Because if we can, we have a worse problem -
> epi freed under us.

Look at the hack in __ep_remove(): if it is concurrent with
eventpoll_release_file(), it will hit this code

        spin_lock(&file->f_lock);
        if (epi->dying && !force) {
                spin_unlock(&file->f_lock);
                return false;
        }

and not free the epi.

But as far as I can tell, almost nothing else cares about the f_lock
and dying logic.

And in fact, I don't think doing

        spin_lock(&file->f_lock);

is even valid in the places that look up file through "epi->ffd.file",
because the lock itself is inside the thing that you can't trust until
you've taken the lock...

So I agree with Kees about the use of "atomic_dec_not_zero()" kind of
logic - but it also needs to be in an RCU-readlocked region, I think.

I wish epoll() just took the damn file ref itself. But since it relies
on the file refcount to release the data structure, that obviously
can't work.

                Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 21:11               ` Al Viro
  2024-05-03 21:24                 ` Linus Torvalds
@ 2024-05-03 21:36                 ` Al Viro
  2024-05-03 21:42                   ` Linus Torvalds
  1 sibling, 1 reply; 90+ messages in thread
From: Al Viro @ 2024-05-03 21:36 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jens Axboe, Bui Quang Minh, Christian Brauner, syzbot, io-uring,
	jack, linux-fsdevel, linux-kernel, syzkaller-bugs, Sumit Semwal,
	Christian König, linux-media, dri-devel, linaro-mm-sig,
	Laura Abbott, Linus Torvalds

On Fri, May 03, 2024 at 10:11:09PM +0100, Al Viro wrote:
> On Fri, May 03, 2024 at 01:28:37PM -0700, Kees Cook wrote:
> > 
> > Is this the right approach? It still feels to me like get_file() needs
> > to happen much earlier...
> 
> I don't believe it needs to happen at all.  The problem is not that
> ->release() can be called during ->poll() - it can't and it doesn't.
> It's that this instance of ->poll() is trying to extend the lifetime
> of that struct file, when it might very well be past the point of no
> return.
> 
> What we need is
> 	* promise that ep_item_poll() won't happen after eventpoll_release_file().
> AFAICS, we do have that.
> 	* ->poll() not playing silly buggers.
> 
> As it is, dma_buf ->poll() is very suspicious regardless of that
> mess - it can grab reference to file for unspecified interval.
> Have that happen shortly before reboot and you are asking for failing
> umount.
> 
> ->poll() must be refcount-neutral wrt file passed to it.  I'm seriously
> tempted to make ->poll() take const struct file * and see if there's
> anything else that would fall out.

... the last part is no-go - poll_wait() must be able to grab a reference
(well, the callback in it must)

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 21:36                 ` Al Viro
@ 2024-05-03 21:42                   ` Linus Torvalds
  2024-05-03 21:53                     ` Al Viro
  0 siblings, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2024-05-03 21:42 UTC (permalink / raw)
  To: Al Viro
  Cc: Kees Cook, Jens Axboe, Bui Quang Minh, Christian Brauner, syzbot,
	io-uring, jack, linux-fsdevel, linux-kernel, syzkaller-bugs,
	Sumit Semwal, Christian König, linux-media, dri-devel,
	linaro-mm-sig, Laura Abbott

On Fri, 3 May 2024 at 14:36, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> ... the last part is no-go - poll_wait() must be able to grab a reference
> (well, the callback in it must)

Yeah. I really think that *poll* itself is doing everything right. It
knows that it's called with a file pointer with a reference, and it
adds its own references as needed.

And I think that's all fine - both for dmabuf in particular, but for
poll in general. That's how things are *supposed* to work. You can
keep references to other things in your 'struct file *', knowing that
files are properly refcounted, and won't go away while you are dealing
with them.

The problem, of course, is that then epoll violates that "called with
reference" part.  epoll very much by design does *not* take references
to the files it keeps track of, and then tears them down at close()
time.

Now, epoll has its reasons for doing that. They are even good reasons.
But that does mean that when epoll needs to deal with that hackery.

I wish we could remove epoll entirely, but that isn't an option, so we
need to just make sure that when it accesses the ffd.file pointer, it
does so more carefully.

              Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 21:33         ` Linus Torvalds
@ 2024-05-03 21:45           ` Al Viro
  2024-05-03 21:52             ` Linus Torvalds
  2024-05-04  9:37           ` Christian Brauner
  1 sibling, 1 reply; 90+ messages in thread
From: Al Viro @ 2024-05-03 21:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: keescook, axboe, brauner, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, May 03, 2024 at 02:33:37PM -0700, Linus Torvalds wrote:

> Look at the hack in __ep_remove(): if it is concurrent with
> eventpoll_release_file(), it will hit this code
> 
>         spin_lock(&file->f_lock);
>         if (epi->dying && !force) {
>                 spin_unlock(&file->f_lock);
>                 return false;
>         }
> 
> and not free the epi.

What does that have to do with ep_item_poll()?

eventpoll_release_file() itself calls __ep_remove().  Have that
happen while ep_item_poll() is running in another thread and
you've got a problem.

AFAICS, exclusion is on ep->mtx.  Callers of ep_item_poll() are
* __ep_eventpoll_poll() - grabs ->mtx
* ep_insert() - called under ->mtx
* ep_modify() - calls are under ->mtx
* ep_send_events() - grabs ->mtx

and eventpoll_release_file() grabs ->mtx around __ep_remove().

How do you get through eventpoll_release_file() while someone
has entered ep_item_poll()?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 21:45           ` Al Viro
@ 2024-05-03 21:52             ` Linus Torvalds
  2024-05-03 22:01               ` Al Viro
  2024-05-03 22:46               ` Kees Cook
  0 siblings, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-03 21:52 UTC (permalink / raw)
  To: Al Viro
  Cc: keescook, axboe, brauner, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, 3 May 2024 at 14:45, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> How do you get through eventpoll_release_file() while someone
> has entered ep_item_poll()?

Doesn't matter.

Look, it's enough that the file count has gone down to zero. You may
not even have gotten to eventpoll_release_file() yet - the important
part is that you're on your *way* to it.

That means that the file will be released - and it means that you have
violated all the refcounting rules for poll().

So I think you're barking up the wrong tree.

                  Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 21:42                   ` Linus Torvalds
@ 2024-05-03 21:53                     ` Al Viro
  2024-05-06 12:23                       ` Daniel Vetter
  0 siblings, 1 reply; 90+ messages in thread
From: Al Viro @ 2024-05-03 21:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Jens Axboe, Bui Quang Minh, Christian Brauner, syzbot,
	io-uring, jack, linux-fsdevel, linux-kernel, syzkaller-bugs,
	Sumit Semwal, Christian König, linux-media, dri-devel,
	linaro-mm-sig, Laura Abbott

On Fri, May 03, 2024 at 02:42:22PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 14:36, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > ... the last part is no-go - poll_wait() must be able to grab a reference
> > (well, the callback in it must)
> 
> Yeah. I really think that *poll* itself is doing everything right. It
> knows that it's called with a file pointer with a reference, and it
> adds its own references as needed.

Not really.  Note that select's __pollwait() does *NOT* leave a reference
at the mercy of driver - it's stuck into poll_table_entry->filp and
the poll_freewait() knows how to take those out.


dmabuf does something very different - it grabs the damn thing into
its private data structures and for all we know it could keep it for
a few hours, until some even materializes.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 21:52             ` Linus Torvalds
@ 2024-05-03 22:01               ` Al Viro
  2024-05-03 22:07                 ` Al Viro
  2024-05-03 22:46               ` Kees Cook
  1 sibling, 1 reply; 90+ messages in thread
From: Al Viro @ 2024-05-03 22:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: keescook, axboe, brauner, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 14:45, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > How do you get through eventpoll_release_file() while someone
> > has entered ep_item_poll()?
> 
> Doesn't matter.
> 
> Look, it's enough that the file count has gone down to zero. You may
> not even have gotten to eventpoll_release_file() yet - the important
> part is that you're on your *way* to it.
> 
> That means that the file will be released - and it means that you have
> violated all the refcounting rules for poll().
> 
> So I think you're barking up the wrong tree.

IMO there are several things in that mess (aside of epoll being what it is).

Trying to grab refcount as you do is fine; the comment is seriously
misleading, though - we *are* guaranteed that struct file hasn't hit ->release(),
let alone getting freed and reused.

Having pollwait callback grab references is fine - and the callback belongs
to whoever's calling ->poll().

Having ->poll() instance itself grab reference is really asking for problem,
even on the boxen that have CONFIG_EPOLL turned off.  select(2) is enough;
it will take care of references grabbed by __pollwait(), but it doesn't
know anything about the references driver has stashed hell knows where for
hell knows how long.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 22:01               ` Al Viro
@ 2024-05-03 22:07                 ` Al Viro
  2024-05-03 23:16                   ` Linus Torvalds
  0 siblings, 1 reply; 90+ messages in thread
From: Al Viro @ 2024-05-03 22:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: keescook, axboe, brauner, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, May 03, 2024 at 11:01:45PM +0100, Al Viro wrote:

> Having ->poll() instance itself grab reference is really asking for problem,
> even on the boxen that have CONFIG_EPOLL turned off.  select(2) is enough;
> it will take care of references grabbed by __pollwait(), but it doesn't
> know anything about the references driver has stashed hell knows where for
> hell knows how long.

Suppose your program calls select() on a pipe and dmabuf, sees data to be read
from pipe, reads it, closes both pipe and dmabuf and exits.

Would you expect that dmabuf file would stick around for hell knows how long
after that?  I would certainly be very surprised by running into that...

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 21:52             ` Linus Torvalds
  2024-05-03 22:01               ` Al Viro
@ 2024-05-03 22:46               ` Kees Cook
  2024-05-03 23:03                 ` Al Viro
  1 sibling, 1 reply; 90+ messages in thread
From: Kees Cook @ 2024-05-03 22:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, axboe, brauner, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote:
> That means that the file will be released - and it means that you have
> violated all the refcounting rules for poll().

I feel like I've been looking at this too long. I think I see another
problem here, but with dmabuf even when epoll is fixed:

dma_buf_poll()
	get_file(dmabuf->file)		/* f_count + 1 */
	dma_buf_poll_add_cb()
		dma_resv_for_each_fence ...
			dma_fence_add_callback(fence, ..., dma_buf_poll_cb)

dma_buf_poll_cb()
	...
        fput(dmabuf->file);		/* f_count - 1 ... for each fence */

Isn't it possible to call dma_buf_poll_cb() (and therefore fput())
multiple times if there is more than 1 fence? Perhaps I've missed a
place where a single struct dma_resv will only ever signal 1 fence? But
looking through dma_fence_signal_timestamp_locked(), I don't see
anything about resv nor somehow looking into other fence cb_list
contents...

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 22:46               ` Kees Cook
@ 2024-05-03 23:03                 ` Al Viro
  2024-05-03 23:23                   ` Kees Cook
  0 siblings, 1 reply; 90+ messages in thread
From: Al Viro @ 2024-05-03 23:03 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, axboe, brauner, christian.koenig, dri-devel,
	io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, May 03, 2024 at 03:46:25PM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote:
> > That means that the file will be released - and it means that you have
> > violated all the refcounting rules for poll().
> 
> I feel like I've been looking at this too long. I think I see another
> problem here, but with dmabuf even when epoll is fixed:
> 
> dma_buf_poll()
> 	get_file(dmabuf->file)		/* f_count + 1 */
> 	dma_buf_poll_add_cb()
> 		dma_resv_for_each_fence ...
> 			dma_fence_add_callback(fence, ..., dma_buf_poll_cb)
> 
> dma_buf_poll_cb()
> 	...
>         fput(dmabuf->file);		/* f_count - 1 ... for each fence */
> 
> Isn't it possible to call dma_buf_poll_cb() (and therefore fput())
> multiple times if there is more than 1 fence? Perhaps I've missed a
> place where a single struct dma_resv will only ever signal 1 fence? But
> looking through dma_fence_signal_timestamp_locked(), I don't see
> anything about resv nor somehow looking into other fence cb_list
> contents...

At a guess,
                r = dma_fence_add_callback(fence, &dcb->cb, dma_buf_poll_cb);
		if (!r)
			return true;

prevents that - it returns 0 on success and -E... on error;
insertion into the list happens only when it's returning 0,
so...

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 22:07                 ` Al Viro
@ 2024-05-03 23:16                   ` Linus Torvalds
  2024-05-03 23:39                     ` Al Viro
  0 siblings, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2024-05-03 23:16 UTC (permalink / raw)
  To: Al Viro
  Cc: keescook, axboe, brauner, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, 3 May 2024 at 15:07, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Suppose your program calls select() on a pipe and dmabuf, sees data to be read
> from pipe, reads it, closes both pipe and dmabuf and exits.
>
> Would you expect that dmabuf file would stick around for hell knows how long
> after that?  I would certainly be very surprised by running into that...

Why?

That's the _point_ of refcounts. They make the thing they refcount
stay around until it's no longer referenced.

Now, I agree that dmabuf's are a bit odd in how they use a 'struct
file' *as* their refcount, but hey, it's a specialty use. Unusual
perhaps, but not exactly wrong.

I suspect that if you saw a dmabuf just have its own 'refcount_t' and
stay around until it was done, you wouldn't bat an eye at it, and it's
really just the "it uses a struct file for counting" that you are
reacting to.

                Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 23:03                 ` Al Viro
@ 2024-05-03 23:23                   ` Kees Cook
  2024-05-03 23:41                     ` Linus Torvalds
  0 siblings, 1 reply; 90+ messages in thread
From: Kees Cook @ 2024-05-03 23:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, axboe, brauner, christian.koenig, dri-devel,
	io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sat, May 04, 2024 at 12:03:18AM +0100, Al Viro wrote:
> On Fri, May 03, 2024 at 03:46:25PM -0700, Kees Cook wrote:
> > On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote:
> > > That means that the file will be released - and it means that you have
> > > violated all the refcounting rules for poll().
> > 
> > I feel like I've been looking at this too long. I think I see another
> > problem here, but with dmabuf even when epoll is fixed:
> > 
> > dma_buf_poll()
> > 	get_file(dmabuf->file)		/* f_count + 1 */
> > 	dma_buf_poll_add_cb()
> > 		dma_resv_for_each_fence ...
> > 			dma_fence_add_callback(fence, ..., dma_buf_poll_cb)
> > 
> > dma_buf_poll_cb()
> > 	...
> >         fput(dmabuf->file);		/* f_count - 1 ... for each fence */
> > 
> > Isn't it possible to call dma_buf_poll_cb() (and therefore fput())
> > multiple times if there is more than 1 fence? Perhaps I've missed a
> > place where a single struct dma_resv will only ever signal 1 fence? But
> > looking through dma_fence_signal_timestamp_locked(), I don't see
> > anything about resv nor somehow looking into other fence cb_list
> > contents...
> 
> At a guess,
>                 r = dma_fence_add_callback(fence, &dcb->cb, dma_buf_poll_cb);
> 		if (!r)
> 			return true;
> 
> prevents that - it returns 0 on success and -E... on error;
> insertion into the list happens only when it's returning 0,
> so...

Yes; thank you. I *have* been looking at it all too long. :)


The last related thing is the drivers/gpu/drm/vmwgfx/ttm_object.c case:

/**
 * get_dma_buf_unless_doomed - get a dma_buf reference if possible.
 *
 * @dmabuf: Non-refcounted pointer to a struct dma-buf.
 *
 * Obtain a file reference from a lookup structure that doesn't refcount
 * the file, but synchronizes with its release method to make sure it
 * has
 * not been freed yet. See for example kref_get_unless_zero
 * documentation.
 * Returns true if refcounting succeeds, false otherwise.
 *
 * Nobody really wants this as a public API yet, so let it mature here
 * for some time...
 */
static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
{
        return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L;
}

If we end up adding epi_fget(), we'll have 2 cases of using
"atomic_long_inc_not_zero" for f_count. Do we need some kind of blessed
helper to live in file.h or something, with appropriate comments?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 23:16                   ` Linus Torvalds
@ 2024-05-03 23:39                     ` Al Viro
  2024-05-03 23:54                       ` Linus Torvalds
  2024-05-04 10:44                       ` Christian Brauner
  0 siblings, 2 replies; 90+ messages in thread
From: Al Viro @ 2024-05-03 23:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: keescook, axboe, brauner, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, May 03, 2024 at 04:16:15PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 15:07, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Suppose your program calls select() on a pipe and dmabuf, sees data to be read
> > from pipe, reads it, closes both pipe and dmabuf and exits.
> >
> > Would you expect that dmabuf file would stick around for hell knows how long
> > after that?  I would certainly be very surprised by running into that...
> 
> Why?
> 
> That's the _point_ of refcounts. They make the thing they refcount
> stay around until it's no longer referenced.
> 
> Now, I agree that dmabuf's are a bit odd in how they use a 'struct
> file' *as* their refcount, but hey, it's a specialty use. Unusual
> perhaps, but not exactly wrong.
> 
> I suspect that if you saw a dmabuf just have its own 'refcount_t' and
> stay around until it was done, you wouldn't bat an eye at it, and it's
> really just the "it uses a struct file for counting" that you are
> reacting to.

*IF* those files are on purely internal filesystem, that's probably
OK; do that with something on something mountable (char device,
sysfs file, etc.) and you have a problem with filesystem staying
busy.

I'm really unfamiliar with the subsystem; it might be OK with all
objects that use that for ->poll(), but that's definitely not a good
thing to see in ->poll() instance in general.  And code gets copied,
so there really should be a big fat comment about the reasons why
it's OK in this particular case.

Said that, it seems that a better approach might be to have
their ->release() cancel callbacks and drop fence references.
Note that they *do* have refcounts - on fences.  The file
(well, dmabuf, really) is pinned only to protect against the
situation when pending callback is still around.  And Kees'
observation about multiple fences is also interesting - we don't
get extra fput(), but only because we get events only from one
fence, which does look fishy...

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 23:23                   ` Kees Cook
@ 2024-05-03 23:41                     ` Linus Torvalds
  2024-05-04  9:19                       ` Christian Brauner
  2024-05-06 12:37                       ` Daniel Vetter
  0 siblings, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-03 23:41 UTC (permalink / raw)
  To: Kees Cook
  Cc: Al Viro, axboe, brauner, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, 3 May 2024 at 16:23, Kees Cook <keescook@chromium.org> wrote:
>
> static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
> {
>         return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L;
> }
>
> If we end up adding epi_fget(), we'll have 2 cases of using
> "atomic_long_inc_not_zero" for f_count. Do we need some kind of blessed
> helper to live in file.h or something, with appropriate comments?

I wonder if we could try to abstract this out a bit more.

These games with non-ref-counted file structures *feel* a bit like the
games we play with non-ref-counted (aka "stashed") 'struct dentry'
that got fairly recently cleaned up with path_from_stashed() when both
nsfs and pidfs started doing the same thing.

I'm not loving the TTM use of this thing, but at least the locking and
logic feels a lot more straightforward (ie the
atomic_long_inc_not_zero() here is clealy under the 'prime->mutex'
lock

IOW, the tty use looks correct to me, and it has fairly simple locking
and is just catching the the race between 'fput()' decrementing the
refcount and and 'file->f_op->release()' doing the actual release.

You are right that it's similar to the epoll thing in that sense, it
just looks a _lot_ more straightforward to me (and, unlike epoll,
doesn't look actively buggy right now).

Could we abstract out this kind of "stashed file pointer" so that we'd
have a *common* form for this? Not just the inc_not_zero part, but the
locking rule too?

              Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 23:39                     ` Al Viro
@ 2024-05-03 23:54                       ` Linus Torvalds
  2024-05-04 10:44                       ` Christian Brauner
  1 sibling, 0 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-03 23:54 UTC (permalink / raw)
  To: Al Viro
  Cc: keescook, axboe, brauner, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, 3 May 2024 at 16:39, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> *IF* those files are on purely internal filesystem, that's probably
> OK; do that with something on something mountable (char device,
> sysfs file, etc.) and you have a problem with filesystem staying
> busy.

Yeah, I agree, it's a bit annoying in general. That said, it's easy to
do: stash a file descriptor in a unix domain socket, and that's
basically exactly what you have: a random reference to a 'struct file'
that will stay around for as long as you just keep that socket around,
long after the "real" file descriptor has been closed, and entirely
separately from it.

And yes, that's exactly why unix domain socket transfers have caused
so many problems over the years, with both refcount overflows and
nasty garbage collection issues.

So randomly taking references to file descriptors certainly isn't new.

In fact, it's so common that I find the epoll pattern annoying, in
that it does something special and *not* taking a ref - and it does
that special thing to *other* ("innocent") file descriptors. Yes,
dma-buf is a bit like those unix domain sockets in that it can keep
random references alive for random times, but at least it does it just
to its own file descriptors, not random other targets.

So the dmabuf thing is very much a "I'm a special file that describes
a dma buffer", and shouldn't really affect anything outside of active
dmabuf uses (which admittedly is a large portion of the GPU drivers,
and has been expanding from there...). I

So the reason I'm annoyed at epoll in this case is that I think epoll
triggered the bug in some entirely innocent subsystem. dma-buf is
doing something differently odd, yes, but at least it's odd in a "I'm
a specialized thing" sense, not in some "I screw over others" sense.

             Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove
  2024-04-08  8:26 [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove syzbot
  2024-04-15 14:31 ` Jens Axboe
  2024-05-03 11:54 ` Bui Quang Minh
@ 2024-05-04  3:23 ` Hillf Danton
  2024-05-04  3:46   ` [syzbot] [fs] " syzbot
  2 siblings, 1 reply; 90+ messages in thread
From: Hillf Danton @ 2024-05-04  3:23 UTC (permalink / raw)
  To: syzbot; +Cc: linux-kernel, syzkaller-bugs

On Mon, 08 Apr 2024 01:26:16 -0700
> syzbot found the following issue on:
> 
> HEAD commit:    480e035fc4c7 Merge tag 'drm-next-2024-03-13' of https://gi..
> git tree:       upstream
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10801175180000

#syz test https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git  480e035fc4c7

--- x/fs/eventpoll.c
+++ y/fs/eventpoll.c
@@ -991,9 +991,13 @@ static __poll_t ep_item_poll(const struc
 	__poll_t res;
 
 	pt->_key = epi->event.events;
-	if (!is_file_epoll(file))
-		res = vfs_poll(file, pt);
-	else
+	if (!is_file_epoll(file)) {
+		res = 0;
+		if (atomic_long_inc_not_zero(&file->f_count)) {
+			res = vfs_poll(file, pt);
+			fput(file);
+		}
+	} else
 		res = __ep_eventpoll_poll(file, pt, depth);
 	return res & epi->event.events;
 }
--

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [syzbot] [fs] general protection fault in __ep_remove
  2024-05-04  3:23 ` [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove Hillf Danton
@ 2024-05-04  3:46   ` syzbot
  0 siblings, 0 replies; 90+ messages in thread
From: syzbot @ 2024-05-04  3:46 UTC (permalink / raw)
  To: hdanton, linux-kernel, syzkaller-bugs

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-and-tested-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com

Tested on:

commit:         480e035f Merge tag 'drm-next-2024-03-13' of https://gi..
git tree:       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
console output: https://syzkaller.appspot.com/x/log.txt?x=137a0b54980000
kernel config:  https://syzkaller.appspot.com/x/.config?x=1e5b814e91787669
dashboard link: https://syzkaller.appspot.com/bug?extid=045b454ab35fd82a35fb
compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
patch:          https://syzkaller.appspot.com/x/patch.diff?x=1046b2a0980000

Note: testing is done by a robot and is best-effort only.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 23:41                     ` Linus Torvalds
@ 2024-05-04  9:19                       ` Christian Brauner
  2024-05-06 12:37                       ` Daniel Vetter
  1 sibling, 0 replies; 90+ messages in thread
From: Christian Brauner @ 2024-05-04  9:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Al Viro, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, May 03, 2024 at 04:41:19PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 16:23, Kees Cook <keescook@chromium.org> wrote:
> >
> > static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
> > {
> >         return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L;
> > }
> >
> > If we end up adding epi_fget(), we'll have 2 cases of using
> > "atomic_long_inc_not_zero" for f_count. Do we need some kind of blessed
> > helper to live in file.h or something, with appropriate comments?
> 
> I wonder if we could try to abstract this out a bit more.
> 
> These games with non-ref-counted file structures *feel* a bit like the
> games we play with non-ref-counted (aka "stashed") 'struct dentry'
> that got fairly recently cleaned up with path_from_stashed() when both
> nsfs and pidfs started doing the same thing.
> 
> I'm not loving the TTM use of this thing, but at least the locking and
> logic feels a lot more straightforward (ie the
> atomic_long_inc_not_zero() here is clealy under the 'prime->mutex'
> lock

The TTM stuff is somewhat wild though and I've commented on that in
https://lore.kernel.org/r/20240503-mitmachen-redakteur-2707ab0cacc3@brauner
another thread that it can just use get_active_file().

Afaict, there's dma_buf_export() that allocates a new file and sets:

file->private_data = dmabuf;
dmabuf->file = file;
dentry->d_fsdata = dmabuf;

The file has f_op->release::dma_buf_file_release() as it's f_op->release
method. When that's called the file's refcount is already zero but the
file has not been freed yet. This will remove the dmabuf from some
public list but it won't free it.

dmabuf dentries have dma_buf_dentry_ops which use
dentry->d_release::dma_buf_release() to release the actual dmabuf
stashed in dentry->d_fsdata.

So that ends up with:

__fput()
-> f_op->release::dma_buf_file_release() // handles file specific freeing
-> dput()
   -> d_op->d_release::dma_buf_release() // handles dmabuf freeing
                                         // including the driver specific stuff.

If you fput() the file then the dmabuf will be freed as well immediately
after it when the dput() happens in __fput().

So that TTM thing does something else then in ttm_object_device_init().
It copies the dma_buf_ops into tdev->ops and replaces the dma_buf_ops
release method with it's own ttm_prime_dmabuf_release() and stashes the
old on in tdev->dma_buf_release.

And it uses that to hook into the release path so that @dmabuf will
still be valid for get_dma_buf_unless_doomed() under prime->mutex.

But again, get_dma_buf_unless_doomed() can just be replaced with
get_active_file() and then we're done with that part.

> IOW, the tty use looks correct to me, and it has fairly simple locking
> and is just catching the the race between 'fput()' decrementing the
> refcount and and 'file->f_op->release()' doing the actual release.
> 
> You are right that it's similar to the epoll thing in that sense, it
> just looks a _lot_ more straightforward to me (and, unlike epoll,
> doesn't look actively buggy right now).

It's not buggy afaict. It literally can just switch to get_active_file()
instead of open-coding it and we're done imho.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 21:24       ` Al Viro
  2024-05-03 21:33         ` Linus Torvalds
@ 2024-05-04  9:25         ` Hillf Danton
  1 sibling, 0 replies; 90+ messages in thread
From: Hillf Danton @ 2024-05-04  9:25 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-kernel, Linus Torvalds, Kees Cook, Christian Konig,
	minhquangbui99, syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, 3 May 2024 22:24:28 +0100 Al Viro wrote:
> On Fri, May 03, 2024 at 02:11:30PM -0700, Linus Torvalds wrote:
> > epoll is a mess, and does various invalid things in the name of
> > performance.
> > 
> > Let's try to rein it in a bit. Something like this, perhaps?
> 
> > +/*
> > + * The ffd.file pointer may be in the process of
> > + * being torn down due to being closed, but we
> > + * may not have finished eventpoll_release() yet.
> > + *
> > + * Technically, even with the atomic_long_inc_not_zero,
> > + * the file may have been free'd and then gotten
> > + * re-allocated to something else (since files are
> > + * not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU).
> 
> Can we get to ep_item_poll(epi, ...) after eventpoll_release_file()
> got past __ep_remove()?  Because if we can, we have a worse problem -

Nope but mtx can help poll go before remove, see below.

> epi freed under us.
> 
> If not, we couldn't possibly have reached ->release() yet, let
> alone freeing anything.

Actually poll can see a file with zero f_count, and LT's idea with
trival change survived the syzbot repro [1].

I think fput currently can race with epoll wrt f_count, and checking
it in dma-buf is necessary if his idea looks too aggressive.

	wait_epoll()			__fput()
	do_epoll_wait()			eventpoll_release_file()
	ep_poll()
	ep_send_events()
	mutex_lock(&ep->mtx)
	ep_item_poll()
	vfs_poll()
	mutex_unlock(&ep->mtx)
					mutex_lock(&ep->mtx)
					dispose = __ep_remove(ep, epi, true)
					mutex_unlock(&ep->mtx)

[1] https://lore.kernel.org/lkml/000000000000f1c99d061798ac6d@google.com/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 21:33         ` Linus Torvalds
  2024-05-03 21:45           ` Al Viro
@ 2024-05-04  9:37           ` Christian Brauner
  2024-05-04 15:32             ` Linus Torvalds
  1 sibling, 1 reply; 90+ messages in thread
From: Christian Brauner @ 2024-05-04  9:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, May 03, 2024 at 02:33:37PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 14:24, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Can we get to ep_item_poll(epi, ...) after eventpoll_release_file()
> > got past __ep_remove()?  Because if we can, we have a worse problem -
> > epi freed under us.
> 
> Look at the hack in __ep_remove(): if it is concurrent with
> eventpoll_release_file(), it will hit this code
> 
>         spin_lock(&file->f_lock);
>         if (epi->dying && !force) {
>                 spin_unlock(&file->f_lock);
>                 return false;
>         }
> 
> and not free the epi.
> 
> But as far as I can tell, almost nothing else cares about the f_lock
> and dying logic.
> 
> And in fact, I don't think doing
> 
>         spin_lock(&file->f_lock);
> 
> is even valid in the places that look up file through "epi->ffd.file",
> because the lock itself is inside the thing that you can't trust until
> you've taken the lock...
> 
> So I agree with Kees about the use of "atomic_dec_not_zero()" kind of
> logic - but it also needs to be in an RCU-readlocked region, I think.

Why isn't it enough to just force dma_buf_poll() to use
get_file_active()? Then that whole problem goes away afaict.

So the fix I had yesterday before I had to step away from the computer
was literally just that [1]. It currently uses two atomic incs
potentially but that can probably be fixed by the dma folks to be
smarter about when they actually need to take a file reference.

> 
> I wish epoll() just took the damn file ref itself. But since it relies
> on the file refcount to release the data structure, that obviously
> can't work.
> 
>                 Linus

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 8fe5aa67b167..7149c45976e1 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -244,13 +244,18 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
        if (!dmabuf || !dmabuf->resv)
                return EPOLLERR;

+       if (!get_file_active(&dmabuf->file))
+               return EPOLLERR;
+
        resv = dmabuf->resv;

        poll_wait(file, &dmabuf->poll, poll);

        events = poll_requested_events(poll) & (EPOLLIN | EPOLLOUT);
-       if (!events)
+       if (!events) {
+               fput(file);
                return 0;
+       }

        dma_resv_lock(resv, NULL);

@@ -268,7 +273,6 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
                if (events & EPOLLOUT) {
                        /* Paired with fput in dma_buf_poll_cb */
                        get_file(dmabuf->file);
-
                        if (!dma_buf_poll_add_cb(resv, true, dcb))
                                /* No callback queued, wake up any other waiters */
                                dma_buf_poll_cb(NULL, &dcb->cb);
@@ -301,6 +305,7 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
        }

        dma_resv_unlock(resv);
+       fput(file);
        return events;
 }

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 18:26   ` get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove) Kees Cook
  2024-05-03 18:49     ` Jens Axboe
  2024-05-03 21:11     ` [PATCH] epoll: try to be a _bit_ better about file lifetimes Linus Torvalds
@ 2024-05-04  9:45     ` Christian Brauner
  2 siblings, 0 replies; 90+ messages in thread
From: Christian Brauner @ 2024-05-04  9:45 UTC (permalink / raw)
  To: Kees Cook
  Cc: Bui Quang Minh, Al Viro, syzbot, axboe, io-uring, jack,
	linux-fsdevel, linux-kernel, syzkaller-bugs, Sumit Semwal,
	Christian König, linux-media, dri-devel, linaro-mm-sig,
	Laura Abbott

On Fri, May 03, 2024 at 11:26:32AM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 06:54:22PM +0700, Bui Quang Minh wrote:
> > [...]
> > Root cause:
> > AFAIK, eventpoll (epoll) does not increase the registered file's reference.
> > To ensure the safety, when the registered file is deallocated in __fput,
> > eventpoll_release is called to unregister the file from epoll. When calling
> > poll on epoll, epoll will loop through registered files and call vfs_poll on
> > these files. In the file's poll, file is guaranteed to be alive, however, as
> > epoll does not increase the registered file's reference, the file may be
> > dying
> > and it's not safe the get the file for later use. And dma_buf_poll violates
> > this. In the dma_buf_poll, it tries to get_file to use in the callback. This
> > leads to a race where the dmabuf->file can be fput twice.
> > 
> > Here is the race occurs in the above proof-of-concept
> > 
> > close(dmabuf->file)
> > __fput_sync (f_count == 1, last ref)
> > f_count-- (f_count == 0 now)
> > __fput
> >                                     epoll_wait
> >                                     vfs_poll(dmabuf->file)
> >                                     get_file(dmabuf->file)(f_count == 1)
> > eventpoll_release
> > dmabuf->file deallocation
> >                                     fput(dmabuf->file) (f_count == 1)
> >                                     f_count--
> >                                     dmabuf->file deallocation
> > 
> > I am not familiar with the dma_buf so I don't know the proper fix for the
> > issue. About the rule that don't get the file for later use in poll callback
> > of
> > file, I wonder if it is there when only select/poll exist or just after
> > epoll
> > appears.
> > 
> > I hope the analysis helps us to fix the issue.
> 
> Thanks for doing this analysis! I suspect at least a start of a fix
> would be this:
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 8fe5aa67b167..15e8f74ee0f2 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
>  
>  		if (events & EPOLLOUT) {
>  			/* Paired with fput in dma_buf_poll_cb */
> -			get_file(dmabuf->file);
> -
> -			if (!dma_buf_poll_add_cb(resv, true, dcb))
> +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> +			    !dma_buf_poll_add_cb(resv, true, dcb))
>  				/* No callback queued, wake up any other waiters */
>  				dma_buf_poll_cb(NULL, &dcb->cb);
>  			else
> @@ -290,9 +289,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
>  
>  		if (events & EPOLLIN) {
>  			/* Paired with fput in dma_buf_poll_cb */
> -			get_file(dmabuf->file);
> -
> -			if (!dma_buf_poll_add_cb(resv, false, dcb))
> +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> +			    !dma_buf_poll_add_cb(resv, false, dcb))
>  				/* No callback queued, wake up any other waiters */
>  				dma_buf_poll_cb(NULL, &dcb->cb);
>  			else
> 
> 
> But this ends up leaving "active" non-zero, and at close time it runs
> into:
> 
>         BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> 
> But the bottom line is that get_file() is unsafe to use in some places,
> one of which appears to be in the poll handler. There are maybe some
> other fragile places too, like in drivers/gpu/drm/vmwgfx/ttm_object.c:
> 
> static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
> {
> 	return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L;
> }
> 
> Which I also note involves a dmabuf...
> 
> Due to this issue I've proposed fixing get_file() to detect pathological states:
> https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/
> 
> But that has run into some push-back. I'm hoping that seeing this epoll
> example will help illustrate what needs fixing a little better.
> 
> I think the best current proposal is to just WARN sooner instead of a
> full refcount_t implementation:
> 
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8dfd53b52744..e09107d0a3d6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1040,7 +1040,8 @@ struct file_handle {
>  
>  static inline struct file *get_file(struct file *f)
>  {
> -	atomic_long_inc(&f->f_count);
> +	long prior = atomic_long_fetch_inc_relaxed(&f->f_count);
> +	WARN_ONCE(!prior, "struct file::f_count incremented from zero; use-after-free condition present!\n");
>  	return f;
>  }
>  
> 
> 
> What's the right way to deal with the dmabuf situation? (And I suspect
> it applies to get_dma_buf_unless_doomed() as well...)

No, it doesn't. That's safe afaict as I've explained at lenght in
the other thread.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 19:59           ` Kees Cook
  2024-05-03 20:28             ` Kees Cook
@ 2024-05-04  9:59             ` Christian Brauner
  1 sibling, 0 replies; 90+ messages in thread
From: Christian Brauner @ 2024-05-04  9:59 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jens Axboe, Bui Quang Minh, Al Viro, syzbot, io-uring, jack,
	linux-fsdevel, linux-kernel, syzkaller-bugs, Sumit Semwal,
	Christian König, linux-media, dri-devel, linaro-mm-sig,
	Laura Abbott

On Fri, May 03, 2024 at 12:59:52PM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 01:35:09PM -0600, Jens Axboe wrote:
> > On 5/3/24 1:22 PM, Kees Cook wrote:
> > > On Fri, May 03, 2024 at 12:49:11PM -0600, Jens Axboe wrote:
> > >> On 5/3/24 12:26 PM, Kees Cook wrote:
> > >>> Thanks for doing this analysis! I suspect at least a start of a fix
> > >>> would be this:
> > >>>
> > >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > >>> index 8fe5aa67b167..15e8f74ee0f2 100644
> > >>> --- a/drivers/dma-buf/dma-buf.c
> > >>> +++ b/drivers/dma-buf/dma-buf.c
> > >>> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
> > >>>  
> > >>>  		if (events & EPOLLOUT) {
> > >>>  			/* Paired with fput in dma_buf_poll_cb */
> > >>> -			get_file(dmabuf->file);
> > >>> -
> > >>> -			if (!dma_buf_poll_add_cb(resv, true, dcb))
> > >>> +			if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> > >>> +			    !dma_buf_poll_add_cb(resv, true, dcb))
> > >>>  				/* No callback queued, wake up any other waiters */
> > >>
> > >> Don't think this is sane at all. I'm assuming you meant:
> > >>
> > >> 	atomic_long_inc_not_zero(&dmabuf->file->f_count);
> > > 
> > > Oops, yes, sorry. I was typed from memory instead of copy/paste.
> > 
> > Figured :-)
> > 
> > >> but won't fly as you're not under RCU in the first place. And what
> > >> protects it from being long gone before you attempt this anyway? This is
> > >> sane way to attempt to fix it, it's completely opposite of what sane ref
> > >> handling should look like.
> > >>
> > >> Not sure what the best fix is here, seems like dma-buf should hold an
> > >> actual reference to the file upfront rather than just stash a pointer
> > >> and then later _hope_ that it can just grab a reference. That seems
> > >> pretty horrible, and the real source of the issue.
> > > 
> > > AFAICT, epoll just doesn't hold any references at all. It depends,
> > > I think, on eventpoll_release() (really eventpoll_release_file())
> > > synchronizing with epoll_wait() (but I don't see how this happens, and
> > > the race seems to be against ep_item_poll() ...?)
> > >
> > > I'm really confused about how eventpoll manages the lifetime of polled
> > > fds.
> > 
> > epoll doesn't hold any references, and it's got some ugly callback to
> > deal with that. It's not ideal, nor pretty, but that's how it currently
> > works. See eventpoll_release() and how it's called. This means that
> > epoll itself is supposedly safe from the file going away, even though it
> > doesn't hold a reference to it.
> 
> Right -- what remains unclear to me is how struct file lifetime is
> expected to work in the struct file_operations::poll callbacks. Because
> using get_file() there looks clearly unsafe...

If you're in ->poll() you're holding the epoll mutex and
eventpoll_release_file() needs to acquire ep->mtx as well. So if you're
in ->poll() then you know that eventpoll_release_file() can't progress
and therefore eventpoll_release() can't make progress. So
f_op->release() won't be able to be called as it happens after
eventpoll_release() in __fput(). But f_count being able to go to zero is
expected.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 23:39                     ` Al Viro
  2024-05-03 23:54                       ` Linus Torvalds
@ 2024-05-04 10:44                       ` Christian Brauner
  1 sibling, 0 replies; 90+ messages in thread
From: Christian Brauner @ 2024-05-04 10:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, keescook, axboe, christian.koenig, dri-devel,
	io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sat, May 04, 2024 at 12:39:00AM +0100, Al Viro wrote:
> On Fri, May 03, 2024 at 04:16:15PM -0700, Linus Torvalds wrote:
> > On Fri, 3 May 2024 at 15:07, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > Suppose your program calls select() on a pipe and dmabuf, sees data to be read
> > > from pipe, reads it, closes both pipe and dmabuf and exits.
> > >
> > > Would you expect that dmabuf file would stick around for hell knows how long
> > > after that?  I would certainly be very surprised by running into that...
> > 
> > Why?
> > 
> > That's the _point_ of refcounts. They make the thing they refcount
> > stay around until it's no longer referenced.
> > 
> > Now, I agree that dmabuf's are a bit odd in how they use a 'struct
> > file' *as* their refcount, but hey, it's a specialty use. Unusual
> > perhaps, but not exactly wrong.
> > 
> > I suspect that if you saw a dmabuf just have its own 'refcount_t' and
> > stay around until it was done, you wouldn't bat an eye at it, and it's
> > really just the "it uses a struct file for counting" that you are
> > reacting to.
> 
> *IF* those files are on purely internal filesystem, that's probably
> OK; do that with something on something mountable (char device,
> sysfs file, etc.) and you have a problem with filesystem staying
> busy.

In this instance it is ok because dma-buf is an internal fs. I had the
exact same reaction you had initially but it doesn't matter for dma-buf
afaict as that thing can never be unmounted.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-04  9:37           ` Christian Brauner
@ 2024-05-04 15:32             ` Linus Torvalds
  2024-05-04 15:40               ` Linus Torvalds
  2024-05-04 18:20               ` Linus Torvalds
  0 siblings, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-04 15:32 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sat, 4 May 2024 at 02:37, Christian Brauner <brauner@kernel.org> wrote:
>
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -244,13 +244,18 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
>         if (!dmabuf || !dmabuf->resv)
>                 return EPOLLERR;
>
> +       if (!get_file_active(&dmabuf->file))
> +               return EPOLLERR;
[...]

I *really* don't think anything that touches dma-buf.c can possibly be right.

This is not a dma-buf.c bug.

This is purely an epoll bug.

Lookie here, the fundamental issue is that epoll can call '->poll()'
on a file descriptor that is being closed concurrently.

That means that *ANY* driver that relies on *any* data structure that
is managed by the lifetime of the 'struct file' will have problems.

Look, here's sock_poll():

    static __poll_t sock_poll(struct file *file, poll_table *wait)
    {
        struct socket *sock = file->private_data;

and that first line looks about as innocent as it possibly can, right?

Now, imagine that this is called from 'epoll' concurrently with the
file being closed for the last time (but it just hasn't _quite_
reached eventpoll_release() yet).

Now, imagine that the kernel is built with preemption, and the epoll
thread gets preempted _just_ before it loads 'file->private_data'.

Furthermore, the machine is under heavy load, and it just stays off
its CPU a long time.

Now, during this TOTALLY INNOCENT sock_poll(), in another thread, the
file closing completes, eventpoll_release() finishes, and the
preemption of the poll() thing just takes so long that you go through
an RCU period too, so that the actual file has been released too.

So now that totally innoced file->private_data load in the poll() is
probably going to get random data.

Yes, the file is allocated as SLAB_TYPESAFE_BY_RCU, so it's probably
still a file. Not guaranteed, even the slab will get fully free'd in
some situations. And yes, the above case is impossible to hit in
practice. You have to hit quite the small race window with an
operation that practically never happens in the first place.

But my point is that the fact that the problem with file->f_count
lifetimes happens for that dmabuf driver is not the fault of the
dmabuf code. Not at all.

It is *ENTIRELY* a bug in epoll, and the dmabuf code is probably just
easier to hit because it has a poll() function that does things that
have longer lifetimes than most things, and interacts more directly
with that f_count.

So I really don't understand why Al thinks this is "dmabuf does bad
things with f_count". It damn well does not. dma-buf is the GOOD GUY
here. It's doing things *PROPERLY*. It's taking refcounts like it damn
well should.

The fact that it takes ref-counts on something that the epoll code has
messed up is *NOT* its fault.

                Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-04 15:32             ` Linus Torvalds
@ 2024-05-04 15:40               ` Linus Torvalds
  2024-05-04 15:53                 ` Linus Torvalds
  2024-05-05 10:50                 ` Christian Brauner
  2024-05-04 18:20               ` Linus Torvalds
  1 sibling, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-04 15:40 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sat, 4 May 2024 at 08:32, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Now, during this TOTALLY INNOCENT sock_poll(), in another thread, the
> file closing completes, eventpoll_release() finishes [..]

Actually, Al is right that ep_item_poll() should be holding the
ep->mtx, so eventpoll_release() -> eventpoll_release_file_file() ->
mutex_lock(&ep->mtx) should block and the file doesn't actually get
released.

So I guess the sock_poll() issue cannot happen. It does need some
poll() function that does 'fget()', and believes that it works.

But because the f_count has already gone down to zero, fget() doesn't
work, and doesn't keep the file around, and you have the bug.

The cases that do fget() in poll() are probably race, but they aren't
buggy. epoll is buggy.

So my example wasn't going to work, but the argument isn't really any
different, it's just a much more limited case that breaks.

And maybe it's even *only* dma-buf that does that fget() in its
->poll() function. Even *then* it's not a dma-buf.c bug.

               Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-04 15:40               ` Linus Torvalds
@ 2024-05-04 15:53                 ` Linus Torvalds
  2024-05-05 19:46                   ` Al Viro
  2024-05-05 10:50                 ` Christian Brauner
  1 sibling, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2024-05-04 15:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sat, 4 May 2024 at 08:40, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And maybe it's even *only* dma-buf that does that fget() in its
> ->poll() function. Even *then* it's not a dma-buf.c bug.

They all do in the sense that they do

  poll_wait
    -> __pollwait
     -> get_file (*boom*)

but the boom is very small because the poll_wait() will be undone by
poll_freewait(), and normally poll/select has held the file count
elevated.

Except for epoll. Which leaves those pollwait entries around until
it's done - but again will be held up on the ep->mtx before it does
so.

So everybody does some f_count games, but possibly dma-buf is the only
one that ends up expecting to hold on to the f_count for longer
periods.

             Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-04 15:32             ` Linus Torvalds
  2024-05-04 15:40               ` Linus Torvalds
@ 2024-05-04 18:20               ` Linus Torvalds
  2024-05-06 14:29                 ` [Linaro-mm-sig] " Christian König
  1 sibling, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2024-05-04 18:20 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sat, 4 May 2024 at 08:32, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Lookie here, the fundamental issue is that epoll can call '->poll()'
> on a file descriptor that is being closed concurrently.

Thinking some more about this, and replying to myself...

Actually, I wonder if we could *really* fix this by simply moving the
eventpoll_release() to where it really belongs.

If we did it in file_close_fd_locked(),  it would actually make a
*lot* more sense. Particularly since eventpoll actually uses this:

    struct epoll_filefd {
        struct file *file;
        int fd;
    } __packed;

ie it doesn't just use the 'struct file *', it uses the 'fd' itself
(for ep_find()).

(Strictly speaking, it should also have a pointer to the 'struct
files_struct' to make the 'int fd' be meaningful).

IOW, eventpoll already considers the file _descriptor_ relevant, not
just the file pointer, and that's destroyed at *close* time, not at
'fput()' time.

Yeah, yeah, the locking situation in file_close_fd_locked() is a bit
inconvenient, but if we can solve that, it would solve the problem in
a fundamentally different way: remove the ep iterm before the
file->f_count has actually been decremented, so the whole "race with
fput()" would just go away entirely.

I dunno. I think that would be the right thing to do, but I wouldn't
be surprised if some disgusting eventpoll user then might depend on
the current situation where the eventpoll thing stays around even
after the close() if you have another copy of the file open.

             Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-04 15:40               ` Linus Torvalds
  2024-05-04 15:53                 ` Linus Torvalds
@ 2024-05-05 10:50                 ` Christian Brauner
  2024-05-05 16:46                   ` Linus Torvalds
  1 sibling, 1 reply; 90+ messages in thread
From: Christian Brauner @ 2024-05-05 10:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sat, May 04, 2024 at 08:40:25AM -0700, Linus Torvalds wrote:
> On Sat, 4 May 2024 at 08:32, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Now, during this TOTALLY INNOCENT sock_poll(), in another thread, the
> > file closing completes, eventpoll_release() finishes [..]
> 
> Actually, Al is right that ep_item_poll() should be holding the
> ep->mtx, so eventpoll_release() -> eventpoll_release_file_file() ->
> mutex_lock(&ep->mtx) should block and the file doesn't actually get
> released.

So I know you've seen it yourself but for my own peace of mind I've said
that in the other mail and in the other thread already that all callers
of ep_item_poll() do already hold the ep->mtx:

do_epoll_ctl()
-> epoll_mutex_lock(&ep->mtx)
-> ep_insert()
   -> ep_item_poll()

do_epoll_ctl()
-> epoll_mutex_lock(&ep->mtx)
-> ep_modify()
   -> ep_item_poll()

ep_send_events()
-> mutex_lock(&ep->mtx)
-> ep_item_poll()

/* nested; and all callers of ep_item_poll() already hold ep->mtx */
__ep_eventpoll_poll()
-> mutex_lock_nested(&ep->mtx, wait)
-> ep_item_poll()

So it's simply not possible to end up with a UAF in f_op->poll() because
eventpoll_release_file_file() serializes on ep->mtx as well:

__fput()
-> eventpoll_release()
   -> eventpoll_release_file()
      {
              // @file->f_count is zero _but file is not freed_
              // so taking file->f_lock is absolutely fine
              spin_lock(&file->f_lock);
              // mark as dying

              // serialzed on ep->mtx
              mutex_lock(&ep->mtx);
              __ep_rmove(ep, epi);
              ...

      }
      -> mutex_lock(&ep->mtx)

-> f_op->release()
-> kfree(file)

So afaict it's simply not possible to end up with a UAF in f_op->poll().

And I agree with you that for some instances it's valid to take another
reference to a file from f_op->poll() but then they need to use
get_file_active() imho and simply handle the case where f_count is zero.
And we need to document that in Documentation/filesystems/file.rst or
locking.rst.

But if it's simply just dma buf that cares about that long-term
reference then really we should just force them to take the reference
like I suggested but don't penalize everyone else. When I took a glance
at all f_op->poll() implementations I didn't spot one that did take
extra references.

But if you absolutely want to have epoll take the reference before it
calls into f_op->poll() that's fine with me as well. But we might end up
forcing epoll to do a lot of final fput()s which I'm not sure is all
that desirable.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-05 10:50                 ` Christian Brauner
@ 2024-05-05 16:46                   ` Linus Torvalds
  2024-05-05 17:55                     ` [PATCH v2] epoll: be " Linus Torvalds
                                       ` (2 more replies)
  0 siblings, 3 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-05 16:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sun, 5 May 2024 at 03:50, Christian Brauner <brauner@kernel.org> wrote:
>
> And I agree with you that for some instances it's valid to take another
> reference to a file from f_op->poll() but then they need to use
> get_file_active() imho and simply handle the case where f_count is zero.

I think this is

 (a) practically impossible to find (since most f_count updates are in
various random helpers)

 (b) not tenable in the first place, since *EVERYBODY* does a f_count
update as part of the bog-standard pollwait

So (b) means that the notion of "warn if somebody increments f_count
from zero" is broken to begin with - but it's doubly broken because it
wouldn't find anything *anyway*, since this never happens in any
normal situation.

And (a) means that any non-automatic finding of this is practically impossible.

> And we need to document that in Documentation/filesystems/file.rst or
> locking.rst.

WHY?

Why cannot you and Al just admit that the problem is in epoll. Always
has been, always will be.

The fact is, it's not dma-buf that is violating any rules. It's epoll.
It's calling out to random driver functions with a file pointer that
is no longer valid.

It really is that simple.

I don't see why you are arguing for "unknown number of drivers - we
know at least *one* - have to be fixed for a bug that is in epoll".

If it was *easy* to fix, and if it was *easy* to validate, then  sure.
But that just isn't the case.

In contrast, in epoll it's *trivial* to fix the one case where it does
a VFS call-out, and just say "you have to follow the rules".

So explain to me again why you want to mess up the driver interface
and everybody who has a '.poll()' function, and not just fix the ONE
clearly buggy piece of code.

Because dammit,. epoll is clearly buggy. It's not enough to say "the
file allocation isn't going away", and claim that that means that it's
not buggy - when the file IS NO LONGER VALID!

                      Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 21:11     ` [PATCH] epoll: try to be a _bit_ better about file lifetimes Linus Torvalds
  2024-05-03 21:24       ` Al Viro
@ 2024-05-05 17:31       ` Jens Axboe
  1 sibling, 0 replies; 90+ messages in thread
From: Jens Axboe @ 2024-05-05 17:31 UTC (permalink / raw)
  To: Linus Torvalds, keescook
  Cc: brauner, christian.koenig, dri-devel, io-uring, jack, laura,
	linux-fsdevel, linux-kernel, linux-media, minhquangbui99,
	sumit.semwal, syzbot+045b454ab35fd82a35fb, syzkaller-bugs, viro

On 5/3/24 3:11 PM, Linus Torvalds wrote:
> epoll is a mess, and does various invalid things in the name of
> performance.
> 
> Let's try to rein it in a bit. Something like this, perhaps?
> 
> Not-yet-signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> 
> This is entirely untested, thus the "Not-yet-signed-off-by".  But I
> think this may be kind of the right path forward. 
> 
> I suspect the ->poll() call is the main case that matters, but there are
> other places where eventpoll just looks up the file pointer without then
> being very careful about it.  The sock_from_file(epi->ffd.file) uses in
> particular should probably also use this to look up the file. 
> 
> Comments?

FWIW, I agree that epoll is the odd one out and there's no reason NOT to
close this gap, regardless of how safe we currently think the existing
usage is.

I've done some basic testing with this - both to verify it fixes the
actual issue at hand (it does, crashes trivially without it), and
networking/pipe based epoll usage and no ill effects observed. Also
passes all ltp test cases as well, but I was less concerned about that
side.

Reviewed-by: Jens Axboe <axboe@kernel.dk>

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2] epoll: be better about file lifetimes
  2024-05-05 16:46                   ` Linus Torvalds
@ 2024-05-05 17:55                     ` Linus Torvalds
  2024-05-05 18:04                       ` Jens Axboe
  2024-05-05 20:01                       ` David Laight
  2024-05-05 20:12                     ` [PATCH] epoll: try to be a _bit_ " Al Viro
  2024-05-06  8:45                     ` Christian Brauner
  2 siblings, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-05 17:55 UTC (permalink / raw)
  To: torvalds
  Cc: axboe, brauner, christian.koenig, dri-devel, io-uring, jack,
	keescook, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs, viro

epoll can call out to vfs_poll() with a file pointer that may race with
the last 'fput()'. That would make f_count go down to zero, and while
the ep->mtx locking means that the resulting file pointer tear-down will
be blocked until the poll returns, it means that f_count is already
dead, and any use of it won't actually get a reference to the file any
more: it's dead regardless.

Make sure we have a valid ref on the file pointer before we call down to
vfs_poll() from the epoll routines.

Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/
Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Changes since v1:

 - add Link, Reported-by, and Jens' reviewed-by. And sign off on it
   because it looks fine to me and we have some testing now.

 - move epi_fget() closer to the user

 - more comments about the background

 - remove the rcu_read_lock(), with the comment explaining why it's not
   needed

 - note about returning zero rather than something like EPOLLERR|POLLHUP
   for a file that is going away

 fs/eventpoll.c | 42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 882b89edc52a..a3f0f868adc4 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -979,6 +979,37 @@ static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int dep
 	return res;
 }
 
+/*
+ * The ffd.file pointer may be in the process of
+ * being torn down due to being closed, but we
+ * may not have finished eventpoll_release() yet.
+ *
+ * Normally, even with the atomic_long_inc_not_zero,
+ * the file may have been free'd and then gotten
+ * re-allocated to something else (since files are
+ * not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU).
+ *
+ * But for epoll, users hold the ep->mtx mutex, and
+ * as such any file in the process of being free'd
+ * will block in eventpoll_release_file() and thus
+ * the underlying file allocation will not be free'd,
+ * and the file re-use cannot happen.
+ *
+ * For the same reason we can avoid a rcu_read_lock()
+ * around the operation - 'ffd.file' cannot go away
+ * even if the refcount has reached zero (but we must
+ * still not call out to ->poll() functions etc).
+ */
+static struct file *epi_fget(const struct epitem *epi)
+{
+	struct file *file;
+
+	file = epi->ffd.file;
+	if (!atomic_long_inc_not_zero(&file->f_count))
+		file = NULL;
+	return file;
+}
+
 /*
  * Differs from ep_eventpoll_poll() in that internal callers already have
  * the ep->mtx so we need to start from depth=1, such that mutex_lock_nested()
@@ -987,14 +1018,23 @@ static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int dep
 static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
 				 int depth)
 {
-	struct file *file = epi->ffd.file;
+	struct file *file = epi_fget(epi);
 	__poll_t res;
 
+	/*
+	 * We could return EPOLLERR | EPOLLHUP or something,
+	 * but let's treat this more as "file doesn't exist,
+	 * poll didn't happen".
+	 */
+	if (!file)
+		return 0;
+
 	pt->_key = epi->event.events;
 	if (!is_file_epoll(file))
 		res = vfs_poll(file, pt);
 	else
 		res = __ep_eventpoll_poll(file, pt, depth);
+	fput(file);
 	return res & epi->event.events;
 }
 
-- 
2.44.0.330.g4d18c88175


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH v2] epoll: be better about file lifetimes
  2024-05-05 17:55                     ` [PATCH v2] epoll: be " Linus Torvalds
@ 2024-05-05 18:04                       ` Jens Axboe
  2024-05-05 20:01                       ` David Laight
  1 sibling, 0 replies; 90+ messages in thread
From: Jens Axboe @ 2024-05-05 18:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: brauner, christian.koenig, dri-devel, io-uring, jack, keescook,
	laura, linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs, viro

On 5/5/24 11:55 AM, Linus Torvalds wrote:
> epoll can call out to vfs_poll() with a file pointer that may race with
> the last 'fput()'. That would make f_count go down to zero, and while
> the ep->mtx locking means that the resulting file pointer tear-down will
> be blocked until the poll returns, it means that f_count is already
> dead, and any use of it won't actually get a reference to the file any
> more: it's dead regardless.
> 
> Make sure we have a valid ref on the file pointer before we call down to
> vfs_poll() from the epoll routines.
> 
> Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/
> Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com
> Reviewed-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> 
> Changes since v1:
> 
>  - add Link, Reported-by, and Jens' reviewed-by. And sign off on it
>    because it looks fine to me and we have some testing now.
> 
>  - move epi_fget() closer to the user
> 
>  - more comments about the background
> 
>  - remove the rcu_read_lock(), with the comment explaining why it's not
>    needed
> 
>  - note about returning zero rather than something like EPOLLERR|POLLHUP
>    for a file that is going away

I did look at that 0 return as well and agreed this is the right choice,
but adding the comment is a good idea.

Anyway, patch still looks fine to me. I'd word wrap the comment section
above epi_fget() wider, but that's just a stylistic choice...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-04 15:53                 ` Linus Torvalds
@ 2024-05-05 19:46                   ` Al Viro
  2024-05-05 20:03                     ` Linus Torvalds
  0 siblings, 1 reply; 90+ messages in thread
From: Al Viro @ 2024-05-05 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, keescook, axboe, christian.koenig, dri-devel,
	io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sat, May 04, 2024 at 08:53:47AM -0700, Linus Torvalds wrote:

>   poll_wait
>     -> __pollwait
>      -> get_file (*boom*)
> 
> but the boom is very small because the poll_wait() will be undone by
> poll_freewait(), and normally poll/select has held the file count
> elevated.

Not quite.  It's not that poll_wait() calls __pollwait(); it calls
whatever callback that caller of ->poll() has set for it.

__pollwait users (select(2) and poll(2), currently) must (and do) make
sure that refcount is elevated; others (and epoll is not the only one)
need to do whatever's right for their callbacks.

I've no problem with having epoll grab a reference, but if we make that
a universal requirement ->poll() instances can rely upon, we'd better
verify that *all* vfs_poll() are OK.  And that ought to go into
Documentation/filesystems/porting.rst ("callers of vfs_poll() must
make sure that file is pinned; ->poll() instances may rely upon that,
but they'd better be very careful about grabbing extra references themselves -
it's acceptable for files on internal mounts, but *NOT* for anything on
mountable filesystems.  Any instance that does it needs an explicit
comment telling not to reuse that blindly." or something along those
lines).

Excluding epoll, select/poll and callers that have just done fdget() and will
do fdput() after vfs_poll(), we have this:

drivers/vhost/vhost.c:213:      mask = vfs_poll(file, &poll->table);
	vhost_poll_start().  Might get interesting...  Calls working
with vq->kick as file seem to rely upon vq->mutex, but I'll need to
refresh my memories of that code to check if that's all we need - and
then there's vhost_net_enable_vq(), which also needs an audit.

fs/aio.c:1738:          mask = vfs_poll(req->file, &pt) & req->events;
fs/aio.c:1932:  mask = vfs_poll(req->file, &apt.pt) & req->events;
	aio_poll() and aio_poll_wake() resp.  req->file here is actually ->ki_filp
	of iocb that contains work as iocb->req.work; it get dropped only in
	iocb_destroy(), which also frees iocb.  Any call that might've run into
	req->file not pinned is already in UAF land.

io_uring/poll.c:303:                    req->cqe.res = vfs_poll(req->file, &pt) & req->apoll_events;
io_uring/poll.c:622:    mask = vfs_poll(req->file, &ipt->pt) & poll->events;
	Should have req->file pinned, but I'll need to RTFS a bit for
details.  That, or ask Jens to confirm...

net/9p/trans_fd.c:236:  ret = vfs_poll(ts->rd, pt);
net/9p/trans_fd.c:238:          ret = (ret & ~EPOLLOUT) | (vfs_poll(ts->wr, pt) & ~EPOLLIN);
	p9_fd_poll(); ->rd and ->wr are pinned and won't get dropped until
p9_fd_close(), which frees ts immediately afterwards.  IOW, if we risk
being called with ->rd or ->wr not pinned, we are in UAF land already.
Incidentally, what the hell is this in p9_fd_open()?
         * It's technically possible for userspace or concurrent mounts to
         * modify this flag concurrently, which will likely result in a
         * broken filesystem. However, just having bad flags here should
         * not crash the kernel or cause any other sort of bug, so mark this
         * particular data race as intentional so that tooling (like KCSAN)
         * can allow it and detect further problems.
         */
Why not simply fix the race instead?  As in
	spin_lock(&ts->rd->f_lock);
        ts->rd->f_flags |= O_NONBLOCK;
	spin_unlock(&ts->rd->f_lock);
and similar for ts->wr?  Sigh...

^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [PATCH v2] epoll: be better about file lifetimes
  2024-05-05 17:55                     ` [PATCH v2] epoll: be " Linus Torvalds
  2024-05-05 18:04                       ` Jens Axboe
@ 2024-05-05 20:01                       ` David Laight
  2024-05-05 20:16                         ` Linus Torvalds
  1 sibling, 1 reply; 90+ messages in thread
From: David Laight @ 2024-05-05 20:01 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: axboe, brauner, christian.koenig, dri-devel, io-uring, jack,
	keescook, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs, viro

From: Linus Torvalds
> Sent: 05 May 2024 18:56
> 
> epoll can call out to vfs_poll() with a file pointer that may race with
> the last 'fput()'. That would make f_count go down to zero, and while
> the ep->mtx locking means that the resulting file pointer tear-down will
> be blocked until the poll returns, it means that f_count is already
> dead, and any use of it won't actually get a reference to the file any
> more: it's dead regardless.
> 
> Make sure we have a valid ref on the file pointer before we call down to
> vfs_poll() from the epoll routines.

How much is the extra pair of atomics going to hurt performance?
IIRC a lot of work was done to (try to) make epoll lockless.

Perhaps the 'hook' into epoll (usually) from sys_close should be
done before any of the references are removed?
(Which is different from Q6/A6 in man epoll - but that seems to be
documenting a bug!)
Then the ->poll() callback can't happen (assuming it is properly
locked) after the ->release() one.

It seems better to add extra atomics to the close/final-fput path
rather than ever ->poll() call epoll makes.

I can get extra references to a driver by dup() open("/dev/fd/n")
and mmap() - but epoll is definitely fd based.
(Which may be why it has the fd number in its data.)

Is there another race between EPOLL_CTL_ADD and close() (from a
different thread)?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-05 19:46                   ` Al Viro
@ 2024-05-05 20:03                     ` Linus Torvalds
  2024-05-05 20:30                       ` Al Viro
  0 siblings, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2024-05-05 20:03 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, keescook, axboe, christian.koenig, dri-devel,
	io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sun, 5 May 2024 at 12:46, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I've no problem with having epoll grab a reference, but if we make that
> a universal requirement ->poll() instances can rely upon,

Al, we're note "making that a requirement".

It always has been.

Otgherwise, the docs should have shouted out DAMN LOUDLY that you
can't rely on all the normal refcounting of 'struct file' THAT EVERY
SINGLE NORMAL VFS FUNCTION CAN.

Lookie herte: epoll is unimportant and irrelevant garbage compared to
something fundamental like "read()", and what does read() do?

It does this:

        struct fd f = fdget_pos(fd);

        if (f.file) {
                ...

which is being DAMN CAREFUL to make sure that the file has the proper
refcounts before it then calls "vfs_read()". There's a lot of very
careful and subtle code in fdget_pos() to make this all proper, and
that even if the file is closed by another thread concurrently, we
*always* have a refcount to it, and it's always live over the whole
'vfs_read()' sequence.

'vfs_poll()' is NOT DIFFERENT in this regard. Not at all.

Now, you have two choices that are intellectually honest:

 - admit that epoll() - which is a hell of a lot less important -
should spend a small fraction of that effort on making its vfs_poll()
use sane

 - say that all this fdget_pos() care is uncalled for in the read()
path, and we should make all the filesystem .read() functions be aware
that the file pointer they get may be garbage, and they should use
get_file_active() to make sure every 'struct file *' use they have is
safe?

because if your choice is that "epoll can do whatever the f*&k it
wants", then it's in clear violation of all the effort we go to in a
MUCH MORE IMPORTANT code path, and is clearly not consistent or
logical.

Neither you nor Christian have explained why you think it's ok for
that epoll() garbage to magically violate all our regular rules.

Your claim that those regular rules are some new conditional
requirement that we'd be imposing. NO. They are the rules that
*anybody* who gets a 'struct file *' pointer should always be able to
rely on by default: it's damn well a ref-counted thing, and the caller
holds the refcount.

The exceptional case is exactly the other way around: if you do random
crap with unrefcounted poitners, it's *your* problem, and *you* are
the one who has to be careful. Not some unrelated poor driver that
didn't know about your f*&k-up.

Dammit, epoll is CLEARLY BUGGY. It's passing off random kernel
pointers without holding a refcount to them. THAT'S A BUG.

And fixing that bug is *not* somehow changing existing rules as you
are trying to claim. No. It's just fixing a bug.

So stop claiming that this is some "new requirement". It is absolutely
nothing of the sort. epoll() actively MISUSED file pointer, because
file pointers are fundamentally refcounted (as are pretty much all
sane kernel interfaces).

                Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-05 16:46                   ` Linus Torvalds
  2024-05-05 17:55                     ` [PATCH v2] epoll: be " Linus Torvalds
@ 2024-05-05 20:12                     ` Al Viro
  2024-05-06  8:45                     ` Christian Brauner
  2 siblings, 0 replies; 90+ messages in thread
From: Al Viro @ 2024-05-05 20:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, keescook, axboe, christian.koenig, dri-devel,
	io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sun, May 05, 2024 at 09:46:05AM -0700, Linus Torvalds wrote:

> WHY?
> 
> Why cannot you and Al just admit that the problem is in epoll. Always
> has been, always will be.

Nobody (well, nobody who'd ever read epoll) argues that epoll is not
a problem.

> The fact is, it's not dma-buf that is violating any rules.

Now, that is something I've a trouble with.  Use of get_file() in there
actually looks rather fishy, regardless of epoll.

At the very least it needs a comment discouraging other instances from
blindly copying this.  A reference to struct file pins down more than
driver-internal objects; if nothing else, it pins down a mount and
if you don't have SB_NOUSER on file_inode(file)->i_sb->s_flags, it's
really not a good idea.

What's more, the reason for that get_file() is, AFAICS, that nodes
we put into callback queue for fence(s) in question[*] are embedded
into dmabuf and we don't want them gone before the callbacks have
happened.  Which looks fishy - it would make more sense to cancel
these callbacks and drop the fence(s) in question from ->release().

I've no problem whatsoever with fs/eventpoll.c grabbing/dropping
file reference around vfs_poll() calls.  And I don't believe that
"try to grab" has any place in dma_buf_poll(); it's just that I'm not
happy about get_file() call being there in the first place.

Sure, the call of ->poll() can bloody well lead to references being
grabbed - by the pollwait callback, which the caller of ->poll()
is aware of.  It's ->poll() instance *itself* grabbing such references
with vfs_poll() caller having no idea what's going on that has
potential for being unpleasant.  And we can't constify 'file' argument
of ->poll() because of poll_wait(), so it's hard to catch those who
do that kind of thing; I've explicitly said so upthread, I believe.

But similar calls of get_file() in ->poll() instances (again, not
the ones that are made by pollwait callback) are something to
watch out for and having the caller pin struct file does not solve
the problem.

[*] at most one per direction, and I've no idea whether there can be more
than one signalling fence for given dmabuf) 

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v2] epoll: be better about file lifetimes
  2024-05-05 20:01                       ` David Laight
@ 2024-05-05 20:16                         ` Linus Torvalds
  0 siblings, 0 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-05 20:16 UTC (permalink / raw)
  To: David Laight
  Cc: axboe, brauner, christian.koenig, dri-devel, io-uring, jack,
	keescook, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs, viro

On Sun, 5 May 2024 at 13:02, David Laight <David.Laight@aculab.com> wrote:
>
> How much is the extra pair of atomics going to hurt performance?
> IIRC a lot of work was done to (try to) make epoll lockless.

If this makes people walk away from epoll, that would be absolutely
*lovely*. Maybe they'd start using io_uring instead, which has had its
problems, but is a lot more capable in the end.

Yes, doing things right is likely more expensive than doing things
wrong. Bugs are cheap. That doesn't make buggy code better.

Epoll really isn't important enough to screw over the VFS subsystem over.

I did point out elsewhere how this could be fixed by epoll() removing
the ep items at a different point:

  https://lore.kernel.org/all/CAHk-=wj6XL9MGCd_nUzRj6SaKeN0TsyTTZDFpGdW34R+zMZaSg@mail.gmail.com/

so if somebody actually wants to fix up epoll properly, that would
probably be great.

In fact, that model would allow epoll() to just keep a proper refcount
as an fd is added to the poll events, and would probably fix a lot of
nastiness. Right now those ep items stay around for basically random
amounts of time.

But maybe there are other ways to fix it. I don't think we have an
actual eventpoll maintainer any more, but what I'm *not* willing to
happen is eventpoll messing up other parts of the kernel. It was
always a ugly performance hack, and was only acceptable as such. "ugly
hack" is ok. "buggy ugly hack" is not.

              Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-05 20:03                     ` Linus Torvalds
@ 2024-05-05 20:30                       ` Al Viro
  2024-05-05 20:53                         ` Linus Torvalds
  0 siblings, 1 reply; 90+ messages in thread
From: Al Viro @ 2024-05-05 20:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, keescook, axboe, christian.koenig, dri-devel,
	io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sun, May 05, 2024 at 01:03:07PM -0700, Linus Torvalds wrote:
> On Sun, 5 May 2024 at 12:46, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > I've no problem with having epoll grab a reference, but if we make that
> > a universal requirement ->poll() instances can rely upon,
> 
> Al, we're note "making that a requirement".
> 
> It always has been.

Argh.   We keep talking past each other.

0.	special-cased ->f_count rule for ->poll() is a wart and it's
better to get rid of it.

1.	fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
git rm taken to it.  Short of that, by all means, let's grab reference
in there around the call of vfs_poll() (see (0)).

2. 	having ->poll() instances grab extra references to file passed
to them is not something that should be encouraged; there's a plenty
of potential problems, and "caller has it pinned, so we are fine with
grabbing extra refs" is nowhere near enough to eliminate those.

3.	dma-buf uses of get_file() are probably safe (epoll shite aside),
but they do look fishy.  That has nothing to do with epoll.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-05 20:30                       ` Al Viro
@ 2024-05-05 20:53                         ` Linus Torvalds
  2024-05-06 12:47                           ` Daniel Vetter
  2024-05-06 16:15                           ` Christian König
  0 siblings, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-05 20:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, keescook, axboe, christian.koenig, dri-devel,
	io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sun, 5 May 2024 at 13:30, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> 0.      special-cased ->f_count rule for ->poll() is a wart and it's
> better to get rid of it.
>
> 1.      fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
> git rm taken to it.  Short of that, by all means, let's grab reference
> in there around the call of vfs_poll() (see (0)).

Agreed on 0/1.

> 2.      having ->poll() instances grab extra references to file passed
> to them is not something that should be encouraged; there's a plenty
> of potential problems, and "caller has it pinned, so we are fine with
> grabbing extra refs" is nowhere near enough to eliminate those.

So it's not clear why you hate it so much, since those extra
references are totally normal in all the other VFS paths.

I mean, they are perhaps not the *common* case, but we have a lot of
random get_file() calls sprinkled around in various places when you
end up passing a file descriptor off to some asynchronous operation
thing.

Yeah, I think most of them tend to be special operations (eg the tty
TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl()
is *that* different from vfs_poll. Different operation, not somehow
"one is more special than the other".

cachefiles and backing-file does it for regular IO, and drop it at IO
completion - not that different from what dma-buf does. It's in
->read_iter() rather than ->poll(), but again: different operations,
but not "one of them is somehow fundamentally different".

> 3.      dma-buf uses of get_file() are probably safe (epoll shite aside),
> but they do look fishy.  That has nothing to do with epoll.

Now, what dma-buf basically seems to do is to avoid ref-counting its
own fundamental data structure, and replaces that by refcounting the
'struct file' that *points* to it instead.

And it is a bit odd, but it actually makes some amount of sense,
because then what it passes around is that file pointer (and it allows
passing it around from user space *as* that file).

And honestly, if you look at why it then needs to add its refcount to
it all, it actually makes sense.  dma-bufs have this notion of
"fences" that are basically completion points for the asynchronous
DMA. Doing a "poll()" operation will add a note to the fence to get
that wakeup when it's done.

And yes, logically it takes a ref to the "struct dma_buf", but because
of how the lifetime of the dma_buf is associated with the lifetime of
the 'struct file', that then turns into taking a ref on the file.

Unusual? Yes. But not illogical. Not obviously broken. Tying the
lifetime of the dma_buf to the lifetime of a file that is passed along
makes _sense_ for that use.

I'm sure dma-bufs could add another level of refcounting on the
'struct dma_buf' itself, and not make it be 1:1 with the file, but
it's not clear to me what the advantage would really be, or why it
would be wrong to re-use a refcount that is already there.

                 Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-05 16:46                   ` Linus Torvalds
  2024-05-05 17:55                     ` [PATCH v2] epoll: be " Linus Torvalds
  2024-05-05 20:12                     ` [PATCH] epoll: try to be a _bit_ " Al Viro
@ 2024-05-06  8:45                     ` Christian Brauner
  2024-05-06  9:26                       ` Christian Brauner
  2024-05-07 21:02                       ` David Laight
  2 siblings, 2 replies; 90+ messages in thread
From: Christian Brauner @ 2024-05-06  8:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

> The fact is, it's not dma-buf that is violating any rules. It's epoll.

I agree that epoll() not taking a reference on the file is at least
unexpected and contradicts the usual code patterns for the sake of
performance and that it very likely is the case that most callers of
f_op->poll() don't know this.

Note, I cleary wrote upthread that I'm ok to do it like you suggested
but raised two concerns a) there's currently only one instance of
prolonged @file lifetime in f_op->poll() afaict and b) that there's
possibly going to be some performance impact on epoll().

So it's at least worth discussing what's more important because epoll()
is very widely used and it's not that we haven't favored performance
before.

But you've already said that you aren't concerned with performance on
epoll() upthread. So afaict then there's really not a lot more to
discuss other than take the patch and see whether we get any complaints.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-06  8:45                     ` Christian Brauner
@ 2024-05-06  9:26                       ` Christian Brauner
  2024-05-06 14:19                         ` Christian Brauner
  2024-05-07 21:02                       ` David Laight
  1 sibling, 1 reply; 90+ messages in thread
From: Christian Brauner @ 2024-05-06  9:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Mon, May 06, 2024 at 10:45:35AM +0200, Christian Brauner wrote:
> > The fact is, it's not dma-buf that is violating any rules. It's epoll.
> 
> I agree that epoll() not taking a reference on the file is at least
> unexpected and contradicts the usual code patterns for the sake of
> performance and that it very likely is the case that most callers of
> f_op->poll() don't know this.
> 
> Note, I cleary wrote upthread that I'm ok to do it like you suggested
> but raised two concerns a) there's currently only one instance of
> prolonged @file lifetime in f_op->poll() afaict and b) that there's
> possibly going to be some performance impact on epoll().
> 
> So it's at least worth discussing what's more important because epoll()
> is very widely used and it's not that we haven't favored performance
> before.
> 
> But you've already said that you aren't concerned with performance on
> epoll() upthread. So afaict then there's really not a lot more to
> discuss other than take the patch and see whether we get any complaints.

Two closing thoughts:

(1) I wonder if this won't cause userspace regressions for the semantics
    of epoll because dying files are now silently ignored whereas before
    they'd generated events.

(2) The other part is that this seems to me that epoll() will now
    temporarly pin filesystems opening up the possibility for spurious
    EBUSY errors.

    If you register a file descriptor in an epoll instance and then
    close it and umount the filesystem but epoll managed to do an fget()
    on that fd before that close() call then epoll will pin that
    filesystem.

    If the f_op->poll() method does something that can take a while
    (blocks on a shared mutex of that subsystem) that umount is very
    likely going to return EBUSY suddenly.

    Afaict, before that this wouldn't have been an issue at all and is
    likely more serious than performance.

    (One option would be to only do epi_fget() for stuff like
    dma-buf that's never unmounted. That'll cover nearly every
    driver out there. Only "real" filesystems would have to contend with
    @file count going to zero but honestly they also deal with dentry
    lookup under RCU which is way more adventurous than this.)

    Maybe I'm barking up the wrong tree though.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 21:53                     ` Al Viro
@ 2024-05-06 12:23                       ` Daniel Vetter
  0 siblings, 0 replies; 90+ messages in thread
From: Daniel Vetter @ 2024-05-06 12:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Kees Cook, Jens Axboe, Bui Quang Minh,
	Christian Brauner, syzbot, io-uring, jack, linux-fsdevel,
	linux-kernel, syzkaller-bugs, Sumit Semwal, Christian König,
	linux-media, dri-devel, linaro-mm-sig, Laura Abbott

On Fri, May 03, 2024 at 10:53:03PM +0100, Al Viro wrote:
> On Fri, May 03, 2024 at 02:42:22PM -0700, Linus Torvalds wrote:
> > On Fri, 3 May 2024 at 14:36, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > ... the last part is no-go - poll_wait() must be able to grab a reference
> > > (well, the callback in it must)
> > 
> > Yeah. I really think that *poll* itself is doing everything right. It
> > knows that it's called with a file pointer with a reference, and it
> > adds its own references as needed.
> 
> Not really.  Note that select's __pollwait() does *NOT* leave a reference
> at the mercy of driver - it's stuck into poll_table_entry->filp and
> the poll_freewait() knows how to take those out.
> 
> 
> dmabuf does something very different - it grabs the damn thing into
> its private data structures and for all we know it could keep it for
> a few hours, until some even materializes.

dma_fence must complete in reasonable amount of time, where "reasonable"
is roughly in line with other i/o (including the option that there's
timeouts if the hw's gone busted).

So definitely not hours (aside from driver bugs when things go really
wrong ofc), but more like a few seconds in a worst case scenario.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-03 23:41                     ` Linus Torvalds
  2024-05-04  9:19                       ` Christian Brauner
@ 2024-05-06 12:37                       ` Daniel Vetter
  1 sibling, 0 replies; 90+ messages in thread
From: Daniel Vetter @ 2024-05-06 12:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Al Viro, axboe, brauner, christian.koenig, dri-devel,
	io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Fri, May 03, 2024 at 04:41:19PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 16:23, Kees Cook <keescook@chromium.org> wrote:
> >
> > static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
> > {
> >         return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L;
> > }
> >
> > If we end up adding epi_fget(), we'll have 2 cases of using
> > "atomic_long_inc_not_zero" for f_count. Do we need some kind of blessed
> > helper to live in file.h or something, with appropriate comments?
> 
> I wonder if we could try to abstract this out a bit more.
> 
> These games with non-ref-counted file structures *feel* a bit like the
> games we play with non-ref-counted (aka "stashed") 'struct dentry'
> that got fairly recently cleaned up with path_from_stashed() when both
> nsfs and pidfs started doing the same thing.
> 
> I'm not loving the TTM use of this thing, but at least the locking and
> logic feels a lot more straightforward (ie the
> atomic_long_inc_not_zero() here is clealy under the 'prime->mutex'
> lock

The one the vmgfx isn't really needed (I think at least), because all
other drivers that use gem or ttm use the dma_buf export cache in
drm/drm_prime.c, which is protected by a bog standard mutex.

vmwgfx is unfortunately special in a lot of ways due to somewhat parallel
dev history. So there might be an uapi reason why the weak reference is
required. I suspect because vmwgfx is reinventing a lot of its own wheels
it can't play the same tricks as gem_prime.c, which hooks into a few core
drm cleanup/release functions.

tldr; drm really has no architectural need for a get_file_unless_doomed,
and I certainly don't want to spread it it further than the vmwgfx
historical special case that was added in 2013.
-Sima

> IOW, the tty use looks correct to me, and it has fairly simple locking
> and is just catching the the race between 'fput()' decrementing the
> refcount and and 'file->f_op->release()' doing the actual release.
> 
> You are right that it's similar to the epoll thing in that sense, it
> just looks a _lot_ more straightforward to me (and, unlike epoll,
> doesn't look actively buggy right now).
> 
> Could we abstract out this kind of "stashed file pointer" so that we'd
> have a *common* form for this? Not just the inc_not_zero part, but the
> locking rule too?
> 
>               Linus

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-05 20:53                         ` Linus Torvalds
@ 2024-05-06 12:47                           ` Daniel Vetter
  2024-05-06 14:46                             ` Christian Brauner
  2024-05-06 16:15                           ` Christian König
  1 sibling, 1 reply; 90+ messages in thread
From: Daniel Vetter @ 2024-05-06 12:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Christian Brauner, keescook, axboe, christian.koenig,
	dri-devel, io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Sun, May 05, 2024 at 01:53:48PM -0700, Linus Torvalds wrote:
> On Sun, 5 May 2024 at 13:30, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > 0.      special-cased ->f_count rule for ->poll() is a wart and it's
> > better to get rid of it.
> >
> > 1.      fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
> > git rm taken to it.  Short of that, by all means, let's grab reference
> > in there around the call of vfs_poll() (see (0)).
> 
> Agreed on 0/1.
> 
> > 2.      having ->poll() instances grab extra references to file passed
> > to them is not something that should be encouraged; there's a plenty
> > of potential problems, and "caller has it pinned, so we are fine with
> > grabbing extra refs" is nowhere near enough to eliminate those.
> 
> So it's not clear why you hate it so much, since those extra
> references are totally normal in all the other VFS paths.
> 
> I mean, they are perhaps not the *common* case, but we have a lot of
> random get_file() calls sprinkled around in various places when you
> end up passing a file descriptor off to some asynchronous operation
> thing.
> 
> Yeah, I think most of them tend to be special operations (eg the tty
> TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl()
> is *that* different from vfs_poll. Different operation, not somehow
> "one is more special than the other".
> 
> cachefiles and backing-file does it for regular IO, and drop it at IO
> completion - not that different from what dma-buf does. It's in
> ->read_iter() rather than ->poll(), but again: different operations,
> but not "one of them is somehow fundamentally different".
> 
> > 3.      dma-buf uses of get_file() are probably safe (epoll shite aside),
> > but they do look fishy.  That has nothing to do with epoll.
> 
> Now, what dma-buf basically seems to do is to avoid ref-counting its
> own fundamental data structure, and replaces that by refcounting the
> 'struct file' that *points* to it instead.
> 
> And it is a bit odd, but it actually makes some amount of sense,
> because then what it passes around is that file pointer (and it allows
> passing it around from user space *as* that file).
> 
> And honestly, if you look at why it then needs to add its refcount to
> it all, it actually makes sense.  dma-bufs have this notion of
> "fences" that are basically completion points for the asynchronous
> DMA. Doing a "poll()" operation will add a note to the fence to get
> that wakeup when it's done.
> 
> And yes, logically it takes a ref to the "struct dma_buf", but because
> of how the lifetime of the dma_buf is associated with the lifetime of
> the 'struct file', that then turns into taking a ref on the file.
> 
> Unusual? Yes. But not illogical. Not obviously broken. Tying the
> lifetime of the dma_buf to the lifetime of a file that is passed along
> makes _sense_ for that use.
> 
> I'm sure dma-bufs could add another level of refcounting on the
> 'struct dma_buf' itself, and not make it be 1:1 with the file, but
> it's not clear to me what the advantage would really be, or why it
> would be wrong to re-use a refcount that is already there.

So there is generally another refcount, because dma_buf is just the
cross-driver interface to some kind of real underlying buffer object from
the various graphics related subsystems we have.

And since it's a pure file based api thing that ceases to serve any
function once the fd/file is gone we tied all the dma_buf refcounting to
the refcount struct file already maintains. But the underlying buffer
object can easily outlive the dma_buf, and over the lifetime of an
underlying buffer object you might actually end up creating different
dma_buf api wrappers for it (but at least in drm we guarantee there's at
most one, hence why vmwgfx does the atomic_inc_unless_zero trick, which I
don't particularly like and isn't really needed).

But we could add another refcount, it just means we have 3 of those then
when only really 2 are needed.

Also maybe here two: dma_fence are bounded like other disk i/o (including
the option of timeouts if things go very wrong), so it's very much not
forever but at most a few seconds worst case (shit hw/driver excluded, as
usual).
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-06  9:26                       ` Christian Brauner
@ 2024-05-06 14:19                         ` Christian Brauner
  0 siblings, 0 replies; 90+ messages in thread
From: Christian Brauner @ 2024-05-06 14:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Mon, May 06, 2024 at 11:27:04AM +0200, Christian Brauner wrote:
> On Mon, May 06, 2024 at 10:45:35AM +0200, Christian Brauner wrote:
> > > The fact is, it's not dma-buf that is violating any rules. It's epoll.
> > 
> > I agree that epoll() not taking a reference on the file is at least
> > unexpected and contradicts the usual code patterns for the sake of
> > performance and that it very likely is the case that most callers of
> > f_op->poll() don't know this.
> > 
> > Note, I cleary wrote upthread that I'm ok to do it like you suggested
> > but raised two concerns a) there's currently only one instance of
> > prolonged @file lifetime in f_op->poll() afaict and b) that there's
> > possibly going to be some performance impact on epoll().
> > 
> > So it's at least worth discussing what's more important because epoll()
> > is very widely used and it's not that we haven't favored performance
> > before.
> > 
> > But you've already said that you aren't concerned with performance on
> > epoll() upthread. So afaict then there's really not a lot more to
> > discuss other than take the patch and see whether we get any complaints.
> 
> Two closing thoughts:
> 
> (1) I wonder if this won't cause userspace regressions for the semantics
>     of epoll because dying files are now silently ignored whereas before
>     they'd generated events.
> 
> (2) The other part is that this seems to me that epoll() will now
>     temporarly pin filesystems opening up the possibility for spurious
>     EBUSY errors.
> 
>     If you register a file descriptor in an epoll instance and then
>     close it and umount the filesystem but epoll managed to do an fget()
>     on that fd before that close() call then epoll will pin that
>     filesystem.
> 
>     If the f_op->poll() method does something that can take a while
>     (blocks on a shared mutex of that subsystem) that umount is very
>     likely going to return EBUSY suddenly.
> 
>     Afaict, before that this wouldn't have been an issue at all and is
>     likely more serious than performance.
> 
>     (One option would be to only do epi_fget() for stuff like
>     dma-buf that's never unmounted. That'll cover nearly every
>     driver out there. Only "real" filesystems would have to contend with
>     @file count going to zero but honestly they also deal with dentry
>     lookup under RCU which is way more adventurous than this.)
> 
>     Maybe I'm barking up the wrong tree though.

Sorry, had to step out for an appointment.

Under the assumption that I'm not entirely off with this - and I really
could be ofc - then one possibility would be that we enable persistence
of files from f_op->poll() for SB_NOUSER filesystems.

That'll catch everything that's relying on anonymous inodes (drm and all
drivers) and init_pseudo() so everything that isn't actually unmountable
(pipefs, pidfs, sockfs, etc.).

So something like the _completely untested_ diff on top of your proposal
above:

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a3f0f868adc4..95968a462544 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1018,8 +1018,24 @@ static struct file *epi_fget(const struct epitem *epi)
 static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
                                 int depth)
 {
-       struct file *file = epi_fget(epi);
+       struct file *file = epi->ffd.file;
        __poll_t res;
+       bool unrefd = false;
+
+       /*
+        * Taking a reference for anything that isn't mountable is fine
+        * because we don't have to worry about spurious EBUSY warnings
+        * from umount().
+        *
+        * File count might go to zero in f_op->poll() for mountable
+        * filesystems.
+        */
+       if (file->f_inode->i_sb->s_flags & SB_NOUSER) {
+               unrefd = true;
+               file = epi_fget(epi);
+       } else if (file_count(file) == 0) {
+               file = NULL;
+       }

        /*
         * We could return EPOLLERR | EPOLLHUP or something,
@@ -1034,7 +1050,9 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
                res = vfs_poll(file, pt);
        else
                res = __ep_eventpoll_poll(file, pt, depth);
-       fput(file);
+
+       if (unrefd)
+               fput(file);
        return res & epi->event.events;
 }

Basically, my worry is that we end up with really annoying to debug
EBUSYs caused by epoll(). I'd really like to avoid that. But again, I
might be wrong and this isn't an issue.

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-04 18:20               ` Linus Torvalds
@ 2024-05-06 14:29                 ` Christian König
  2024-05-07 11:02                   ` Daniel Vetter
  2024-05-08 10:08                   ` Christian Brauner
  0 siblings, 2 replies; 90+ messages in thread
From: Christian König @ 2024-05-06 14:29 UTC (permalink / raw)
  To: Linus Torvalds, Christian Brauner
  Cc: Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

Am 04.05.24 um 20:20 schrieb Linus Torvalds:
> On Sat, 4 May 2024 at 08:32, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> Lookie here, the fundamental issue is that epoll can call '->poll()'
>> on a file descriptor that is being closed concurrently.
> Thinking some more about this, and replying to myself...
>
> Actually, I wonder if we could *really* fix this by simply moving the
> eventpoll_release() to where it really belongs.
>
> If we did it in file_close_fd_locked(),  it would actually make a
> *lot* more sense. Particularly since eventpoll actually uses this:
>
>      struct epoll_filefd {
>          struct file *file;
>          int fd;
>      } __packed;
>
> ie it doesn't just use the 'struct file *', it uses the 'fd' itself
> (for ep_find()).
>
> (Strictly speaking, it should also have a pointer to the 'struct
> files_struct' to make the 'int fd' be meaningful).

While I completely agree on this I unfortunately have to ruin the idea.

Before we had KCMP some people relied on the strange behavior of 
eventpoll to compare struct files when the fd is the same.

I just recently suggested that solution to somebody at AMD as a 
workaround when KCMP is disabled because of security hardening and I'm 
pretty sure I've seen it somewhere else as well.

So when we change that it would break (undocumented?) UAPI behavior.

Regards,
Christian.

>
> IOW, eventpoll already considers the file _descriptor_ relevant, not
> just the file pointer, and that's destroyed at *close* time, not at
> 'fput()' time.
>
> Yeah, yeah, the locking situation in file_close_fd_locked() is a bit
> inconvenient, but if we can solve that, it would solve the problem in
> a fundamentally different way: remove the ep iterm before the
> file->f_count has actually been decremented, so the whole "race with
> fput()" would just go away entirely.
>
> I dunno. I think that would be the right thing to do, but I wouldn't
> be surprised if some disgusting eventpoll user then might depend on
> the current situation where the eventpoll thing stays around even
> after the close() if you have another copy of the file open.
>
>               Linus
> _______________________________________________
> Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org
> To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-06 12:47                           ` Daniel Vetter
@ 2024-05-06 14:46                             ` Christian Brauner
  2024-05-07 10:58                               ` Daniel Vetter
  0 siblings, 1 reply; 90+ messages in thread
From: Christian Brauner @ 2024-05-06 14:46 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Linus Torvalds, Al Viro, keescook, axboe, christian.koenig,
	dri-devel, io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Mon, May 06, 2024 at 02:47:23PM +0200, Daniel Vetter wrote:
> On Sun, May 05, 2024 at 01:53:48PM -0700, Linus Torvalds wrote:
> > On Sun, 5 May 2024 at 13:30, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > 0.      special-cased ->f_count rule for ->poll() is a wart and it's
> > > better to get rid of it.
> > >
> > > 1.      fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
> > > git rm taken to it.  Short of that, by all means, let's grab reference
> > > in there around the call of vfs_poll() (see (0)).
> > 
> > Agreed on 0/1.
> > 
> > > 2.      having ->poll() instances grab extra references to file passed
> > > to them is not something that should be encouraged; there's a plenty
> > > of potential problems, and "caller has it pinned, so we are fine with
> > > grabbing extra refs" is nowhere near enough to eliminate those.
> > 
> > So it's not clear why you hate it so much, since those extra
> > references are totally normal in all the other VFS paths.
> > 
> > I mean, they are perhaps not the *common* case, but we have a lot of
> > random get_file() calls sprinkled around in various places when you
> > end up passing a file descriptor off to some asynchronous operation
> > thing.
> > 
> > Yeah, I think most of them tend to be special operations (eg the tty
> > TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl()
> > is *that* different from vfs_poll. Different operation, not somehow
> > "one is more special than the other".
> > 
> > cachefiles and backing-file does it for regular IO, and drop it at IO
> > completion - not that different from what dma-buf does. It's in
> > ->read_iter() rather than ->poll(), but again: different operations,
> > but not "one of them is somehow fundamentally different".
> > 
> > > 3.      dma-buf uses of get_file() are probably safe (epoll shite aside),
> > > but they do look fishy.  That has nothing to do with epoll.
> > 
> > Now, what dma-buf basically seems to do is to avoid ref-counting its
> > own fundamental data structure, and replaces that by refcounting the
> > 'struct file' that *points* to it instead.
> > 
> > And it is a bit odd, but it actually makes some amount of sense,
> > because then what it passes around is that file pointer (and it allows
> > passing it around from user space *as* that file).
> > 
> > And honestly, if you look at why it then needs to add its refcount to
> > it all, it actually makes sense.  dma-bufs have this notion of
> > "fences" that are basically completion points for the asynchronous
> > DMA. Doing a "poll()" operation will add a note to the fence to get
> > that wakeup when it's done.
> > 
> > And yes, logically it takes a ref to the "struct dma_buf", but because
> > of how the lifetime of the dma_buf is associated with the lifetime of
> > the 'struct file', that then turns into taking a ref on the file.
> > 
> > Unusual? Yes. But not illogical. Not obviously broken. Tying the
> > lifetime of the dma_buf to the lifetime of a file that is passed along
> > makes _sense_ for that use.
> > 
> > I'm sure dma-bufs could add another level of refcounting on the
> > 'struct dma_buf' itself, and not make it be 1:1 with the file, but
> > it's not clear to me what the advantage would really be, or why it
> > would be wrong to re-use a refcount that is already there.
> 
> So there is generally another refcount, because dma_buf is just the
> cross-driver interface to some kind of real underlying buffer object from
> the various graphics related subsystems we have.
> 
> And since it's a pure file based api thing that ceases to serve any
> function once the fd/file is gone we tied all the dma_buf refcounting to
> the refcount struct file already maintains. But the underlying buffer
> object can easily outlive the dma_buf, and over the lifetime of an
> underlying buffer object you might actually end up creating different
> dma_buf api wrappers for it (but at least in drm we guarantee there's at
> most one, hence why vmwgfx does the atomic_inc_unless_zero trick, which I
> don't particularly like and isn't really needed).
> 
> But we could add another refcount, it just means we have 3 of those then
> when only really 2 are needed.

Fwiw, the TTM thing described upthread and in the other thread really
tries hard to work around the dma_buf == file lifetime choice by hooking
into the dma-buf specific release function so it can access the dmabuf
and then the file. All that seems like a pretty error prone thing to me.
So a separate refcount for dma_buf wouldn't be the worst as that would
allow that TTM thing to benefit and remove that nasty hacking into your
generic dma_buf ops. But maybe I'm the only one who sees it that way and
I'm certainly not familiar enough with dma-buf.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-05 20:53                         ` Linus Torvalds
  2024-05-06 12:47                           ` Daniel Vetter
@ 2024-05-06 16:15                           ` Christian König
  1 sibling, 0 replies; 90+ messages in thread
From: Christian König @ 2024-05-06 16:15 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: Christian Brauner, keescook, axboe, dri-devel, io-uring, jack,
	laura, linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs

Am 05.05.24 um 22:53 schrieb Linus Torvalds:
> On Sun, 5 May 2024 at 13:30, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> 0.      special-cased ->f_count rule for ->poll() is a wart and it's
>> better to get rid of it.
>>
>> 1.      fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
>> git rm taken to it.  Short of that, by all means, let's grab reference
>> in there around the call of vfs_poll() (see (0)).
> Agreed on 0/1.
>
>> 2.      having ->poll() instances grab extra references to file passed
>> to them is not something that should be encouraged; there's a plenty
>> of potential problems, and "caller has it pinned, so we are fine with
>> grabbing extra refs" is nowhere near enough to eliminate those.
> So it's not clear why you hate it so much, since those extra
> references are totally normal in all the other VFS paths.

Sorry to maybe jumping into the middle of the discussion, but for 
DMA-buf the behavior Al doesn't want is actually desired.

And I totally understand why Al is against it for file system based 
files, but for this case it's completely intentional.

Removing the callback on close is what we used to do a long time ago, 
but that turned out into a locking nightmare because it meant that we 
need to be able to wait for driver specific locks from whatever non 
interrupt context fput() is called from.

Regards,
Christian.

>
> I mean, they are perhaps not the *common* case, but we have a lot of
> random get_file() calls sprinkled around in various places when you
> end up passing a file descriptor off to some asynchronous operation
> thing.
>
> Yeah, I think most of them tend to be special operations (eg the tty
> TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl()
> is *that* different from vfs_poll. Different operation, not somehow
> "one is more special than the other".
>
> cachefiles and backing-file does it for regular IO, and drop it at IO
> completion - not that different from what dma-buf does. It's in
> ->read_iter() rather than ->poll(), but again: different operations,
> but not "one of them is somehow fundamentally different".
>
>> 3.      dma-buf uses of get_file() are probably safe (epoll shite aside),
>> but they do look fishy.  That has nothing to do with epoll.
> Now, what dma-buf basically seems to do is to avoid ref-counting its
> own fundamental data structure, and replaces that by refcounting the
> 'struct file' that *points* to it instead.
>
> And it is a bit odd, but it actually makes some amount of sense,
> because then what it passes around is that file pointer (and it allows
> passing it around from user space *as* that file).
>
> And honestly, if you look at why it then needs to add its refcount to
> it all, it actually makes sense.  dma-bufs have this notion of
> "fences" that are basically completion points for the asynchronous
> DMA. Doing a "poll()" operation will add a note to the fence to get
> that wakeup when it's done.
>
> And yes, logically it takes a ref to the "struct dma_buf", but because
> of how the lifetime of the dma_buf is associated with the lifetime of
> the 'struct file', that then turns into taking a ref on the file.
>
> Unusual? Yes. But not illogical. Not obviously broken. Tying the
> lifetime of the dma_buf to the lifetime of a file that is passed along
> makes _sense_ for that use.
>
> I'm sure dma-bufs could add another level of refcounting on the
> 'struct dma_buf' itself, and not make it be 1:1 with the file, but
> it's not clear to me what the advantage would really be, or why it
> would be wrong to re-use a refcount that is already there.
>
>                   Linus


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-03 21:24                 ` Linus Torvalds
  2024-05-03 21:30                   ` Al Viro
@ 2024-05-06 17:46                   ` Stefan Metzmacher
  2024-05-06 18:17                     ` Linus Torvalds
  1 sibling, 1 reply; 90+ messages in thread
From: Stefan Metzmacher @ 2024-05-06 17:46 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: Kees Cook, Jens Axboe, Bui Quang Minh, Christian Brauner, syzbot,
	io-uring, jack, linux-fsdevel, linux-kernel, syzkaller-bugs,
	Sumit Semwal, Christian König, linux-media, dri-devel,
	linaro-mm-sig, Laura Abbott

Am 03.05.24 um 23:24 schrieb Linus Torvalds:
> On Fri, 3 May 2024 at 14:11, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> What we need is
>>          * promise that ep_item_poll() won't happen after eventpoll_release_file().
>> AFAICS, we do have that.
>>          * ->poll() not playing silly buggers.
> 
> No. That is not enough at all.
> 
> Because even with perfectly normal "->poll()", and even with the
> ep_item_poll() happening *before* eventpoll_release_file(), you have
> this trivial race:
> 
>    ep_item_poll()
>       ->poll()
> 
> and *between* those two operations, another CPU does "close()", and
> that causes eventpoll_release_file() to be called, and now f_count
> goes down to zero while ->poll() is running.
> 
> So you do need to increment the file count around the ->poll() call, I feel.
> 
> Or, alternatively, you'd need to serialize with
> eventpoll_release_file(), but that would need to be some sleeping lock
> held over the ->poll() call.
> 
>> As it is, dma_buf ->poll() is very suspicious regardless of that
>> mess - it can grab reference to file for unspecified interval.
> 
> I think that's actually much preferable to what epoll does, which is
> to keep using files without having reference counts to them (and then
> relying on magically not racing with eventpoll_release_file().

I think it's a very important detail that epoll does not take
real references. Otherwise an application level 'close()' on a socket
would not trigger a tcp disconnect, when an fd is still registered with
epoll.

I noticed that some parts of Samba currently rely on this when I tried
to convert tevent from epoll to IORING_OP_POLL_ADD (which takes a longer term reference)

And I guess there will be other applications also relying on the current epoll
behavior. That a closed fs automatically removes itself from epoll.

A short term reference just around ->poll() might be fine,
but please no reference via EPOLL_CTL_ADD.

Changing that can cause security problems in user space.

I haven't followed all details of this thread,
please ignore me if that's all clear already :-)

Thanks!
metze



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-06 17:46                   ` Stefan Metzmacher
@ 2024-05-06 18:17                     ` Linus Torvalds
  2024-05-08  8:47                       ` David Laight
  0 siblings, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2024-05-06 18:17 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Al Viro, Kees Cook, Jens Axboe, Bui Quang Minh,
	Christian Brauner, syzbot, io-uring, jack, linux-fsdevel,
	linux-kernel, syzkaller-bugs, Sumit Semwal, Christian König,
	linux-media, dri-devel, linaro-mm-sig, Laura Abbott

On Mon, 6 May 2024 at 10:46, Stefan Metzmacher <metze@samba.org> wrote:
>
> I think it's a very important detail that epoll does not take
> real references. Otherwise an application level 'close()' on a socket
> would not trigger a tcp disconnect, when an fd is still registered with
> epoll.

Yes, exactly.

epoll() ends up actually wanting the lifetime of the ep item be the
lifetime of the file _descriptor_, not the lifetime of the file
itself.

We approximate that - badly - with epoll not taking a reference on the
file pointer, and then at final fput() it tears things down.

But that has two real issues, and one of them is that "oh, now epoll
has file pointers that are actually dead" that caused this thread.

The other issue is that "approximates" thing, where it means that
duplicating the file pointer (dup*() and fork() end unix socket file
sending etc) will not mean that the epoll ref is also out of sync with
the lifetime of the file descriptor.

That's why I suggested that "clean up epoll references at
file_close_fd() time instead:

  https://lore.kernel.org/all/CAHk-=wj6XL9MGCd_nUzRj6SaKeN0TsyTTZDFpGdW34R+zMZaSg@mail.gmail.com/

because it would actually really *fix* the lifetime issue of ep items.

In the process, it would make it possible to actually take a f_count
reference at EPOLL_CTL_ADD time, since now the lifetime of the EP
wouldn't be tied to the lifetime of the 'struct file *' pointer, it
would be properly tied to the lifetime of the actual file descriptor
that you are adding.

So it would be a huge conceptual cleanup too.

(Of course - at that point EPOLL_CTL_ADD still doesn't actually _need_
a reference to the file, since the file being open in itself is
already that reference - but the point here being that there would
*be* a reference that the epoll code would effectively have, and you'd
never be in the situation we were in where there would be stale "dead"
file pointers that just haven't gone through the cleanup yet).

But I'd rather not touch the epoll code more than I have to.

Which is why I applied the minimal patch for just "refcount over
vfs_poll()", and am just mentioning my suggestion in the hope that
some eager beaver would like to see how painful it would do to make
the bigger surgery...

                   Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-06 14:46                             ` Christian Brauner
@ 2024-05-07 10:58                               ` Daniel Vetter
  0 siblings, 0 replies; 90+ messages in thread
From: Daniel Vetter @ 2024-05-07 10:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Daniel Vetter, Linus Torvalds, Al Viro, keescook, axboe,
	christian.koenig, dri-devel, io-uring, jack, laura,
	linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs

On Mon, May 06, 2024 at 04:46:54PM +0200, Christian Brauner wrote:
> On Mon, May 06, 2024 at 02:47:23PM +0200, Daniel Vetter wrote:
> > On Sun, May 05, 2024 at 01:53:48PM -0700, Linus Torvalds wrote:
> > > On Sun, 5 May 2024 at 13:30, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > >
> > > > 0.      special-cased ->f_count rule for ->poll() is a wart and it's
> > > > better to get rid of it.
> > > >
> > > > 1.      fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
> > > > git rm taken to it.  Short of that, by all means, let's grab reference
> > > > in there around the call of vfs_poll() (see (0)).
> > > 
> > > Agreed on 0/1.
> > > 
> > > > 2.      having ->poll() instances grab extra references to file passed
> > > > to them is not something that should be encouraged; there's a plenty
> > > > of potential problems, and "caller has it pinned, so we are fine with
> > > > grabbing extra refs" is nowhere near enough to eliminate those.
> > > 
> > > So it's not clear why you hate it so much, since those extra
> > > references are totally normal in all the other VFS paths.
> > > 
> > > I mean, they are perhaps not the *common* case, but we have a lot of
> > > random get_file() calls sprinkled around in various places when you
> > > end up passing a file descriptor off to some asynchronous operation
> > > thing.
> > > 
> > > Yeah, I think most of them tend to be special operations (eg the tty
> > > TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl()
> > > is *that* different from vfs_poll. Different operation, not somehow
> > > "one is more special than the other".
> > > 
> > > cachefiles and backing-file does it for regular IO, and drop it at IO
> > > completion - not that different from what dma-buf does. It's in
> > > ->read_iter() rather than ->poll(), but again: different operations,
> > > but not "one of them is somehow fundamentally different".
> > > 
> > > > 3.      dma-buf uses of get_file() are probably safe (epoll shite aside),
> > > > but they do look fishy.  That has nothing to do with epoll.
> > > 
> > > Now, what dma-buf basically seems to do is to avoid ref-counting its
> > > own fundamental data structure, and replaces that by refcounting the
> > > 'struct file' that *points* to it instead.
> > > 
> > > And it is a bit odd, but it actually makes some amount of sense,
> > > because then what it passes around is that file pointer (and it allows
> > > passing it around from user space *as* that file).
> > > 
> > > And honestly, if you look at why it then needs to add its refcount to
> > > it all, it actually makes sense.  dma-bufs have this notion of
> > > "fences" that are basically completion points for the asynchronous
> > > DMA. Doing a "poll()" operation will add a note to the fence to get
> > > that wakeup when it's done.
> > > 
> > > And yes, logically it takes a ref to the "struct dma_buf", but because
> > > of how the lifetime of the dma_buf is associated with the lifetime of
> > > the 'struct file', that then turns into taking a ref on the file.
> > > 
> > > Unusual? Yes. But not illogical. Not obviously broken. Tying the
> > > lifetime of the dma_buf to the lifetime of a file that is passed along
> > > makes _sense_ for that use.
> > > 
> > > I'm sure dma-bufs could add another level of refcounting on the
> > > 'struct dma_buf' itself, and not make it be 1:1 with the file, but
> > > it's not clear to me what the advantage would really be, or why it
> > > would be wrong to re-use a refcount that is already there.
> > 
> > So there is generally another refcount, because dma_buf is just the
> > cross-driver interface to some kind of real underlying buffer object from
> > the various graphics related subsystems we have.
> > 
> > And since it's a pure file based api thing that ceases to serve any
> > function once the fd/file is gone we tied all the dma_buf refcounting to
> > the refcount struct file already maintains. But the underlying buffer
> > object can easily outlive the dma_buf, and over the lifetime of an
> > underlying buffer object you might actually end up creating different
> > dma_buf api wrappers for it (but at least in drm we guarantee there's at
> > most one, hence why vmwgfx does the atomic_inc_unless_zero trick, which I
> > don't particularly like and isn't really needed).
> > 
> > But we could add another refcount, it just means we have 3 of those then
> > when only really 2 are needed.
> 
> Fwiw, the TTM thing described upthread and in the other thread really
> tries hard to work around the dma_buf == file lifetime choice by hooking
> into the dma-buf specific release function so it can access the dmabuf
> and then the file. All that seems like a pretty error prone thing to me.
> So a separate refcount for dma_buf wouldn't be the worst as that would
> allow that TTM thing to benefit and remove that nasty hacking into your
> generic dma_buf ops. But maybe I'm the only one who sees it that way and
> I'm certainly not familiar enough with dma-buf.

So the tricky part is the uniqueness requirement drm has for buffer
objects (and hence dma_buf wrappers), which together with the refcounting
makes dma_buf quite tricky:

- dma_buf needs to hold some reference onto the underlying object, or it
  wont work

- but you're not allowed to just create a new dma_buf every time someone
  exports an underlying object to a dma_buf, because that would break the
  uniqueness requirement. Which means the underlying object must also hold
  some kind of reference to its dma_buf, if it exists. So that on buffer
  export it can just increment the refcount for that and return it,
  instead of creating a new one.

Which would be a reference loop that never gets freed, so you need one of
two tricks:

- Either a weak reference, i.e. just a pointer plus
  atomic_inc_unless_zero trickery like ttm does. Splitting that refcount
  into more refcounts doesn't fundamentally solve the problem, it just
  adds even more refcounts.

- Or you do what all other drm drivers do in drm_prime.c do and careful
  clean up the dma_buf re-export cache when the userspace references (but
  not all kernel internal ones) disappear, to unbreak that reference loop.
  This needs to be done with extreme care and took a lot of screaming to
  get right, because if you have a race you might end up breaking the
  uniqueness requirement and have two dma_buf floating around.

So neither of these solutions really are simple, but I agree with you that
the atomic_inc_unless_zero trickery is less simple. It's definitely not
cool that it's done by digging around in struct file internals.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-06 14:29                 ` [Linaro-mm-sig] " Christian König
@ 2024-05-07 11:02                   ` Daniel Vetter
  2024-05-07 16:46                     ` Linus Torvalds
  2024-05-08 10:08                   ` Christian Brauner
  1 sibling, 1 reply; 90+ messages in thread
From: Daniel Vetter @ 2024-05-07 11:02 UTC (permalink / raw)
  To: Christian König
  Cc: Linus Torvalds, Christian Brauner, Al Viro, keescook, axboe,
	christian.koenig, dri-devel, io-uring, jack, laura,
	linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs

On Mon, May 06, 2024 at 04:29:44PM +0200, Christian König wrote:
> Am 04.05.24 um 20:20 schrieb Linus Torvalds:
> > On Sat, 4 May 2024 at 08:32, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > Lookie here, the fundamental issue is that epoll can call '->poll()'
> > > on a file descriptor that is being closed concurrently.
> > Thinking some more about this, and replying to myself...
> > 
> > Actually, I wonder if we could *really* fix this by simply moving the
> > eventpoll_release() to where it really belongs.
> > 
> > If we did it in file_close_fd_locked(),  it would actually make a
> > *lot* more sense. Particularly since eventpoll actually uses this:
> > 
> >      struct epoll_filefd {
> >          struct file *file;
> >          int fd;
> >      } __packed;
> > 
> > ie it doesn't just use the 'struct file *', it uses the 'fd' itself
> > (for ep_find()).
> > 
> > (Strictly speaking, it should also have a pointer to the 'struct
> > files_struct' to make the 'int fd' be meaningful).
> 
> While I completely agree on this I unfortunately have to ruin the idea.
> 
> Before we had KCMP some people relied on the strange behavior of eventpoll
> to compare struct files when the fd is the same.
> 
> I just recently suggested that solution to somebody at AMD as a workaround
> when KCMP is disabled because of security hardening and I'm pretty sure I've
> seen it somewhere else as well.
> 
> So when we change that it would break (undocumented?) UAPI behavior.

Uh extremely aside, but doesn't this mean we should just enable kcmp on
files unconditionally, since there's an alternative? Or a least everywhere
CONFIG_EPOLL is enabled?

It's really annoying that on some distros/builds we don't have that, and
for gpu driver stack reasons we _really_ need to know whether a fd is the
same as another, due to some messy uniqueness requirements on buffer
objects various drivers have.
-Sima

> 
> Regards,
> Christian.
> 
> > 
> > IOW, eventpoll already considers the file _descriptor_ relevant, not
> > just the file pointer, and that's destroyed at *close* time, not at
> > 'fput()' time.
> > 
> > Yeah, yeah, the locking situation in file_close_fd_locked() is a bit
> > inconvenient, but if we can solve that, it would solve the problem in
> > a fundamentally different way: remove the ep iterm before the
> > file->f_count has actually been decremented, so the whole "race with
> > fput()" would just go away entirely.
> > 
> > I dunno. I think that would be the right thing to do, but I wouldn't
> > be surprised if some disgusting eventpoll user then might depend on
> > the current situation where the eventpoll thing stays around even
> > after the close() if you have another copy of the file open.
> > 
> >               Linus
> > _______________________________________________
> > Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org
> > To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-07 11:02                   ` Daniel Vetter
@ 2024-05-07 16:46                     ` Linus Torvalds
  2024-05-07 17:45                       ` Christian König
  2024-05-07 18:04                       ` Daniel Vetter
  0 siblings, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-07 16:46 UTC (permalink / raw)
  To: Christian König, Linus Torvalds, Christian Brauner, Al Viro,
	keescook, axboe, christian.koenig, dri-devel, io-uring, jack,
	laura, linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs

On Tue, 7 May 2024 at 04:03, Daniel Vetter <daniel@ffwll.ch> wrote:
>
> It's really annoying that on some distros/builds we don't have that, and
> for gpu driver stack reasons we _really_ need to know whether a fd is the
> same as another, due to some messy uniqueness requirements on buffer
> objects various drivers have.

It's sad that such a simple thing would require two other horrid
models (EPOLL or KCMP).

There'[s a reason that KCMP is a config option - *some* of that is
horrible code - but the "compare file descriptors for equality" is not
that reason.

Note that KCMP really is a broken mess. It's also a potential security
hole, even for the simple things, because of how it ends up comparing
kernel pointers (ie it doesn't just say "same file descriptor", it
gives an ordering of them, so you can use KCMP to sort things in
kernel space).

And yes, it orders them after obfuscating the pointer, but it's still
not something I would consider sane as a baseline interface. It was
designed for checkpoint-restore, it's the wrong thing to use for some
"are these file descriptors the same".

The same argument goes for using EPOLL for that. Disgusting hack.

Just what are the requirements for the GPU stack? Is one of the file
descriptors "trusted", IOW, you know what kind it is?

Because dammit, it's *so* easy to do. You could just add a core DRM
ioctl for it. Literally just

        struct fd f1 = fdget(fd1);
        struct fd f2 = fdget(fd2);
        int same;

        same = f1.file && f1.file == f2.file;
        fdput(fd1);
        fdput(fd2);
        return same;

where the only question is if you also woudl want to deal with O_PATH
fd's, in which case the "fdget()" would be "fdget_raw()".

Honestly, adding some DRM ioctl for this sounds hacky, but it sounds
less hacky than relying on EPOLL or KCMP.

I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
too, if this is possibly a more common thing. and not just DRM wants
it.

Would something like that work for you?

                 Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-07 16:46                     ` Linus Torvalds
@ 2024-05-07 17:45                       ` Christian König
  2024-05-08  7:51                         ` Michel Dänzer
  2024-05-08  8:23                         ` Christian Brauner
  2024-05-07 18:04                       ` Daniel Vetter
  1 sibling, 2 replies; 90+ messages in thread
From: Christian König @ 2024-05-07 17:45 UTC (permalink / raw)
  To: Linus Torvalds, Christian Brauner, Al Viro, keescook, axboe,
	christian.koenig, dri-devel, io-uring, jack, laura,
	linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs

Am 07.05.24 um 18:46 schrieb Linus Torvalds:
> On Tue, 7 May 2024 at 04:03, Daniel Vetter <daniel@ffwll.ch> wrote:
>> It's really annoying that on some distros/builds we don't have that, and
>> for gpu driver stack reasons we _really_ need to know whether a fd is the
>> same as another, due to some messy uniqueness requirements on buffer
>> objects various drivers have.
> It's sad that such a simple thing would require two other horrid
> models (EPOLL or KCMP).
>
> There'[s a reason that KCMP is a config option - *some* of that is
> horrible code - but the "compare file descriptors for equality" is not
> that reason.
>
> Note that KCMP really is a broken mess. It's also a potential security
> hole, even for the simple things, because of how it ends up comparing
> kernel pointers (ie it doesn't just say "same file descriptor", it
> gives an ordering of them, so you can use KCMP to sort things in
> kernel space).
>
> And yes, it orders them after obfuscating the pointer, but it's still
> not something I would consider sane as a baseline interface. It was
> designed for checkpoint-restore, it's the wrong thing to use for some
> "are these file descriptors the same".
>
> The same argument goes for using EPOLL for that. Disgusting hack.
>
> Just what are the requirements for the GPU stack? Is one of the file
> descriptors "trusted", IOW, you know what kind it is?
>
> Because dammit, it's *so* easy to do. You could just add a core DRM
> ioctl for it. Literally just
>
>          struct fd f1 = fdget(fd1);
>          struct fd f2 = fdget(fd2);
>          int same;
>
>          same = f1.file && f1.file == f2.file;
>          fdput(fd1);
>          fdput(fd2);
>          return same;
>
> where the only question is if you also woudl want to deal with O_PATH
> fd's, in which case the "fdget()" would be "fdget_raw()".
>
> Honestly, adding some DRM ioctl for this sounds hacky, but it sounds
> less hacky than relying on EPOLL or KCMP.
>
> I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
> too, if this is possibly a more common thing. and not just DRM wants
> it.
>
> Would something like that work for you?

Well the generic approach yes, the DRM specific one maybe. IIRC we need 
to be able to compare both DRM as well as DMA-buf file descriptors.

The basic problem userspace tries to solve is that drivers might get the 
same fd through two different code paths.

For example application using OpenGL/Vulkan for rendering and VA-API for 
video decoding/encoding at the same time.

Both APIs get a fd which identifies the device to use. It can be the 
same, but it doesn't have to.

If it's the same device driver connection (or in kernel speak underlying 
struct file) then you can optimize away importing and exporting of 
buffers for example.

Additional to that it makes cgroup accounting much easier because you 
don't count things twice because they are shared etc...

Regards,
Christian.

>
>                   Linus


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-07 16:46                     ` Linus Torvalds
  2024-05-07 17:45                       ` Christian König
@ 2024-05-07 18:04                       ` Daniel Vetter
  2024-05-07 19:07                         ` Linus Torvalds
  1 sibling, 1 reply; 90+ messages in thread
From: Daniel Vetter @ 2024-05-07 18:04 UTC (permalink / raw)
  To: Linus Torvalds, Simon Ser, Pekka Paalanen
  Cc: Christian König, Christian Brauner, Al Viro, keescook,
	axboe, christian.koenig, dri-devel, io-uring, jack, laura,
	linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs

On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote:
> On Tue, 7 May 2024 at 04:03, Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > It's really annoying that on some distros/builds we don't have that, and
> > for gpu driver stack reasons we _really_ need to know whether a fd is the
> > same as another, due to some messy uniqueness requirements on buffer
> > objects various drivers have.
>
> It's sad that such a simple thing would require two other horrid
> models (EPOLL or KCMP).
>
> There'[s a reason that KCMP is a config option - *some* of that is
> horrible code - but the "compare file descriptors for equality" is not
> that reason.
>
> Note that KCMP really is a broken mess. It's also a potential security
> hole, even for the simple things, because of how it ends up comparing
> kernel pointers (ie it doesn't just say "same file descriptor", it
> gives an ordering of them, so you can use KCMP to sort things in
> kernel space).
>
> And yes, it orders them after obfuscating the pointer, but it's still
> not something I would consider sane as a baseline interface. It was
> designed for checkpoint-restore, it's the wrong thing to use for some
> "are these file descriptors the same".
>
> The same argument goes for using EPOLL for that. Disgusting hack.
>
> Just what are the requirements for the GPU stack? Is one of the file
> descriptors "trusted", IOW, you know what kind it is?
>
> Because dammit, it's *so* easy to do. You could just add a core DRM
> ioctl for it. Literally just
>
>         struct fd f1 = fdget(fd1);
>         struct fd f2 = fdget(fd2);
>         int same;
>
>         same = f1.file && f1.file == f2.file;
>         fdput(fd1);
>         fdput(fd2);
>         return same;
>
> where the only question is if you also woudl want to deal with O_PATH
> fd's, in which case the "fdget()" would be "fdget_raw()".
>
> Honestly, adding some DRM ioctl for this sounds hacky, but it sounds
> less hacky than relying on EPOLL or KCMP.

Well, in slightly more code (because it's part of the "import this
dma-buf/dma-fence/whatever fd into a driver object" ioctl) this is what we
do.

The issue is that there's generic userspace (like compositors) that sees
these things fly by and would also like to know whether the other side
they receive them from is doing nasty stuff/buggy/evil. And they don't
have access to the device drm fd (since those are a handful of layers away
behind the opengl/vulkan userspace drivers even if the compositor could get
at them, and in some cases not even that).

So if we do this in drm we'd essentially have to create a special
drm_compare_files chardev, put the ioctl there and then tell everyone to
make that thing world-accessible.

Which is just too close to a real syscall that it's offensive, and hey
kcmp does what we want already (but unfortunately also way more). So we
rejected adding that to drm. But we did think about it.

> I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
> too, if this is possibly a more common thing. and not just DRM wants
> it.
>
> Would something like that work for you?

Yes.

Adding Simon and Pekka as two of the usual suspects for this kind of
stuff. Also example code (the int return value is just so that callers know
when kcmp isn't available, they all only care about equality):

https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239
-Sima

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-07 18:04                       ` Daniel Vetter
@ 2024-05-07 19:07                         ` Linus Torvalds
  2024-05-08  5:55                           ` Christian König
                                             ` (2 more replies)
  0 siblings, 3 replies; 90+ messages in thread
From: Linus Torvalds @ 2024-05-07 19:07 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Simon Ser, Pekka Paalanen, Christian König,
	Christian Brauner, Al Viro, keescook, axboe, christian.koenig,
	dri-devel, io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Tue, 7 May 2024 at 11:04, Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote:
>
> > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
> > too, if this is possibly a more common thing. and not just DRM wants
> > it.
> >
> > Would something like that work for you?
>
> Yes.
>
> Adding Simon and Pekka as two of the usual suspects for this kind of
> stuff. Also example code (the int return value is just so that callers know
> when kcmp isn't available, they all only care about equality):
>
> https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239

That example thing shows that we shouldn't make it a FISAME ioctl - we
should make it a fcntl() instead, and it would just be a companion to
F_DUPFD.

Doesn't that strike everybody as a *much* cleaner interface? I think
F_ISDUP would work very naturally indeed with F_DUPFD.

Yes? No?

                       Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-06  8:45                     ` Christian Brauner
  2024-05-06  9:26                       ` Christian Brauner
@ 2024-05-07 21:02                       ` David Laight
  1 sibling, 0 replies; 90+ messages in thread
From: David Laight @ 2024-05-07 21:02 UTC (permalink / raw)
  To: 'Christian Brauner', Linus Torvalds
  Cc: Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

From: Christian Brauner
> Sent: 06 May 2024 09:45
> 
> > The fact is, it's not dma-buf that is violating any rules. It's epoll.
> 
> I agree that epoll() not taking a reference on the file is at least
> unexpected and contradicts the usual code patterns for the sake of
> performance and that it very likely is the case that most callers of
> f_op->poll() don't know this.
> 
> Note, I cleary wrote upthread that I'm ok to do it like you suggested
> but raised two concerns a) there's currently only one instance of
> prolonged @file lifetime in f_op->poll() afaict and b) that there's
> possibly going to be some performance impact on epoll().
> 
> So it's at least worth discussing what's more important because epoll()
> is very widely used and it's not that we haven't favored performance
> before.
> 
> But you've already said that you aren't concerned with performance on
> epoll() upthread. So afaict then there's really not a lot more to
> discuss other than take the patch and see whether we get any complaints.

Surely there isn't a problem with epoll holding a reference to the file
structure - it isn't really any different to a dup().

'All' that needs to happen is that the 'magic' that makes epoll() remove
files on the last fput happen when the close is done.
I'm sure there are horrid locking issues it that code (separate from
it calling ->poll() after ->release()) eg if you call close() concurrently
with EPOLL_CTL_ADD.

I'm not at all sure it would have mattered if epoll kept the file open.
But it can't do that because it is documented not to.
As well as poll/select holding a reference to all their fd for the duration
of the system call, a successful mmap() holds a reference until the pages
are all unmapped - usually by process exit.

We (dayjob) have code that uses epoll() to monitor large numbers of UDP
sockets. I was doing some tests (trying to) receive RTP (audio) data
concurrently on 10000 sockets with typically one packet every 20ms.
There are 10000 associated RCTP sockets that are usually idle.
A more normal limit would be 1000 RTP sockets.
All the data needs to go into a single (multithreaded) process.
Just getting all the packets queued on the sockets was non-trivial.
epoll is about the only way to actually read the data.
(That needed multiple epoll fd so each thread could process all
the events from one epoll fd then look for another unprocessed fd.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-07 19:07                         ` Linus Torvalds
@ 2024-05-08  5:55                           ` Christian König
  2024-05-08  8:32                             ` Daniel Vetter
  2024-05-08  8:05                           ` Christian Brauner
  2024-05-08 16:19                           ` Linus Torvalds
  2 siblings, 1 reply; 90+ messages in thread
From: Christian König @ 2024-05-08  5:55 UTC (permalink / raw)
  To: Linus Torvalds, Daniel Vetter
  Cc: Simon Ser, Pekka Paalanen, Christian Brauner, Al Viro, keescook,
	axboe, christian.koenig, dri-devel, io-uring, jack, laura,
	linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs

Am 07.05.24 um 21:07 schrieb Linus Torvalds:
> On Tue, 7 May 2024 at 11:04, Daniel Vetter <daniel@ffwll.ch> wrote:
>> On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote:
>>
>>> I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
>>> too, if this is possibly a more common thing. and not just DRM wants
>>> it.
>>>
>>> Would something like that work for you?
>> Yes.
>>
>> Adding Simon and Pekka as two of the usual suspects for this kind of
>> stuff. Also example code (the int return value is just so that callers know
>> when kcmp isn't available, they all only care about equality):
>>
>> https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239
> That example thing shows that we shouldn't make it a FISAME ioctl - we
> should make it a fcntl() instead, and it would just be a companion to
> F_DUPFD.
>
> Doesn't that strike everybody as a *much* cleaner interface? I think
> F_ISDUP would work very naturally indeed with F_DUPFD.
>
> Yes? No?

Sounds absolutely sane to me.

Christian.

>
>                         Linus


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-07 17:45                       ` Christian König
@ 2024-05-08  7:51                         ` Michel Dänzer
  2024-05-08  7:59                           ` Christian König
  2024-05-08  8:23                         ` Christian Brauner
  1 sibling, 1 reply; 90+ messages in thread
From: Michel Dänzer @ 2024-05-08  7:51 UTC (permalink / raw)
  To: Christian König, Linus Torvalds, Christian Brauner, Al Viro,
	keescook, axboe, christian.koenig, dri-devel, io-uring, jack,
	laura, linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs

On 2024-05-07 19:45, Christian König wrote:
> Am 07.05.24 um 18:46 schrieb Linus Torvalds:
>>
>> Just what are the requirements for the GPU stack? Is one of the file
>> descriptors "trusted", IOW, you know what kind it is?
>>
>> Because dammit, it's *so* easy to do. You could just add a core DRM
>> ioctl for it. Literally just
>>
>>          struct fd f1 = fdget(fd1);
>>          struct fd f2 = fdget(fd2);
>>          int same;
>>
>>          same = f1.file && f1.file == f2.file;
>>          fdput(fd1);
>>          fdput(fd2);
>>          return same;
>>
>> where the only question is if you also woudl want to deal with O_PATH
>> fd's, in which case the "fdget()" would be "fdget_raw()".
>>
>> Honestly, adding some DRM ioctl for this sounds hacky, but it sounds
>> less hacky than relying on EPOLL or KCMP.
>>
>> I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
>> too, if this is possibly a more common thing. and not just DRM wants
>> it.
>>
>> Would something like that work for you?
> 
> Well the generic approach yes, the DRM specific one maybe. IIRC we need to be able to compare both DRM as well as DMA-buf file descriptors.
> 
> The basic problem userspace tries to solve is that drivers might get the same fd through two different code paths.
> 
> For example application using OpenGL/Vulkan for rendering and VA-API for video decoding/encoding at the same time.
> 
> Both APIs get a fd which identifies the device to use. It can be the same, but it doesn't have to.
> 
> If it's the same device driver connection (or in kernel speak underlying struct file) then you can optimize away importing and exporting of buffers for example.

It's not just about optimization. Mesa needs to know this for correct tracking of GEM handles. If it guesses incorrectly, there can be misbehaviour.


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-08  7:51                         ` Michel Dänzer
@ 2024-05-08  7:59                           ` Christian König
  0 siblings, 0 replies; 90+ messages in thread
From: Christian König @ 2024-05-08  7:59 UTC (permalink / raw)
  To: Michel Dänzer, Christian König, Linus Torvalds,
	Christian Brauner, Al Viro, keescook, axboe, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

Am 08.05.24 um 09:51 schrieb Michel Dänzer:
> On 2024-05-07 19:45, Christian König wrote:
>> Am 07.05.24 um 18:46 schrieb Linus Torvalds:
>>> Just what are the requirements for the GPU stack? Is one of the file
>>> descriptors "trusted", IOW, you know what kind it is?
>>>
>>> Because dammit, it's *so* easy to do. You could just add a core DRM
>>> ioctl for it. Literally just
>>>
>>>           struct fd f1 = fdget(fd1);
>>>           struct fd f2 = fdget(fd2);
>>>           int same;
>>>
>>>           same = f1.file && f1.file == f2.file;
>>>           fdput(fd1);
>>>           fdput(fd2);
>>>           return same;
>>>
>>> where the only question is if you also woudl want to deal with O_PATH
>>> fd's, in which case the "fdget()" would be "fdget_raw()".
>>>
>>> Honestly, adding some DRM ioctl for this sounds hacky, but it sounds
>>> less hacky than relying on EPOLL or KCMP.
>>>
>>> I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
>>> too, if this is possibly a more common thing. and not just DRM wants
>>> it.
>>>
>>> Would something like that work for you?
>> Well the generic approach yes, the DRM specific one maybe. IIRC we need to be able to compare both DRM as well as DMA-buf file descriptors.
>>
>> The basic problem userspace tries to solve is that drivers might get the same fd through two different code paths.
>>
>> For example application using OpenGL/Vulkan for rendering and VA-API for video decoding/encoding at the same time.
>>
>> Both APIs get a fd which identifies the device to use. It can be the same, but it doesn't have to.
>>
>> If it's the same device driver connection (or in kernel speak underlying struct file) then you can optimize away importing and exporting of buffers for example.
> It's not just about optimization. Mesa needs to know this for correct tracking of GEM handles. If it guesses incorrectly, there can be misbehaviour.

Oh, yeah good point as well.

I think we can say in general that if two userspace driver libraries 
would mess with the state of an fd at the same time without knowing of 
each other bad things would happen.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-07 19:07                         ` Linus Torvalds
  2024-05-08  5:55                           ` Christian König
@ 2024-05-08  8:05                           ` Christian Brauner
  2024-05-08 16:19                           ` Linus Torvalds
  2 siblings, 0 replies; 90+ messages in thread
From: Christian Brauner @ 2024-05-08  8:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Vetter, Simon Ser, Pekka Paalanen, Christian König,
	Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Tue, May 07, 2024 at 12:07:10PM -0700, Linus Torvalds wrote:
> On Tue, 7 May 2024 at 11:04, Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote:
> >
> > > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
> > > too, if this is possibly a more common thing. and not just DRM wants
> > > it.
> > >
> > > Would something like that work for you?
> >
> > Yes.
> >
> > Adding Simon and Pekka as two of the usual suspects for this kind of
> > stuff. Also example code (the int return value is just so that callers know
> > when kcmp isn't available, they all only care about equality):
> >
> > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239
> 
> That example thing shows that we shouldn't make it a FISAME ioctl - we
> should make it a fcntl() instead, and it would just be a companion to
> F_DUPFD.
> 
> Doesn't that strike everybody as a *much* cleaner interface? I think

+1
See
https://github.com/systemd/systemd/blob/a4f0e0da3573a10bc5404142be8799418760b1d1/src/basic/fd-util.c#L517
that's another heavy user of this kind of functionality.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-07 17:45                       ` Christian König
  2024-05-08  7:51                         ` Michel Dänzer
@ 2024-05-08  8:23                         ` Christian Brauner
  2024-05-08  9:10                           ` Christian König
  1 sibling, 1 reply; 90+ messages in thread
From: Christian Brauner @ 2024-05-08  8:23 UTC (permalink / raw)
  To: Christian König
  Cc: Linus Torvalds, Al Viro, keescook, axboe, christian.koenig,
	dri-devel, io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Tue, May 07, 2024 at 07:45:02PM +0200, Christian König wrote:
> Am 07.05.24 um 18:46 schrieb Linus Torvalds:
> > On Tue, 7 May 2024 at 04:03, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > It's really annoying that on some distros/builds we don't have that, and
> > > for gpu driver stack reasons we _really_ need to know whether a fd is the
> > > same as another, due to some messy uniqueness requirements on buffer
> > > objects various drivers have.
> > It's sad that such a simple thing would require two other horrid
> > models (EPOLL or KCMP).
> > 
> > There'[s a reason that KCMP is a config option - *some* of that is
> > horrible code - but the "compare file descriptors for equality" is not
> > that reason.
> > 
> > Note that KCMP really is a broken mess. It's also a potential security
> > hole, even for the simple things, because of how it ends up comparing
> > kernel pointers (ie it doesn't just say "same file descriptor", it
> > gives an ordering of them, so you can use KCMP to sort things in
> > kernel space).
> > 
> > And yes, it orders them after obfuscating the pointer, but it's still
> > not something I would consider sane as a baseline interface. It was
> > designed for checkpoint-restore, it's the wrong thing to use for some
> > "are these file descriptors the same".
> > 
> > The same argument goes for using EPOLL for that. Disgusting hack.
> > 
> > Just what are the requirements for the GPU stack? Is one of the file
> > descriptors "trusted", IOW, you know what kind it is?
> > 
> > Because dammit, it's *so* easy to do. You could just add a core DRM
> > ioctl for it. Literally just
> > 
> >          struct fd f1 = fdget(fd1);
> >          struct fd f2 = fdget(fd2);
> >          int same;
> > 
> >          same = f1.file && f1.file == f2.file;
> >          fdput(fd1);
> >          fdput(fd2);
> >          return same;
> > 
> > where the only question is if you also woudl want to deal with O_PATH
> > fd's, in which case the "fdget()" would be "fdget_raw()".
> > 
> > Honestly, adding some DRM ioctl for this sounds hacky, but it sounds
> > less hacky than relying on EPOLL or KCMP.
> > 
> > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
> > too, if this is possibly a more common thing. and not just DRM wants
> > it.
> > 
> > Would something like that work for you?
> 
> Well the generic approach yes, the DRM specific one maybe. IIRC we need to
> be able to compare both DRM as well as DMA-buf file descriptors.
> 
> The basic problem userspace tries to solve is that drivers might get the
> same fd through two different code paths.
> 
> For example application using OpenGL/Vulkan for rendering and VA-API for
> video decoding/encoding at the same time.
> 
> Both APIs get a fd which identifies the device to use. It can be the same,
> but it doesn't have to.
> 
> If it's the same device driver connection (or in kernel speak underlying
> struct file) then you can optimize away importing and exporting of buffers
> for example.
> 
> Additional to that it makes cgroup accounting much easier because you don't
> count things twice because they are shared etc...

One thing to keep in mind is that a generic VFS level comparing function
will only catch the obvious case where you have dup() equivalency as
outlined above by Linus. That's what most people are interested in and
that could easily replace most kcmp() use-cases for comparing fds.

But, of course there's the case where you have two file descriptors
referring to two different files that reference the same underlying
object (usually stashed in file->private_data).

For most cases that problem can ofc be solved by comparing the
underlying inode. But that doesn't work for drivers using the generic
anonymous inode infrastructure because it uses the same inode for
everything or for cases where the same underlying object can even be
represented by different inodes.

So for such cases a driver specific ioctl() to compare two fds will
be needed in addition to the generic helper.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-08  5:55                           ` Christian König
@ 2024-05-08  8:32                             ` Daniel Vetter
  2024-05-08 10:16                               ` Christian Brauner
  0 siblings, 1 reply; 90+ messages in thread
From: Daniel Vetter @ 2024-05-08  8:32 UTC (permalink / raw)
  To: Christian König
  Cc: Linus Torvalds, Daniel Vetter, Simon Ser, Pekka Paalanen,
	Christian Brauner, Al Viro, keescook, axboe, christian.koenig,
	dri-devel, io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Wed, May 08, 2024 at 07:55:08AM +0200, Christian König wrote:
> Am 07.05.24 um 21:07 schrieb Linus Torvalds:
> > On Tue, 7 May 2024 at 11:04, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote:
> > > 
> > > > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
> > > > too, if this is possibly a more common thing. and not just DRM wants
> > > > it.
> > > > 
> > > > Would something like that work for you?
> > > Yes.
> > > 
> > > Adding Simon and Pekka as two of the usual suspects for this kind of
> > > stuff. Also example code (the int return value is just so that callers know
> > > when kcmp isn't available, they all only care about equality):
> > > 
> > > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239
> > That example thing shows that we shouldn't make it a FISAME ioctl - we
> > should make it a fcntl() instead, and it would just be a companion to
> > F_DUPFD.
> > 
> > Doesn't that strike everybody as a *much* cleaner interface? I think
> > F_ISDUP would work very naturally indeed with F_DUPFD.
> > 
> > Yes? No?
> 
> Sounds absolutely sane to me.

Yeah fcntl(fd1, F_ISDUP, fd2); sounds extremely reasonable to me too.

Aside, after some irc discussions I paged a few more of the relevant info
back in, and at least for dma-buf we kinda sorted this out by going away
from the singleton inode in this patch: ed63bb1d1f84 ("dma-buf: give each
buffer a full-fledged inode")

It's uapi now so we can't ever undo that, but with hindsight just the
F_ISDUP is really what we wanted. Because we have no need for that inode
aside from the unique inode number that's only used to compare dma-buf fd
for sameness, e.g.

https://gitlab.freedesktop.org/wlroots/wlroots/-/blob/master/render/vulkan/texture.c#L490

The one question I have is whether this could lead to some exploit tools,
because at least the android conformance test suite verifies that kcmp
isn't available to apps (which is where we need it, because even with all
the binder-based isolation gpu userspace still all run in the application
process due to performance reasons, any ipc at all is just too much).

Otoh if we just add this to drm fd as an ioctl somewhere, then it will
also be available to every android app because they all do need the gpu
for rendering. So going with the full generic fcntl is probably best.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)
  2024-05-06 18:17                     ` Linus Torvalds
@ 2024-05-08  8:47                       ` David Laight
  0 siblings, 0 replies; 90+ messages in thread
From: David Laight @ 2024-05-08  8:47 UTC (permalink / raw)
  To: 'Linus Torvalds', Stefan Metzmacher
  Cc: Al Viro, Kees Cook, Jens Axboe, Bui Quang Minh,
	Christian Brauner, syzbot, io-uring, jack, linux-fsdevel,
	linux-kernel, syzkaller-bugs, Sumit Semwal, Christian König,
	linux-media, dri-devel, linaro-mm-sig, Laura Abbott

From: Linus Torvalds
> Sent: 06 May 2024 19:18
...
> Which is why I applied the minimal patch for just "refcount over
> vfs_poll()", and am just mentioning my suggestion in the hope that
> some eager beaver would like to see how painful it would do to make
> the bigger surgery...

I wonder if I can work out how it (doesn't) currently work...

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-08  8:23                         ` Christian Brauner
@ 2024-05-08  9:10                           ` Christian König
  0 siblings, 0 replies; 90+ messages in thread
From: Christian König @ 2024-05-08  9:10 UTC (permalink / raw)
  To: Christian Brauner, Christian König
  Cc: Linus Torvalds, Al Viro, keescook, axboe, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

Am 08.05.24 um 10:23 schrieb Christian Brauner:
> On Tue, May 07, 2024 at 07:45:02PM +0200, Christian König wrote:
>> Am 07.05.24 um 18:46 schrieb Linus Torvalds:
>>> On Tue, 7 May 2024 at 04:03, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>> It's really annoying that on some distros/builds we don't have that, and
>>>> for gpu driver stack reasons we _really_ need to know whether a fd is the
>>>> same as another, due to some messy uniqueness requirements on buffer
>>>> objects various drivers have.
>>> It's sad that such a simple thing would require two other horrid
>>> models (EPOLL or KCMP).
>>>
>>> There'[s a reason that KCMP is a config option - *some* of that is
>>> horrible code - but the "compare file descriptors for equality" is not
>>> that reason.
>>>
>>> Note that KCMP really is a broken mess. It's also a potential security
>>> hole, even for the simple things, because of how it ends up comparing
>>> kernel pointers (ie it doesn't just say "same file descriptor", it
>>> gives an ordering of them, so you can use KCMP to sort things in
>>> kernel space).
>>>
>>> And yes, it orders them after obfuscating the pointer, but it's still
>>> not something I would consider sane as a baseline interface. It was
>>> designed for checkpoint-restore, it's the wrong thing to use for some
>>> "are these file descriptors the same".
>>>
>>> The same argument goes for using EPOLL for that. Disgusting hack.
>>>
>>> Just what are the requirements for the GPU stack? Is one of the file
>>> descriptors "trusted", IOW, you know what kind it is?
>>>
>>> Because dammit, it's *so* easy to do. You could just add a core DRM
>>> ioctl for it. Literally just
>>>
>>>           struct fd f1 = fdget(fd1);
>>>           struct fd f2 = fdget(fd2);
>>>           int same;
>>>
>>>           same = f1.file && f1.file == f2.file;
>>>           fdput(fd1);
>>>           fdput(fd2);
>>>           return same;
>>>
>>> where the only question is if you also woudl want to deal with O_PATH
>>> fd's, in which case the "fdget()" would be "fdget_raw()".
>>>
>>> Honestly, adding some DRM ioctl for this sounds hacky, but it sounds
>>> less hacky than relying on EPOLL or KCMP.
>>>
>>> I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
>>> too, if this is possibly a more common thing. and not just DRM wants
>>> it.
>>>
>>> Would something like that work for you?
>> Well the generic approach yes, the DRM specific one maybe. IIRC we need to
>> be able to compare both DRM as well as DMA-buf file descriptors.
>>
>> The basic problem userspace tries to solve is that drivers might get the
>> same fd through two different code paths.
>>
>> For example application using OpenGL/Vulkan for rendering and VA-API for
>> video decoding/encoding at the same time.
>>
>> Both APIs get a fd which identifies the device to use. It can be the same,
>> but it doesn't have to.
>>
>> If it's the same device driver connection (or in kernel speak underlying
>> struct file) then you can optimize away importing and exporting of buffers
>> for example.
>>
>> Additional to that it makes cgroup accounting much easier because you don't
>> count things twice because they are shared etc...
> One thing to keep in mind is that a generic VFS level comparing function
> will only catch the obvious case where you have dup() equivalency as
> outlined above by Linus. That's what most people are interested in and
> that could easily replace most kcmp() use-cases for comparing fds.
>
> But, of course there's the case where you have two file descriptors
> referring to two different files that reference the same underlying
> object (usually stashed in file->private_data).
>
> For most cases that problem can ofc be solved by comparing the
> underlying inode. But that doesn't work for drivers using the generic
> anonymous inode infrastructure because it uses the same inode for
> everything or for cases where the same underlying object can even be
> represented by different inodes.
>
> So for such cases a driver specific ioctl() to compare two fds will
> be needed in addition to the generic helper.

At least for the DRM we already have some solution for that, see 
drmGetPrimaryDeviceNameFromFd() for an example.

Basically the file->private_data is still something different, but we 
use this to figure out if we have two file descriptors (with individual 
struct files underneath) pointing to the same hw driver.

This is important if you need to know if just importing/exporting of 
DMA-buf handles between the two file descriptors is enough or if you 
deal with two different hw devices and need to do stuff like format 
conversion etc...

Regards,
Christian.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-06 14:29                 ` [Linaro-mm-sig] " Christian König
  2024-05-07 11:02                   ` Daniel Vetter
@ 2024-05-08 10:08                   ` Christian Brauner
  2024-05-08 15:45                     ` Daniel Vetter
  1 sibling, 1 reply; 90+ messages in thread
From: Christian Brauner @ 2024-05-08 10:08 UTC (permalink / raw)
  To: Christian König, Linus Torvalds
  Cc: Linus Torvalds, Al Viro, keescook, axboe, christian.koenig,
	dri-devel, io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Mon, May 06, 2024 at 04:29:44PM +0200, Christian König wrote:
> Am 04.05.24 um 20:20 schrieb Linus Torvalds:
> > On Sat, 4 May 2024 at 08:32, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > Lookie here, the fundamental issue is that epoll can call '->poll()'
> > > on a file descriptor that is being closed concurrently.
> > Thinking some more about this, and replying to myself...
> > 
> > Actually, I wonder if we could *really* fix this by simply moving the
> > eventpoll_release() to where it really belongs.
> > 
> > If we did it in file_close_fd_locked(),  it would actually make a
> > *lot* more sense. Particularly since eventpoll actually uses this:
> > 
> >      struct epoll_filefd {
> >          struct file *file;
> >          int fd;
> >      } __packed;
> > 
> > ie it doesn't just use the 'struct file *', it uses the 'fd' itself
> > (for ep_find()).
> > 
> > (Strictly speaking, it should also have a pointer to the 'struct
> > files_struct' to make the 'int fd' be meaningful).
> 
> While I completely agree on this I unfortunately have to ruin the idea.
> 
> Before we had KCMP some people relied on the strange behavior of eventpoll
> to compare struct files when the fd is the same.
> 
> I just recently suggested that solution to somebody at AMD as a workaround
> when KCMP is disabled because of security hardening and I'm pretty sure I've
> seen it somewhere else as well.
> 
> So when we change that it would break (undocumented?) UAPI behavior.

I've worked on that a bit yesterday and I learned new things about epoll
and ran into some limitations.

Like, what happens if process P1 has a file descriptor registered in an
epoll instance and now P1 forks and creates P2. So every file that P1
maintains gets copied into a new file descriptor table for P2. And the
same file descriptors refer to the same files for both P1 and P2.

So there's two interesting cases here:

(1) P2 explicitly removes the file descriptor from the epoll instance
    via epoll_ctl(EPOLL_CTL_DEL). That removal affects both P1 and P2
    since the <fd, file> pair is only registered once and it isn't
    marked whether it belongs to P1 and P2 fdtable.

    So effectively fork()ing with epoll creates a weird shared state
    where removal of file descriptors that were registered before the
    fork() affects both child and parent.

    I found that surprising even though I've worked with epoll quite
    extensively in low-level userspace.

(2) P2 doesn't close it's file descriptors. It just exits. Since removal
    of the file descriptor from the epoll instance isn't done during
    close() but during last fput() P1's epoll state remains unaffected
    by P2's sloppy exit because P1 still holds references to all files
    in its fdtable.

    (Sidenote, if one ends up adding every more duped-fds into epoll
    instance that one doesn't explicitly close and all of them refer to
    the same file wouldn't one just be allocating new epitems that
    are kept around for a really long time?)

So if the removal of the fd would now be done during close() or during
exit_files() when we call close_files() and since there's currently no
way of differentiating whether P1 or P2 own that fd it would mean that
(2) collapses into (1) and we'd always alter (1)'s epoll state. That
would be a UAPI break.

So say we record the fdtable to get ownership of that file descriptor so
P2 doesn't close anything in (2) that really belongs to P1 to fix that
problem.

But afaict, that would break another possible use-case. Namely, where P1
creates an epoll instance and registeres fds and then fork()s to create
P2. Now P1 can exit and P2 takes over the epoll loop of P1. This
wouldn't work anymore because P1 would deregister all fds it owns in
that epoll instance during exit. I didn't see an immediate nice way of
fixing that issue.

But note that taking over an epoll loop from the parent doesn't work
reliably for some file descriptors. Consider man signalfd(2):

   epoll(7) semantics
       If a process adds (via epoll_ctl(2)) a signalfd file descriptor to an epoll(7) instance,
       then epoll_wait(2) returns events only for signals sent to that process.  In particular,
       if  the process then uses fork(2) to create a child process, then the child will be able
       to read(2) signals that  are  sent  to  it  using  the  signalfd  file  descriptor,  but
       epoll_wait(2)  will  not  indicate  that the signalfd file descriptor is ready.  In this
       scenario, a possible workaround is that after the fork(2), the child process  can  close
       the  signalfd  file descriptor that it inherited from the parent process and then create
       another signalfd file descriptor and add it to the epoll instance.   Alternatively,  the
       parent and the child could delay creating their (separate) signalfd file descriptors and
       adding them to the epoll instance until after the call to fork(2).

So effectively P1 opens a signalfd and registers it in an epoll
instance. Then it fork()s and creates P2. Now both P1 and P2 call
epoll_wait(). Since signalfds are always relative to the caller and P1
did call signalfd_poll() to register the callback only P1 can get
events. So P2 can't take over signalfds in that epoll loop.

Honestly, the inheritance semantics of epoll across fork() seem pretty
wonky and it would've been better if an epoll fd inherited across
would've returned ESTALE or EINVAL or something. And if that inheritance
of epoll instances would really be a big use-case there'd be some
explicit way to enable this.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-08  8:32                             ` Daniel Vetter
@ 2024-05-08 10:16                               ` Christian Brauner
  0 siblings, 0 replies; 90+ messages in thread
From: Christian Brauner @ 2024-05-08 10:16 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, Linus Torvalds, Simon Ser, Pekka Paalanen,
	Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Wed, May 08, 2024 at 10:32:08AM +0200, Daniel Vetter wrote:
> On Wed, May 08, 2024 at 07:55:08AM +0200, Christian König wrote:
> > Am 07.05.24 um 21:07 schrieb Linus Torvalds:
> > > On Tue, 7 May 2024 at 11:04, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote:
> > > > 
> > > > > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
> > > > > too, if this is possibly a more common thing. and not just DRM wants
> > > > > it.
> > > > > 
> > > > > Would something like that work for you?
> > > > Yes.
> > > > 
> > > > Adding Simon and Pekka as two of the usual suspects for this kind of
> > > > stuff. Also example code (the int return value is just so that callers know
> > > > when kcmp isn't available, they all only care about equality):
> > > > 
> > > > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239
> > > That example thing shows that we shouldn't make it a FISAME ioctl - we
> > > should make it a fcntl() instead, and it would just be a companion to
> > > F_DUPFD.
> > > 
> > > Doesn't that strike everybody as a *much* cleaner interface? I think
> > > F_ISDUP would work very naturally indeed with F_DUPFD.
> > > 
> > > Yes? No?
> > 
> > Sounds absolutely sane to me.
> 
> Yeah fcntl(fd1, F_ISDUP, fd2); sounds extremely reasonable to me too.
> 
> Aside, after some irc discussions I paged a few more of the relevant info
> back in, and at least for dma-buf we kinda sorted this out by going away
> from the singleton inode in this patch: ed63bb1d1f84 ("dma-buf: give each
> buffer a full-fledged inode")
> 
> It's uapi now so we can't ever undo that, but with hindsight just the
> F_ISDUP is really what we wanted. Because we have no need for that inode
> aside from the unique inode number that's only used to compare dma-buf fd
> for sameness, e.g.
> 
> https://gitlab.freedesktop.org/wlroots/wlroots/-/blob/master/render/vulkan/texture.c#L490
> 
> The one question I have is whether this could lead to some exploit tools,
> because at least the android conformance test suite verifies that kcmp
> isn't available to apps (which is where we need it, because even with all
> the binder-based isolation gpu userspace still all run in the application
> process due to performance reasons, any ipc at all is just too much).
> 
> Otoh if we just add this to drm fd as an ioctl somewhere, then it will
> also be available to every android app because they all do need the gpu
> for rendering. So going with the full generic fcntl is probably best.
> -Sima

fcntl() will call security_file_fcntl(). IIRC, Android uses selinux and
I'm pretty certain they'd disallow any fcntl() operations they deems
unsafe. So a kernel update for them would likely require allow-listing
the new fcntl(). Or if they do allow all new fnctl()s by default they'd
have to disallow it if they thought that's an issue but really I don't
even think there's any issue in that.

I think kcmp() is a different problem because you can use it to compare
objects from different tasks. The generic fcntl() wouldn't allow that.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-08 10:08                   ` Christian Brauner
@ 2024-05-08 15:45                     ` Daniel Vetter
  2024-05-10 10:55                       ` Christian Brauner
  0 siblings, 1 reply; 90+ messages in thread
From: Daniel Vetter @ 2024-05-08 15:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christian König, Linus Torvalds, Al Viro, keescook, axboe,
	christian.koenig, dri-devel, io-uring, jack, laura,
	linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs

On Wed, May 08, 2024 at 12:08:57PM +0200, Christian Brauner wrote:
> On Mon, May 06, 2024 at 04:29:44PM +0200, Christian König wrote:
> > Am 04.05.24 um 20:20 schrieb Linus Torvalds:
> > > On Sat, 4 May 2024 at 08:32, Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > > Lookie here, the fundamental issue is that epoll can call '->poll()'
> > > > on a file descriptor that is being closed concurrently.
> > > Thinking some more about this, and replying to myself...
> > > 
> > > Actually, I wonder if we could *really* fix this by simply moving the
> > > eventpoll_release() to where it really belongs.
> > > 
> > > If we did it in file_close_fd_locked(),  it would actually make a
> > > *lot* more sense. Particularly since eventpoll actually uses this:
> > > 
> > >      struct epoll_filefd {
> > >          struct file *file;
> > >          int fd;
> > >      } __packed;
> > > 
> > > ie it doesn't just use the 'struct file *', it uses the 'fd' itself
> > > (for ep_find()).
> > > 
> > > (Strictly speaking, it should also have a pointer to the 'struct
> > > files_struct' to make the 'int fd' be meaningful).
> > 
> > While I completely agree on this I unfortunately have to ruin the idea.
> > 
> > Before we had KCMP some people relied on the strange behavior of eventpoll
> > to compare struct files when the fd is the same.
> > 
> > I just recently suggested that solution to somebody at AMD as a workaround
> > when KCMP is disabled because of security hardening and I'm pretty sure I've
> > seen it somewhere else as well.
> > 
> > So when we change that it would break (undocumented?) UAPI behavior.
> 
> I've worked on that a bit yesterday and I learned new things about epoll
> and ran into some limitations.
> 
> Like, what happens if process P1 has a file descriptor registered in an
> epoll instance and now P1 forks and creates P2. So every file that P1
> maintains gets copied into a new file descriptor table for P2. And the
> same file descriptors refer to the same files for both P1 and P2.

So this is pretty similar to any other struct file that has resources
hanging off the struct file instead of the underlying inode. Like drm
chardev files, where all the buffers, gpu contexts and everything else
hangs off the file and there's no other way to get at them (except when
exporting to some explicitly meant-for-sharing file like dma-buf).

If you fork() that it's utter hilarity, which is why absolutely everyone
should set O_CLOEXEC on these. Or EPOLL_CLOEXEC for epoll_create.

For the uapi issue you describe below my take would be that we should just
try, and hope that everyone's been dutifully using O_CLOEXEC. But maybe
I'm biased from the gpu world, where we've been hammering it in that
"O_CLOEXEC or bust" mantra since well over a decade. Really the only valid
use-case is something like systemd handing open files to a service, where
it drops priviledges even well before the exec() call. But we can't switch
around the defaults for any of these special open files with anything more
than just a current seek position as state, since that breaks uapi.
-Sima

> 
> So there's two interesting cases here:
> 
> (1) P2 explicitly removes the file descriptor from the epoll instance
>     via epoll_ctl(EPOLL_CTL_DEL). That removal affects both P1 and P2
>     since the <fd, file> pair is only registered once and it isn't
>     marked whether it belongs to P1 and P2 fdtable.
> 
>     So effectively fork()ing with epoll creates a weird shared state
>     where removal of file descriptors that were registered before the
>     fork() affects both child and parent.
> 
>     I found that surprising even though I've worked with epoll quite
>     extensively in low-level userspace.
> 
> (2) P2 doesn't close it's file descriptors. It just exits. Since removal
>     of the file descriptor from the epoll instance isn't done during
>     close() but during last fput() P1's epoll state remains unaffected
>     by P2's sloppy exit because P1 still holds references to all files
>     in its fdtable.
> 
>     (Sidenote, if one ends up adding every more duped-fds into epoll
>     instance that one doesn't explicitly close and all of them refer to
>     the same file wouldn't one just be allocating new epitems that
>     are kept around for a really long time?)
> 
> So if the removal of the fd would now be done during close() or during
> exit_files() when we call close_files() and since there's currently no
> way of differentiating whether P1 or P2 own that fd it would mean that
> (2) collapses into (1) and we'd always alter (1)'s epoll state. That
> would be a UAPI break.
> 
> So say we record the fdtable to get ownership of that file descriptor so
> P2 doesn't close anything in (2) that really belongs to P1 to fix that
> problem.
> 
> But afaict, that would break another possible use-case. Namely, where P1
> creates an epoll instance and registeres fds and then fork()s to create
> P2. Now P1 can exit and P2 takes over the epoll loop of P1. This
> wouldn't work anymore because P1 would deregister all fds it owns in
> that epoll instance during exit. I didn't see an immediate nice way of
> fixing that issue.
> 
> But note that taking over an epoll loop from the parent doesn't work
> reliably for some file descriptors. Consider man signalfd(2):
> 
>    epoll(7) semantics
>        If a process adds (via epoll_ctl(2)) a signalfd file descriptor to an epoll(7) instance,
>        then epoll_wait(2) returns events only for signals sent to that process.  In particular,
>        if  the process then uses fork(2) to create a child process, then the child will be able
>        to read(2) signals that  are  sent  to  it  using  the  signalfd  file  descriptor,  but
>        epoll_wait(2)  will  not  indicate  that the signalfd file descriptor is ready.  In this
>        scenario, a possible workaround is that after the fork(2), the child process  can  close
>        the  signalfd  file descriptor that it inherited from the parent process and then create
>        another signalfd file descriptor and add it to the epoll instance.   Alternatively,  the
>        parent and the child could delay creating their (separate) signalfd file descriptors and
>        adding them to the epoll instance until after the call to fork(2).
> 
> So effectively P1 opens a signalfd and registers it in an epoll
> instance. Then it fork()s and creates P2. Now both P1 and P2 call
> epoll_wait(). Since signalfds are always relative to the caller and P1
> did call signalfd_poll() to register the callback only P1 can get
> events. So P2 can't take over signalfds in that epoll loop.
> 
> Honestly, the inheritance semantics of epoll across fork() seem pretty
> wonky and it would've been better if an epoll fd inherited across
> would've returned ESTALE or EINVAL or something. And if that inheritance
> of epoll instances would really be a big use-case there'd be some
> explicit way to enable this.

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-07 19:07                         ` Linus Torvalds
  2024-05-08  5:55                           ` Christian König
  2024-05-08  8:05                           ` Christian Brauner
@ 2024-05-08 16:19                           ` Linus Torvalds
  2024-05-08 17:14                             ` Linus Torvalds
  2 siblings, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2024-05-08 16:19 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Simon Ser, Pekka Paalanen, Christian König,
	Christian Brauner, Al Viro, keescook, axboe, christian.koenig,
	dri-devel, io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

[-- Attachment #1: Type: text/plain, Size: 1750 bytes --]

On Tue, 7 May 2024 at 12:07, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> That example thing shows that we shouldn't make it a FISAME ioctl - we
> should make it a fcntl() instead, and it would just be a companion to
> F_DUPFD.
>
> Doesn't that strike everybody as a *much* cleaner interface? I think
> F_ISDUP would work very naturally indeed with F_DUPFD.

So since we already have two versions of F_DUPFD (the other being
F_DUPFD_CLOEXEC) I decided that the best thing to do is to just extend
on that existing naming pattern, and called it F_DUPFD_QUERY instead.

I'm not married to the name, so if somebody hates it, feel free to
argue otherwise.

But with that, the suggested patch would end up looking something like
the attached (I also re-ordered the existing "F_LINUX_SPECIFIC_BASE"
users, since one of them was out of numerical order).

This really feels like a very natural thing, and yes, the 'same_fd()'
function in systemd that Christian also pointed at could use this very
naturally.

Also note that I obviously haven't tested this. Because obviously this
is trivially correct and cannot possibly have any bugs. Right? RIGHT?

And yes, I did check - despite the odd jump in numbers, we've never
had anything between F_NOTIFY (+2) and F_CANCELLK (+5).

We added F_SETLEASE (+0) , F_GETLEASE (+1) and F_NOTIFY (+2) in
2.4.0-test9 (roughly October 2000, I didn't dig deeper).

And then back in 2007 we suddenly jumped to F_CANCELLK (+5) in commit
9b9d2ab4154a ("locks: add lock cancel command"). I don't know why 3/4
were shunned.

After that we had 22d2b35b200f ("F_DUPFD_CLOEXEC implementation") add
F_DUPFD_CLOEXEC (+6).

I'd have loved to put F_DUPFD_QUERY next to it, but +5 and +7 are both used.

                Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 2375 bytes --]

 fs/fcntl.c                 | 23 +++++++++++++++++++++++
 include/uapi/linux/fcntl.h | 14 ++++++++------
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 54cc85d3338e..1ddb63f70445 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -327,6 +327,25 @@ static long fcntl_set_rw_hint(struct file *file, unsigned int cmd,
 	return 0;
 }
 
+/*
+ * Is the file descriptor a dup of the file?
+ */
+static long f_dupfd_query(int fd, struct file *filp)
+{
+	struct fd f = fdget_raw(fd);
+
+	/*
+	 * We can do the 'fdput()' immediately, as the only thing that
+	 * matters is the pointer value which isn't changed by the fdput.
+	 *
+	 * Technically we didn't need a ref at all, and 'fdget()' was
+	 * overkill, but given our lockless file pointer lookup, the
+	 * alternatives are complicated.
+	 */
+	fdput(f);
+	return f.file == filp;
+}
+
 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		struct file *filp)
 {
@@ -342,6 +361,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_DUPFD_CLOEXEC:
 		err = f_dupfd(argi, filp, O_CLOEXEC);
 		break;
+	case F_DUPFD_QUERY:
+		err = f_dupfd_query(argi, filp);
+		break;
 	case F_GETFD:
 		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
 		break;
@@ -446,6 +468,7 @@ static int check_fcntl_cmd(unsigned cmd)
 	switch (cmd) {
 	case F_DUPFD:
 	case F_DUPFD_CLOEXEC:
+	case F_DUPFD_QUERY:
 	case F_GETFD:
 	case F_SETFD:
 	case F_GETFL:
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 282e90aeb163..c0bcc185fa48 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -8,6 +8,14 @@
 #define F_SETLEASE	(F_LINUX_SPECIFIC_BASE + 0)
 #define F_GETLEASE	(F_LINUX_SPECIFIC_BASE + 1)
 
+/*
+ * Request nofications on a directory.
+ * See below for events that may be notified.
+ */
+#define F_NOTIFY	(F_LINUX_SPECIFIC_BASE + 2)
+
+#define F_DUPFD_QUERY	(F_LINUX_SPECIFIC_BASE + 3)
+
 /*
  * Cancel a blocking posix lock; internal use only until we expose an
  * asynchronous lock api to userspace:
@@ -17,12 +25,6 @@
 /* Create a file descriptor with FD_CLOEXEC set. */
 #define F_DUPFD_CLOEXEC	(F_LINUX_SPECIFIC_BASE + 6)
 
-/*
- * Request nofications on a directory.
- * See below for events that may be notified.
- */
-#define F_NOTIFY	(F_LINUX_SPECIFIC_BASE+2)
-
 /*
  * Set and get of pipe page size array
  */

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-08 16:19                           ` Linus Torvalds
@ 2024-05-08 17:14                             ` Linus Torvalds
  2024-05-09 11:38                               ` Christian Brauner
  0 siblings, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2024-05-08 17:14 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Simon Ser, Pekka Paalanen, Christian König,
	Christian Brauner, Al Viro, keescook, axboe, christian.koenig,
	dri-devel, io-uring, jack, laura, linaro-mm-sig, linux-fsdevel,
	linux-kernel, linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Wed, 8 May 2024 at 09:19, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So since we already have two versions of F_DUPFD (the other being
> F_DUPFD_CLOEXEC) I decided that the best thing to do is to just extend
> on that existing naming pattern, and called it F_DUPFD_QUERY instead.
>
> I'm not married to the name, so if somebody hates it, feel free to
> argue otherwise.

Side note: with this patch, doing

   ret = fcntl(fd1, F_DUPFD_QUERY, fd2);

will result in:

 -1 (EBADF): 'fd1' is not a valid file descriptor
 -1 (EINVAL): old kernel that doesn't support F_DUPFD_QUERY
 0: fd2 does not refer to the same file as fd1
 1: fd2 is the same 'struct file' as fd1

and it might be worth noting a couple of things here:

 (a) fd2 being an invalid file descriptor does not cause EBADF, it
just causes "does not match".

 (b) we *could* use more bits for more equality

IOW, it would possibly make sense to extend the 0/1 result to be

- bit #0: same file pointer
- bit #1: same path
- bit #2: same dentry
- bit #3: same inode

which are all different levels of "sameness".

Does anybody care? Do we want to extend on this "sameness"? I'm not
convinced, but it might be a good idea to document this as a possibly
future extension, ie *if* what you care about is "same file pointer",
maybe you should make sure to only look at bit #0.

               Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-08 17:14                             ` Linus Torvalds
@ 2024-05-09 11:38                               ` Christian Brauner
  2024-05-09 15:48                                 ` Linus Torvalds
  0 siblings, 1 reply; 90+ messages in thread
From: Christian Brauner @ 2024-05-09 11:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Vetter, Simon Ser, Pekka Paalanen, Christian König,
	Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Wed, May 08, 2024 at 10:14:44AM -0700, Linus Torvalds wrote:
> On Wed, 8 May 2024 at 09:19, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So since we already have two versions of F_DUPFD (the other being
> > F_DUPFD_CLOEXEC) I decided that the best thing to do is to just extend
> > on that existing naming pattern, and called it F_DUPFD_QUERY instead.
> >
> > I'm not married to the name, so if somebody hates it, feel free to
> > argue otherwise.
> 
> Side note: with this patch, doing
> 
>    ret = fcntl(fd1, F_DUPFD_QUERY, fd2);
> 
> will result in:
> 
>  -1 (EBADF): 'fd1' is not a valid file descriptor
>  -1 (EINVAL): old kernel that doesn't support F_DUPFD_QUERY
>  0: fd2 does not refer to the same file as fd1
>  1: fd2 is the same 'struct file' as fd1
> 
> and it might be worth noting a couple of things here:
> 
>  (a) fd2 being an invalid file descriptor does not cause EBADF, it
> just causes "does not match".
> 
>  (b) we *could* use more bits for more equality
> 
> IOW, it would possibly make sense to extend the 0/1 result to be
> 
> - bit #0: same file pointer
> - bit #1: same path
> - bit #2: same dentry
> - bit #3: same inode
> 
> which are all different levels of "sameness".

Not worth it without someone explaining in detail why imho. First pass
should be to try and replace kcmp() in scenarios where it's obviously
not needed or overkill.

I've added a CLASS(fd_raw) in a preliminary patch since we'll need that
anyway which means that your comparison patch becomes even simpler imho.
I've also added a selftest patch:

https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc

?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-09 11:38                               ` Christian Brauner
@ 2024-05-09 15:48                                 ` Linus Torvalds
  2024-05-10  6:33                                   ` Christian Brauner
  0 siblings, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2024-05-09 15:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Daniel Vetter, Simon Ser, Pekka Paalanen, Christian König,
	Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Thu, 9 May 2024 at 04:39, Christian Brauner <brauner@kernel.org> wrote:
>
> Not worth it without someone explaining in detail why imho. First pass
> should be to try and replace kcmp() in scenarios where it's obviously
> not needed or overkill.

Ack.

> I've added a CLASS(fd_raw) in a preliminary patch since we'll need that
> anyway which means that your comparison patch becomes even simpler imho.
> I've also added a selftest patch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc

LGTM.

Maybe worth adding an explicit test for "open same file, but two
separate opens, F_DUPFD_QUERY returns 0? Just to clarify the "it's not
testing the file on the filesystem for equality, but the file pointer
itself".

             Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-09 15:48                                 ` Linus Torvalds
@ 2024-05-10  6:33                                   ` Christian Brauner
  0 siblings, 0 replies; 90+ messages in thread
From: Christian Brauner @ 2024-05-10  6:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Vetter, Simon Ser, Pekka Paalanen, Christian König,
	Al Viro, keescook, axboe, christian.koenig, dri-devel, io-uring,
	jack, laura, linaro-mm-sig, linux-fsdevel, linux-kernel,
	linux-media, minhquangbui99, sumit.semwal,
	syzbot+045b454ab35fd82a35fb, syzkaller-bugs

On Thu, May 09, 2024 at 08:48:20AM -0700, Linus Torvalds wrote:
> On Thu, 9 May 2024 at 04:39, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Not worth it without someone explaining in detail why imho. First pass
> > should be to try and replace kcmp() in scenarios where it's obviously
> > not needed or overkill.
> 
> Ack.
> 
> > I've added a CLASS(fd_raw) in a preliminary patch since we'll need that
> > anyway which means that your comparison patch becomes even simpler imho.
> > I've also added a selftest patch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc
> 
> LGTM.
> 
> Maybe worth adding an explicit test for "open same file, but two
> separate opens, F_DUPFD_QUERY returns 0? Just to clarify the "it's not
> testing the file on the filesystem for equality, but the file pointer
> itself".

Yep, good point. Added now.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-08 15:45                     ` Daniel Vetter
@ 2024-05-10 10:55                       ` Christian Brauner
  2024-05-11 18:25                         ` David Laight
  0 siblings, 1 reply; 90+ messages in thread
From: Christian Brauner @ 2024-05-10 10:55 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, Linus Torvalds, Al Viro, keescook, axboe,
	christian.koenig, dri-devel, io-uring, jack, laura,
	linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs

> For the uapi issue you describe below my take would be that we should just
> try, and hope that everyone's been dutifully using O_CLOEXEC. But maybe
> I'm biased from the gpu world, where we've been hammering it in that
> "O_CLOEXEC or bust" mantra since well over a decade. Really the only valid

Oh, we're very much on the same page. All new file descriptor types that
I've added over the years are O_CLOEXEC by default. IOW, you need to
remove O_CLOEXEC explicitly (see pidfd as an example). And imho, any new
fd type that's added should just be O_CLOEXEC by default.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
  2024-05-10 10:55                       ` Christian Brauner
@ 2024-05-11 18:25                         ` David Laight
  0 siblings, 0 replies; 90+ messages in thread
From: David Laight @ 2024-05-11 18:25 UTC (permalink / raw)
  To: 'Christian Brauner', Daniel Vetter
  Cc: Christian König, Linus Torvalds, Al Viro, keescook, axboe,
	christian.koenig, dri-devel, io-uring, jack, laura,
	linaro-mm-sig, linux-fsdevel, linux-kernel, linux-media,
	minhquangbui99, sumit.semwal, syzbot+045b454ab35fd82a35fb,
	syzkaller-bugs

From: Christian Brauner
> Sent: 10 May 2024 11:55
> 
> > For the uapi issue you describe below my take would be that we should just
> > try, and hope that everyone's been dutifully using O_CLOEXEC. But maybe
> > I'm biased from the gpu world, where we've been hammering it in that
> > "O_CLOEXEC or bust" mantra since well over a decade. Really the only valid
> 
> Oh, we're very much on the same page. All new file descriptor types that
> I've added over the years are O_CLOEXEC by default. IOW, you need to
> remove O_CLOEXEC explicitly (see pidfd as an example). And imho, any new
> fd type that's added should just be O_CLOEXEC by default.

For fd a shell redirect creates you may want so be able to say
'this fd will have O_CLOEXEC set after the next exec'.
Also (possibly) a flag that can't be cleared once set and that
gets kept by dup() etc.
But maybe that is excessive?

I've certainly used:
# ip netns exec ns command 3</sys/class/net
in order to be able to (easily) read status for interfaces in the
default namespace and a specific namespace.
The would be hard if the O_CLOEXEC flag had got set by default.
(Especially without a shell builtin to clear it.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2024-05-11 18:25 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-08  8:26 [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove syzbot
2024-04-15 14:31 ` Jens Axboe
2024-04-15 14:57   ` Pavel Begunkov
2024-05-03 11:54 ` Bui Quang Minh
2024-05-03 18:26   ` get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove) Kees Cook
2024-05-03 18:49     ` Jens Axboe
2024-05-03 19:22       ` Kees Cook
2024-05-03 19:35         ` Jens Axboe
2024-05-03 19:59           ` Kees Cook
2024-05-03 20:28             ` Kees Cook
2024-05-03 21:11               ` Al Viro
2024-05-03 21:24                 ` Linus Torvalds
2024-05-03 21:30                   ` Al Viro
2024-05-06 17:46                   ` Stefan Metzmacher
2024-05-06 18:17                     ` Linus Torvalds
2024-05-08  8:47                       ` David Laight
2024-05-03 21:36                 ` Al Viro
2024-05-03 21:42                   ` Linus Torvalds
2024-05-03 21:53                     ` Al Viro
2024-05-06 12:23                       ` Daniel Vetter
2024-05-04  9:59             ` Christian Brauner
2024-05-03 21:11     ` [PATCH] epoll: try to be a _bit_ better about file lifetimes Linus Torvalds
2024-05-03 21:24       ` Al Viro
2024-05-03 21:33         ` Linus Torvalds
2024-05-03 21:45           ` Al Viro
2024-05-03 21:52             ` Linus Torvalds
2024-05-03 22:01               ` Al Viro
2024-05-03 22:07                 ` Al Viro
2024-05-03 23:16                   ` Linus Torvalds
2024-05-03 23:39                     ` Al Viro
2024-05-03 23:54                       ` Linus Torvalds
2024-05-04 10:44                       ` Christian Brauner
2024-05-03 22:46               ` Kees Cook
2024-05-03 23:03                 ` Al Viro
2024-05-03 23:23                   ` Kees Cook
2024-05-03 23:41                     ` Linus Torvalds
2024-05-04  9:19                       ` Christian Brauner
2024-05-06 12:37                       ` Daniel Vetter
2024-05-04  9:37           ` Christian Brauner
2024-05-04 15:32             ` Linus Torvalds
2024-05-04 15:40               ` Linus Torvalds
2024-05-04 15:53                 ` Linus Torvalds
2024-05-05 19:46                   ` Al Viro
2024-05-05 20:03                     ` Linus Torvalds
2024-05-05 20:30                       ` Al Viro
2024-05-05 20:53                         ` Linus Torvalds
2024-05-06 12:47                           ` Daniel Vetter
2024-05-06 14:46                             ` Christian Brauner
2024-05-07 10:58                               ` Daniel Vetter
2024-05-06 16:15                           ` Christian König
2024-05-05 10:50                 ` Christian Brauner
2024-05-05 16:46                   ` Linus Torvalds
2024-05-05 17:55                     ` [PATCH v2] epoll: be " Linus Torvalds
2024-05-05 18:04                       ` Jens Axboe
2024-05-05 20:01                       ` David Laight
2024-05-05 20:16                         ` Linus Torvalds
2024-05-05 20:12                     ` [PATCH] epoll: try to be a _bit_ " Al Viro
2024-05-06  8:45                     ` Christian Brauner
2024-05-06  9:26                       ` Christian Brauner
2024-05-06 14:19                         ` Christian Brauner
2024-05-07 21:02                       ` David Laight
2024-05-04 18:20               ` Linus Torvalds
2024-05-06 14:29                 ` [Linaro-mm-sig] " Christian König
2024-05-07 11:02                   ` Daniel Vetter
2024-05-07 16:46                     ` Linus Torvalds
2024-05-07 17:45                       ` Christian König
2024-05-08  7:51                         ` Michel Dänzer
2024-05-08  7:59                           ` Christian König
2024-05-08  8:23                         ` Christian Brauner
2024-05-08  9:10                           ` Christian König
2024-05-07 18:04                       ` Daniel Vetter
2024-05-07 19:07                         ` Linus Torvalds
2024-05-08  5:55                           ` Christian König
2024-05-08  8:32                             ` Daniel Vetter
2024-05-08 10:16                               ` Christian Brauner
2024-05-08  8:05                           ` Christian Brauner
2024-05-08 16:19                           ` Linus Torvalds
2024-05-08 17:14                             ` Linus Torvalds
2024-05-09 11:38                               ` Christian Brauner
2024-05-09 15:48                                 ` Linus Torvalds
2024-05-10  6:33                                   ` Christian Brauner
2024-05-08 10:08                   ` Christian Brauner
2024-05-08 15:45                     ` Daniel Vetter
2024-05-10 10:55                       ` Christian Brauner
2024-05-11 18:25                         ` David Laight
2024-05-04  9:25         ` Hillf Danton
2024-05-05 17:31       ` Jens Axboe
2024-05-04  9:45     ` get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove) Christian Brauner
2024-05-04  3:23 ` [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove Hillf Danton
2024-05-04  3:46   ` [syzbot] [fs] " syzbot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.