* [syzbot] KASAN: use-after-free Read in remove_wait_queue (3)
@ 2021-11-16 9:23 syzbot
2021-12-10 22:42 ` syzbot
[not found] ` <20211211015620.1793-1-hdanton@sina.com>
0 siblings, 2 replies; 16+ messages in thread
From: syzbot @ 2021-11-16 9:23 UTC (permalink / raw)
To: linux-fsdevel, linux-kernel, netdev, syzkaller-bugs, viro
Hello,
syzbot found the following issue on:
HEAD commit: cc0356d6a02e Merge tag 'x86_core_for_v5.16_rc1' of git://g..
git tree: net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=102e7ce6b00000
kernel config: https://syzkaller.appspot.com/x/.config?x=a5d447cdc3ae81d9
dashboard link: https://syzkaller.appspot.com/bug?extid=cdb5dd11c97cc532efad
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
Unfortunately, I don't have any reproducer for this issue yet.
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com
==================================================================
BUG: KASAN: use-after-free in __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4885
Read of size 8 at addr ffff888032563540 by task syz-executor.2/8869
CPU: 1 PID: 8869 Comm: syz-executor.2 Not tainted 5.15.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x6c/0x309 mm/kasan/report.c:256
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
__lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4885
lock_acquire kernel/locking/lockdep.c:5625 [inline]
lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
remove_wait_queue+0x1d/0x180 kernel/sched/wait.c:55
ep_remove_wait_queue+0x88/0x1a0 fs/eventpoll.c:545
ep_unregister_pollwait fs/eventpoll.c:561 [inline]
ep_free+0x18a/0x390 fs/eventpoll.c:756
ep_eventpoll_release+0x41/0x60 fs/eventpoll.c:788
__fput+0x286/0x9f0 fs/file_table.c:280
task_work_run+0xdd/0x1a0 kernel/task_work.c:164
exit_task_work include/linux/task_work.h:32 [inline]
do_exit+0xbaa/0x2a20 kernel/exit.c:826
do_group_exit+0x125/0x310 kernel/exit.c:923
get_signal+0x47d/0x21d0 kernel/signal.c:2855
arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
handle_signal_work kernel/entry/common.c:148 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
__syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f8b46801ae9
Code: Unable to access opcode bytes at RIP 0x7f8b46801abf.
RSP: 002b:00007f8b43d77218 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: 0000000000000000 RBX: 00007f8b46914f68 RCX: 00007f8b46801ae9
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f8b46914f68
RBP: 00007f8b46914f60 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8b46914f6c
R13: 00007fff0432647f R14: 00007f8b43d77300 R15: 0000000000022000
</TASK>
Allocated by task 8869:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:434 [inline]
____kasan_kmalloc mm/kasan/common.c:513 [inline]
____kasan_kmalloc mm/kasan/common.c:472 [inline]
__kasan_kmalloc+0xa4/0xd0 mm/kasan/common.c:522
kmalloc include/linux/slab.h:591 [inline]
psi_trigger_create.part.0+0x15e/0x7f0 kernel/sched/psi.c:1141
cgroup_pressure_write+0x15d/0x6b0 kernel/cgroup/cgroup.c:3622
cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3829
kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
call_write_iter include/linux/fs.h:2161 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x12d/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Freed by task 8869:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
kasan_set_track+0x1c/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free mm/kasan/common.c:328 [inline]
__kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:1700 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1726
slab_free mm/slub.c:3492 [inline]
kfree+0xf3/0x550 mm/slub.c:4552
cgroup_pressure_write+0x18d/0x6b0 kernel/cgroup/cgroup.c:3628
cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3829
kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
call_write_iter include/linux/fs.h:2161 [inline]
new_sync_write+0x429/0x660 fs/read_write.c:503
vfs_write+0x7cd/0xae0 fs/read_write.c:590
ksys_write+0x12d/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Last potentially related work creation:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
kasan_record_aux_stack+0xe9/0x110 mm/kasan/generic.c:348
insert_work+0x48/0x370 kernel/workqueue.c:1353
__queue_work+0x5ca/0xee0 kernel/workqueue.c:1519
queue_work_on+0xee/0x110 kernel/workqueue.c:1546
queue_work include/linux/workqueue.h:501 [inline]
call_usermodehelper_exec+0x1f0/0x4c0 kernel/umh.c:435
kobject_uevent_env+0xf8f/0x1650 lib/kobject_uevent.c:618
rx_queue_add_kobject net/core/net-sysfs.c:1069 [inline]
net_rx_queue_update_kobjects+0xf8/0x500 net/core/net-sysfs.c:1109
register_queue_kobjects net/core/net-sysfs.c:1766 [inline]
netdev_register_kobject+0x275/0x430 net/core/net-sysfs.c:2014
register_netdevice+0xd31/0x1500 net/core/dev.c:10330
__ip_tunnel_create+0x398/0x5c0 net/ipv4/ip_tunnel.c:267
ip_tunnel_init_net+0x2e4/0x9d0 net/ipv4/ip_tunnel.c:1070
ops_init+0xaf/0x470 net/core/net_namespace.c:140
setup_net+0x40f/0xa30 net/core/net_namespace.c:326
copy_net_ns+0x319/0x760 net/core/net_namespace.c:470
create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
ksys_unshare+0x445/0x920 kernel/fork.c:3076
__do_sys_unshare kernel/fork.c:3150 [inline]
__se_sys_unshare kernel/fork.c:3148 [inline]
__x64_sys_unshare+0x2d/0x40 kernel/fork.c:3148
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Second to last potentially related work creation:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
kasan_record_aux_stack+0xe9/0x110 mm/kasan/generic.c:348
insert_work+0x48/0x370 kernel/workqueue.c:1353
__queue_work+0x5ca/0xee0 kernel/workqueue.c:1519
queue_work_on+0xee/0x110 kernel/workqueue.c:1546
queue_work include/linux/workqueue.h:501 [inline]
call_usermodehelper_exec+0x1f0/0x4c0 kernel/umh.c:435
kobject_uevent_env+0xf8f/0x1650 lib/kobject_uevent.c:618
rx_queue_add_kobject net/core/net-sysfs.c:1069 [inline]
net_rx_queue_update_kobjects+0xf8/0x500 net/core/net-sysfs.c:1109
register_queue_kobjects net/core/net-sysfs.c:1766 [inline]
netdev_register_kobject+0x275/0x430 net/core/net-sysfs.c:2014
register_netdevice+0xd31/0x1500 net/core/dev.c:10330
__ip_tunnel_create+0x398/0x5c0 net/ipv4/ip_tunnel.c:267
ip_tunnel_init_net+0x2e4/0x9d0 net/ipv4/ip_tunnel.c:1070
vti_init_net+0x2a/0x370 net/ipv4/ip_vti.c:504
ops_init+0xaf/0x470 net/core/net_namespace.c:140
setup_net+0x40f/0xa30 net/core/net_namespace.c:326
copy_net_ns+0x319/0x760 net/core/net_namespace.c:470
create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
ksys_unshare+0x445/0x920 kernel/fork.c:3076
__do_sys_unshare kernel/fork.c:3150 [inline]
__se_sys_unshare kernel/fork.c:3148 [inline]
__x64_sys_unshare+0x2d/0x40 kernel/fork.c:3148
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff888032563500
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 64 bytes inside of
192-byte region [ffff888032563500, ffff8880325635c0)
The buggy address belongs to the page:
page:ffffea0000c958c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x32563
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 ffffea0001ef2780 0000000600000002 ffff888010c41a00
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112cc0(GFP_USER|__GFP_NOWARN|__GFP_NORETRY), pid 3310, ts 611411618313, free_ts 611397869806
prep_new_page mm/page_alloc.c:2426 [inline]
get_page_from_freelist+0xa72/0x2f80 mm/page_alloc.c:4155
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5381
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
alloc_slab_page mm/slub.c:1770 [inline]
allocate_slab mm/slub.c:1907 [inline]
new_slab+0x319/0x490 mm/slub.c:1970
___slab_alloc+0x921/0xfe0 mm/slub.c:3001
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3088
slab_alloc_node mm/slub.c:3179 [inline]
slab_alloc mm/slub.c:3221 [inline]
__kmalloc_track_caller+0x2ef/0x310 mm/slub.c:4916
__do_krealloc mm/slab_common.c:1208 [inline]
krealloc+0x87/0xf0 mm/slab_common.c:1241
push_jmp_history kernel/bpf/verifier.c:2278 [inline]
is_state_visited kernel/bpf/verifier.c:10967 [inline]
do_check kernel/bpf/verifier.c:11107 [inline]
do_check_common+0x3521/0xcf50 kernel/bpf/verifier.c:13374
do_check_main kernel/bpf/verifier.c:13437 [inline]
bpf_check+0x87ca/0xbc90 kernel/bpf/verifier.c:14004
bpf_prog_load+0xf4c/0x21e0 kernel/bpf/syscall.c:2329
__sys_bpf+0x67e/0x5f00 kernel/bpf/syscall.c:4618
__do_sys_bpf kernel/bpf/syscall.c:4722 [inline]
__se_sys_bpf kernel/bpf/syscall.c:4720 [inline]
__x64_sys_bpf+0x75/0xb0 kernel/bpf/syscall.c:4720
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1340 [inline]
free_pcp_prepare+0x326/0x810 mm/page_alloc.c:1391
free_unref_page_prepare mm/page_alloc.c:3317 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3396
__vunmap+0x781/0xb70 mm/vmalloc.c:2621
__vfree+0x3c/0xd0 mm/vmalloc.c:2669
vfree+0x5a/0x90 mm/vmalloc.c:2700
bpf_jit_free+0xbb/0x1c0
bpf_prog_free_deferred+0x5c1/0x790 kernel/bpf/core.c:2292
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2297
worker_thread+0x658/0x11f0 kernel/workqueue.c:2444
kthread+0x405/0x4f0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
Memory state around the buggy address:
ffff888032563400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888032563480: 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888032563500: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888032563580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff888032563600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] KASAN: use-after-free Read in remove_wait_queue (3) 2021-11-16 9:23 [syzbot] KASAN: use-after-free Read in remove_wait_queue (3) syzbot @ 2021-12-10 22:42 ` syzbot [not found] ` <20211211015620.1793-1-hdanton@sina.com> 1 sibling, 0 replies; 16+ messages in thread From: syzbot @ 2021-12-10 22:42 UTC (permalink / raw) To: linux-fsdevel, linux-kernel, netdev, syzkaller-bugs, viro syzbot has found a reproducer for the following issue on: HEAD commit: e5d75fc20b92 sh_eth: Use dev_err_probe() helper git tree: net-next console output: https://syzkaller.appspot.com/x/log.txt?x=1540cdceb00000 kernel config: https://syzkaller.appspot.com/x/.config?x=24fd48984584829b dashboard link: https://syzkaller.appspot.com/bug?extid=cdb5dd11c97cc532efad compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15de00bab00000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15ad646db00000 IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com ================================================================== BUG: KASAN: use-after-free in __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897 Read of size 8 at addr ffff888015be3740 by task syz-executor161/3598 CPU: 1 PID: 3598 Comm: syz-executor161 Not tainted 5.16.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247 __kasan_report mm/kasan/report.c:433 [inline] kasan_report.cold+0x83/0xdf mm/kasan/report.c:450 __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897 lock_acquire kernel/locking/lockdep.c:5637 [inline] lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162 remove_wait_queue+0x1d/0x180 kernel/sched/wait.c:55 ep_remove_wait_queue+0x88/0x1a0 fs/eventpoll.c:545 ep_unregister_pollwait fs/eventpoll.c:561 [inline] ep_remove+0x106/0x9c0 fs/eventpoll.c:690 eventpoll_release_file+0xe1/0x130 fs/eventpoll.c:923 eventpoll_release include/linux/eventpoll.h:53 [inline] __fput+0x87b/0x9f0 fs/file_table.c:271 task_work_run+0xdd/0x1a0 kernel/task_work.c:164 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:175 [inline] exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f3167c0def3 Code: c7 c2 c0 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb ba 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8 RSP: 002b:00007ffddef2e488 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007f3167c0def3 RDX: 000000000000002f RSI: 0000000020001340 RDI: 0000000000000004 RBP: 0000000000000000 R08: 0000000000000014 R09: 00007ffddef2e4b0 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffddef2e4ac R13: 00007ffddef2e4c0 R14: 00007ffddef2e500 R15: 0000000000000000 </TASK> Allocated by task 3598: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:434 [inline] ____kasan_kmalloc mm/kasan/common.c:513 [inline] ____kasan_kmalloc mm/kasan/common.c:472 [inline] __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:522 kmalloc include/linux/slab.h:590 [inline] psi_trigger_create.part.0+0x15e/0x7f0 kernel/sched/psi.c:1141 cgroup_pressure_write+0x15d/0x6b0 kernel/cgroup/cgroup.c:3645 cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852 kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296 call_write_iter include/linux/fs.h:2162 [inline] new_sync_write+0x429/0x660 fs/read_write.c:503 vfs_write+0x7cd/0xae0 fs/read_write.c:590 ksys_write+0x12d/0x250 fs/read_write.c:643 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 3598: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 kasan_set_track+0x21/0x30 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370 ____kasan_slab_free mm/kasan/common.c:366 [inline] ____kasan_slab_free mm/kasan/common.c:328 [inline] __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374 kasan_slab_free include/linux/kasan.h:235 [inline] slab_free_hook mm/slub.c:1723 [inline] slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749 slab_free mm/slub.c:3513 [inline] kfree+0xf6/0x560 mm/slub.c:4561 cgroup_pressure_write+0x18d/0x6b0 kernel/cgroup/cgroup.c:3651 cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852 kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296 call_write_iter include/linux/fs.h:2162 [inline] new_sync_write+0x429/0x660 fs/read_write.c:503 vfs_write+0x7cd/0xae0 fs/read_write.c:590 ksys_write+0x12d/0x250 fs/read_write.c:643 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff888015be3700 which belongs to the cache kmalloc-192 of size 192 The buggy address is located 64 bytes inside of 192-byte region [ffff888015be3700, ffff888015be37c0) The buggy address belongs to the page: page:ffffea000056f8c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x15be3 flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000000200 0000000000000000 dead000000000001 ffff888010c41a00 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 1, ts 1983850449, free_ts 0 prep_new_page mm/page_alloc.c:2418 [inline] get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369 alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2036 alloc_pages+0x29f/0x300 mm/mempolicy.c:2186 alloc_slab_page mm/slub.c:1793 [inline] allocate_slab mm/slub.c:1930 [inline] new_slab+0x32d/0x4a0 mm/slub.c:1993 ___slab_alloc+0x918/0xfe0 mm/slub.c:3022 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109 slab_alloc_node mm/slub.c:3200 [inline] slab_alloc mm/slub.c:3242 [inline] kmem_cache_alloc_trace+0x289/0x2c0 mm/slub.c:3259 kmalloc include/linux/slab.h:590 [inline] kzalloc include/linux/slab.h:724 [inline] call_usermodehelper_setup+0x97/0x340 kernel/umh.c:365 kobject_uevent_env+0xf73/0x1650 lib/kobject_uevent.c:614 version_sysfs_builtin kernel/params.c:878 [inline] param_sysfs_init+0x146/0x43b kernel/params.c:969 do_one_initcall+0x103/0x650 init/main.c:1297 do_initcall_level init/main.c:1370 [inline] do_initcalls init/main.c:1386 [inline] do_basic_setup init/main.c:1405 [inline] kernel_init_freeable+0x6b1/0x73a init/main.c:1610 kernel_init+0x1a/0x1d0 init/main.c:1499 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 page_owner free stack trace missing Memory state around the buggy address: ffff888015be3600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888015be3680: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc >ffff888015be3700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888015be3780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff888015be3800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20211211015620.1793-1-hdanton@sina.com>]
* Re: [syzbot] KASAN: use-after-free Read in remove_wait_queue (3) [not found] ` <20211211015620.1793-1-hdanton@sina.com> @ 2021-12-11 3:00 ` Eric Biggers 2022-01-05 15:20 ` psi_trigger_poll() is completely broken Eric Biggers 0 siblings, 1 reply; 16+ messages in thread From: Eric Biggers @ 2021-12-11 3:00 UTC (permalink / raw) To: Johannes Weiner, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton Cc: syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs, linux-mm, Linus Torvalds On Sat, Dec 11, 2021 at 09:56:20AM +0800, Hillf Danton wrote: > On Fri, 10 Dec 2021 14:42:22 -0800 > > syzbot has found a reproducer for the following issue on: > > > > HEAD commit: e5d75fc20b92 sh_eth: Use dev_err_probe() helper > > git tree: net-next > > console output: https://syzkaller.appspot.com/x/log.txt?x=1540cdceb00000 > > kernel config: https://syzkaller.appspot.com/x/.config?x=24fd48984584829b > > dashboard link: https://syzkaller.appspot.com/bug?extid=cdb5dd11c97cc532efad > > compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2 > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15de00bab00000 > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15ad646db00000 > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com > > > > ================================================================== > > BUG: KASAN: use-after-free in __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897 > > Read of size 8 at addr ffff888015be3740 by task syz-executor161/3598 > > > > CPU: 1 PID: 3598 Comm: syz-executor161 Not tainted 5.16.0-rc4-syzkaller #0 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 > > Call Trace: > > <TASK> > > __dump_stack lib/dump_stack.c:88 [inline] > > dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 > > print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247 > > __kasan_report mm/kasan/report.c:433 [inline] > > kasan_report.cold+0x83/0xdf mm/kasan/report.c:450 > > __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897 > > lock_acquire kernel/locking/lockdep.c:5637 [inline] > > lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602 > > __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] > > _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162 > > remove_wait_queue+0x1d/0x180 kernel/sched/wait.c:55 > > ep_remove_wait_queue+0x88/0x1a0 fs/eventpoll.c:545 > > ep_unregister_pollwait fs/eventpoll.c:561 [inline] > > ep_remove+0x106/0x9c0 fs/eventpoll.c:690 > > eventpoll_release_file+0xe1/0x130 fs/eventpoll.c:923 > > eventpoll_release include/linux/eventpoll.h:53 [inline] > > __fput+0x87b/0x9f0 fs/file_table.c:271 > > task_work_run+0xdd/0x1a0 kernel/task_work.c:164 > > tracehook_notify_resume include/linux/tracehook.h:189 [inline] > > exit_to_user_mode_loop kernel/entry/common.c:175 [inline] > > exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207 > > __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] > > syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 > > do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 > > entry_SYSCALL_64_after_hwframe+0x44/0xae > > RIP: 0033:0x7f3167c0def3 > > Code: c7 c2 c0 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb ba 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8 > > RSP: 002b:00007ffddef2e488 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 > > RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007f3167c0def3 > > RDX: 000000000000002f RSI: 0000000020001340 RDI: 0000000000000004 > > RBP: 0000000000000000 R08: 0000000000000014 R09: 00007ffddef2e4b0 > > R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffddef2e4ac > > R13: 00007ffddef2e4c0 R14: 00007ffddef2e500 R15: 0000000000000000 > > </TASK> > > > > Allocated by task 3598: > > kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 > > kasan_set_track mm/kasan/common.c:46 [inline] > > set_alloc_info mm/kasan/common.c:434 [inline] > > ____kasan_kmalloc mm/kasan/common.c:513 [inline] > > ____kasan_kmalloc mm/kasan/common.c:472 [inline] > > __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:522 > > kmalloc include/linux/slab.h:590 [inline] > > psi_trigger_create.part.0+0x15e/0x7f0 kernel/sched/psi.c:1141 > > cgroup_pressure_write+0x15d/0x6b0 kernel/cgroup/cgroup.c:3645 > > cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852 > > kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296 > > call_write_iter include/linux/fs.h:2162 [inline] > > new_sync_write+0x429/0x660 fs/read_write.c:503 > > vfs_write+0x7cd/0xae0 fs/read_write.c:590 > > ksys_write+0x12d/0x250 fs/read_write.c:643 > > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 > > entry_SYSCALL_64_after_hwframe+0x44/0xae > > > > Freed by task 3598: > > kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 > > kasan_set_track+0x21/0x30 mm/kasan/common.c:46 > > kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370 > > ____kasan_slab_free mm/kasan/common.c:366 [inline] > > ____kasan_slab_free mm/kasan/common.c:328 [inline] > > __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374 > > kasan_slab_free include/linux/kasan.h:235 [inline] > > slab_free_hook mm/slub.c:1723 [inline] > > slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749 > > slab_free mm/slub.c:3513 [inline] > > kfree+0xf6/0x560 mm/slub.c:4561 > > cgroup_pressure_write+0x18d/0x6b0 kernel/cgroup/cgroup.c:3651 > > cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852 > > kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296 > > call_write_iter include/linux/fs.h:2162 [inline] > > new_sync_write+0x429/0x660 fs/read_write.c:503 > > vfs_write+0x7cd/0xae0 fs/read_write.c:590 > > ksys_write+0x12d/0x250 fs/read_write.c:643 > > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 > > entry_SYSCALL_64_after_hwframe+0x44/0xae > > > > The buggy address belongs to the object at ffff888015be3700 > > which belongs to the cache kmalloc-192 of size 192 > > The buggy address is located 64 bytes inside of > > 192-byte region [ffff888015be3700, ffff888015be37c0) > > The buggy address belongs to the page: > > page:ffffea000056f8c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x15be3 > > flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff) > > raw: 00fff00000000200 0000000000000000 dead000000000001 ffff888010c41a00 > > raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 > > page dumped because: kasan: bad access detected > > page_owner tracks the page as allocated > > page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 1, ts 1983850449, free_ts 0 > > prep_new_page mm/page_alloc.c:2418 [inline] > > get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149 > > __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369 > > alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2036 > > alloc_pages+0x29f/0x300 mm/mempolicy.c:2186 > > alloc_slab_page mm/slub.c:1793 [inline] > > allocate_slab mm/slub.c:1930 [inline] > > new_slab+0x32d/0x4a0 mm/slub.c:1993 > > ___slab_alloc+0x918/0xfe0 mm/slub.c:3022 > > __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109 > > slab_alloc_node mm/slub.c:3200 [inline] > > slab_alloc mm/slub.c:3242 [inline] > > kmem_cache_alloc_trace+0x289/0x2c0 mm/slub.c:3259 > > kmalloc include/linux/slab.h:590 [inline] > > kzalloc include/linux/slab.h:724 [inline] > > call_usermodehelper_setup+0x97/0x340 kernel/umh.c:365 > > kobject_uevent_env+0xf73/0x1650 lib/kobject_uevent.c:614 > > version_sysfs_builtin kernel/params.c:878 [inline] > > param_sysfs_init+0x146/0x43b kernel/params.c:969 > > do_one_initcall+0x103/0x650 init/main.c:1297 > > do_initcall_level init/main.c:1370 [inline] > > do_initcalls init/main.c:1386 [inline] > > do_basic_setup init/main.c:1405 [inline] > > kernel_init_freeable+0x6b1/0x73a init/main.c:1610 > > kernel_init+0x1a/0x1d0 init/main.c:1499 > > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 > > page_owner free stack trace missing > > > > Memory state around the buggy address: > > ffff888015be3600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > ffff888015be3680: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc > > >ffff888015be3700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ^ > > ffff888015be3780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc > > ffff888015be3800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > ================================================================== > > Hey Eric > > Let us know if this uaf adds another call site for what you added [1]. > > Hillf > > [1] https://lore.kernel.org/lkml/20211209010455.42744-2-ebiggers@kernel.org/ > > +++ x/kernel/sched/psi.c > @@ -1193,7 +1193,7 @@ static void psi_trigger_destroy(struct k > * Wakeup waiters to stop polling. Can happen if cgroup is deleted > * from under a polling process. > */ > - wake_up_interruptible(&t->event_wait); > + wake_up_pollfree(&t->event_wait); > > mutex_lock(&group->trigger_lock); [added linux-mm and all maintainers for kernel/sched/psi.c] Well, it is the same sort of issue, but POLLFREE is *not* enough here. POLLFREE only works if the lifetime of waitqueue is tied to the polling task, as blocking polls don't handle it -- only non-blocking polls do. The kernel/sched/psi.c use case is just totally broken, since the lifetime of its waitqueue is totally arbitrary; the open file descriptor can be written to at any time by any process, which causes the waitqueue to be freed. So it will cause a use-after-free even for regular blocking poll(). To fix this, I think the psi trigger stuff will need to be refactored to have just one waitqueue per open file. We need to be removing uses of POLLFREE, not adding new ones. (See Linus' comments on POLLFREE here: https://lore.kernel.org/lkml/CAHk-=wgvt7PH+AU_29H95tJQZ9FnhS8vVmymbhpZ6NZ7yaAigw@mail.gmail.com/) Here are some repros: #include <fcntl.h> #include <sys/epoll.h> #include <unistd.h> int main() { int fd = open("/proc/pressure/cpu", O_RDWR); int epfd = epoll_create(1); const char trigger[] = "some 100000 1000000"; struct epoll_event event = { .events = EPOLLIN }; write(fd, trigger, sizeof(trigger)); epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event); write(fd, trigger, sizeof(trigger)); } #include <fcntl.h> #include <sys/poll.h> #include <unistd.h> int main() { int fd = open("/proc/pressure/cpu", O_RDWR); const char trigger[] = "some 100000 1000000"; if (fork()) { struct pollfd pfd = { .fd = fd, .events = POLLIN }; for (;;) poll(&pfd, 1, -1); } else { for (;;) write(fd, trigger, sizeof(trigger)); } } ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2021-12-11 3:00 ` Eric Biggers @ 2022-01-05 15:20 ` Eric Biggers 2022-01-05 18:50 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Eric Biggers @ 2022-01-05 15:20 UTC (permalink / raw) To: Johannes Weiner, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton Cc: syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs, linux-mm, Linus Torvalds [changed subject line to hopefully get people to stop ignoring this] Please see my message below where I explained the problem in detail. Any response from the maintainers of kernel/sched/psi.c? There are a lot of you: $ ./scripts/get_maintainer.pl kernel/sched/psi.c Johannes Weiner <hannes@cmpxchg.org> (maintainer:PRESSURE STALL INFORMATION (PSI)) Ingo Molnar <mingo@redhat.com> (maintainer:SCHEDULER) Peter Zijlstra <peterz@infradead.org> (maintainer:SCHEDULER) Juri Lelli <juri.lelli@redhat.com> (maintainer:SCHEDULER) Vincent Guittot <vincent.guittot@linaro.org> (maintainer:SCHEDULER) Dietmar Eggemann <dietmar.eggemann@arm.com> (reviewer:SCHEDULER) Steven Rostedt <rostedt@goodmis.org> (reviewer:SCHEDULER) Ben Segall <bsegall@google.com> (reviewer:SCHEDULER) Mel Gorman <mgorman@suse.de> (reviewer:SCHEDULER) Daniel Bristot de Oliveira <bristot@redhat.com> (reviewer:SCHEDULER) linux-kernel@vger.kernel.org (open list:SCHEDULER) On Fri, Dec 10, 2021 at 07:00:26PM -0800, Eric Biggers wrote: > On Sat, Dec 11, 2021 at 09:56:20AM +0800, Hillf Danton wrote: > > On Fri, 10 Dec 2021 14:42:22 -0800 > > > syzbot has found a reproducer for the following issue on: > > > > > > HEAD commit: e5d75fc20b92 sh_eth: Use dev_err_probe() helper > > > git tree: net-next > > > console output: https://syzkaller.appspot.com/x/log.txt?x=1540cdceb00000 > > > kernel config: https://syzkaller.appspot.com/x/.config?x=24fd48984584829b > > > dashboard link: https://syzkaller.appspot.com/bug?extid=cdb5dd11c97cc532efad > > > compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2 > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15de00bab00000 > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15ad646db00000 > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > > Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com > > > > > > ================================================================== > > > BUG: KASAN: use-after-free in __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897 > > > Read of size 8 at addr ffff888015be3740 by task syz-executor161/3598 > > > > > > CPU: 1 PID: 3598 Comm: syz-executor161 Not tainted 5.16.0-rc4-syzkaller #0 > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 > > > Call Trace: > > > <TASK> > > > __dump_stack lib/dump_stack.c:88 [inline] > > > dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 > > > print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247 > > > __kasan_report mm/kasan/report.c:433 [inline] > > > kasan_report.cold+0x83/0xdf mm/kasan/report.c:450 > > > __lock_acquire+0x3d86/0x54a0 kernel/locking/lockdep.c:4897 > > > lock_acquire kernel/locking/lockdep.c:5637 [inline] > > > lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602 > > > __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] > > > _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162 > > > remove_wait_queue+0x1d/0x180 kernel/sched/wait.c:55 > > > ep_remove_wait_queue+0x88/0x1a0 fs/eventpoll.c:545 > > > ep_unregister_pollwait fs/eventpoll.c:561 [inline] > > > ep_remove+0x106/0x9c0 fs/eventpoll.c:690 > > > eventpoll_release_file+0xe1/0x130 fs/eventpoll.c:923 > > > eventpoll_release include/linux/eventpoll.h:53 [inline] > > > __fput+0x87b/0x9f0 fs/file_table.c:271 > > > task_work_run+0xdd/0x1a0 kernel/task_work.c:164 > > > tracehook_notify_resume include/linux/tracehook.h:189 [inline] > > > exit_to_user_mode_loop kernel/entry/common.c:175 [inline] > > > exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207 > > > __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] > > > syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 > > > do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 > > > entry_SYSCALL_64_after_hwframe+0x44/0xae > > > RIP: 0033:0x7f3167c0def3 > > > Code: c7 c2 c0 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb ba 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8 > > > RSP: 002b:00007ffddef2e488 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 > > > RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007f3167c0def3 > > > RDX: 000000000000002f RSI: 0000000020001340 RDI: 0000000000000004 > > > RBP: 0000000000000000 R08: 0000000000000014 R09: 00007ffddef2e4b0 > > > R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffddef2e4ac > > > R13: 00007ffddef2e4c0 R14: 00007ffddef2e500 R15: 0000000000000000 > > > </TASK> > > > > > > Allocated by task 3598: > > > kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 > > > kasan_set_track mm/kasan/common.c:46 [inline] > > > set_alloc_info mm/kasan/common.c:434 [inline] > > > ____kasan_kmalloc mm/kasan/common.c:513 [inline] > > > ____kasan_kmalloc mm/kasan/common.c:472 [inline] > > > __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:522 > > > kmalloc include/linux/slab.h:590 [inline] > > > psi_trigger_create.part.0+0x15e/0x7f0 kernel/sched/psi.c:1141 > > > cgroup_pressure_write+0x15d/0x6b0 kernel/cgroup/cgroup.c:3645 > > > cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852 > > > kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296 > > > call_write_iter include/linux/fs.h:2162 [inline] > > > new_sync_write+0x429/0x660 fs/read_write.c:503 > > > vfs_write+0x7cd/0xae0 fs/read_write.c:590 > > > ksys_write+0x12d/0x250 fs/read_write.c:643 > > > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > > > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 > > > entry_SYSCALL_64_after_hwframe+0x44/0xae > > > > > > Freed by task 3598: > > > kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 > > > kasan_set_track+0x21/0x30 mm/kasan/common.c:46 > > > kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370 > > > ____kasan_slab_free mm/kasan/common.c:366 [inline] > > > ____kasan_slab_free mm/kasan/common.c:328 [inline] > > > __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374 > > > kasan_slab_free include/linux/kasan.h:235 [inline] > > > slab_free_hook mm/slub.c:1723 [inline] > > > slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749 > > > slab_free mm/slub.c:3513 [inline] > > > kfree+0xf6/0x560 mm/slub.c:4561 > > > cgroup_pressure_write+0x18d/0x6b0 kernel/cgroup/cgroup.c:3651 > > > cgroup_file_write+0x1ec/0x780 kernel/cgroup/cgroup.c:3852 > > > kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296 > > > call_write_iter include/linux/fs.h:2162 [inline] > > > new_sync_write+0x429/0x660 fs/read_write.c:503 > > > vfs_write+0x7cd/0xae0 fs/read_write.c:590 > > > ksys_write+0x12d/0x250 fs/read_write.c:643 > > > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > > > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 > > > entry_SYSCALL_64_after_hwframe+0x44/0xae > > > > > > The buggy address belongs to the object at ffff888015be3700 > > > which belongs to the cache kmalloc-192 of size 192 > > > The buggy address is located 64 bytes inside of > > > 192-byte region [ffff888015be3700, ffff888015be37c0) > > > The buggy address belongs to the page: > > > page:ffffea000056f8c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x15be3 > > > flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff) > > > raw: 00fff00000000200 0000000000000000 dead000000000001 ffff888010c41a00 > > > raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 > > > page dumped because: kasan: bad access detected > > > page_owner tracks the page as allocated > > > page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 1, ts 1983850449, free_ts 0 > > > prep_new_page mm/page_alloc.c:2418 [inline] > > > get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149 > > > __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369 > > > alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2036 > > > alloc_pages+0x29f/0x300 mm/mempolicy.c:2186 > > > alloc_slab_page mm/slub.c:1793 [inline] > > > allocate_slab mm/slub.c:1930 [inline] > > > new_slab+0x32d/0x4a0 mm/slub.c:1993 > > > ___slab_alloc+0x918/0xfe0 mm/slub.c:3022 > > > __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109 > > > slab_alloc_node mm/slub.c:3200 [inline] > > > slab_alloc mm/slub.c:3242 [inline] > > > kmem_cache_alloc_trace+0x289/0x2c0 mm/slub.c:3259 > > > kmalloc include/linux/slab.h:590 [inline] > > > kzalloc include/linux/slab.h:724 [inline] > > > call_usermodehelper_setup+0x97/0x340 kernel/umh.c:365 > > > kobject_uevent_env+0xf73/0x1650 lib/kobject_uevent.c:614 > > > version_sysfs_builtin kernel/params.c:878 [inline] > > > param_sysfs_init+0x146/0x43b kernel/params.c:969 > > > do_one_initcall+0x103/0x650 init/main.c:1297 > > > do_initcall_level init/main.c:1370 [inline] > > > do_initcalls init/main.c:1386 [inline] > > > do_basic_setup init/main.c:1405 [inline] > > > kernel_init_freeable+0x6b1/0x73a init/main.c:1610 > > > kernel_init+0x1a/0x1d0 init/main.c:1499 > > > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 > > > page_owner free stack trace missing > > > > > > Memory state around the buggy address: > > > ffff888015be3600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > ffff888015be3680: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc > > > >ffff888015be3700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > > ^ > > > ffff888015be3780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc > > > ffff888015be3800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > ================================================================== > > > > Hey Eric > > > > Let us know if this uaf adds another call site for what you added [1]. > > > > Hillf > > > > [1] https://lore.kernel.org/lkml/20211209010455.42744-2-ebiggers@kernel.org/ > > > > +++ x/kernel/sched/psi.c > > @@ -1193,7 +1193,7 @@ static void psi_trigger_destroy(struct k > > * Wakeup waiters to stop polling. Can happen if cgroup is deleted > > * from under a polling process. > > */ > > - wake_up_interruptible(&t->event_wait); > > + wake_up_pollfree(&t->event_wait); > > > > mutex_lock(&group->trigger_lock); > > [added linux-mm and all maintainers for kernel/sched/psi.c] > > Well, it is the same sort of issue, but POLLFREE is *not* enough here. POLLFREE > only works if the lifetime of waitqueue is tied to the polling task, as blocking > polls don't handle it -- only non-blocking polls do. > > The kernel/sched/psi.c use case is just totally broken, since the lifetime of > its waitqueue is totally arbitrary; the open file descriptor can be written to > at any time by any process, which causes the waitqueue to be freed. So it will > cause a use-after-free even for regular blocking poll(). > > To fix this, I think the psi trigger stuff will need to be refactored to have > just one waitqueue per open file. We need to be removing uses of POLLFREE, not > adding new ones. (See Linus' comments on POLLFREE here: > https://lore.kernel.org/lkml/CAHk-=wgvt7PH+AU_29H95tJQZ9FnhS8vVmymbhpZ6NZ7yaAigw@mail.gmail.com/) > > Here are some repros: > > #include <fcntl.h> > #include <sys/epoll.h> > #include <unistd.h> > int main() > { > int fd = open("/proc/pressure/cpu", O_RDWR); > int epfd = epoll_create(1); > const char trigger[] = "some 100000 1000000"; > struct epoll_event event = { .events = EPOLLIN }; > > write(fd, trigger, sizeof(trigger)); > epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event); > write(fd, trigger, sizeof(trigger)); > } > > > #include <fcntl.h> > #include <sys/poll.h> > #include <unistd.h> > int main() > { > int fd = open("/proc/pressure/cpu", O_RDWR); > const char trigger[] = "some 100000 1000000"; > > if (fork()) { > struct pollfd pfd = { .fd = fd, .events = POLLIN }; > > for (;;) > poll(&pfd, 1, -1); > } else { > for (;;) > write(fd, trigger, sizeof(trigger)); > } > } ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2022-01-05 15:20 ` psi_trigger_poll() is completely broken Eric Biggers @ 2022-01-05 18:50 ` Linus Torvalds 2022-01-05 19:07 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2022-01-05 18:50 UTC (permalink / raw) To: Eric Biggers Cc: Johannes Weiner, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton, syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs, Linux-MM On Wed, Jan 5, 2022 at 7:21 AM Eric Biggers <ebiggers@kernel.org> wrote: > > [changed subject line to hopefully get people to stop ignoring this] > > Please see my message below where I explained the problem in detail. Any > response from the maintainers of kernel/sched/psi.c? There are a lot of you: Ok, this one is clearly a kernel/sched/psi.c bug, since the lifetime isn't even maintained by the fiel reference. I think the proper thing to do is to move the whole "get kref to trigger pointer" in the open/close code, and keep the ref around that way. The natural thing to do would be to look up the trigger at open time, save the pointer in seq->private, and release it at close time. Sadly, right now the code actually uses that 'seq->private' as an indirect rcu-pointer to the trigger data, instead of as the trigger data itself. And that seems very much on purpose and inherent to that 'psi_write()' model, where it changes the trigger pointer very much on purpose. So I agree 100% - the PSI code is fundamentally broken. psi_write() seems to be literally _designed_ to do the wrong thing. I don't know who - if anybody - uses this. My preference would be to just disable the completely broken poll support. Another alternative is to just make 'psi_write()' return -EBUSY if there are existing poll waiters (ie t->event_wait not being empty. At least then the open file would keep the kref to the trigger. That would require that 'psi_trigger_replace()' serialize with the waitqueue lock (easy), but all callers would also have to check the return value of it The cgroup code does psi_trigger_replace(&of->priv, NULL); in the release function, but I guess that might work since at release time there shouldn't be any pending polls anyway. But it would also mean that anybody who can open the file for reading (so that they can poll it) would be able to keep people from changing it. But yes, I think that unless we get some reply from the PSI maintainers, we will have to just disable polling entirely. I hope there are no users that would break, but considering that the current code is clearly too broken to live, this may be one of those "we have no choice" cases. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2022-01-05 18:50 ` Linus Torvalds @ 2022-01-05 19:07 ` Linus Torvalds 2022-01-05 19:13 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2022-01-05 19:07 UTC (permalink / raw) To: Eric Biggers, Tejun Heo, Zefan Li Cc: Johannes Weiner, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton, syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs, Linux-MM On Wed, Jan 5, 2022 at 10:50 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > That would require that 'psi_trigger_replace()' serialize with the > waitqueue lock (easy) I take the "easy" back. The other side of that serialization would require that the poll() side also re-lookup the trigger pointer under that same lock. And you can't do that with the waitqueue lock, because 'poll_wait()' does the add_wait_queue() internally, and that will take the waitqueue lock. So you can't take and hold the waitqueue lock in the caller in poll, it would just deadlock. And not holding the lock over the call would mean that you'd have a small race between adding a new poll waiter, and checking that the trigger is still the same one. We could use another lock - the code in kernel/sched/psi.c already does mutex_lock(&seq->lock); psi_trigger_replace(&seq->private, new); mutex_unlock(&seq->lock); and could use that same lock around the poll sequence too. But the cgroup_pressure_write() code doesn't even do that, and concurrent writes aren't serialized at all (much less concurrent poll() calls). Side note: it looks like concurrent writes in the cgroup_pressure_write() is literally broken. Because psi_trigger_replace() is *not* handling concurrency, and does that struct psi_trigger *old = *trigger_ptr; .... if (old) kref_put(&old->refcount, psi_trigger_destroy); assuming that the caller holds some lock that makes '*trigger_ptr' a stable thing. Again, kernel/sched/psi.c itself does that already, but the cgroup code doesn't seem to. So the bugs in this area go deeper than "just" poll(). The whole psi_trigger_replace() thing is literally broken even ignoring the poll() interactions. Whoever came up with that stupid "replace existing trigger with a write()" model should feel bad. It's garbage, and it's actively buggy in multiple ways. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2022-01-05 19:07 ` Linus Torvalds @ 2022-01-05 19:13 ` Linus Torvalds 2022-01-06 22:05 ` Tejun Heo 2022-01-10 13:45 ` Johannes Weiner 0 siblings, 2 replies; 16+ messages in thread From: Linus Torvalds @ 2022-01-05 19:13 UTC (permalink / raw) To: Eric Biggers, Tejun Heo, Zefan Li Cc: Johannes Weiner, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton, syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs, Linux-MM On Wed, Jan 5, 2022 at 11:07 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Whoever came up with that stupid "replace existing trigger with a > write()" model should feel bad. It's garbage, and it's actively buggy > in multiple ways. What are the users? Can we make the rule for -EBUSY simply be that you can _install_ a trigger, but you can't replace an existing one (except with NULL, when you close it). That would fix the poll() lifetime issue, and would make the psi_trigger_replace() races fairly easy to fix - just use if (cmpxchg(trigger_ptr, NULL, new) != NULL) { ... free 'new', return -EBUSY .. to install the new one, instead of rcu_assign_pointer(*trigger_ptr, new); or something like that. No locking necessary. But I assume people actually end up re-writing triggers, because people are perverse and have taken advantage of this completely broken API. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2022-01-05 19:13 ` Linus Torvalds @ 2022-01-06 22:05 ` Tejun Heo 2022-01-06 22:59 ` Linus Torvalds 2022-01-10 13:45 ` Johannes Weiner 1 sibling, 1 reply; 16+ messages in thread From: Tejun Heo @ 2022-01-06 22:05 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Biggers, Zefan Li, Johannes Weiner, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton, syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs, Linux-MM Happy new year, On Wed, Jan 05, 2022 at 11:13:30AM -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2022 at 11:07 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Whoever came up with that stupid "replace existing trigger with a > > write()" model should feel bad. It's garbage, and it's actively buggy > > in multiple ways. > > What are the users? Can we make the rule for -EBUSY simply be that you > can _install_ a trigger, but you can't replace an existing one (except > with NULL, when you close it). I don't have enough context here and Johannes seems offline today. Let's wait for him to chime in. > That would fix the poll() lifetime issue, and would make the > psi_trigger_replace() races fairly easy to fix - just use > > if (cmpxchg(trigger_ptr, NULL, new) != NULL) { > ... free 'new', return -EBUSY .. > > to install the new one, instead of > > rcu_assign_pointer(*trigger_ptr, new); > > or something like that. No locking necessary. > > But I assume people actually end up re-writing triggers, because > people are perverse and have taken advantage of this completely broken > API. IIRC, the rationale for the shared trigger at the time was around the complexities of preventing it from devolving into O(N) trigger checks on every pressure update. If the overriding behavior is something that can be changed, I'd prefer going for per-opener triggers even if that involves adding complexities (maybe a rbtree w/ prev/next links for faster sweeps?). Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2022-01-06 22:05 ` Tejun Heo @ 2022-01-06 22:59 ` Linus Torvalds 2022-01-07 0:13 ` Eric Biggers 0 siblings, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2022-01-06 22:59 UTC (permalink / raw) To: Tejun Heo Cc: Eric Biggers, Zefan Li, Johannes Weiner, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton, syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs, Linux-MM [-- Attachment #1: Type: text/plain, Size: 3614 bytes --] On Thu, Jan 6, 2022 at 2:05 PM Tejun Heo <tj@kernel.org> wrote: > > On Wed, Jan 05, 2022 at 11:13:30AM -0800, Linus Torvalds wrote: > > > > What are the users? Can we make the rule for -EBUSY simply be that you > > can _install_ a trigger, but you can't replace an existing one (except > > with NULL, when you close it). > > I don't have enough context here and Johannes seems offline today. Let's > wait for him to chime in. Context here: https://lore.kernel.org/all/YdW3WfHURBXRmn%2F6@sol.localdomain/ > IIRC, the rationale for the shared trigger at the time was around the > complexities of preventing it from devolving into O(N) trigger checks on > every pressure update. If the overriding behavior is something that can be > changed, I'd prefer going for per-opener triggers even if that involves > adding complexities (maybe a rbtree w/ prev/next links for faster sweeps?). So here's a COMPLETELY UNTESTED patch to try to fix the lifetime and locking. The locking was completely broken, in that psi_trigger_replace() expected that the caller would hold some exclusive lock so that it would release the correct previous trigger. The cgroup code doesn't seem to have any such exclusion. This (UNTESTED!) patch fixes that breakage by just using a cmpxchg loop. And the lifetime was completely broken (and that's Eric's email) because psi_trigger_replace() would drop the refcount to the old trigger - assuming it got the right one - even though the old trigger could still have active waiters on the waitqueue due to poll() or select(). This (UNTESTED!) patch fixes _that_ breakage by making psi_trigger_replace() instead just put the previous trigger on the "stale_trigger" linked list, and never release it at all. It now gets released by "psi_trigger_release()" instead, which walks the list at file release time. Doing "psi_trigger_replace(.., NULL)" is not valid any more. And because the reference cannot go away, we now can throw away all the incorrect temporary kref_get/put games from psi_trigger_poll(), which didn't actually fix the race at all, only limited it to the poll waitqueue. That also means we can remove the "synchronize_rcu()" from psi_trigger_destroy(), since that was trying to hide all the problems with the "take rcu lock and then do kref_get()" thing not having locking. The locking still doesn't exist, but since we don't release the old one when replacing it, the issue is moot. NOTE NOTE NOTE! Not only is this patch entirely untested, there are optimizations you could do if there was some sane synchronization between psi_trigger_poll() and psi_trigger_replace(). I put comments about it in the code, but right now the code just assumes that replacing a trigger is fairly rare (and since it requires write permissions, it's not something random users can do). I'm not proud of this patch, but I think it might fix the fundamental bugs in the code for now. It's not lovely, it has room for improvement, and I wish we didn't need this kind of thing, but it looks superficially sane as a fix to me. Comments? And once again: this is UNTESTED. I've compiled-tested it, it looks kind of sane to me, but honestly, I don't know the code very well. Also, I'm not super-happy with how that 'psi_disabled' static branch works. If somebody switches it off after it has been on, that will also disable the freeing code, so now you'll be leaking memory. I couldn't find it in myself to care. Eric - you have the test-case, and the eagle-eyes that found this problem in the first place. As such, your opinion and comments count more than most... Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 5084 bytes --] include/linux/psi.h | 1 + include/linux/psi_types.h | 3 ++ kernel/cgroup/cgroup.c | 2 +- kernel/sched/psi.c | 71 ++++++++++++++++++++++++++++++++--------------- 4 files changed, 54 insertions(+), 23 deletions(-) diff --git a/include/linux/psi.h b/include/linux/psi.h index 65eb1476ac70..9ec9468d6068 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -33,6 +33,7 @@ void cgroup_move_task(struct task_struct *p, struct css_set *to); struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf, size_t nbytes, enum psi_res res); void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *t); +void psi_trigger_release(void **trigger_ptr); __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file, poll_table *wait); diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 0a23300d49af..eab79e68bf56 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -132,6 +132,9 @@ struct psi_trigger { /* Refcounting to prevent premature destruction */ struct kref refcount; + + /* Previous trigger that this one replaced */ + struct psi_trigger *stale_trigger; }; struct psi_group { diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 919194de39c8..801d0aec0443 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3684,7 +3684,7 @@ static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, static void cgroup_pressure_release(struct kernfs_open_file *of) { - psi_trigger_replace(&of->priv, NULL); + psi_trigger_release(&of->priv); } bool cgroup_psi_enabled(void) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 1652f2bb54b7..10430f75f21a 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1152,6 +1152,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, t->last_event_time = 0; init_waitqueue_head(&t->event_wait); kref_init(&t->refcount); + t->stale_trigger = NULL; mutex_lock(&group->trigger_lock); @@ -1223,12 +1224,6 @@ static void psi_trigger_destroy(struct kref *ref) mutex_unlock(&group->trigger_lock); - /* - * Wait for both *trigger_ptr from psi_trigger_replace and - * poll_task RCUs to complete their read-side critical sections - * before destroying the trigger and optionally the poll_task - */ - synchronize_rcu(); /* * Stop kthread 'psimon' after releasing trigger_lock to prevent a * deadlock while waiting for psi_poll_work to acquire trigger_lock @@ -1243,16 +1238,48 @@ static void psi_trigger_destroy(struct kref *ref) kfree(t); } +/* + * Replacing a trigger must not throw away the old one, since it + * can still have pending waiters. + * + * Possible optimization: after successfully installing a new + * trigger, we could release the old one from the stale list + * early. Not done here yet - needs locking with psi_trigger_poll. + */ void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *new) { - struct psi_trigger *old = *trigger_ptr; + if (static_branch_likely(&psi_disabled)) + return; + + for (;;) { + struct psi_trigger *old = *trigger_ptr; + + new->stale_trigger = old; + if (try_cmpxchg(trigger_ptr, old, new)) + break; + } + + /* + * Now that the new one has been installed, we could + * check if the stale one has an empty wait-queue + * and release it early... But we'd need some locking + * with enw poll users to be sure. + */ +} + +/* No locking needed for final release */ +void psi_trigger_release(void **trigger_ptr) +{ + struct psi_trigger *trigger; if (static_branch_likely(&psi_disabled)) return; - rcu_assign_pointer(*trigger_ptr, new); - if (old) - kref_put(&old->refcount, psi_trigger_destroy); + while (trigger) { + struct psi_trigger *next = trigger->stale_trigger; + kref_put(&trigger->refcount, psi_trigger_destroy); + trigger = next; + } } __poll_t psi_trigger_poll(void **trigger_ptr, @@ -1264,24 +1291,24 @@ __poll_t psi_trigger_poll(void **trigger_ptr, if (static_branch_likely(&psi_disabled)) return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI; - rcu_read_lock(); - - t = rcu_dereference(*(void __rcu __force **)trigger_ptr); - if (!t) { - rcu_read_unlock(); + /* + * See psi_trigger_replace(): finding a trigger means + * that it is guaranteed to have an elevated refcount + * for the lifetime of this file descriptor. + * + * If we had locking, we could release it early. As it + * is, we'll only release it with psi_trigger_release() + * at the very end. + */ + t = READ_ONCE(*trigger_ptr); + if (!t) return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI; - } - kref_get(&t->refcount); - - rcu_read_unlock(); poll_wait(file, &t->event_wait, wait); if (cmpxchg(&t->event, 1, 0) == 1) ret |= EPOLLPRI; - kref_put(&t->refcount, psi_trigger_destroy); - return ret; } @@ -1347,7 +1374,7 @@ static int psi_fop_release(struct inode *inode, struct file *file) { struct seq_file *seq = file->private_data; - psi_trigger_replace(&seq->private, NULL); + psi_trigger_release(&seq->private); return single_release(inode, file); } ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2022-01-06 22:59 ` Linus Torvalds @ 2022-01-07 0:13 ` Eric Biggers 2022-01-07 3:02 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Eric Biggers @ 2022-01-07 0:13 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Zefan Li, Johannes Weiner, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton, syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs, Linux-MM On Thu, Jan 06, 2022 at 02:59:36PM -0800, Linus Torvalds wrote: > > So here's a COMPLETELY UNTESTED patch to try to fix the lifetime and locking. > > The locking was completely broken, in that psi_trigger_replace() > expected that the caller would hold some exclusive lock so that it > would release the correct previous trigger. The cgroup code doesn't > seem to have any such exclusion. > > This (UNTESTED!) patch fixes that breakage by just using a cmpxchg loop. > > And the lifetime was completely broken (and that's Eric's email) > because psi_trigger_replace() would drop the refcount to the old > trigger - assuming it got the right one - even though the old trigger > could still have active waiters on the waitqueue due to poll() or > select(). > > This (UNTESTED!) patch fixes _that_ breakage by making > psi_trigger_replace() instead just put the previous trigger on the > "stale_trigger" linked list, and never release it at all. > > It now gets released by "psi_trigger_release()" instead, which walks > the list at file release time. Doing "psi_trigger_replace(.., NULL)" > is not valid any more. > > And because the reference cannot go away, we now can throw away all > the incorrect temporary kref_get/put games from psi_trigger_poll(), > which didn't actually fix the race at all, only limited it to the poll > waitqueue. > > That also means we can remove the "synchronize_rcu()" from > psi_trigger_destroy(), since that was trying to hide all the problems > with the "take rcu lock and then do kref_get()" thing not having > locking. The locking still doesn't exist, but since we don't release > the old one when replacing it, the issue is moot. > > NOTE NOTE NOTE! Not only is this patch entirely untested, there are > optimizations you could do if there was some sane synchronization > between psi_trigger_poll() and psi_trigger_replace(). I put comments > about it in the code, but right now the code just assumes that > replacing a trigger is fairly rare (and since it requires write > permissions, it's not something random users can do). > > I'm not proud of this patch, but I think it might fix the fundamental > bugs in the code for now. > > It's not lovely, it has room for improvement, and I wish we didn't > need this kind of thing, but it looks superficially sane as a fix to > me. > > Comments? > > And once again: this is UNTESTED. I've compiled-tested it, it looks > kind of sane to me, but honestly, I don't know the code very well. > > Also, I'm not super-happy with how that 'psi_disabled' static branch > works. If somebody switches it off after it has been on, that will > also disable the freeing code, so now you'll be leaking memory. > > I couldn't find it in myself to care. I had to make the following changes to Linus's patch: diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 10430f75f21a..7d5afa89db44 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1255,7 +1255,7 @@ void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *new) struct psi_trigger *old = *trigger_ptr; new->stale_trigger = old; - if (try_cmpxchg(trigger_ptr, old, new)) + if (try_cmpxchg(trigger_ptr, &old, new)) break; } @@ -1270,7 +1270,7 @@ void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *new) /* No locking needed for final release */ void psi_trigger_release(void **trigger_ptr) { - struct psi_trigger *trigger; + struct psi_trigger *trigger = *trigger_ptr; if (static_branch_likely(&psi_disabled)) return; After that, the two reproducers I gave in https://lore.kernel.org/r/YbQUSlq76Iv5L4cC@sol.localdomain (the ones at the end of my mail, not the syzbot-generated ones which I didn't try) no longer crash the kernel. This is one way to fix the use-after-free, but the fact that it allows anyone who can write to a /proc/pressure/* file to cause the kernel to allocate an unbounded number of 'struct psi_trigger' structs is still really broken. I think we really need an answer to Linus' question: > What are the users? Can we make the rule for -EBUSY simply be that you > can _install_ a trigger, but you can't replace an existing one (except > with NULL, when you close it). ... since that would be a much better fix. The example in Documentation/accounting/psi.rst only does a single write; that case wouldn't be broken if we made multiple writes not work. - Eric ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2022-01-07 0:13 ` Eric Biggers @ 2022-01-07 3:02 ` Linus Torvalds 0 siblings, 0 replies; 16+ messages in thread From: Linus Torvalds @ 2022-01-07 3:02 UTC (permalink / raw) To: Eric Biggers Cc: Tejun Heo, Zefan Li, Johannes Weiner, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton, syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs, Linux-MM On Thu, Jan 6, 2022 at 4:14 PM Eric Biggers <ebiggers@kernel.org> wrote: > > I had to make the following changes to Linus's patch: Ack. Thanks. > This is one way to fix the use-after-free, but the fact that it allows anyone > who can write to a /proc/pressure/* file to cause the kernel to allocate an > unbounded number of 'struct psi_trigger' structs is still really broken. Yeah, I agree. Very non-optimal - that patch really was trying to just keep the status quo, and fixing the immediate problems. Modifying that patch to only allow a previous NULL value in psi_trigger_replace() would be fairly simple - it would basically just get rid of the "stale_trigger" list (and the loops it creates). You'd still want the psi_trigger_release() model to separate that whole "release" from "new trigger". But that does require that nobody ever does more than a single write to one file. Debian code search finds those "/proc/pressure/xyz" files mentioned at least by systemd and the chromium chrome browser sources. Whether they actually write triggers to them, I can't say. Maybe we just need to try. Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2022-01-05 19:13 ` Linus Torvalds 2022-01-06 22:05 ` Tejun Heo @ 2022-01-10 13:45 ` Johannes Weiner 2022-01-10 17:25 ` Suren Baghdasaryan 1 sibling, 1 reply; 16+ messages in thread From: Johannes Weiner @ 2022-01-10 13:45 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Biggers, Tejun Heo, Zefan Li, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton, syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs, Linux-MM, Suren Baghdasaryan On Wed, Jan 05, 2022 at 11:13:30AM -0800, Linus Torvalds wrote: > On Wed, Jan 5, 2022 at 11:07 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Whoever came up with that stupid "replace existing trigger with a > > write()" model should feel bad. It's garbage, and it's actively buggy > > in multiple ways. > > What are the users? Can we make the rule for -EBUSY simply be that you > can _install_ a trigger, but you can't replace an existing one (except > with NULL, when you close it). Apologies for the delay, I'm traveling right now. The primary user of the poll interface is still Android userspace OOM killing. I'm CCing Suren who is the most familiar with this usecase. Suren, the way the refcounting is written right now assumes that poll_wait() is the actual blocking wait. That's not true, it just queues the waiter and saves &t->event_wait, and the *caller* of psi_trigger_poll() continues to interact with it afterwards. If at all possible, I would also prefer the simplicity of one trigger setup per fd; if you need a new trigger, close the fd and open again. Can you please take a look if that is workable from the Android side? (I'm going to follow up on the static branch issue Linus pointed out, later this week when I'm back home. I also think we should add Suren as additional psi maintainer since the polling code is a good chunk of the codebase and he shouldn't miss threads like these.) > That would fix the poll() lifetime issue, and would make the > psi_trigger_replace() races fairly easy to fix - just use > > if (cmpxchg(trigger_ptr, NULL, new) != NULL) { > ... free 'new', return -EBUSY .. > > to install the new one, instead of > > rcu_assign_pointer(*trigger_ptr, new); > > or something like that. No locking necessary. > > But I assume people actually end up re-writing triggers, because > people are perverse and have taken advantage of this completely broken > API. > > Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2022-01-10 13:45 ` Johannes Weiner @ 2022-01-10 17:25 ` Suren Baghdasaryan 2022-01-10 17:42 ` Linus Torvalds 0 siblings, 1 reply; 16+ messages in thread From: Suren Baghdasaryan @ 2022-01-10 17:25 UTC (permalink / raw) To: Johannes Weiner Cc: Linus Torvalds, Eric Biggers, Tejun Heo, Zefan Li, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton, syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs, Linux-MM On Mon, Jan 10, 2022 at 5:45 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Jan 05, 2022 at 11:13:30AM -0800, Linus Torvalds wrote: > > On Wed, Jan 5, 2022 at 11:07 AM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > Whoever came up with that stupid "replace existing trigger with a > > > write()" model should feel bad. It's garbage, and it's actively buggy > > > in multiple ways. > > > > What are the users? Can we make the rule for -EBUSY simply be that you > > can _install_ a trigger, but you can't replace an existing one (except > > with NULL, when you close it). > > Apologies for the delay, I'm traveling right now. > > The primary user of the poll interface is still Android userspace OOM > killing. I'm CCing Suren who is the most familiar with this usecase. > > Suren, the way the refcounting is written right now assumes that > poll_wait() is the actual blocking wait. That's not true, it just > queues the waiter and saves &t->event_wait, and the *caller* of > psi_trigger_poll() continues to interact with it afterwards. Thanks for adding me, Johannes. I see where I made a mistake. Terribly sorry for the trouble this caused. I do feel bad. > > If at all possible, I would also prefer the simplicity of one trigger > setup per fd; if you need a new trigger, close the fd and open again. > > Can you please take a look if that is workable from the Android side? Yes, one trigger per fd would work fine for Android. That's how we intended to use it. I'm still catching up on this email thread. Once I digest it, will try to fix this with one-trigger-per-fd approach. About the issue of serializing concurrent writes for cgroup_pressure_write() similar to how psi_write() does. Doesn't of->mutex inside kernfs_fop_write_iter() serialize the writes to the same file: https://elixir.bootlin.com/linux/latest/source/fs/kernfs/file.c#L287 ? > > (I'm going to follow up on the static branch issue Linus pointed out, > later this week when I'm back home. I also think we should add Suren > as additional psi maintainer since the polling code is a good chunk of > the codebase and he shouldn't miss threads like these.) That would help me not to miss these emails and respond promptly. Thanks, Suren. > > > That would fix the poll() lifetime issue, and would make the > > psi_trigger_replace() races fairly easy to fix - just use > > > > if (cmpxchg(trigger_ptr, NULL, new) != NULL) { > > ... free 'new', return -EBUSY .. > > > > to install the new one, instead of > > > > rcu_assign_pointer(*trigger_ptr, new); > > > > or something like that. No locking necessary. > > > > But I assume people actually end up re-writing triggers, because > > people are perverse and have taken advantage of this completely broken > > API. > > > > Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2022-01-10 17:25 ` Suren Baghdasaryan @ 2022-01-10 17:42 ` Linus Torvalds 2022-01-10 18:19 ` Suren Baghdasaryan 0 siblings, 1 reply; 16+ messages in thread From: Linus Torvalds @ 2022-01-10 17:42 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Johannes Weiner, Eric Biggers, Tejun Heo, Zefan Li, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton, syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs, Linux-MM On Mon, Jan 10, 2022 at 9:25 AM Suren Baghdasaryan <surenb@google.com> wrote: > > About the issue of serializing concurrent writes for > cgroup_pressure_write() similar to how psi_write() does. Doesn't > of->mutex inside kernfs_fop_write_iter() serialize the writes to the > same file? Ahh, yes, it looks like that does solve the serialization issue. Sorry, I missed that because I'm not actually all that familiar with the kernfs 'of' code. So the only issue is the trigger lifetime one, and if a single trigger is sufficient and returning -EBUSY for trying to replace an existing one is good, then I think that's the proper fix. I'm very busy with the merge window (and some upcoming travel and family events), so I'm hoping somebody will write and test such a patch. Please? Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2022-01-10 17:42 ` Linus Torvalds @ 2022-01-10 18:19 ` Suren Baghdasaryan 2022-01-11 3:02 ` Suren Baghdasaryan 0 siblings, 1 reply; 16+ messages in thread From: Suren Baghdasaryan @ 2022-01-10 18:19 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, Eric Biggers, Tejun Heo, Zefan Li, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton, syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs, Linux-MM On Mon, Jan 10, 2022 at 9:42 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, Jan 10, 2022 at 9:25 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > About the issue of serializing concurrent writes for > > cgroup_pressure_write() similar to how psi_write() does. Doesn't > > of->mutex inside kernfs_fop_write_iter() serialize the writes to the > > same file? > > Ahh, yes, it looks like that does solve the serialization issue. > Sorry, I missed that because I'm not actually all that familiar with > the kernfs 'of' code. > > So the only issue is the trigger lifetime one, and if a single trigger > is sufficient and returning -EBUSY for trying to replace an existing > one is good, then I think that's the proper fix. > > I'm very busy with the merge window (and some upcoming travel and > family events), so I'm hoping somebody will write and test such a > patch. Please? Yes, definitely. I'm on it. Will try posting it later today or tomorrow morning if testing reveals something unexpected. Thanks! > > Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: psi_trigger_poll() is completely broken 2022-01-10 18:19 ` Suren Baghdasaryan @ 2022-01-11 3:02 ` Suren Baghdasaryan 0 siblings, 0 replies; 16+ messages in thread From: Suren Baghdasaryan @ 2022-01-11 3:02 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, Eric Biggers, Tejun Heo, Zefan Li, Peter Zijlstra, Juri Lelli, Vincent Guittot, Ingo Molnar, Hillf Danton, syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs, Linux-MM On Mon, Jan 10, 2022 at 10:19 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Mon, Jan 10, 2022 at 9:42 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Mon, Jan 10, 2022 at 9:25 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > About the issue of serializing concurrent writes for > > > cgroup_pressure_write() similar to how psi_write() does. Doesn't > > > of->mutex inside kernfs_fop_write_iter() serialize the writes to the > > > same file? > > > > Ahh, yes, it looks like that does solve the serialization issue. > > Sorry, I missed that because I'm not actually all that familiar with > > the kernfs 'of' code. > > > > So the only issue is the trigger lifetime one, and if a single trigger > > is sufficient and returning -EBUSY for trying to replace an existing > > one is good, then I think that's the proper fix. > > > > I'm very busy with the merge window (and some upcoming travel and > > family events), so I'm hoping somebody will write and test such a > > patch. Please? > > Yes, definitely. I'm on it. Will try posting it later today or > tomorrow morning if testing reveals something unexpected. My first attempt to fix this issue is posted at: https://lore.kernel.org/all/20220111025138.1071848-1-surenb@google.com/ Couple notes: - I don't think we need psi_trigger::refcount anymore, therefore it's removed. - synchronize_rcu is kept to ensure we do not free group->poll_task while psi_schedule_poll_work is using it. - Documentation needed minimal changes because it did not clearly specify how trigger overwrite should work. Now it does. I ran as many test cases as I could find/create. I'll work on adding some kselftests for psi triggers to test different usage patterns. Thanks, Suren. > Thanks! > > > > > Linus ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2022-01-11 3:02 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-11-16 9:23 [syzbot] KASAN: use-after-free Read in remove_wait_queue (3) syzbot 2021-12-10 22:42 ` syzbot [not found] ` <20211211015620.1793-1-hdanton@sina.com> 2021-12-11 3:00 ` Eric Biggers 2022-01-05 15:20 ` psi_trigger_poll() is completely broken Eric Biggers 2022-01-05 18:50 ` Linus Torvalds 2022-01-05 19:07 ` Linus Torvalds 2022-01-05 19:13 ` Linus Torvalds 2022-01-06 22:05 ` Tejun Heo 2022-01-06 22:59 ` Linus Torvalds 2022-01-07 0:13 ` Eric Biggers 2022-01-07 3:02 ` Linus Torvalds 2022-01-10 13:45 ` Johannes Weiner 2022-01-10 17:25 ` Suren Baghdasaryan 2022-01-10 17:42 ` Linus Torvalds 2022-01-10 18:19 ` Suren Baghdasaryan 2022-01-11 3:02 ` Suren Baghdasaryan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.