linux-fbdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [bug-report] task info hung problem in fb_deferred_io_work()
@ 2024-04-18 12:39 Harshit Mogalapalli
  2024-04-18 13:14 ` Patrik Jakobsson
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Harshit Mogalapalli @ 2024-04-18 12:39 UTC (permalink / raw)
  To: jayalk, Daniel Vetter, deller, linux-fbdev, dri-devel
  Cc: tiwai, namcao, bigeasy, patrik.r.jakobsson, LKML, Vegard Nossum,
	George Kennedy, Darren Kenny, chuansheng.liu

Hi,

While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung 
bug in fb_deferred_io_work()


This is in 5.15.149 tag, and this is introduced by a set of commits:

2655757a3f10 fbdev: flush deferred IO before closing
15492bab7393 fbdev: Fix incorrect page mapping clearance at 
fb_deferred_io_release()
87b9802ca824 fbdev: Fix invalid page access after closing deferred I/O 
devices
9a95fc04261f fbdev: Rename pagelist to pagereflist for deferred I/O
186b89659c4c fbdev: Track deferred-I/O pages in pageref struct
e79b2b2aadef fbdev: defio: fix the pagelist corruption
0616b00a31d6 fbdev: Don't sort deferred-I/O pages by default
5d3aff76a316 fbdev/defio: Early-out if page is already enlisted

I then tried if it could be reproducible on upstream kernel as well, and 
it did. below is the call trace with latest 6.9-rc4 based kernel. I have 
enabled the LOCKDEP to get locking related information.

  INFO: task kworker/0:3:1077 blocked for more than 122 seconds.
        Not tainted 6.9.0-rc4+ #54
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:kworker/0:3     state:D stack:0     pid:1077  tgid:1077  ppid:2 
    flags:0x00004000
  Workqueue: events fb_deferred_io_work
  Call Trace:
   <TASK>
  __schedule (kernel/sched/core.c:5409 kernel/sched/core.c:6746)
  ? __pfx___schedule (kernel/sched/core.c:6617)
  ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
  ? __pfx___lock_release (kernel/locking/lockdep.c:5406)
  ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
  schedule (./arch/x86/include/asm/preempt.h:84 kernel/sched/core.c:6824 
kernel/sched/core.c:6838)
  io_schedule (kernel/sched/core.c:9019 kernel/sched/core.c:9045)
  folio_wait_bit_common (mm/filemap.c:1272)
  ? __pfx_folio_wait_bit_common (mm/filemap.c:1206)
  ? __pfx___mutex_lock (kernel/locking/mutex.c:751)
  ? lock_acquire.part.0 (kernel/locking/lockdep.c:467 
kernel/locking/lockdep.c:5756)
  ? __pfx_wake_page_function (mm/filemap.c:1091)
  ? __pfx_lock_acquire.part.0 (kernel/locking/lockdep.c:5719)
  fb_deferred_io_work (./include/linux/pagemap.h:1069 
drivers/video/fbdev/core/fb_defio.c:248)
  ? lock_acquire (./include/trace/events/lock.h:24 
kernel/locking/lockdep.c:5725)
  process_one_work (kernel/workqueue.c:3254)
  ? __pfx_lock_acquire.part.0 (kernel/locking/lockdep.c:5719)
  ? __pfx_process_one_work (kernel/workqueue.c:3156)
  ? assign_work (kernel/workqueue.c:1210)
  worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
  ? __kthread_parkme (kernel/kthread.c:293)
  ? __pfx_worker_thread (kernel/workqueue.c:3362)
  kthread (kernel/kthread.c:388)
  ? __pfx_kthread (kernel/kthread.c:341)
  ret_from_fork (arch/x86/kernel/process.c:147)
  ? __pfx_kthread (kernel/kthread.c:341)
  ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
   </TASK>
  INFO: task r:2664 blocked for more than 123 seconds.
        Not tainted 6.9.0-rc4+ #54
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:r               state:D stack:0     pid:2664  tgid:2664 
ppid:2618   flags:0x00000002
  Call Trace:
   <TASK>
  __schedule (kernel/sched/core.c:5409 kernel/sched/core.c:6746)
  ? __pfx___schedule (kernel/sched/core.c:6617)
  ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
  ? __pfx___lock_release (kernel/locking/lockdep.c:5406)
  ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
  schedule (./arch/x86/include/asm/preempt.h:84 kernel/sched/core.c:6824 
kernel/sched/core.c:6838)
  schedule_timeout (kernel/time/timer.c:2559)
  ? __pfx_schedule_timeout (kernel/time/timer.c:2544)
  ? mark_lock.part.0 (./arch/x86/include/asm/bitops.h:227 
./arch/x86/include/asm/bitops.h:239 
./include/asm-generic/bitops/instrumented-non-atomic.h:142 
kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:4656)
  ? mark_held_locks (kernel/locking/lockdep.c:4274)
  ? lockdep_hardirqs_on_prepare.part.0 (kernel/locking/lockdep.c:4300 
kernel/locking/lockdep.c:4359)
  __wait_for_common (kernel/sched/completion.c:95 
kernel/sched/completion.c:116)
  ? __pfx_schedule_timeout (kernel/time/timer.c:2544)
  ? __pfx___wait_for_common (kernel/sched/completion.c:110)
  ? start_flush_work (kernel/workqueue.c:4192)
  __flush_work (kernel/workqueue.c:4207)
  ? __pfx___flush_work (kernel/workqueue.c:4195)
  ? __pfx_wq_barrier_func (kernel/workqueue.c:3742)
  ? flush_delayed_work (kernel/workqueue.c:4249)
  ? __pfx_fb_release (drivers/video/fbdev/core/fb_chrdev.c:375)
  fb_deferred_io_lastclose (drivers/video/fbdev/core/fb_defio.c:321)
  ? __pfx_fb_release (drivers/video/fbdev/core/fb_chrdev.c:375)
  fb_deferred_io_release (drivers/video/fbdev/core/fb_defio.c:333)
  fb_release (drivers/video/fbdev/core/fb_chrdev.c:383)
  ? __pfx_fb_release (drivers/video/fbdev/core/fb_chrdev.c:375)
  __fput (fs/file_table.c:422)
  task_work_run (kernel/task_work.c:180 (discriminator 1))
  ? __pfx_task_work_run (kernel/task_work.c:148)
  ? do_raw_spin_unlock (./arch/x86/include/asm/paravirt.h:589 
./arch/x86/include/asm/qspinlock.h:57 kernel/locking/spinlock_debug.c:142)
  do_exit (kernel/exit.c:879)
  ? do_raw_spin_lock (./arch/x86/include/asm/atomic.h:115 
./include/linux/atomic/atomic-arch-fallback.h:2170 
./include/linux/atomic/atomic-instrumented.h:1302 
./include/asm-generic/qspinlock.h:111 kernel/locking/spinlock_debug.c:116)
  ? __pfx_do_exit (kernel/exit.c:819)
  ? __pfx_do_raw_spin_lock (kernel/locking/spinlock_debug.c:114)
  do_group_exit (kernel/exit.c:1008)
  __x64_sys_exit_group (kernel/exit.c:1036)
  do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  RIP: 0033:0x7f9cfe71e5f1
  RSP: 002b:00007ffd636aec08 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
  RAX: ffffffffffffffda RBX: 00007f9cfe7fc510 RCX: 00007f9cfe71e5f1
  RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
  RBP: 0000000000000000 R08: ffffffffffffff80 R09: 0000000000000000
  R10: 0000000000000020 R11: 0000000000000246 R12: 00007f9cfe7fc510
  R13: 0000000000000000 R14: 00007f9cfe7fc9e8 R15: 00007f9cfe7fca00
   </TASK>

  Showing all locks held in the system:
  1 lock held by khungtaskd/50:
  #0: ffffffff950219c0 (rcu_read_lock){....}-{1:2}, at: 
debug_show_all_locks (./include/linux/rcupdate.h:329 (discriminator 1) 
./include/linux/rcupdate.h:781 (discriminator 1) 
kernel/locking/lockdep.c:6614 (discriminator 1))
  3 locks held by kworker/0:3/1077:
  #0: ffff88810005ad48 ((wq_completion)events){+.+.}-{0:0}, at: 
process_one_work (kernel/workqueue.c:3229)
  #1: ffff8881080dfd98 
((work_completion)(&(&info->deferred_work)->work)){+.+.}-{0:0}, at: 
process_one_work (kernel/workqueue.c:3230)
  #2: ffff88811416db38 (&fbdefio->lock){+.+.}-{3:3}, at: 
fb_deferred_io_work (drivers/video/fbdev/core/fb_defio.c:246)
  1 lock held by r/2664:
  #0: ffff8881141fe078 (&fb_info->lock){+.+.}-{3:3}, at: fb_release 
(drivers/video/fbdev/core/fb_chrdev.c:380)



Syzkaller initially generated a long reproducer, Vegard simplified it to 
the one below

C repro:

#define _GNU_SOURCE
#include <sys/mman.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>

int main(void)
{
         int fd = open("/dev/fb0", O_RDWR, 0);

         char *ptr = (char *) 0x20000000ul;

         mmap(ptr, 4096, PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0xff000);
         *ptr = 0;

         mmap(ptr, 4096, PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0);
         *ptr = 0;

         return 0;
}

This needs to be run as root to see the hang in 122 seconds.

238 static void fb_deferred_io_work(struct work_struct *work)
239 {
240         struct fb_info *info = container_of(work, struct fb_info, 
deferred_work.work);
241         struct fb_deferred_io_pageref *pageref, *next;
242         struct fb_deferred_io *fbdefio = info->fbdefio;
243
244         /* here we mkclean the pages, then do all deferred IO */
245         mutex_lock(&fbdefio->lock);
246         list_for_each_entry(pageref, &fbdefio->pagereflist, list) {
247                 struct page *cur = pageref->page;
248                 lock_page(cur);
249                 page_mkclean(cur);
250                 unlock_page(cur);
251         }

It is stalling on lock_page() in fb_deferred_io_work()

I will be happy to try any patches and report back.


Thanks,
Harshit

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-18 12:39 [bug-report] task info hung problem in fb_deferred_io_work() Harshit Mogalapalli
@ 2024-04-18 13:14 ` Patrik Jakobsson
  2024-04-18 14:05   ` Nam Cao
  2024-04-19  7:27   ` Harshit Mogalapalli
  2024-04-18 14:06 ` Nam Cao
  2024-04-19 14:59 ` Nam Cao
  2 siblings, 2 replies; 21+ messages in thread
From: Patrik Jakobsson @ 2024-04-18 13:14 UTC (permalink / raw)
  To: Harshit Mogalapalli
  Cc: jayalk, Daniel Vetter, deller, linux-fbdev, dri-devel, tiwai,
	namcao, bigeasy, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu

On Thu, Apr 18, 2024 at 2:40 PM Harshit Mogalapalli
<harshit.m.mogalapalli@oracle.com> wrote:
>
> Hi,
>
> While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung
> bug in fb_deferred_io_work()
>
>
> This is in 5.15.149 tag, and this is introduced by a set of commits:
>
> 2655757a3f10 fbdev: flush deferred IO before closing
> 15492bab7393 fbdev: Fix incorrect page mapping clearance at
> fb_deferred_io_release()
> 87b9802ca824 fbdev: Fix invalid page access after closing deferred I/O
> devices
> 9a95fc04261f fbdev: Rename pagelist to pagereflist for deferred I/O
> 186b89659c4c fbdev: Track deferred-I/O pages in pageref struct
> e79b2b2aadef fbdev: defio: fix the pagelist corruption
> 0616b00a31d6 fbdev: Don't sort deferred-I/O pages by default
> 5d3aff76a316 fbdev/defio: Early-out if page is already enlisted
>
> I then tried if it could be reproducible on upstream kernel as well, and
> it did. below is the call trace with latest 6.9-rc4 based kernel. I have
> enabled the LOCKDEP to get locking related information.
>
>   INFO: task kworker/0:3:1077 blocked for more than 122 seconds.
>         Not tainted 6.9.0-rc4+ #54
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   task:kworker/0:3     state:D stack:0     pid:1077  tgid:1077  ppid:2
>     flags:0x00004000
>   Workqueue: events fb_deferred_io_work
>   Call Trace:
>    <TASK>
>   __schedule (kernel/sched/core.c:5409 kernel/sched/core.c:6746)
>   ? __pfx___schedule (kernel/sched/core.c:6617)
>   ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
>   ? __pfx___lock_release (kernel/locking/lockdep.c:5406)
>   ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
>   schedule (./arch/x86/include/asm/preempt.h:84 kernel/sched/core.c:6824
> kernel/sched/core.c:6838)
>   io_schedule (kernel/sched/core.c:9019 kernel/sched/core.c:9045)
>   folio_wait_bit_common (mm/filemap.c:1272)
>   ? __pfx_folio_wait_bit_common (mm/filemap.c:1206)
>   ? __pfx___mutex_lock (kernel/locking/mutex.c:751)
>   ? lock_acquire.part.0 (kernel/locking/lockdep.c:467
> kernel/locking/lockdep.c:5756)
>   ? __pfx_wake_page_function (mm/filemap.c:1091)
>   ? __pfx_lock_acquire.part.0 (kernel/locking/lockdep.c:5719)
>   fb_deferred_io_work (./include/linux/pagemap.h:1069
> drivers/video/fbdev/core/fb_defio.c:248)
>   ? lock_acquire (./include/trace/events/lock.h:24
> kernel/locking/lockdep.c:5725)
>   process_one_work (kernel/workqueue.c:3254)
>   ? __pfx_lock_acquire.part.0 (kernel/locking/lockdep.c:5719)
>   ? __pfx_process_one_work (kernel/workqueue.c:3156)
>   ? assign_work (kernel/workqueue.c:1210)
>   worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
>   ? __kthread_parkme (kernel/kthread.c:293)
>   ? __pfx_worker_thread (kernel/workqueue.c:3362)
>   kthread (kernel/kthread.c:388)
>   ? __pfx_kthread (kernel/kthread.c:341)
>   ret_from_fork (arch/x86/kernel/process.c:147)
>   ? __pfx_kthread (kernel/kthread.c:341)
>   ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
>    </TASK>
>   INFO: task r:2664 blocked for more than 123 seconds.
>         Not tainted 6.9.0-rc4+ #54
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   task:r               state:D stack:0     pid:2664  tgid:2664
> ppid:2618   flags:0x00000002
>   Call Trace:
>    <TASK>
>   __schedule (kernel/sched/core.c:5409 kernel/sched/core.c:6746)
>   ? __pfx___schedule (kernel/sched/core.c:6617)
>   ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
>   ? __pfx___lock_release (kernel/locking/lockdep.c:5406)
>   ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
>   schedule (./arch/x86/include/asm/preempt.h:84 kernel/sched/core.c:6824
> kernel/sched/core.c:6838)
>   schedule_timeout (kernel/time/timer.c:2559)
>   ? __pfx_schedule_timeout (kernel/time/timer.c:2544)
>   ? mark_lock.part.0 (./arch/x86/include/asm/bitops.h:227
> ./arch/x86/include/asm/bitops.h:239
> ./include/asm-generic/bitops/instrumented-non-atomic.h:142
> kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:4656)
>   ? mark_held_locks (kernel/locking/lockdep.c:4274)
>   ? lockdep_hardirqs_on_prepare.part.0 (kernel/locking/lockdep.c:4300
> kernel/locking/lockdep.c:4359)
>   __wait_for_common (kernel/sched/completion.c:95
> kernel/sched/completion.c:116)
>   ? __pfx_schedule_timeout (kernel/time/timer.c:2544)
>   ? __pfx___wait_for_common (kernel/sched/completion.c:110)
>   ? start_flush_work (kernel/workqueue.c:4192)
>   __flush_work (kernel/workqueue.c:4207)
>   ? __pfx___flush_work (kernel/workqueue.c:4195)
>   ? __pfx_wq_barrier_func (kernel/workqueue.c:3742)
>   ? flush_delayed_work (kernel/workqueue.c:4249)
>   ? __pfx_fb_release (drivers/video/fbdev/core/fb_chrdev.c:375)
>   fb_deferred_io_lastclose (drivers/video/fbdev/core/fb_defio.c:321)
>   ? __pfx_fb_release (drivers/video/fbdev/core/fb_chrdev.c:375)
>   fb_deferred_io_release (drivers/video/fbdev/core/fb_defio.c:333)
>   fb_release (drivers/video/fbdev/core/fb_chrdev.c:383)
>   ? __pfx_fb_release (drivers/video/fbdev/core/fb_chrdev.c:375)
>   __fput (fs/file_table.c:422)
>   task_work_run (kernel/task_work.c:180 (discriminator 1))
>   ? __pfx_task_work_run (kernel/task_work.c:148)
>   ? do_raw_spin_unlock (./arch/x86/include/asm/paravirt.h:589
> ./arch/x86/include/asm/qspinlock.h:57 kernel/locking/spinlock_debug.c:142)
>   do_exit (kernel/exit.c:879)
>   ? do_raw_spin_lock (./arch/x86/include/asm/atomic.h:115
> ./include/linux/atomic/atomic-arch-fallback.h:2170
> ./include/linux/atomic/atomic-instrumented.h:1302
> ./include/asm-generic/qspinlock.h:111 kernel/locking/spinlock_debug.c:116)
>   ? __pfx_do_exit (kernel/exit.c:819)
>   ? __pfx_do_raw_spin_lock (kernel/locking/spinlock_debug.c:114)
>   do_group_exit (kernel/exit.c:1008)
>   __x64_sys_exit_group (kernel/exit.c:1036)
>   do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
>   entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
>   RIP: 0033:0x7f9cfe71e5f1
>   RSP: 002b:00007ffd636aec08 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>   RAX: ffffffffffffffda RBX: 00007f9cfe7fc510 RCX: 00007f9cfe71e5f1
>   RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
>   RBP: 0000000000000000 R08: ffffffffffffff80 R09: 0000000000000000
>   R10: 0000000000000020 R11: 0000000000000246 R12: 00007f9cfe7fc510
>   R13: 0000000000000000 R14: 00007f9cfe7fc9e8 R15: 00007f9cfe7fca00
>    </TASK>
>
>   Showing all locks held in the system:
>   1 lock held by khungtaskd/50:
>   #0: ffffffff950219c0 (rcu_read_lock){....}-{1:2}, at:
> debug_show_all_locks (./include/linux/rcupdate.h:329 (discriminator 1)
> ./include/linux/rcupdate.h:781 (discriminator 1)
> kernel/locking/lockdep.c:6614 (discriminator 1))
>   3 locks held by kworker/0:3/1077:
>   #0: ffff88810005ad48 ((wq_completion)events){+.+.}-{0:0}, at:
> process_one_work (kernel/workqueue.c:3229)
>   #1: ffff8881080dfd98
> ((work_completion)(&(&info->deferred_work)->work)){+.+.}-{0:0}, at:
> process_one_work (kernel/workqueue.c:3230)
>   #2: ffff88811416db38 (&fbdefio->lock){+.+.}-{3:3}, at:
> fb_deferred_io_work (drivers/video/fbdev/core/fb_defio.c:246)
>   1 lock held by r/2664:
>   #0: ffff8881141fe078 (&fb_info->lock){+.+.}-{3:3}, at: fb_release
> (drivers/video/fbdev/core/fb_chrdev.c:380)
>
>
>
> Syzkaller initially generated a long reproducer, Vegard simplified it to
> the one below
>
> C repro:
>
> #define _GNU_SOURCE
> #include <sys/mman.h>
> #include <fcntl.h>
> #include <stdlib.h>
> #include <unistd.h>
>
> int main(void)
> {
>          int fd = open("/dev/fb0", O_RDWR, 0);
>
>          char *ptr = (char *) 0x20000000ul;
>
>          mmap(ptr, 4096, PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0xff000);
>          *ptr = 0;
>
>          mmap(ptr, 4096, PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0);
>          *ptr = 0;
>
>          return 0;
> }
>
> This needs to be run as root to see the hang in 122 seconds.
>
> 238 static void fb_deferred_io_work(struct work_struct *work)
> 239 {
> 240         struct fb_info *info = container_of(work, struct fb_info,
> deferred_work.work);
> 241         struct fb_deferred_io_pageref *pageref, *next;
> 242         struct fb_deferred_io *fbdefio = info->fbdefio;
> 243
> 244         /* here we mkclean the pages, then do all deferred IO */
> 245         mutex_lock(&fbdefio->lock);
> 246         list_for_each_entry(pageref, &fbdefio->pagereflist, list) {
> 247                 struct page *cur = pageref->page;
> 248                 lock_page(cur);
> 249                 page_mkclean(cur);
> 250                 unlock_page(cur);
> 251         }
>
> It is stalling on lock_page() in fb_deferred_io_work()
>
> I will be happy to try any patches and report back.

Hi,
This sounds similar to the SUSE bug [1]. We fixed it by reverting [2]
in the SUSE kernel. The problem seems to be that flush_delayed_work()
kills the timer and re-queues the work but doesn't guarantee that it
is finished when returning. So when the device is closed, the
fb_deferred_io_work() function can still be queued and tries to access
memory that's been freed.

Patch [2] tries to solve the problem of not throwing away pending data
when closing the device. Perhaps calling cancel_delayed_work_sync()
and then follow up with a manual call to fb_deferred_io_work() would
be enough to flush the remaining data?

-Patrik

[1] https://bugzilla.suse.com/show_bug.cgi?id=1221814
[2] 33cd6ea9c067 fbdev: flush deferred IO before closing

>
>
> Thanks,
> Harshit

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-18 13:14 ` Patrik Jakobsson
@ 2024-04-18 14:05   ` Nam Cao
  2024-04-18 14:34     ` Patrik Jakobsson
  2024-04-19  7:27   ` Harshit Mogalapalli
  1 sibling, 1 reply; 21+ messages in thread
From: Nam Cao @ 2024-04-18 14:05 UTC (permalink / raw)
  To: Patrik Jakobsson
  Cc: Harshit Mogalapalli, jayalk, Daniel Vetter, deller, linux-fbdev,
	dri-devel, tiwai, bigeasy, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu

On 2024-04-18 Patrik Jakobsson wrote:
> This sounds similar to the SUSE bug [1]. We fixed it by reverting [2]
> in the SUSE kernel. The problem seems to be that flush_delayed_work()
> kills the timer and re-queues the work but doesn't guarantee that it
> is finished when returning. So when the device is closed, the
> fb_deferred_io_work() function can still be queued and tries to access
> memory that's been freed.

flush_delayed_work() *does* guarantee the work is finished before
returning.

> Patch [2] tries to solve the problem of not throwing away pending data
> when closing the device. Perhaps calling cancel_delayed_work_sync()
> and then follow up with a manual call to fb_deferred_io_work() would
> be enough to flush the remaining data?
> 
> -Patrik
> 
> [1] https://bugzilla.suse.com/show_bug.cgi?id=1221814
> [2] 33cd6ea9c067 fbdev: flush deferred IO before closing
> 
> >
> >
> > Thanks,
> > Harshit  


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-18 12:39 [bug-report] task info hung problem in fb_deferred_io_work() Harshit Mogalapalli
  2024-04-18 13:14 ` Patrik Jakobsson
@ 2024-04-18 14:06 ` Nam Cao
  2024-04-18 14:26   ` Takashi Iwai
  2024-04-19  7:35   ` Harshit Mogalapalli
  2024-04-19 14:59 ` Nam Cao
  2 siblings, 2 replies; 21+ messages in thread
From: Nam Cao @ 2024-04-18 14:06 UTC (permalink / raw)
  To: Harshit Mogalapalli
  Cc: jayalk, Daniel Vetter, deller, linux-fbdev, dri-devel, tiwai,
	bigeasy, patrik.r.jakobsson, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu

On 2024-04-18 Harshit Mogalapalli wrote:
> While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung 
> bug in fb_deferred_io_work()

Which framebuffer device are you using exactly? It is possible that
the problem is with the device driver, not core framebuffer.

Best regards,
Nam

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-18 14:06 ` Nam Cao
@ 2024-04-18 14:26   ` Takashi Iwai
  2024-04-18 19:29     ` Helge Deller
  2024-04-19  7:35   ` Harshit Mogalapalli
  1 sibling, 1 reply; 21+ messages in thread
From: Takashi Iwai @ 2024-04-18 14:26 UTC (permalink / raw)
  To: Nam Cao
  Cc: Harshit Mogalapalli, jayalk, Daniel Vetter, deller, linux-fbdev,
	dri-devel, tiwai, bigeasy, patrik.r.jakobsson, LKML,
	Vegard Nossum, George Kennedy, Darren Kenny, chuansheng.liu

On Thu, 18 Apr 2024 16:06:52 +0200,
Nam Cao wrote:
> 
> On 2024-04-18 Harshit Mogalapalli wrote:
> > While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung 
> > bug in fb_deferred_io_work()
> 
> Which framebuffer device are you using exactly? It is possible that
> the problem is with the device driver, not core framebuffer.

Note that it was already known that using flush_delayed_work() caused
a problem.  See the thread of the fix patch:
  https://lore.kernel.org/all/20230129082856.22113-1-tiwai@suse.de/

BTW, the problem is seen with bochs drm.


Takashi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-18 14:05   ` Nam Cao
@ 2024-04-18 14:34     ` Patrik Jakobsson
  2024-04-19 15:18       ` Nam Cao
  0 siblings, 1 reply; 21+ messages in thread
From: Patrik Jakobsson @ 2024-04-18 14:34 UTC (permalink / raw)
  To: Nam Cao
  Cc: Harshit Mogalapalli, jayalk, Daniel Vetter, deller, linux-fbdev,
	dri-devel, tiwai, bigeasy, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu

On Thu, Apr 18, 2024 at 4:05 PM Nam Cao <namcao@linutronix.de> wrote:
>
> On 2024-04-18 Patrik Jakobsson wrote:
> > This sounds similar to the SUSE bug [1]. We fixed it by reverting [2]
> > in the SUSE kernel. The problem seems to be that flush_delayed_work()
> > kills the timer and re-queues the work but doesn't guarantee that it
> > is finished when returning. So when the device is closed, the
> > fb_deferred_io_work() function can still be queued and tries to access
> > memory that's been freed.
>
> flush_delayed_work() *does* guarantee the work is finished before
> returning.

Right, flush_work() does guarantee that the work is finished, but it
doesn't guarantee that the queue is idle if work has been requeued
since flush started. So fb_deferred_io_work() should be able to happen
after the device is closed. Or am I missing something?

>
> > Patch [2] tries to solve the problem of not throwing away pending data
> > when closing the device. Perhaps calling cancel_delayed_work_sync()
> > and then follow up with a manual call to fb_deferred_io_work() would
> > be enough to flush the remaining data?
> >
> > -Patrik
> >
> > [1] https://bugzilla.suse.com/show_bug.cgi?id=1221814
> > [2] 33cd6ea9c067 fbdev: flush deferred IO before closing
> >
> > >
> > >
> > > Thanks,
> > > Harshit
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-18 14:26   ` Takashi Iwai
@ 2024-04-18 19:29     ` Helge Deller
  2024-04-19  6:44       ` Takashi Iwai
  0 siblings, 1 reply; 21+ messages in thread
From: Helge Deller @ 2024-04-18 19:29 UTC (permalink / raw)
  To: Takashi Iwai, Nam Cao, Thomas Zimmermann
  Cc: Harshit Mogalapalli, Daniel Vetter, linux-fbdev, dri-devel,
	bigeasy, patrik.r.jakobsson, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu

On 4/18/24 16:26, Takashi Iwai wrote:
> On Thu, 18 Apr 2024 16:06:52 +0200,
> Nam Cao wrote:
>>
>> On 2024-04-18 Harshit Mogalapalli wrote:
>>> While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung
>>> bug in fb_deferred_io_work()
>>
>> Which framebuffer device are you using exactly? It is possible that
>> the problem is with the device driver, not core framebuffer.
>
> Note that it was already known that using flush_delayed_work() caused
> a problem.  See the thread of the fix patch:
>    https://lore.kernel.org/all/20230129082856.22113-1-tiwai@suse.de/

Harshit reported the hung tasks with kernel v5.15-stable, and can even reproduce
that issue with kernel v6.9-rc4 although it has all of your patches from
that referenced mail thread applied.
So, what does your statement that "it was already known that it causes problems" exactly mean?
Can it be fixed? Is someone looking into fixing it?

> BTW, the problem is seen with bochs drm.

Helge

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-18 19:29     ` Helge Deller
@ 2024-04-19  6:44       ` Takashi Iwai
  2024-04-19  7:39         ` Harshit Mogalapalli
  0 siblings, 1 reply; 21+ messages in thread
From: Takashi Iwai @ 2024-04-19  6:44 UTC (permalink / raw)
  To: Helge Deller
  Cc: Takashi Iwai, Nam Cao, Thomas Zimmermann, Harshit Mogalapalli,
	Daniel Vetter, linux-fbdev, dri-devel, bigeasy,
	patrik.r.jakobsson, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu

On Thu, 18 Apr 2024 21:29:57 +0200,
Helge Deller wrote:
> 
> On 4/18/24 16:26, Takashi Iwai wrote:
> > On Thu, 18 Apr 2024 16:06:52 +0200,
> > Nam Cao wrote:
> >> 
> >> On 2024-04-18 Harshit Mogalapalli wrote:
> >>> While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung
> >>> bug in fb_deferred_io_work()
> >> 
> >> Which framebuffer device are you using exactly? It is possible that
> >> the problem is with the device driver, not core framebuffer.
> > 
> > Note that it was already known that using flush_delayed_work() caused
> > a problem.  See the thread of the fix patch:
> >    https://lore.kernel.org/all/20230129082856.22113-1-tiwai@suse.de/
> 
> Harshit reported the hung tasks with kernel v5.15-stable, and can even reproduce
> that issue with kernel v6.9-rc4 although it has all of your patches from
> that referenced mail thread applied.
> So, what does your statement that "it was already known that it causes problems" exactly mean?
> Can it be fixed? Is someone looking into fixing it?

My original fix was intentionally with cancel_delayed_work_sync()
because flush_delayed_work() didn't work.  We knew that it'd miss some
last-minute queued change, but it's better than crash, so it was
applied in that way.

Then later on, the commit 33cd6ea9c067 changed cancel_*() to
flush_delayed_work() blindly, and the known problem resurfaced again.


Takashi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-18 13:14 ` Patrik Jakobsson
  2024-04-18 14:05   ` Nam Cao
@ 2024-04-19  7:27   ` Harshit Mogalapalli
  1 sibling, 0 replies; 21+ messages in thread
From: Harshit Mogalapalli @ 2024-04-19  7:27 UTC (permalink / raw)
  To: Patrik Jakobsson
  Cc: jayalk, Daniel Vetter, deller, linux-fbdev, dri-devel, tiwai,
	namcao, bigeasy, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu

Hi Patrik,

On 18/04/24 18:44, Patrik Jakobsson wrote:
> On Thu, Apr 18, 2024 at 2:40 PM Harshit Mogalapalli
> <harshit.m.mogalapalli@oracle.com> wrote:
>>
>> Hi,
>>
>> While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung
>> bug in fb_deferred_io_work()
>>
>>
>> This is in 5.15.149 tag, and this is introduced by a set of commits:
>>
>> 2655757a3f10 fbdev: flush deferred IO before closing
>> 15492bab7393 fbdev: Fix incorrect page mapping clearance at
>> fb_deferred_io_release()
>> 87b9802ca824 fbdev: Fix invalid page access after closing deferred I/O
>> devices
>> 9a95fc04261f fbdev: Rename pagelist to pagereflist for deferred I/O
>> 186b89659c4c fbdev: Track deferred-I/O pages in pageref struct
>> e79b2b2aadef fbdev: defio: fix the pagelist corruption
>> 0616b00a31d6 fbdev: Don't sort deferred-I/O pages by default
>> 5d3aff76a316 fbdev/defio: Early-out if page is already enlisted
>>
>> I then tried if it could be reproducible on upstream kernel as well, and
>> it did. below is the call trace with latest 6.9-rc4 based kernel. I have
>> enabled the LOCKDEP to get locking related information.
>>
>>    INFO: task kworker/0:3:1077 blocked for more than 122 seconds.
>>          Not tainted 6.9.0-rc4+ #54
>>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>    task:kworker/0:3     state:D stack:0     pid:1077  tgid:1077  ppid:2
>>      flags:0x00004000
>>    Workqueue: events fb_deferred_io_work
>>    Call Trace:
>>     <TASK>
>>    __schedule (kernel/sched/core.c:5409 kernel/sched/core.c:6746)
>>    ? __pfx___schedule (kernel/sched/core.c:6617)
>>    ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
>>    ? __pfx___lock_release (kernel/locking/lockdep.c:5406)
>>    ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
>>    schedule (./arch/x86/include/asm/preempt.h:84 kernel/sched/core.c:6824
>> kernel/sched/core.c:6838)
>>    io_schedule (kernel/sched/core.c:9019 kernel/sched/core.c:9045)
>>    folio_wait_bit_common (mm/filemap.c:1272)
>>    ? __pfx_folio_wait_bit_common (mm/filemap.c:1206)
>>    ? __pfx___mutex_lock (kernel/locking/mutex.c:751)
>>    ? lock_acquire.part.0 (kernel/locking/lockdep.c:467
>> kernel/locking/lockdep.c:5756)
>>    ? __pfx_wake_page_function (mm/filemap.c:1091)
>>    ? __pfx_lock_acquire.part.0 (kernel/locking/lockdep.c:5719)
>>    fb_deferred_io_work (./include/linux/pagemap.h:1069
>> drivers/video/fbdev/core/fb_defio.c:248)
>>    ? lock_acquire (./include/trace/events/lock.h:24
>> kernel/locking/lockdep.c:5725)
>>    process_one_work (kernel/workqueue.c:3254)
>>    ? __pfx_lock_acquire.part.0 (kernel/locking/lockdep.c:5719)
>>    ? __pfx_process_one_work (kernel/workqueue.c:3156)
>>    ? assign_work (kernel/workqueue.c:1210)
>>    worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
>>    ? __kthread_parkme (kernel/kthread.c:293)
>>    ? __pfx_worker_thread (kernel/workqueue.c:3362)
>>    kthread (kernel/kthread.c:388)
>>    ? __pfx_kthread (kernel/kthread.c:341)
>>    ret_from_fork (arch/x86/kernel/process.c:147)
>>    ? __pfx_kthread (kernel/kthread.c:341)
>>    ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
>>     </TASK>
>>    INFO: task r:2664 blocked for more than 123 seconds.
>>          Not tainted 6.9.0-rc4+ #54
>>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>    task:r               state:D stack:0     pid:2664  tgid:2664
>> ppid:2618   flags:0x00000002
>>    Call Trace:
>>     <TASK>
>>    __schedule (kernel/sched/core.c:5409 kernel/sched/core.c:6746)
>>    ? __pfx___schedule (kernel/sched/core.c:6617)
>>    ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
>>    ? __pfx___lock_release (kernel/locking/lockdep.c:5406)
>>    ? schedule (kernel/sched/core.c:6804 kernel/sched/core.c:6837)
>>    schedule (./arch/x86/include/asm/preempt.h:84 kernel/sched/core.c:6824
>> kernel/sched/core.c:6838)
>>    schedule_timeout (kernel/time/timer.c:2559)
>>    ? __pfx_schedule_timeout (kernel/time/timer.c:2544)
>>    ? mark_lock.part.0 (./arch/x86/include/asm/bitops.h:227
>> ./arch/x86/include/asm/bitops.h:239
>> ./include/asm-generic/bitops/instrumented-non-atomic.h:142
>> kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:4656)
>>    ? mark_held_locks (kernel/locking/lockdep.c:4274)
>>    ? lockdep_hardirqs_on_prepare.part.0 (kernel/locking/lockdep.c:4300
>> kernel/locking/lockdep.c:4359)
>>    __wait_for_common (kernel/sched/completion.c:95
>> kernel/sched/completion.c:116)
>>    ? __pfx_schedule_timeout (kernel/time/timer.c:2544)
>>    ? __pfx___wait_for_common (kernel/sched/completion.c:110)
>>    ? start_flush_work (kernel/workqueue.c:4192)
>>    __flush_work (kernel/workqueue.c:4207)
>>    ? __pfx___flush_work (kernel/workqueue.c:4195)
>>    ? __pfx_wq_barrier_func (kernel/workqueue.c:3742)
>>    ? flush_delayed_work (kernel/workqueue.c:4249)
>>    ? __pfx_fb_release (drivers/video/fbdev/core/fb_chrdev.c:375)
>>    fb_deferred_io_lastclose (drivers/video/fbdev/core/fb_defio.c:321)
>>    ? __pfx_fb_release (drivers/video/fbdev/core/fb_chrdev.c:375)
>>    fb_deferred_io_release (drivers/video/fbdev/core/fb_defio.c:333)
>>    fb_release (drivers/video/fbdev/core/fb_chrdev.c:383)
>>    ? __pfx_fb_release (drivers/video/fbdev/core/fb_chrdev.c:375)
>>    __fput (fs/file_table.c:422)
>>    task_work_run (kernel/task_work.c:180 (discriminator 1))
>>    ? __pfx_task_work_run (kernel/task_work.c:148)
>>    ? do_raw_spin_unlock (./arch/x86/include/asm/paravirt.h:589
>> ./arch/x86/include/asm/qspinlock.h:57 kernel/locking/spinlock_debug.c:142)
>>    do_exit (kernel/exit.c:879)
>>    ? do_raw_spin_lock (./arch/x86/include/asm/atomic.h:115
>> ./include/linux/atomic/atomic-arch-fallback.h:2170
>> ./include/linux/atomic/atomic-instrumented.h:1302
>> ./include/asm-generic/qspinlock.h:111 kernel/locking/spinlock_debug.c:116)
>>    ? __pfx_do_exit (kernel/exit.c:819)
>>    ? __pfx_do_raw_spin_lock (kernel/locking/spinlock_debug.c:114)
>>    do_group_exit (kernel/exit.c:1008)
>>    __x64_sys_exit_group (kernel/exit.c:1036)
>>    do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
>>    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
>>    RIP: 0033:0x7f9cfe71e5f1
>>    RSP: 002b:00007ffd636aec08 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>>    RAX: ffffffffffffffda RBX: 00007f9cfe7fc510 RCX: 00007f9cfe71e5f1
>>    RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
>>    RBP: 0000000000000000 R08: ffffffffffffff80 R09: 0000000000000000
>>    R10: 0000000000000020 R11: 0000000000000246 R12: 00007f9cfe7fc510
>>    R13: 0000000000000000 R14: 00007f9cfe7fc9e8 R15: 00007f9cfe7fca00
>>     </TASK>
>>
>>    Showing all locks held in the system:
>>    1 lock held by khungtaskd/50:
>>    #0: ffffffff950219c0 (rcu_read_lock){....}-{1:2}, at:
>> debug_show_all_locks (./include/linux/rcupdate.h:329 (discriminator 1)
>> ./include/linux/rcupdate.h:781 (discriminator 1)
>> kernel/locking/lockdep.c:6614 (discriminator 1))
>>    3 locks held by kworker/0:3/1077:
>>    #0: ffff88810005ad48 ((wq_completion)events){+.+.}-{0:0}, at:
>> process_one_work (kernel/workqueue.c:3229)
>>    #1: ffff8881080dfd98
>> ((work_completion)(&(&info->deferred_work)->work)){+.+.}-{0:0}, at:
>> process_one_work (kernel/workqueue.c:3230)
>>    #2: ffff88811416db38 (&fbdefio->lock){+.+.}-{3:3}, at:
>> fb_deferred_io_work (drivers/video/fbdev/core/fb_defio.c:246)
>>    1 lock held by r/2664:
>>    #0: ffff8881141fe078 (&fb_info->lock){+.+.}-{3:3}, at: fb_release
>> (drivers/video/fbdev/core/fb_chrdev.c:380)
>>
>>
>>
>> Syzkaller initially generated a long reproducer, Vegard simplified it to
>> the one below
>>
>> C repro:
>>
>> #define _GNU_SOURCE
>> #include <sys/mman.h>
>> #include <fcntl.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>>
>> int main(void)
>> {
>>           int fd = open("/dev/fb0", O_RDWR, 0);
>>
>>           char *ptr = (char *) 0x20000000ul;
>>
>>           mmap(ptr, 4096, PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0xff000);
>>           *ptr = 0;
>>
>>           mmap(ptr, 4096, PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0);
>>           *ptr = 0;
>>
>>           return 0;
>> }
>>
>> This needs to be run as root to see the hang in 122 seconds.
>>
>> 238 static void fb_deferred_io_work(struct work_struct *work)
>> 239 {
>> 240         struct fb_info *info = container_of(work, struct fb_info,
>> deferred_work.work);
>> 241         struct fb_deferred_io_pageref *pageref, *next;
>> 242         struct fb_deferred_io *fbdefio = info->fbdefio;
>> 243
>> 244         /* here we mkclean the pages, then do all deferred IO */
>> 245         mutex_lock(&fbdefio->lock);
>> 246         list_for_each_entry(pageref, &fbdefio->pagereflist, list) {
>> 247                 struct page *cur = pageref->page;
>> 248                 lock_page(cur);
>> 249                 page_mkclean(cur);
>> 250                 unlock_page(cur);
>> 251         }
>>
>> It is stalling on lock_page() in fb_deferred_io_work()
>>
>> I will be happy to try any patches and report back.
> 
> Hi,


Thanks for the quick response and sharing details.

> This sounds similar to the SUSE bug [1]. We fixed it by reverting [2]
> in the SUSE kernel. The problem seems to be that flush_delayed_work()
> kills the timer and re-queues the work but doesn't guarantee that it
> is finished when returning. So when the device is closed, the
> fb_deferred_io_work() function can still be queued and tries to access
> memory that's been freed.
> 

I tried reverting that commit but saw some other similar problem,

here it is:

[  618.508726] INFO: task r:2682 blocked for more than 122 seconds.
[  618.513662]       Not tainted 6.9.0-rc4+ #55
[  618.517105] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  618.523204] task:r               state:D stack:0     pid:2682 
tgid:2682  ppid:2636   flags:0x00000004
[  618.530325] Call Trace:
[  618.532470]  <TASK>
[  618.534394]  __schedule+0x787/0x1b10
[  618.537506]  ? __pfx___schedule+0x10/0x10
[  618.540784]  ? schedule+0x1e6/0x290
[  618.543698]  ? __lock_release+0x103/0x460
[  618.546982]  ? folio_wait_bit_common+0x22a/0x8b0
[  618.550697]  ? __pfx___lock_release+0x10/0x10
[  618.554268]  ? mark_lock.part.0+0x77/0x880
[  618.557614]  ? schedule+0x1e6/0x290
[  618.560545]  schedule+0xdd/0x290
[  618.563324]  io_schedule+0xc4/0x140
[  618.566251]  folio_wait_bit_common+0x32a/0x8b0
[  618.570062]  ? __pfx_folio_wait_bit_common+0x10/0x10
[  618.574231]  ? mark_lock.part.0+0x77/0x880
[  618.577575]  ? __pfx_wake_page_function+0x10/0x10
[  618.581430]  __do_fault+0x2fc/0x610
[  618.584329]  ? __thp_vma_allowable_orders+0x25e/0x9f0
[  618.588442]  do_fault+0x22e/0xb60
[  618.591222]  handle_pte_fault+0x55a/0x7e0
[  618.594543]  ? thp_vma_allowable_orders.constprop.0+0x66/0x1e0
[  618.599204]  __handle_mm_fault+0xf93/0x1580
[  618.602652]  ? __pfx___handle_mm_fault+0x10/0x10
[  618.606460]  ? find_vma+0xc5/0x150
[  618.609324]  ? __pfx_find_vma+0x10/0x10
[  618.612497]  ? lock_acquire+0x1bf/0x230
[  618.615639]  handle_mm_fault+0x288/0xa80
[  618.618882]  do_user_addr_fault+0x39a/0x1280
[  618.622453]  exc_page_fault+0xa0/0x1c0
[  618.625530]  asm_exc_page_fault+0x26/0x30
[  618.628807] RIP: 0033:0x40118d
[  618.631429] RSP: 002b:00007ffdae7ffa40 EFLAGS: 00010207
[  618.635586] RAX: 0000000020000000 RBX: 0000000000000000 RCX: 
00007f8c00b4d557
[  618.641281] RDX: 0000000000000002 RSI: 0000000000001000 RDI: 
0000000020000000
[  618.646822] RBP: 00007ffdae7ffa50 R08: 0000000000000003 R09: 
00000000000ff000
[  618.652349] R10: 0000000000000011 R11: 0000000000000246 R12: 
00007ffdae7ffb88
[  618.657931] R13: 0000000000401136 R14: 00007f8c00cedaa0 R15: 
0000000000403e08
[  618.663498]  </TASK>
[  618.665808]
[  618.665808] Showing all locks held in the system:
[  618.671090] 1 lock held by khungtaskd/51:
[  618.674595]  #0: ffffffff950219c0 (rcu_read_lock){....}-{1:2}, at: 
debug_show_all_locks+0x75/0x340
[  618.682063] 1 lock held by r/2682:
[  618.685101]  #0: ffff888100072fe0 (&mm->mmap_lock){++++}-{3:3}, at: 
lock_mm_and_find_vma+0x3a/0x6b0
[  618.692648]
[  618.694194] =============================================
[  618.694194]


Note that, reverting all 8 commits doesn't show any issue on 5.15.y, but 
that does not look like a feasible solution.

I think we can say that the issue is localized to the above pointed 8 
commits.

Thanks,
Harshit


> Patch [2] tries to solve the problem of not throwing away pending data
> when closing the device. Perhaps calling cancel_delayed_work_sync()
> and then follow up with a manual call to fb_deferred_io_work() would
> be enough to flush the remaining data?
> 
> -Patrik
> 
> [1] https://bugzilla.suse.com/show_bug.cgi?id=1221814
> [2] 33cd6ea9c067 fbdev: flush deferred IO before closing
> 
>>
>>
>> Thanks,
>> Harshit


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-18 14:06 ` Nam Cao
  2024-04-18 14:26   ` Takashi Iwai
@ 2024-04-19  7:35   ` Harshit Mogalapalli
  1 sibling, 0 replies; 21+ messages in thread
From: Harshit Mogalapalli @ 2024-04-19  7:35 UTC (permalink / raw)
  To: Nam Cao
  Cc: Daniel Vetter, deller, linux-fbdev, dri-devel, tiwai, bigeasy,
	patrik.r.jakobsson, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu

Hi Nam,

On 18/04/24 19:36, Nam Cao wrote:
> On 2024-04-18 Harshit Mogalapalli wrote:
>> While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung
>> bug in fb_deferred_io_work()
> 
> Which framebuffer device are you using exactly? It is possible that
> the problem is with the device driver, not core framebuffer.
> 

I am not very sure about the answer, but the C reproducer uses(/dev/fb0)

int fd = open("/dev/fb0", O_RDWR, 0);

Thanks,
Harshit

> Best regards,
> Nam


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-19  6:44       ` Takashi Iwai
@ 2024-04-19  7:39         ` Harshit Mogalapalli
  2024-04-19  7:45           ` Takashi Iwai
  0 siblings, 1 reply; 21+ messages in thread
From: Harshit Mogalapalli @ 2024-04-19  7:39 UTC (permalink / raw)
  To: Takashi Iwai, Helge Deller
  Cc: Nam Cao, Thomas Zimmermann, Daniel Vetter, linux-fbdev,
	dri-devel, bigeasy, patrik.r.jakobsson, LKML, Vegard Nossum,
	George Kennedy, Darren Kenny, chuansheng.liu

Hi Takashi,

On 19/04/24 12:14, Takashi Iwai wrote:
> On Thu, 18 Apr 2024 21:29:57 +0200,
> Helge Deller wrote:
>>
>> On 4/18/24 16:26, Takashi Iwai wrote:
>>> On Thu, 18 Apr 2024 16:06:52 +0200,
>>> Nam Cao wrote:
>>>>
>>>> On 2024-04-18 Harshit Mogalapalli wrote:
>>>>> While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung
>>>>> bug in fb_deferred_io_work()
>>>>
>>>> Which framebuffer device are you using exactly? It is possible that
>>>> the problem is with the device driver, not core framebuffer.
>>>
>>> Note that it was already known that using flush_delayed_work() caused
>>> a problem.  See the thread of the fix patch:
>>>     https://lore.kernel.org/all/20230129082856.22113-1-tiwai@suse.de/
>>
>> Harshit reported the hung tasks with kernel v5.15-stable, and can even reproduce
>> that issue with kernel v6.9-rc4 although it has all of your patches from
>> that referenced mail thread applied.
>> So, what does your statement that "it was already known that it causes problems" exactly mean?
>> Can it be fixed? Is someone looking into fixing it?
> 
> My original fix was intentionally with cancel_delayed_work_sync()
> because flush_delayed_work() didn't work.  We knew that it'd miss some
> last-minute queued change, but it's better than crash, so it was
> applied in that way.
> 

Thanks for sharing these details.

> Then later on, the commit 33cd6ea9c067 changed cancel_*() to
> flush_delayed_work() blindly, and the known problem resurfaced again.
> 

I have reverted that commit, but still could see some other task hung 
message as shared here on other reply:

https://lore.kernel.org/all/d2485cb9-277d-4b8e-9794-02f1efababc9@oracle.com/

Thanks,
Harshit
> 
> Takashi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-19  7:39         ` Harshit Mogalapalli
@ 2024-04-19  7:45           ` Takashi Iwai
  2024-04-19  7:50             ` Patrik Jakobsson
  2024-04-19 14:42             ` Harshit Mogalapalli
  0 siblings, 2 replies; 21+ messages in thread
From: Takashi Iwai @ 2024-04-19  7:45 UTC (permalink / raw)
  To: Harshit Mogalapalli
  Cc: Takashi Iwai, Helge Deller, Nam Cao, Thomas Zimmermann,
	Daniel Vetter, linux-fbdev, dri-devel, bigeasy,
	patrik.r.jakobsson, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu

On Fri, 19 Apr 2024 09:39:09 +0200,
Harshit Mogalapalli wrote:
> 
> Hi Takashi,
> 
> On 19/04/24 12:14, Takashi Iwai wrote:
> > On Thu, 18 Apr 2024 21:29:57 +0200,
> > Helge Deller wrote:
> >> 
> >> On 4/18/24 16:26, Takashi Iwai wrote:
> >>> On Thu, 18 Apr 2024 16:06:52 +0200,
> >>> Nam Cao wrote:
> >>>> 
> >>>> On 2024-04-18 Harshit Mogalapalli wrote:
> >>>>> While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung
> >>>>> bug in fb_deferred_io_work()
> >>>> 
> >>>> Which framebuffer device are you using exactly? It is possible that
> >>>> the problem is with the device driver, not core framebuffer.
> >>> 
> >>> Note that it was already known that using flush_delayed_work() caused
> >>> a problem.  See the thread of the fix patch:
> >>>     https://lore.kernel.org/all/20230129082856.22113-1-tiwai@suse.de/
> >> 
> >> Harshit reported the hung tasks with kernel v5.15-stable, and can even reproduce
> >> that issue with kernel v6.9-rc4 although it has all of your patches from
> >> that referenced mail thread applied.
> >> So, what does your statement that "it was already known that it causes problems" exactly mean?
> >> Can it be fixed? Is someone looking into fixing it?
> > 
> > My original fix was intentionally with cancel_delayed_work_sync()
> > because flush_delayed_work() didn't work.  We knew that it'd miss some
> > last-minute queued change, but it's better than crash, so it was
> > applied in that way.
> > 
> 
> Thanks for sharing these details.
> 
> > Then later on, the commit 33cd6ea9c067 changed cancel_*() to
> > flush_delayed_work() blindly, and the known problem resurfaced again.
> > 
> 
> I have reverted that commit, but still could see some other task hung
> message as shared here on other reply:
> 
> https://lore.kernel.org/all/d2485cb9-277d-4b8e-9794-02f1efababc9@oracle.com/

Yes, then it could be a different cause, I suppose.
The crash with flush_delayed_work() was a real crash, no hanging task,
IIRC.

Can you reproduce the issue with the latest Linus upstream, too?


thanks,

Takashi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-19  7:45           ` Takashi Iwai
@ 2024-04-19  7:50             ` Patrik Jakobsson
  2024-04-19 15:34               ` Nam Cao
  2024-04-19 14:42             ` Harshit Mogalapalli
  1 sibling, 1 reply; 21+ messages in thread
From: Patrik Jakobsson @ 2024-04-19  7:50 UTC (permalink / raw)
  To: Takashi Iwai
  Cc: Harshit Mogalapalli, Helge Deller, Nam Cao, Thomas Zimmermann,
	Daniel Vetter, linux-fbdev, dri-devel, bigeasy, LKML,
	Vegard Nossum, George Kennedy, Darren Kenny, chuansheng.liu

On Fri, Apr 19, 2024 at 9:45 AM Takashi Iwai <tiwai@suse.de> wrote:
>
> On Fri, 19 Apr 2024 09:39:09 +0200,
> Harshit Mogalapalli wrote:
> >
> > Hi Takashi,
> >
> > On 19/04/24 12:14, Takashi Iwai wrote:
> > > On Thu, 18 Apr 2024 21:29:57 +0200,
> > > Helge Deller wrote:
> > >>
> > >> On 4/18/24 16:26, Takashi Iwai wrote:
> > >>> On Thu, 18 Apr 2024 16:06:52 +0200,
> > >>> Nam Cao wrote:
> > >>>>
> > >>>> On 2024-04-18 Harshit Mogalapalli wrote:
> > >>>>> While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung
> > >>>>> bug in fb_deferred_io_work()
> > >>>>
> > >>>> Which framebuffer device are you using exactly? It is possible that
> > >>>> the problem is with the device driver, not core framebuffer.
> > >>>
> > >>> Note that it was already known that using flush_delayed_work() caused
> > >>> a problem.  See the thread of the fix patch:
> > >>>     https://lore.kernel.org/all/20230129082856.22113-1-tiwai@suse.de/
> > >>
> > >> Harshit reported the hung tasks with kernel v5.15-stable, and can even reproduce
> > >> that issue with kernel v6.9-rc4 although it has all of your patches from
> > >> that referenced mail thread applied.
> > >> So, what does your statement that "it was already known that it causes problems" exactly mean?
> > >> Can it be fixed? Is someone looking into fixing it?
> > >
> > > My original fix was intentionally with cancel_delayed_work_sync()
> > > because flush_delayed_work() didn't work.  We knew that it'd miss some
> > > last-minute queued change, but it's better than crash, so it was
> > > applied in that way.
> > >
> >
> > Thanks for sharing these details.
> >
> > > Then later on, the commit 33cd6ea9c067 changed cancel_*() to
> > > flush_delayed_work() blindly, and the known problem resurfaced again.
> > >
> >
> > I have reverted that commit, but still could see some other task hung
> > message as shared here on other reply:
> >
> > https://lore.kernel.org/all/d2485cb9-277d-4b8e-9794-02f1efababc9@oracle.com/
>
> Yes, then it could be a different cause, I suppose.
> The crash with flush_delayed_work() was a real crash, no hanging task,
> IIRC.

Neither cancel_delayed_work_sync() or flush_delayed_work() prevent new
work from being scheduled after they return. But
cancel_delayed_work_sync() at least makes sure the queue is empty so
the problem becomes less apparent.

Could this explain what we're seeing?

>
> Can you reproduce the issue with the latest Linus upstream, too?
>
>
> thanks,
>
> Takashi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-19  7:45           ` Takashi Iwai
  2024-04-19  7:50             ` Patrik Jakobsson
@ 2024-04-19 14:42             ` Harshit Mogalapalli
  1 sibling, 0 replies; 21+ messages in thread
From: Harshit Mogalapalli @ 2024-04-19 14:42 UTC (permalink / raw)
  To: Takashi Iwai, patrik.r.jakobsson
  Cc: Helge Deller, Nam Cao, Thomas Zimmermann, Daniel Vetter,
	linux-fbdev, dri-devel, bigeasy, LKML, Vegard Nossum,
	George Kennedy, Darren Kenny, chuansheng.liu

Hi Takashi,
On 19/04/24 13:15, Takashi Iwai wrote:
> On Fri, 19 Apr 2024 09:39:09 +0200,
>>> Then later on, the commit 33cd6ea9c067 changed cancel_*() to
>>> flush_delayed_work() blindly, and the known problem resurfaced again.
>>>
>>
>> I have reverted that commit, but still could see some other task hung
>> message as shared here on other reply:
>>
>> https://lore.kernel.org/all/d2485cb9-277d-4b8e-9794-02f1efababc9@oracle.com/
> 
> Yes, then it could be a different cause, I suppose.
> The crash with flush_delayed_work() was a real crash, no hanging task,
> IIRC.
> 
> Can you reproduce the issue with the latest Linus upstream, too?
> 

Yes, this was reproduced on 6.9-rc4 kernel[1], also reverting that 
commit 33cd6ea9c067 on upstream still showed another issue[2].


[1]https://lore.kernel.org/all/271372d6-e665-4e7f-b088-dee5f4ab341a@oracle.com/
[2] 
https://lore.kernel.org/all/d2485cb9-277d-4b8e-9794-02f1efababc9@oracle.com/


Thanks,
Harshit

> 
> thanks,
> 
> Takashi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-18 12:39 [bug-report] task info hung problem in fb_deferred_io_work() Harshit Mogalapalli
  2024-04-18 13:14 ` Patrik Jakobsson
  2024-04-18 14:06 ` Nam Cao
@ 2024-04-19 14:59 ` Nam Cao
  2024-04-19 16:05   ` Harshit Mogalapalli
  2 siblings, 1 reply; 21+ messages in thread
From: Nam Cao @ 2024-04-19 14:59 UTC (permalink / raw)
  To: Harshit Mogalapalli
  Cc: jayalk, Daniel Vetter, deller, linux-fbdev, dri-devel, tiwai,
	bigeasy, patrik.r.jakobsson, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu

On 2024-04-18 Harshit Mogalapalli wrote:
> While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung 
> bug in fb_deferred_io_work()

I think the problem is because of improper offset address calculation.
The kernel calculate address offset with:
	offset = vmf->address - vmf->vma->vm_start

Now the problem is that your C program mmap the framebuffer at 2
different offsets:
	mmap(ptr, 4096, PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0xff000);
	mmap(ptr, 4096, PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0);

but the kernel doesn't take these different offsets into account.
So, 2 different pages are mistakenly recognized as the same page.

Can you try the following patch?

Best regards,
Nam

diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c
index dae96c9f61cf..d5d6cd9e8b29 100644
--- a/drivers/video/fbdev/core/fb_defio.c
+++ b/drivers/video/fbdev/core/fb_defio.c
@@ -196,7 +196,8 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long
  */
 static vm_fault_t fb_deferred_io_page_mkwrite(struct fb_info *info, struct vm_fault *vmf)
 {
-	unsigned long offset = vmf->address - vmf->vma->vm_start;
+	unsigned long offset = vmf->address - vmf->vma->vm_start
+			+ (vmf->vma->vm_pgoff << PAGE_SHIFT);
 	struct page *page = vmf->page;
 
 	file_update_time(vmf->vma->vm_file);



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-18 14:34     ` Patrik Jakobsson
@ 2024-04-19 15:18       ` Nam Cao
  0 siblings, 0 replies; 21+ messages in thread
From: Nam Cao @ 2024-04-19 15:18 UTC (permalink / raw)
  To: Patrik Jakobsson
  Cc: Harshit Mogalapalli, jayalk, Daniel Vetter, deller, linux-fbdev,
	dri-devel, tiwai, bigeasy, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu

On 2024-04-18 Patrik Jakobsson wrote:
> On Thu, Apr 18, 2024 at 4:05 PM Nam Cao <namcao@linutronix.de> wrote:
> >
> > On 2024-04-18 Patrik Jakobsson wrote:  
> > > This sounds similar to the SUSE bug [1]. We fixed it by reverting [2]
> > > in the SUSE kernel. The problem seems to be that flush_delayed_work()
> > > kills the timer and re-queues the work but doesn't guarantee that it
> > > is finished when returning. So when the device is closed, the
> > > fb_deferred_io_work() function can still be queued and tries to access
> > > memory that's been freed.  
> >
> > flush_delayed_work() *does* guarantee the work is finished before
> > returning.  
> 
> Right, flush_work() does guarantee that the work is finished, but it
> doesn't guarantee that the queue is idle if work has been requeued
> since flush started. So fb_deferred_io_work() should be able to happen
> after the device is closed. Or am I missing something?

I'm confused: how is it possible for fb_deferred_io_work() to run after
the device is closed?

The workqueue is flushed during closing of device. If more work is
queued during/after the flushing, then someone must be writing to the
device during/after the closing of the device. How can that happen?

Best regards,
Nam

> >  
> > > Patch [2] tries to solve the problem of not throwing away pending data
> > > when closing the device. Perhaps calling cancel_delayed_work_sync()
> > > and then follow up with a manual call to fb_deferred_io_work() would
> > > be enough to flush the remaining data?
> > >
> > > -Patrik
> > >
> > > [1] https://bugzilla.suse.com/show_bug.cgi?id=1221814
> > > [2] 33cd6ea9c067 fbdev: flush deferred IO before closing
> > >  
> > > >
> > > >
> > > > Thanks,
> > > > Harshit  
> >  


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-19  7:50             ` Patrik Jakobsson
@ 2024-04-19 15:34               ` Nam Cao
  2024-04-30 11:45                 ` Patrik Jakobsson
  0 siblings, 1 reply; 21+ messages in thread
From: Nam Cao @ 2024-04-19 15:34 UTC (permalink / raw)
  To: Patrik Jakobsson
  Cc: Takashi Iwai, Harshit Mogalapalli, Helge Deller,
	Thomas Zimmermann, Daniel Vetter, linux-fbdev, dri-devel,
	bigeasy, LKML, Vegard Nossum, George Kennedy, Darren Kenny,
	chuansheng.liu

On 2024-04-19 Patrik Jakobsson wrote:
> Neither cancel_delayed_work_sync() or flush_delayed_work() prevent new
> work from being scheduled after they return.

flush_delayed_work() is called during device closing. And because no
writes are performed after the device has been closed, no new work
should be queued after flush_delayed_work().

> But
> cancel_delayed_work_sync() at least makes sure the queue is empty so
> the problem becomes less apparent.
> 
> Could this explain what we're seeing?

I suspect that cancel_delayed_work_sync() is only treating the symptoms
by preventing the deferred work from running. The real bug is "someone"
giving fb_deferred_io_work() invalid pages to work with. But that's
just a blind guess.

Best regards,
Nam

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-19 14:59 ` Nam Cao
@ 2024-04-19 16:05   ` Harshit Mogalapalli
  2024-04-19 16:23     ` Nam Cao
  0 siblings, 1 reply; 21+ messages in thread
From: Harshit Mogalapalli @ 2024-04-19 16:05 UTC (permalink / raw)
  To: Nam Cao
  Cc: jayalk, Daniel Vetter, deller, linux-fbdev, dri-devel, tiwai,
	bigeasy, patrik.r.jakobsson, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu, stable, Greg Kroah-Hartman,
	Sasha Levin

Hi Nam,


+CC stable( heads up as this is a regression affecting 5.15.y and 
probably others, Greg: this was reproducible upstream so reported 
everything w.r.t upstream code but initially found on 5.15.y)

On 19/04/24 20:29, Nam Cao wrote:
> On 2024-04-18 Harshit Mogalapalli wrote:
>> While fuzzing 5.15.y kernel with Syzkaller, we noticed a INFO: task hung
>> bug in fb_deferred_io_work()
> 
> I think the problem is because of improper offset address calculation.
> The kernel calculate address offset with:
> 	offset = vmf->address - vmf->vma->vm_start
> 
> Now the problem is that your C program mmap the framebuffer at 2
> different offsets:
> 	mmap(ptr, 4096, PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0xff000);
> 	mmap(ptr, 4096, PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0);
> 
> but the kernel doesn't take these different offsets into account.
> So, 2 different pages are mistakenly recognized as the same page.
> 
> Can you try the following patch?

This patch works well against the reproducer, this simplified repro and 
the longer repro which syzkaller generated couldn't trigger any hang 
with the below patch applied.

Thanks,
Harshit
> 
> Best regards,
> Nam
> 
> diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c
> index dae96c9f61cf..d5d6cd9e8b29 100644
> --- a/drivers/video/fbdev/core/fb_defio.c
> +++ b/drivers/video/fbdev/core/fb_defio.c
> @@ -196,7 +196,8 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long
>    */
>   static vm_fault_t fb_deferred_io_page_mkwrite(struct fb_info *info, struct vm_fault *vmf)
>   {
> -	unsigned long offset = vmf->address - vmf->vma->vm_start;
> +	unsigned long offset = vmf->address - vmf->vma->vm_start
> +			+ (vmf->vma->vm_pgoff << PAGE_SHIFT);
>   	struct page *page = vmf->page;
>   
>   	file_update_time(vmf->vma->vm_file);
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-19 16:05   ` Harshit Mogalapalli
@ 2024-04-19 16:23     ` Nam Cao
  2024-04-19 16:39       ` Harshit Mogalapalli
  0 siblings, 1 reply; 21+ messages in thread
From: Nam Cao @ 2024-04-19 16:23 UTC (permalink / raw)
  To: Harshit Mogalapalli
  Cc: jayalk, Daniel Vetter, deller, linux-fbdev, dri-devel, tiwai,
	bigeasy, patrik.r.jakobsson, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu, stable, Greg Kroah-Hartman,
	Sasha Levin

On 2024-04-19 Harshit Mogalapalli wrote:
> +CC stable( heads up as this is a regression affecting 5.15.y and 
> probably others, Greg: this was reproducible upstream so reported 
> everything w.r.t upstream code but initially found on 5.15.y)

No worry about this, I will add a "Cc: <stable@vger.kernel.org>" tag
to the patch, and the stable folks (or their scripts) will pick it up.

> This patch works well against the reproducer, this simplified repro and 
> the longer repro which syzkaller generated couldn't trigger any hang 
> with the below patch applied.

Great, thanks! Can I add
	Reported-and-tested-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
to the patch?

Best regards,
Nam

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-19 16:23     ` Nam Cao
@ 2024-04-19 16:39       ` Harshit Mogalapalli
  0 siblings, 0 replies; 21+ messages in thread
From: Harshit Mogalapalli @ 2024-04-19 16:39 UTC (permalink / raw)
  To: Nam Cao
  Cc: jayalk, Daniel Vetter, deller, linux-fbdev, dri-devel, tiwai,
	bigeasy, patrik.r.jakobsson, LKML, Vegard Nossum, George Kennedy,
	Darren Kenny, chuansheng.liu, stable, Greg Kroah-Hartman,
	Sasha Levin

On 19/04/24 21:53, Nam Cao wrote:
> On 2024-04-19 Harshit Mogalapalli wrote:
>> +CC stable( heads up as this is a regression affecting 5.15.y and
>> probably others, Greg: this was reproducible upstream so reported
>> everything w.r.t upstream code but initially found on 5.15.y)
> 
> No worry about this, I will add a "Cc: <stable@vger.kernel.org>" tag
> to the patch, and the stable folks (or their scripts) will pick it up.
> 

Thanks for that!

>> This patch works well against the reproducer, this simplified repro and
>> the longer repro which syzkaller generated couldn't trigger any hang
>> with the below patch applied.
> 
> Great, thanks! Can I add
> 	Reported-and-tested-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
> to the patch?
> 
Sure, that would be good!

Thanks,
Harshit
> Best regards,
> Nam


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [bug-report] task info hung problem in fb_deferred_io_work()
  2024-04-19 15:34               ` Nam Cao
@ 2024-04-30 11:45                 ` Patrik Jakobsson
  0 siblings, 0 replies; 21+ messages in thread
From: Patrik Jakobsson @ 2024-04-30 11:45 UTC (permalink / raw)
  To: Nam Cao
  Cc: Takashi Iwai, Harshit Mogalapalli, Helge Deller,
	Thomas Zimmermann, Daniel Vetter, linux-fbdev, dri-devel,
	bigeasy, LKML, Vegard Nossum, George Kennedy, Darren Kenny,
	chuansheng.liu

On Fri, Apr 19, 2024 at 5:34 PM Nam Cao <namcao@linutronix.de> wrote:
>
> On 2024-04-19 Patrik Jakobsson wrote:
> > Neither cancel_delayed_work_sync() or flush_delayed_work() prevent new
> > work from being scheduled after they return.
>
> flush_delayed_work() is called during device closing. And because no
> writes are performed after the device has been closed, no new work
> should be queued after flush_delayed_work().

Yes, nothing should write after the device is closed but the events
are asynchronous so in theory the order is not guaranteed. I also find
it unlikely but I have no other theory at this point.

>
> > But
> > cancel_delayed_work_sync() at least makes sure the queue is empty so
> > the problem becomes less apparent.
> >
> > Could this explain what we're seeing?
>
> I suspect that cancel_delayed_work_sync() is only treating the symptoms
> by preventing the deferred work from running. The real bug is "someone"
> giving fb_deferred_io_work() invalid pages to work with. But that's
> just a blind guess.

Trying to figure out when the page goes away in relation to when the
work is triggered might be a good place to start.

>
> Best regards,
> Nam

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-04-30 11:45 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-18 12:39 [bug-report] task info hung problem in fb_deferred_io_work() Harshit Mogalapalli
2024-04-18 13:14 ` Patrik Jakobsson
2024-04-18 14:05   ` Nam Cao
2024-04-18 14:34     ` Patrik Jakobsson
2024-04-19 15:18       ` Nam Cao
2024-04-19  7:27   ` Harshit Mogalapalli
2024-04-18 14:06 ` Nam Cao
2024-04-18 14:26   ` Takashi Iwai
2024-04-18 19:29     ` Helge Deller
2024-04-19  6:44       ` Takashi Iwai
2024-04-19  7:39         ` Harshit Mogalapalli
2024-04-19  7:45           ` Takashi Iwai
2024-04-19  7:50             ` Patrik Jakobsson
2024-04-19 15:34               ` Nam Cao
2024-04-30 11:45                 ` Patrik Jakobsson
2024-04-19 14:42             ` Harshit Mogalapalli
2024-04-19  7:35   ` Harshit Mogalapalli
2024-04-19 14:59 ` Nam Cao
2024-04-19 16:05   ` Harshit Mogalapalli
2024-04-19 16:23     ` Nam Cao
2024-04-19 16:39       ` Harshit Mogalapalli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).