All of lore.kernel.org
 help / color / mirror / Atom feed
* general protection fault in wb_workfn (2)
@ 2018-05-26  9:15 syzbot
  2018-05-27  0:47 ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: syzbot @ 2018-05-26  9:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, syzkaller-bugs, viro

Hello,

syzbot found the following crash on:

HEAD commit:    305bb5521282 Merge tag 'selinux-pr-20180516' of git://git...
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=153eb40f800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=f3b4e30da84ec1ed
dashboard link: https://syzkaller.appspot.com/bug?extid=4a7438e774b21ddd8eca
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com

binder: 13169:13171 ioctl 40047459 20000000 returned -22
sock: process `syz-executor6' is using obsolete setsockopt SO_BSDCOMPAT
binder: 13169:13202 Acquire 1 refcount change on invalid ref 0 ret -22
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 88 Comm: kworker/u4:3 Not tainted 4.17.0-rc5+ #55
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: writeback wb_workfn
RIP: 0010:dev_name include/linux/device.h:1008 [inline]
RIP: 0010:wb_workfn+0x195/0x1740 fs/fs-writeback.c:1937
RSP: 0018:ffff8801d964f270 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff814e0f15
RDX: 000000000000000a RSI: ffffffff81cd221d RDI: 0000000000000050
RBP: ffff8801d964f750 R08: ffff8801d97c6700 R09: ffffed003b5e46c2
R10: ffffed003b5e46c2 R11: ffff8801daf23613 R12: 0000000000000001
R13: 1ffff1003b2c9f37 R14: ffff8801d964f728 R15: ffff8801d6836f18
FS:  0000000000000000(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fec3a840db8 CR3: 00000001b49ae000 CR4: 00000000001426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
  process_scheduled_works kernel/workqueue.c:2205 [inline]
  worker_thread+0xa30/0x1440 kernel/workqueue.c:2284
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 ee 13 00 00 48 8b 9b 08 06 00 00 48  
b8 00 00 00 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 <80> 3c 02 00  
0f 85 3f 14 00 00 4c 8b 63 50 4d 85 e4 0f 84 a9 0e
RIP: dev_name include/linux/device.h:1008 [inline] RSP: ffff8801d964f270
RIP: wb_workfn+0x195/0x1740 fs/fs-writeback.c:1937 RSP: ffff8801d964f270
---[ end trace baf4ced88bb756b8 ]---


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-05-26  9:15 general protection fault in wb_workfn (2) syzbot
@ 2018-05-27  0:47 ` Tetsuo Handa
  2018-05-27  2:21   ` [PATCH] bdi: Fix another oops in wb_workfn() Tetsuo Handa
  2018-05-28 13:35   ` general protection fault in wb_workfn (2) Jan Kara
  0 siblings, 2 replies; 43+ messages in thread
From: Tetsuo Handa @ 2018-05-27  0:47 UTC (permalink / raw)
  To: syzbot, syzkaller-bugs, jack
  Cc: linux-fsdevel, linux-kernel, viro, axboe, tj, david, linux-block

Forwarding http://lkml.kernel.org/r/201805251915.FGH64517.HVFJOOLFFMQStO@I-love.SAKURA.ne.jp .

Jan Kara wrote:
> > void delayed_work_timer_fn(struct timer_list *t)
> > {
> > 	struct delayed_work *dwork = from_timer(dwork, t, timer);
> > 
> > 	/* should have been called from irqsafe timer with irq already off */
> > 	__queue_work(dwork->cpu, dwork->wq, &dwork->work);
> > }
> > 
> > Then, wb_workfn() is after all scheduled even if we check for
> > WB_registered bit, isn't it?
> 
> It can be queued after WB_registered bit is cleared but it cannot be queued
> after mod_delayed_work(bdi_wq, &wb->dwork, 0) has finished. That function
> deletes the pending timer (the timer cannot be armed again because
> WB_registered is cleared) and queues what should be the last round of
> wb_workfn().

mod_delayed_work() deletes the pending timer but does not wait for already
invoked timer handler to complete because it is using del_timer() rather than
del_timer_sync(). Then, what happens if __queue_work() is almost concurrently
executed from two CPUs, one from mod_delayed_work(bdi_wq, &wb->dwork, 0) from
wb_shutdown() path (which is called without spin_lock_bh(&wb->work_lock)) and
the other from delayed_work_timer_fn() path (which is called without checking
WB_registered bit under spin_lock_bh(&wb->work_lock)) ?

wb_wakeup_delayed() {
  spin_lock_bh(&wb->work_lock);
  if (test_bit(WB_registered, &wb->state)) // succeeds
    queue_delayed_work(bdi_wq, &wb->d_work, timeout) {
      queue_delayed_work_on(WORK_CPU_UNBOUND, bdi_wq, &wb->d_work, timeout) {
         if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&wb->d_work.work))) { // succeeds
           __queue_delayed_work(WORK_CPU_UNBOUND, bdi_wq, &wb->d_work, timeout) {
             add_timer(timer); // schedules for delayed_work_timer_fn()
           }
         }
      }
    }
  spin_unlock_bh(&wb->work_lock);
}

delayed_work_timer_fn() {
  // del_timer() already returns false at this point because this timer
  // is already inside handler. But something took long here enough to
  // wait for __queue_work() from wb_shutdown() path to finish?
  __queue_work(WORK_CPU_UNBOUND, bdi_wq, &wb->d_work.work) {
    insert_work(pwq, work, worklist, work_flags);
  }
}

wb_shutdown() {
  mod_delayed_work(bdi_wq, &wb->dwork, 0) {
    mod_delayed_work_on(WORK_CPU_UNBOUND, bdi_wq, &wb->dwork, 0) {
      ret = try_to_grab_pending(&wb->dwork.work, true, &flags) {
        if (likely(del_timer(&wb->dwork.timer))) // fails because already in delayed_work_timer_fn()
          return 1;
        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&wb->dwork.work))) // fails because already set by queue_delayed_work()
          return 0;
        // Returns 1 or -ENOENT after doing something?
      }
      if (ret >= 0)
        __queue_delayed_work(WORK_CPU_UNBOUND, bdi_wq, &wb->dwork, 0) {
          __queue_work(WORK_CPU_UNBOUND, bdi_wq, &wb->dwork.work) {
            insert_work(pwq, work, worklist, work_flags);
          }
        }
    }
  }
}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] bdi: Fix another oops in wb_workfn()
  2018-05-27  0:47 ` Tetsuo Handa
@ 2018-05-27  2:21   ` Tetsuo Handa
  2018-05-27  2:36     ` Tejun Heo
  2018-05-28 13:35   ` general protection fault in wb_workfn (2) Jan Kara
  1 sibling, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2018-05-27  2:21 UTC (permalink / raw)
  To: syzbot, syzkaller-bugs, jack
  Cc: linux-fsdevel, linux-kernel, viro, axboe, tj, david, linux-block

>From 8a8222698163d1fe180258566e9a3ff43f54fcd9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 27 May 2018 11:08:20 +0900
Subject: [PATCH] bdi: Fix another oops in wb_workfn()

syzbot is still hitting NULL pointer dereference at wb_workfn() [1].
This might be because we overlooked that delayed_work_timer_fn() does not
check WB_registered before calling __queue_work() while mod_delayed_work()
does not wait for already started delayed_work_timer_fn() because it uses
del_timer() rather than del_timer_sync().

Make wb_shutdown() be careful about wb_wakeup_delayed() path.

[1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
---
 mm/backing-dev.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7441bd9..31e1d7e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -372,11 +372,24 @@ static void wb_shutdown(struct bdi_writeback *wb)
 
 	cgwb_remove_from_bdi_list(wb);
 	/*
+	 * mod_delayed_work() is not appropriate here, for
+	 * delayed_work_timer_fn() from wb_wakeup_delayed() does not check
+	 * WB_registered before calling __queue_work().
+	 */
+	del_timer_sync(&wb->dwork.timer);
+	/*
+	 * Clear WORK_STRUCT_PENDING_BIT in order to make sure that next
+	 * queue_delayed_work() actually enqueues this work to the tail, for
+	 * wb_wakeup_delayed() already set WORK_STRUCT_PENDING_BIT before
+	 * scheduling delayed_work_timer_fn().
+	 */
+	cancel_delayed_work_sync(&wb->dwork);
+	/*
 	 * Drain work list and shutdown the delayed_work.  !WB_registered
 	 * tells wb_workfn() that @wb is dying and its work_list needs to
 	 * be drained no matter what.
 	 */
-	mod_delayed_work(bdi_wq, &wb->dwork, 0);
+	queue_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
 	/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-05-27  2:21   ` [PATCH] bdi: Fix another oops in wb_workfn() Tetsuo Handa
@ 2018-05-27  2:36     ` Tejun Heo
  2018-05-27  4:43       ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2018-05-27  2:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: syzbot, syzkaller-bugs, jack, linux-fsdevel, linux-kernel, viro,
	axboe, david, linux-block

On Sun, May 27, 2018 at 11:21:25AM +0900, Tetsuo Handa wrote:
> From 8a8222698163d1fe180258566e9a3ff43f54fcd9 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sun, 27 May 2018 11:08:20 +0900
> Subject: [PATCH] bdi: Fix another oops in wb_workfn()
> 
> syzbot is still hitting NULL pointer dereference at wb_workfn() [1].
> This might be because we overlooked that delayed_work_timer_fn() does not
> check WB_registered before calling __queue_work() while mod_delayed_work()
> does not wait for already started delayed_work_timer_fn() because it uses
> del_timer() rather than del_timer_sync().

It shouldn't be that as dwork timer is an irq safe timer.  Even if
that's the case, the right thing to do would be fixing workqueue
rather than reaching into workqueue internals from backing-dev code.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-05-27  2:36     ` Tejun Heo
@ 2018-05-27  4:43       ` Tetsuo Handa
  2018-05-29 13:46         ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2018-05-27  4:43 UTC (permalink / raw)
  To: tj
  Cc: syzbot+4a7438e774b21ddd8eca, syzkaller-bugs, jack, viro, axboe,
	david, linux-block

Tejun Heo wrote:
> On Sun, May 27, 2018 at 11:21:25AM +0900, Tetsuo Handa wrote:
> > syzbot is still hitting NULL pointer dereference at wb_workfn() [1].
> > This might be because we overlooked that delayed_work_timer_fn() does not
> > check WB_registered before calling __queue_work() while mod_delayed_work()
> > does not wait for already started delayed_work_timer_fn() because it uses
> > del_timer() rather than del_timer_sync().
> 
> It shouldn't be that as dwork timer is an irq safe timer.  Even if
> that's the case, the right thing to do would be fixing workqueue
> rather than reaching into workqueue internals from backing-dev code.
> 

Do you think that there is possibility that __queue_work() is almost concurrently
executed from two CPUs, one from mod_delayed_work(bdi_wq, &wb->dwork, 0) from
wb_shutdown() path (which is called without spin_lock_bh(&wb->work_lock)) and
the other from delayed_work_timer_fn() path (which is called without checking
WB_registered bit under spin_lock_bh(&wb->work_lock)) ?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-05-27  0:47 ` Tetsuo Handa
  2018-05-27  2:21   ` [PATCH] bdi: Fix another oops in wb_workfn() Tetsuo Handa
@ 2018-05-28 13:35   ` Jan Kara
  2018-05-30 16:00       ` Tetsuo Handa
  1 sibling, 1 reply; 43+ messages in thread
From: Jan Kara @ 2018-05-28 13:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: syzbot, syzkaller-bugs, jack, linux-fsdevel, linux-kernel, viro,
	axboe, tj, david, linux-block

On Sun 27-05-18 09:47:54, Tetsuo Handa wrote:
> Forwarding http://lkml.kernel.org/r/201805251915.FGH64517.HVFJOOLFFMQStO@I-love.SAKURA.ne.jp .
> 
> Jan Kara wrote:
> > > void delayed_work_timer_fn(struct timer_list *t)
> > > {
> > > 	struct delayed_work *dwork = from_timer(dwork, t, timer);
> > > 
> > > 	/* should have been called from irqsafe timer with irq already off */
> > > 	__queue_work(dwork->cpu, dwork->wq, &dwork->work);
> > > }
> > > 
> > > Then, wb_workfn() is after all scheduled even if we check for
> > > WB_registered bit, isn't it?
> > 
> > It can be queued after WB_registered bit is cleared but it cannot be queued
> > after mod_delayed_work(bdi_wq, &wb->dwork, 0) has finished. That function
> > deletes the pending timer (the timer cannot be armed again because
> > WB_registered is cleared) and queues what should be the last round of
> > wb_workfn().
> 
> mod_delayed_work() deletes the pending timer but does not wait for already
> invoked timer handler to complete because it is using del_timer() rather than
> del_timer_sync(). Then, what happens if __queue_work() is almost concurrently
> executed from two CPUs, one from mod_delayed_work(bdi_wq, &wb->dwork, 0) from
> wb_shutdown() path (which is called without spin_lock_bh(&wb->work_lock)) and
> the other from delayed_work_timer_fn() path (which is called without checking
> WB_registered bit under spin_lock_bh(&wb->work_lock)) ?

In this case, work should still be queued only once. The synchronization in
this case should be provided by the WORK_STRUCT_PENDING_BIT. When a delayed
work is queued by mod_delayed_work(), this bit is set, and gets cleared
only once the work is started on some CPU. But admittedly this code is
rather convoluted so I may be missing something.

Also you should note that flush_delayed_work() which follows
mod_delayed_work() in wb_shutdown() does del_timer_sync() so I don't see
how anything could get past that. In fact mod_delayed_work() is in
wb_shutdown() path to make sure wb_workfn() gets executed at least once
before the bdi_writeback structure gets cleaned up so that all queued items
are finished. We do not rely on it to remove pending timers or queued
wb_workfn() executions.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-05-27  4:43       ` Tetsuo Handa
@ 2018-05-29 13:46         ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2018-05-29 13:46 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: syzbot+4a7438e774b21ddd8eca, syzkaller-bugs, jack, viro, axboe,
	david, linux-block

On Sun, May 27, 2018 at 01:43:45PM +0900, Tetsuo Handa wrote:
> Tejun Heo wrote:
> > On Sun, May 27, 2018 at 11:21:25AM +0900, Tetsuo Handa wrote:
> > > syzbot is still hitting NULL pointer dereference at wb_workfn() [1].
> > > This might be because we overlooked that delayed_work_timer_fn() does not
> > > check WB_registered before calling __queue_work() while mod_delayed_work()
> > > does not wait for already started delayed_work_timer_fn() because it uses
> > > del_timer() rather than del_timer_sync().
> > 
> > It shouldn't be that as dwork timer is an irq safe timer.  Even if
> > that's the case, the right thing to do would be fixing workqueue
> > rather than reaching into workqueue internals from backing-dev code.
> > 
> 
> Do you think that there is possibility that __queue_work() is almost concurrently
> executed from two CPUs, one from mod_delayed_work(bdi_wq, &wb->dwork, 0) from
> wb_shutdown() path (which is called without spin_lock_bh(&wb->work_lock)) and
> the other from delayed_work_timer_fn() path (which is called without checking
> WB_registered bit under spin_lock_bh(&wb->work_lock)) ?

__queue_work() is gated by WORK_STRUCT_PENDING_BIT, so I don't see how
multiple instances would execute concurrently for the same work item.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-05-28 13:35   ` general protection fault in wb_workfn (2) Jan Kara
@ 2018-05-30 16:00       ` Tetsuo Handa
  0 siblings, 0 replies; 43+ messages in thread
From: Tetsuo Handa @ 2018-05-30 16:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: syzbot, syzkaller-bugs, linux-fsdevel, linux-kernel, viro, axboe,
	tj, david, linux-block

So, we have no idea what is happening...
Then, what about starting from temporary debug printk() patch shown below?

>>From 4f70f72ad3c9ae6ce1678024ef740aca4958e5b0 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 30 May 2018 09:57:10 +0900
Subject: [PATCH] bdi: Add temporary config for debugging wb_workfn() versus
 bdi_unregister() race bug.

syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
limitations that syzbot cannot find reproducer for this bug (frequency is
once or twice per a day) nor we can't capture vmcore in the environment
which syzbot is using, for now we need to rely on printk() debugging.

[1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 block/Kconfig     |  7 +++++++
 fs/fs-writeback.c | 31 +++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index 28ec557..fbce13e 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -139,6 +139,13 @@ config BLK_CMDLINE_PARSER
 
 	See Documentation/block/cmdline-partition.txt for more information.
 
+config BLK_DEBUG_WB_WORKFN_RACE
+	bool "Dump upon hitting wb_workfn() versus bdi_unregister() race bug."
+	default n
+	---help---
+	This is a temporary option used for obtaining information for
+	specific bug. This option will be removed after the bug is fixed.
+
 config BLK_WBT
 	bool "Enable support for block device writeback throttling"
 	default n
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 471d863..b4dd078 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1934,6 +1934,37 @@ void wb_workfn(struct work_struct *work)
 						struct bdi_writeback, dwork);
 	long pages_written;
 
+#ifdef CONFIG_BLK_DEBUG_WB_WORKFN_RACE
+	if (!wb->bdi->dev) {
+		pr_warn("WARNING: %s: device is NULL\n", __func__);
+		pr_warn("wb->state=%lx\n", wb->state);
+		pr_warn("list_empty(&wb->work_list)=%u\n",
+			list_empty(&wb->work_list));
+		if (!wb->bdi)
+			pr_warn("wb->bdi == NULL\n");
+		else {
+			pr_warn("list_empty(&wb->bdi->bdi_list)=%u\n",
+				list_empty(&wb->bdi->bdi_list));
+			pr_warn("wb->bdi->wb.state=%lx\n", wb->bdi->wb.state);
+		}
+		if (!wb->congested)
+			pr_warn("wb->congested == NULL\n");
+#ifdef CONFIG_CGROUP_WRITEBACK
+		else if (!wb->congested->__bdi)
+			pr_warn("wb->congested->__bdi == NULL\n");
+		else {
+			pr_warn("(wb->congested->__bdi == wb->bdi)=%u\n",
+				wb->congested->__bdi == wb->bdi);
+			pr_warn("list_empty(&wb->congested->__bdi->bdi_list)=%u\n",
+				list_empty(&wb->congested->__bdi->bdi_list));
+			pr_warn("wb->congested->__bdi->wb.state=%lx\n",
+				wb->congested->__bdi->wb.state);
+		}
+#endif
+		/* Will halt shortly due to NULL pointer dereference... */
+	}
+#endif
+
 	set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
 	current->flags |= PF_SWAPWRITE;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
@ 2018-05-30 16:00       ` Tetsuo Handa
  0 siblings, 0 replies; 43+ messages in thread
From: Tetsuo Handa @ 2018-05-30 16:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: syzbot, syzkaller-bugs, linux-fsdevel, linux-kernel, viro, axboe,
	tj, david, linux-block

So, we have no idea what is happening...
Then, what about starting from temporary debug printk() patch shown below?

>From 4f70f72ad3c9ae6ce1678024ef740aca4958e5b0 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 30 May 2018 09:57:10 +0900
Subject: [PATCH] bdi: Add temporary config for debugging wb_workfn() versus
 bdi_unregister() race bug.

syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
limitations that syzbot cannot find reproducer for this bug (frequency is
once or twice per a day) nor we can't capture vmcore in the environment
which syzbot is using, for now we need to rely on printk() debugging.

[1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 block/Kconfig     |  7 +++++++
 fs/fs-writeback.c | 31 +++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index 28ec557..fbce13e 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -139,6 +139,13 @@ config BLK_CMDLINE_PARSER
 
 	See Documentation/block/cmdline-partition.txt for more information.
 
+config BLK_DEBUG_WB_WORKFN_RACE
+	bool "Dump upon hitting wb_workfn() versus bdi_unregister() race bug."
+	default n
+	---help---
+	This is a temporary option used for obtaining information for
+	specific bug. This option will be removed after the bug is fixed.
+
 config BLK_WBT
 	bool "Enable support for block device writeback throttling"
 	default n
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 471d863..b4dd078 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1934,6 +1934,37 @@ void wb_workfn(struct work_struct *work)
 						struct bdi_writeback, dwork);
 	long pages_written;
 
+#ifdef CONFIG_BLK_DEBUG_WB_WORKFN_RACE
+	if (!wb->bdi->dev) {
+		pr_warn("WARNING: %s: device is NULL\n", __func__);
+		pr_warn("wb->state=%lx\n", wb->state);
+		pr_warn("list_empty(&wb->work_list)=%u\n",
+			list_empty(&wb->work_list));
+		if (!wb->bdi)
+			pr_warn("wb->bdi == NULL\n");
+		else {
+			pr_warn("list_empty(&wb->bdi->bdi_list)=%u\n",
+				list_empty(&wb->bdi->bdi_list));
+			pr_warn("wb->bdi->wb.state=%lx\n", wb->bdi->wb.state);
+		}
+		if (!wb->congested)
+			pr_warn("wb->congested == NULL\n");
+#ifdef CONFIG_CGROUP_WRITEBACK
+		else if (!wb->congested->__bdi)
+			pr_warn("wb->congested->__bdi == NULL\n");
+		else {
+			pr_warn("(wb->congested->__bdi == wb->bdi)=%u\n",
+				wb->congested->__bdi == wb->bdi);
+			pr_warn("list_empty(&wb->congested->__bdi->bdi_list)=%u\n",
+				list_empty(&wb->congested->__bdi->bdi_list));
+			pr_warn("wb->congested->__bdi->wb.state=%lx\n",
+				wb->congested->__bdi->wb.state);
+		}
+#endif
+		/* Will halt shortly due to NULL pointer dereference... */
+	}
+#endif
+
 	set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
 	current->flags |= PF_SWAPWRITE;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-05-30 16:00       ` Tetsuo Handa
  (?)
@ 2018-05-31 11:42       ` Jan Kara
  2018-05-31 13:19         ` Tetsuo Handa
  -1 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2018-05-31 11:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel, linux-kernel,
	viro, axboe, tj, david, linux-block

On Thu 31-05-18 01:00:08, Tetsuo Handa wrote:
> So, we have no idea what is happening...
> Then, what about starting from temporary debug printk() patch shown below?
> 
> >From 4f70f72ad3c9ae6ce1678024ef740aca4958e5b0 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 30 May 2018 09:57:10 +0900
> Subject: [PATCH] bdi: Add temporary config for debugging wb_workfn() versus
>  bdi_unregister() race bug.
> 
> syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
> limitations that syzbot cannot find reproducer for this bug (frequency is
> once or twice per a day) nor we can't capture vmcore in the environment
> which syzbot is using, for now we need to rely on printk() debugging.
> 
> [1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Hum a bit ugly solution but if others are fine with this, I can live with
it for a while as well. Or would it be possible for syzkaller to just test
some git tree where this patch is included? Then we would not even have to
have the extra config option...

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 471d863..b4dd078 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1934,6 +1934,37 @@ void wb_workfn(struct work_struct *work)
>  						struct bdi_writeback, dwork);
>  	long pages_written;
>  
> +#ifdef CONFIG_BLK_DEBUG_WB_WORKFN_RACE
> +	if (!wb->bdi->dev) {
> +		pr_warn("WARNING: %s: device is NULL\n", __func__);
> +		pr_warn("wb->state=%lx\n", wb->state);
> +		pr_warn("list_empty(&wb->work_list)=%u\n",
> +			list_empty(&wb->work_list));
> +		if (!wb->bdi)

This is not possible when we dereferences wb->bdi above...

> +			pr_warn("wb->bdi == NULL\n");
> +		else {
> +			pr_warn("list_empty(&wb->bdi->bdi_list)=%u\n",
> +				list_empty(&wb->bdi->bdi_list));
> +			pr_warn("wb->bdi->wb.state=%lx\n", wb->bdi->wb.state);
> +		}

It would be also good to print whether wb == wb->bdi->wb (i.e. it is the
default writeback structure or one for some cgroup) and also
wb->bdi->wb.state.

								Honza

> +		if (!wb->congested)
> +			pr_warn("wb->congested == NULL\n");
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +		else if (!wb->congested->__bdi)
> +			pr_warn("wb->congested->__bdi == NULL\n");
> +		else {
> +			pr_warn("(wb->congested->__bdi == wb->bdi)=%u\n",
> +				wb->congested->__bdi == wb->bdi);
> +			pr_warn("list_empty(&wb->congested->__bdi->bdi_list)=%u\n",
> +				list_empty(&wb->congested->__bdi->bdi_list));
> +			pr_warn("wb->congested->__bdi->wb.state=%lx\n",
> +				wb->congested->__bdi->wb.state);
> +		}
> +#endif
> +		/* Will halt shortly due to NULL pointer dereference... */
> +	}
> +#endif
> +
>  	set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
>  	current->flags |= PF_SWAPWRITE;
>  
> -- 
> 1.8.3.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-05-31 11:42       ` Jan Kara
@ 2018-05-31 13:19         ` Tetsuo Handa
  2018-05-31 13:42           ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2018-05-31 13:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: syzbot, syzkaller-bugs, linux-fsdevel, linux-kernel, viro, axboe,
	tj, david, linux-block, Dmitry Vyukov

On 2018/05/31 20:42, Jan Kara wrote:
> On Thu 31-05-18 01:00:08, Tetsuo Handa wrote:
>> So, we have no idea what is happening...
>> Then, what about starting from temporary debug printk() patch shown below?
>>
>> >From 4f70f72ad3c9ae6ce1678024ef740aca4958e5b0 Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Date: Wed, 30 May 2018 09:57:10 +0900
>> Subject: [PATCH] bdi: Add temporary config for debugging wb_workfn() versus
>>  bdi_unregister() race bug.
>>
>> syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
>> limitations that syzbot cannot find reproducer for this bug (frequency is
>> once or twice per a day) nor we can't capture vmcore in the environment
>> which syzbot is using, for now we need to rely on printk() debugging.
>>
>> [1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
>>
>> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> Hum a bit ugly solution but if others are fine with this, I can live with
> it for a while as well. Or would it be possible for syzkaller to just test
> some git tree where this patch is included? Then we would not even have to
> have the extra config option...

If syzbot can reproduce this bug that way. While it is possible to add/remove
git trees syzbot tests, frequently adding/removing trees is bothering.

syzbot can enable extra config option. Maybe the config name should be
something like CONFIG_DEBUG_FOR_SYZBOT rather than individual topic.

I think that syzbot is using many VM instances. I don't know how many
instances will be needed for reproducing this bug within reasonable period.
More git trees syzbot tests, (I assume that) longer period will be needed
for reproducing this bug. The most reliable way is to use the shared part
of all trees (i.e. linux.git).

> 
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index 471d863..b4dd078 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -1934,6 +1934,37 @@ void wb_workfn(struct work_struct *work)
>>  						struct bdi_writeback, dwork);
>>  	long pages_written;
>>  
>> +#ifdef CONFIG_BLK_DEBUG_WB_WORKFN_RACE
>> +	if (!wb->bdi->dev) {
>> +		pr_warn("WARNING: %s: device is NULL\n", __func__);
>> +		pr_warn("wb->state=%lx\n", wb->state);
>> +		pr_warn("list_empty(&wb->work_list)=%u\n",
>> +			list_empty(&wb->work_list));
>> +		if (!wb->bdi)
> 
> This is not possible when we dereferences wb->bdi above...

Oops. I missed it.

> 
>> +			pr_warn("wb->bdi == NULL\n");
>> +		else {
>> +			pr_warn("list_empty(&wb->bdi->bdi_list)=%u\n",
>> +				list_empty(&wb->bdi->bdi_list));
>> +			pr_warn("wb->bdi->wb.state=%lx\n", wb->bdi->wb.state);
>> +		}
> 
> It would be also good to print whether wb == wb->bdi->wb (i.e. it is the
> default writeback structure or one for some cgroup) and also
> wb->bdi->wb.state.
> 

wb->bdi->wb.state is already printed. Updated patch is shown below.
Anything else to print?



>From 3f3346d42b804e59d12caaa525365a8482505f08 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 31 May 2018 22:07:20 +0900
Subject: [PATCH v2] bdi: Add temporary config for debugging wb_workfn() versus
 bdi_unregister() race bug.

syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
limitations that syzbot cannot find reproducer for this bug (frequency is
once or twice per a day) nor we can't capture vmcore in the environment
which syzbot is using, for now we need to rely on printk() debugging.

[1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 block/Kconfig     |  7 +++++++
 fs/fs-writeback.c | 28 ++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index 28ec557..fbce13e 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -139,6 +139,13 @@ config BLK_CMDLINE_PARSER
 
 	See Documentation/block/cmdline-partition.txt for more information.
 
+config BLK_DEBUG_WB_WORKFN_RACE
+	bool "Dump upon hitting wb_workfn() versus bdi_unregister() race bug."
+	default n
+	---help---
+	This is a temporary option used for obtaining information for
+	specific bug. This option will be removed after the bug is fixed.
+
 config BLK_WBT
 	bool "Enable support for block device writeback throttling"
 	default n
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 471d863..14ab873 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1934,6 +1934,34 @@ void wb_workfn(struct work_struct *work)
 						struct bdi_writeback, dwork);
 	long pages_written;
 
+#ifdef CONFIG_BLK_DEBUG_WB_WORKFN_RACE
+	if (!wb->bdi->dev) {
+		pr_warn("WARNING: %s: device is NULL\n", __func__);
+		pr_warn("wb->state=%lx\n", wb->state);
+		pr_warn("(wb == &wb->bdi->wb)=%u\n", wb == &wb->bdi->wb);
+		pr_warn("list_empty(&wb->work_list)=%u\n",
+			list_empty(&wb->work_list));
+		pr_warn("list_empty(&wb->bdi->bdi_list)=%u\n",
+			list_empty(&wb->bdi->bdi_list));
+		pr_warn("wb->bdi->wb.state=%lx\n", wb->bdi->wb.state);
+		if (!wb->congested)
+			pr_warn("wb->congested == NULL\n");
+#ifdef CONFIG_CGROUP_WRITEBACK
+		else if (!wb->congested->__bdi)
+			pr_warn("wb->congested->__bdi == NULL\n");
+		else {
+			pr_warn("(wb->congested->__bdi == wb->bdi)=%u\n",
+				wb->congested->__bdi == wb->bdi);
+			pr_warn("list_empty(&wb->congested->__bdi->bdi_list)=%u\n",
+				list_empty(&wb->congested->__bdi->bdi_list));
+			pr_warn("wb->congested->__bdi->wb.state=%lx\n",
+				wb->congested->__bdi->wb.state);
+		}
+#endif
+		/* Will halt shortly due to NULL pointer dereference... */
+	}
+#endif
+
 	set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
 	current->flags |= PF_SWAPWRITE;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-05-31 13:19         ` Tetsuo Handa
@ 2018-05-31 13:42           ` Jan Kara
  2018-05-31 16:56             ` Jens Axboe
  2018-06-01  2:30             ` general protection fault in wb_workfn (2) Dave Chinner
  0 siblings, 2 replies; 43+ messages in thread
From: Jan Kara @ 2018-05-31 13:42 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel, linux-kernel,
	viro, axboe, tj, david, linux-block, Dmitry Vyukov

On Thu 31-05-18 22:19:44, Tetsuo Handa wrote:
> On 2018/05/31 20:42, Jan Kara wrote:
> > On Thu 31-05-18 01:00:08, Tetsuo Handa wrote:
> >> So, we have no idea what is happening...
> >> Then, what about starting from temporary debug printk() patch shown below?
> >>
> >> >From 4f70f72ad3c9ae6ce1678024ef740aca4958e5b0 Mon Sep 17 00:00:00 2001
> >> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> >> Date: Wed, 30 May 2018 09:57:10 +0900
> >> Subject: [PATCH] bdi: Add temporary config for debugging wb_workfn() versus
> >>  bdi_unregister() race bug.
> >>
> >> syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
> >> limitations that syzbot cannot find reproducer for this bug (frequency is
> >> once or twice per a day) nor we can't capture vmcore in the environment
> >> which syzbot is using, for now we need to rely on printk() debugging.
> >>
> >> [1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
> >>
> >> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > 
> > Hum a bit ugly solution but if others are fine with this, I can live with
> > it for a while as well. Or would it be possible for syzkaller to just test
> > some git tree where this patch is included? Then we would not even have to
> > have the extra config option...
> 
> If syzbot can reproduce this bug that way. While it is possible to add/remove
> git trees syzbot tests, frequently adding/removing trees is bothering.
> 
> syzbot can enable extra config option. Maybe the config name should be
> something like CONFIG_DEBUG_FOR_SYZBOT rather than individual topic.
> 
> I think that syzbot is using many VM instances. I don't know how many
> instances will be needed for reproducing this bug within reasonable period.
> More git trees syzbot tests, (I assume that) longer period will be needed
> for reproducing this bug. The most reliable way is to use the shared part
> of all trees (i.e. linux.git).

I understand this, I'd be just a bit reluctant to merge temporary debug
patches like this to Linus' tree only to revert them later just because
syzkaller... What do others think?

> From 3f3346d42b804e59d12caaa525365a8482505f08 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Thu, 31 May 2018 22:07:20 +0900
> Subject: [PATCH v2] bdi: Add temporary config for debugging wb_workfn() versus
>  bdi_unregister() race bug.
> 
> syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
> limitations that syzbot cannot find reproducer for this bug (frequency is
> once or twice per a day) nor we can't capture vmcore in the environment
> which syzbot is using, for now we need to rely on printk() debugging.
> 
> [1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

The debug patch looks good to me now. Thanks for writing it.

								Honza

> ---
>  block/Kconfig     |  7 +++++++
>  fs/fs-writeback.c | 28 ++++++++++++++++++++++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index 28ec557..fbce13e 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -139,6 +139,13 @@ config BLK_CMDLINE_PARSER
>  
>  	See Documentation/block/cmdline-partition.txt for more information.
>  
> +config BLK_DEBUG_WB_WORKFN_RACE
> +	bool "Dump upon hitting wb_workfn() versus bdi_unregister() race bug."
> +	default n
> +	---help---
> +	This is a temporary option used for obtaining information for
> +	specific bug. This option will be removed after the bug is fixed.
> +
>  config BLK_WBT
>  	bool "Enable support for block device writeback throttling"
>  	default n
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 471d863..14ab873 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1934,6 +1934,34 @@ void wb_workfn(struct work_struct *work)
>  						struct bdi_writeback, dwork);
>  	long pages_written;
>  
> +#ifdef CONFIG_BLK_DEBUG_WB_WORKFN_RACE
> +	if (!wb->bdi->dev) {
> +		pr_warn("WARNING: %s: device is NULL\n", __func__);
> +		pr_warn("wb->state=%lx\n", wb->state);
> +		pr_warn("(wb == &wb->bdi->wb)=%u\n", wb == &wb->bdi->wb);
> +		pr_warn("list_empty(&wb->work_list)=%u\n",
> +			list_empty(&wb->work_list));
> +		pr_warn("list_empty(&wb->bdi->bdi_list)=%u\n",
> +			list_empty(&wb->bdi->bdi_list));
> +		pr_warn("wb->bdi->wb.state=%lx\n", wb->bdi->wb.state);
> +		if (!wb->congested)
> +			pr_warn("wb->congested == NULL\n");
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +		else if (!wb->congested->__bdi)
> +			pr_warn("wb->congested->__bdi == NULL\n");
> +		else {
> +			pr_warn("(wb->congested->__bdi == wb->bdi)=%u\n",
> +				wb->congested->__bdi == wb->bdi);
> +			pr_warn("list_empty(&wb->congested->__bdi->bdi_list)=%u\n",
> +				list_empty(&wb->congested->__bdi->bdi_list));
> +			pr_warn("wb->congested->__bdi->wb.state=%lx\n",
> +				wb->congested->__bdi->wb.state);
> +		}
> +#endif
> +		/* Will halt shortly due to NULL pointer dereference... */
> +	}
> +#endif
> +
>  	set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
>  	current->flags |= PF_SWAPWRITE;
>  
> -- 
> 1.8.3.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-05-31 13:42           ` Jan Kara
@ 2018-05-31 16:56             ` Jens Axboe
  2018-06-05 13:45               ` Tetsuo Handa
  2018-06-01  2:30             ` general protection fault in wb_workfn (2) Dave Chinner
  1 sibling, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2018-05-31 16:56 UTC (permalink / raw)
  To: Jan Kara, Tetsuo Handa
  Cc: syzbot, syzkaller-bugs, linux-fsdevel, linux-kernel, viro, tj,
	david, linux-block, Dmitry Vyukov

On 5/31/18 7:42 AM, Jan Kara wrote:
> On Thu 31-05-18 22:19:44, Tetsuo Handa wrote:
>> On 2018/05/31 20:42, Jan Kara wrote:
>>> On Thu 31-05-18 01:00:08, Tetsuo Handa wrote:
>>>> So, we have no idea what is happening...
>>>> Then, what about starting from temporary debug printk() patch shown below?
>>>>
>>>> >From 4f70f72ad3c9ae6ce1678024ef740aca4958e5b0 Mon Sep 17 00:00:00 2001
>>>> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>>>> Date: Wed, 30 May 2018 09:57:10 +0900
>>>> Subject: [PATCH] bdi: Add temporary config for debugging wb_workfn() versus
>>>>  bdi_unregister() race bug.
>>>>
>>>> syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
>>>> limitations that syzbot cannot find reproducer for this bug (frequency is
>>>> once or twice per a day) nor we can't capture vmcore in the environment
>>>> which syzbot is using, for now we need to rely on printk() debugging.
>>>>
>>>> [1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
>>>>
>>>> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>>>
>>> Hum a bit ugly solution but if others are fine with this, I can live with
>>> it for a while as well. Or would it be possible for syzkaller to just test
>>> some git tree where this patch is included? Then we would not even have to
>>> have the extra config option...
>>
>> If syzbot can reproduce this bug that way. While it is possible to add/remove
>> git trees syzbot tests, frequently adding/removing trees is bothering.
>>
>> syzbot can enable extra config option. Maybe the config name should be
>> something like CONFIG_DEBUG_FOR_SYZBOT rather than individual topic.
>>
>> I think that syzbot is using many VM instances. I don't know how many
>> instances will be needed for reproducing this bug within reasonable period.
>> More git trees syzbot tests, (I assume that) longer period will be needed
>> for reproducing this bug. The most reliable way is to use the shared part
>> of all trees (i.e. linux.git).
> 
> I understand this, I'd be just a bit reluctant to merge temporary debug
> patches like this to Linus' tree only to revert them later just because
> syzkaller... What do others think?

I guess I don't understand why having it in Linus's tree would make
a difference to syzkaller?

If there is a compelling reason why that absolutely has to be done,
I don't think it should be accompanied by a special Kconfig option
for it. It should just be on unconditionally, with the intent to
remove it before release.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-05-31 13:42           ` Jan Kara
  2018-05-31 16:56             ` Jens Axboe
@ 2018-06-01  2:30             ` Dave Chinner
  1 sibling, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2018-06-01  2:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tetsuo Handa, syzbot, syzkaller-bugs, linux-fsdevel,
	linux-kernel, viro, axboe, tj, linux-block, Dmitry Vyukov

On Thu, May 31, 2018 at 03:42:12PM +0200, Jan Kara wrote:
> On Thu 31-05-18 22:19:44, Tetsuo Handa wrote:
> > On 2018/05/31 20:42, Jan Kara wrote:
> > > On Thu 31-05-18 01:00:08, Tetsuo Handa wrote:
> > >> So, we have no idea what is happening...
> > >> Then, what about starting from temporary debug printk() patch shown below?
> > >>
> > >> >From 4f70f72ad3c9ae6ce1678024ef740aca4958e5b0 Mon Sep 17 00:00:00 2001
> > >> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > >> Date: Wed, 30 May 2018 09:57:10 +0900
> > >> Subject: [PATCH] bdi: Add temporary config for debugging wb_workfn() versus
> > >>  bdi_unregister() race bug.
> > >>
> > >> syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
> > >> limitations that syzbot cannot find reproducer for this bug (frequency is
> > >> once or twice per a day) nor we can't capture vmcore in the environment
> > >> which syzbot is using, for now we need to rely on printk() debugging.
> > >>
> > >> [1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
> > >>
> > >> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > 
> > > Hum a bit ugly solution but if others are fine with this, I can live with
> > > it for a while as well. Or would it be possible for syzkaller to just test
> > > some git tree where this patch is included? Then we would not even have to
> > > have the extra config option...
> > 
> > If syzbot can reproduce this bug that way. While it is possible to add/remove
> > git trees syzbot tests, frequently adding/removing trees is bothering.
> > 
> > syzbot can enable extra config option. Maybe the config name should be
> > something like CONFIG_DEBUG_FOR_SYZBOT rather than individual topic.
> > 
> > I think that syzbot is using many VM instances. I don't know how many
> > instances will be needed for reproducing this bug within reasonable period.
> > More git trees syzbot tests, (I assume that) longer period will be needed
> > for reproducing this bug. The most reliable way is to use the shared part
> > of all trees (i.e. linux.git).
> 
> I understand this, I'd be just a bit reluctant to merge temporary debug
> patches like this to Linus' tree only to revert them later just because
> syzkaller... What do others think?

Don't commit temporary debug crap to the mainline kernel. Enable
syzkaller to run against a user supplied git tree specification if
it needs special debug to track down a problem.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-05-31 16:56             ` Jens Axboe
@ 2018-06-05 13:45               ` Tetsuo Handa
  2018-06-07 18:46                 ` Dmitry Vyukov
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2018-06-05 13:45 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel,
	linux-kernel, viro, tj, david, linux-block, Linus Torvalds

Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight
against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...

On 2018/06/01 1:56, Jens Axboe wrote:
> On 5/31/18 7:42 AM, Jan Kara wrote:
>> On Thu 31-05-18 22:19:44, Tetsuo Handa wrote:
>>> On 2018/05/31 20:42, Jan Kara wrote:
>>>> On Thu 31-05-18 01:00:08, Tetsuo Handa wrote:
>>>>> So, we have no idea what is happening...
>>>>> Then, what about starting from temporary debug printk() patch shown below?
>>>>>
>>>>> >From 4f70f72ad3c9ae6ce1678024ef740aca4958e5b0 Mon Sep 17 00:00:00 2001
>>>>> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>>>>> Date: Wed, 30 May 2018 09:57:10 +0900
>>>>> Subject: [PATCH] bdi: Add temporary config for debugging wb_workfn() versus
>>>>>  bdi_unregister() race bug.
>>>>>
>>>>> syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
>>>>> limitations that syzbot cannot find reproducer for this bug (frequency is
>>>>> once or twice per a day) nor we can't capture vmcore in the environment
>>>>> which syzbot is using, for now we need to rely on printk() debugging.
>>>>>
>>>>> [1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
>>>>>
>>>>> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>>>>
>>>> Hum a bit ugly solution but if others are fine with this, I can live with
>>>> it for a while as well. Or would it be possible for syzkaller to just test
>>>> some git tree where this patch is included? Then we would not even have to
>>>> have the extra config option...
>>>
>>> If syzbot can reproduce this bug that way. While it is possible to add/remove
>>> git trees syzbot tests, frequently adding/removing trees is bothering.
>>>
>>> syzbot can enable extra config option. Maybe the config name should be
>>> something like CONFIG_DEBUG_FOR_SYZBOT rather than individual topic.
>>>
>>> I think that syzbot is using many VM instances. I don't know how many
>>> instances will be needed for reproducing this bug within reasonable period.
>>> More git trees syzbot tests, (I assume that) longer period will be needed
>>> for reproducing this bug. The most reliable way is to use the shared part
>>> of all trees (i.e. linux.git).
>>
>> I understand this, I'd be just a bit reluctant to merge temporary debug
>> patches like this to Linus' tree only to revert them later just because
>> syzkaller... What do others think?
> 
> I guess I don't understand why having it in Linus's tree would make
> a difference to syzkaller?
> 
> If there is a compelling reason why that absolutely has to be done,
> I don't think it should be accompanied by a special Kconfig option
> for it. It should just be on unconditionally, with the intent to
> remove it before release.
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-06-05 13:45               ` Tetsuo Handa
@ 2018-06-07 18:46                 ` Dmitry Vyukov
  2018-06-08  2:31                   ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Dmitry Vyukov @ 2018-06-07 18:46 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel,
	LKML, Al Viro, Tejun Heo, Dave Chinner, linux-block,
	Linus Torvalds

On Tue, Jun 5, 2018 at 3:45 PM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight
> against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...

Hi Tetsuo,

Most of the reasons for not doing it still stand. A syzkaller instance
will produce not just this bug, it will produce hundreds of different
bugs. Then the question is: what to do with these bugs? Report all to
mailing lists?
I think the solution here is just to run syzkaller instance locally.
It's just a program anybody can run it on any kernel with any custom
patches. Moreover for local instance it's also possible to limit set
of tested syscalls to increase probability of hitting this bug and at
the same time filter out most of other bugs.

Do we have any idea about the guilty subsystem? You mentioned
bdi_unregister, why? What would be the set of syscalls to concentrate
on?
I will do a custom run when I get around to it, if nobody else beats me to it.


> On 2018/06/01 1:56, Jens Axboe wrote:
>> On 5/31/18 7:42 AM, Jan Kara wrote:
>>> On Thu 31-05-18 22:19:44, Tetsuo Handa wrote:
>>>> On 2018/05/31 20:42, Jan Kara wrote:
>>>>> On Thu 31-05-18 01:00:08, Tetsuo Handa wrote:
>>>>>> So, we have no idea what is happening...
>>>>>> Then, what about starting from temporary debug printk() patch shown below?
>>>>>>
>>>>>> >From 4f70f72ad3c9ae6ce1678024ef740aca4958e5b0 Mon Sep 17 00:00:00 2001
>>>>>> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>>>>>> Date: Wed, 30 May 2018 09:57:10 +0900
>>>>>> Subject: [PATCH] bdi: Add temporary config for debugging wb_workfn() versus
>>>>>>  bdi_unregister() race bug.
>>>>>>
>>>>>> syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
>>>>>> limitations that syzbot cannot find reproducer for this bug (frequency is
>>>>>> once or twice per a day) nor we can't capture vmcore in the environment
>>>>>> which syzbot is using, for now we need to rely on printk() debugging.
>>>>>>
>>>>>> [1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
>>>>>>
>>>>>> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>>>>>
>>>>> Hum a bit ugly solution but if others are fine with this, I can live with
>>>>> it for a while as well. Or would it be possible for syzkaller to just test
>>>>> some git tree where this patch is included? Then we would not even have to
>>>>> have the extra config option...
>>>>
>>>> If syzbot can reproduce this bug that way. While it is possible to add/remove
>>>> git trees syzbot tests, frequently adding/removing trees is bothering.
>>>>
>>>> syzbot can enable extra config option. Maybe the config name should be
>>>> something like CONFIG_DEBUG_FOR_SYZBOT rather than individual topic.
>>>>
>>>> I think that syzbot is using many VM instances. I don't know how many
>>>> instances will be needed for reproducing this bug within reasonable period.
>>>> More git trees syzbot tests, (I assume that) longer period will be needed
>>>> for reproducing this bug. The most reliable way is to use the shared part
>>>> of all trees (i.e. linux.git).
>>>
>>> I understand this, I'd be just a bit reluctant to merge temporary debug
>>> patches like this to Linus' tree only to revert them later just because
>>> syzkaller... What do others think?
>>
>> I guess I don't understand why having it in Linus's tree would make
>> a difference to syzkaller?
>>
>> If there is a compelling reason why that absolutely has to be done,
>> I don't think it should be accompanied by a special Kconfig option
>> for it. It should just be on unconditionally, with the intent to
>> remove it before release.
>>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-06-07 18:46                 ` Dmitry Vyukov
@ 2018-06-08  2:31                   ` Tetsuo Handa
  2018-06-08 14:45                     ` Dmitry Vyukov
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2018-06-08  2:31 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel,
	LKML, Al Viro, Tejun Heo, Dave Chinner, linux-block,
	Linus Torvalds

Dmitry Vyukov wrote:
> On Tue, Jun 5, 2018 at 3:45 PM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> > Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight
> > against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...
> 
> Hi Tetsuo,
> 
> Most of the reasons for not doing it still stand. A syzkaller instance
> will produce not just this bug, it will produce hundreds of different
> bugs. Then the question is: what to do with these bugs? Report all to
> mailing lists?

Is it possible to add linux-next.git tree as a target for fuzzing? If yes,
we can try debug patches easily, in addition to find bugs earlier than now.

> I think the solution here is just to run syzkaller instance locally.
> It's just a program anybody can run it on any kernel with any custom
> patches. Moreover for local instance it's also possible to limit set
> of tested syscalls to increase probability of hitting this bug and at
> the same time filter out most of other bugs.

If this bug is reproducible with VM resources individual developer can afford...

Since my Linux development environment is VMware guests on a Windows PC, I can't
run VM instance which needs KVM acceleration. Also, due to security policy, I can't
utilize external VM resources available on the Internet, as well as I can't use ssh
and git protocols. Speak of this bug, even with a lot of VM instances, syzbot can
reproduce this bug only once or twice per a day. Thus, the question for me boils
down to, whether I can reproduce this bug using one VMware guest instance with 4GB
of memory. Effectively, I don't have access to environments for running syzkaller
instance...

> 
> Do we have any idea about the guilty subsystem? You mentioned
> bdi_unregister, why? What would be the set of syscalls to concentrate
> on?
> I will do a custom run when I get around to it, if nobody else beats me to it.

Because bdi_unregister() does "bdi->dev = NULL;" which wb_workfn() is hitting
NULL pointer dereference.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-06-08  2:31                   ` Tetsuo Handa
@ 2018-06-08 14:45                     ` Dmitry Vyukov
  2018-06-08 15:16                       ` Dmitry Vyukov
  0 siblings, 1 reply; 43+ messages in thread
From: Dmitry Vyukov @ 2018-06-08 14:45 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel,
	LKML, Al Viro, Tejun Heo, Dave Chinner, linux-block,
	Linus Torvalds

On Fri, Jun 8, 2018 at 4:31 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> Dmitry Vyukov wrote:
>> On Tue, Jun 5, 2018 at 3:45 PM, Tetsuo Handa
>> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>> > Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight
>> > against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...
>>
>> Hi Tetsuo,
>>
>> Most of the reasons for not doing it still stand. A syzkaller instance
>> will produce not just this bug, it will produce hundreds of different
>> bugs. Then the question is: what to do with these bugs? Report all to
>> mailing lists?
>
> Is it possible to add linux-next.git tree as a target for fuzzing? If yes,
> we can try debug patches easily, in addition to find bugs earlier than now.

syzbot tested linux-next and mmotm initially, but they were removed at
the request of kernel developers. See:
https://groups.google.com/d/msg/syzkaller/0H0LHW_ayR8/dsK5qGB_AQAJ
and:
https://groups.google.com/d/msg/syzkaller-bugs/FeAgni6Atlk/U0JGoR0AAwAJ
Indeed, linux-next produces around 50 assorted one-off unexplainable
bug reports.


>> I think the solution here is just to run syzkaller instance locally.
>> It's just a program anybody can run it on any kernel with any custom
>> patches. Moreover for local instance it's also possible to limit set
>> of tested syscalls to increase probability of hitting this bug and at
>> the same time filter out most of other bugs.
>
> If this bug is reproducible with VM resources individual developer can afford...
>
> Since my Linux development environment is VMware guests on a Windows PC, I can't
> run VM instance which needs KVM acceleration. Also, due to security policy, I can't
> utilize external VM resources available on the Internet, as well as I can't use ssh
> and git protocols. Speak of this bug, even with a lot of VM instances, syzbot can
> reproduce this bug only once or twice per a day. Thus, the question for me boils
> down to, whether I can reproduce this bug using one VMware guest instance with 4GB
> of memory. Effectively, I don't have access to environments for running syzkaller
> instance...

Well, I don't know what to say, it does require some resources.

>> Do we have any idea about the guilty subsystem? You mentioned
>> bdi_unregister, why? What would be the set of syscalls to concentrate
>> on?
>> I will do a custom run when I get around to it, if nobody else beats me to it.
>
> Because bdi_unregister() does "bdi->dev = NULL;" which wb_workfn() is hitting
> NULL pointer dereference.

Right, wb_workfn is not a generic function, it's fs-specific function.

Trying to reproduce this locally now.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-06-08 14:45                     ` Dmitry Vyukov
@ 2018-06-08 15:16                       ` Dmitry Vyukov
  2018-06-08 16:53                         ` Dmitry Vyukov
  0 siblings, 1 reply; 43+ messages in thread
From: Dmitry Vyukov @ 2018-06-08 15:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel,
	LKML, Al Viro, Tejun Heo, Dave Chinner, linux-block,
	Linus Torvalds

On Fri, Jun 8, 2018 at 4:45 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Fri, Jun 8, 2018 at 4:31 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>> Dmitry Vyukov wrote:
>>> On Tue, Jun 5, 2018 at 3:45 PM, Tetsuo Handa
>>> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>> > Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight
>>> > against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...
>>>
>>> Hi Tetsuo,
>>>
>>> Most of the reasons for not doing it still stand. A syzkaller instance
>>> will produce not just this bug, it will produce hundreds of different
>>> bugs. Then the question is: what to do with these bugs? Report all to
>>> mailing lists?
>>
>> Is it possible to add linux-next.git tree as a target for fuzzing? If yes,
>> we can try debug patches easily, in addition to find bugs earlier than now.
>
> syzbot tested linux-next and mmotm initially, but they were removed at
> the request of kernel developers. See:
> https://groups.google.com/d/msg/syzkaller/0H0LHW_ayR8/dsK5qGB_AQAJ
> and:
> https://groups.google.com/d/msg/syzkaller-bugs/FeAgni6Atlk/U0JGoR0AAwAJ
> Indeed, linux-next produces around 50 assorted one-off unexplainable
> bug reports.
>
>
>>> I think the solution here is just to run syzkaller instance locally.
>>> It's just a program anybody can run it on any kernel with any custom
>>> patches. Moreover for local instance it's also possible to limit set
>>> of tested syscalls to increase probability of hitting this bug and at
>>> the same time filter out most of other bugs.
>>
>> If this bug is reproducible with VM resources individual developer can afford...
>>
>> Since my Linux development environment is VMware guests on a Windows PC, I can't
>> run VM instance which needs KVM acceleration. Also, due to security policy, I can't
>> utilize external VM resources available on the Internet, as well as I can't use ssh
>> and git protocols. Speak of this bug, even with a lot of VM instances, syzbot can
>> reproduce this bug only once or twice per a day. Thus, the question for me boils
>> down to, whether I can reproduce this bug using one VMware guest instance with 4GB
>> of memory. Effectively, I don't have access to environments for running syzkaller
>> instance...
>
> Well, I don't know what to say, it does require some resources.
>
>>> Do we have any idea about the guilty subsystem? You mentioned
>>> bdi_unregister, why? What would be the set of syscalls to concentrate
>>> on?
>>> I will do a custom run when I get around to it, if nobody else beats me to it.
>>
>> Because bdi_unregister() does "bdi->dev = NULL;" which wb_workfn() is hitting
>> NULL pointer dereference.
>
> Right, wb_workfn is not a generic function, it's fs-specific function.
>
> Trying to reproduce this locally now.


No luck so far.

Trying to look from a different angle: is it possible that bdi->dev is
not set yet, rather then already reset?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-06-08 15:16                       ` Dmitry Vyukov
@ 2018-06-08 16:53                         ` Dmitry Vyukov
  2018-06-08 17:14                           ` Dmitry Vyukov
  0 siblings, 1 reply; 43+ messages in thread
From: Dmitry Vyukov @ 2018-06-08 16:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel,
	LKML, Al Viro, Tejun Heo, Dave Chinner, linux-block,
	Linus Torvalds

On Fri, Jun 8, 2018 at 5:16 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> On Fri, Jun 8, 2018 at 4:31 AM, Tetsuo Handa
>> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>> Dmitry Vyukov wrote:
>>>> On Tue, Jun 5, 2018 at 3:45 PM, Tetsuo Handa
>>>> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>>> > Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight
>>>> > against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...
>>>>
>>>> Hi Tetsuo,
>>>>
>>>> Most of the reasons for not doing it still stand. A syzkaller instance
>>>> will produce not just this bug, it will produce hundreds of different
>>>> bugs. Then the question is: what to do with these bugs? Report all to
>>>> mailing lists?
>>>
>>> Is it possible to add linux-next.git tree as a target for fuzzing? If yes,
>>> we can try debug patches easily, in addition to find bugs earlier than now.
>>
>> syzbot tested linux-next and mmotm initially, but they were removed at
>> the request of kernel developers. See:
>> https://groups.google.com/d/msg/syzkaller/0H0LHW_ayR8/dsK5qGB_AQAJ
>> and:
>> https://groups.google.com/d/msg/syzkaller-bugs/FeAgni6Atlk/U0JGoR0AAwAJ
>> Indeed, linux-next produces around 50 assorted one-off unexplainable
>> bug reports.
>>
>>
>>>> I think the solution here is just to run syzkaller instance locally.
>>>> It's just a program anybody can run it on any kernel with any custom
>>>> patches. Moreover for local instance it's also possible to limit set
>>>> of tested syscalls to increase probability of hitting this bug and at
>>>> the same time filter out most of other bugs.
>>>
>>> If this bug is reproducible with VM resources individual developer can afford...
>>>
>>> Since my Linux development environment is VMware guests on a Windows PC, I can't
>>> run VM instance which needs KVM acceleration. Also, due to security policy, I can't
>>> utilize external VM resources available on the Internet, as well as I can't use ssh
>>> and git protocols. Speak of this bug, even with a lot of VM instances, syzbot can
>>> reproduce this bug only once or twice per a day. Thus, the question for me boils
>>> down to, whether I can reproduce this bug using one VMware guest instance with 4GB
>>> of memory. Effectively, I don't have access to environments for running syzkaller
>>> instance...
>>
>> Well, I don't know what to say, it does require some resources.
>>
>>>> Do we have any idea about the guilty subsystem? You mentioned
>>>> bdi_unregister, why? What would be the set of syscalls to concentrate
>>>> on?
>>>> I will do a custom run when I get around to it, if nobody else beats me to it.
>>>
>>> Because bdi_unregister() does "bdi->dev = NULL;" which wb_workfn() is hitting
>>> NULL pointer dereference.
>>
>> Right, wb_workfn is not a generic function, it's fs-specific function.
>>
>> Trying to reproduce this locally now.
>
>
> No luck so far.
>
> Trying to look from a different angle: is it possible that bdi->dev is
> not set yet, rather then already reset?


I was able to reproduce this once locally running syz-crush utility
replaying one of the crash logs. Now running with Tetsuo's patch.

I can say we hunting a very subtle race condition with short
inconsistency window, perhaps few instructions.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-06-08 16:53                         ` Dmitry Vyukov
@ 2018-06-08 17:14                           ` Dmitry Vyukov
  2018-06-09  5:30                             ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Dmitry Vyukov @ 2018-06-08 17:14 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel,
	LKML, Al Viro, Tejun Heo, Dave Chinner, linux-block,
	Linus Torvalds

On Fri, Jun 8, 2018 at 6:53 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Fri, Jun 8, 2018 at 5:16 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>> On Fri, Jun 8, 2018 at 4:31 AM, Tetsuo Handa
>>> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>>> Dmitry Vyukov wrote:
>>>>> On Tue, Jun 5, 2018 at 3:45 PM, Tetsuo Handa
>>>>> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>>>> > Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight
>>>>> > against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...
>>>>>
>>>>> Hi Tetsuo,
>>>>>
>>>>> Most of the reasons for not doing it still stand. A syzkaller instance
>>>>> will produce not just this bug, it will produce hundreds of different
>>>>> bugs. Then the question is: what to do with these bugs? Report all to
>>>>> mailing lists?
>>>>
>>>> Is it possible to add linux-next.git tree as a target for fuzzing? If yes,
>>>> we can try debug patches easily, in addition to find bugs earlier than now.
>>>
>>> syzbot tested linux-next and mmotm initially, but they were removed at
>>> the request of kernel developers. See:
>>> https://groups.google.com/d/msg/syzkaller/0H0LHW_ayR8/dsK5qGB_AQAJ
>>> and:
>>> https://groups.google.com/d/msg/syzkaller-bugs/FeAgni6Atlk/U0JGoR0AAwAJ
>>> Indeed, linux-next produces around 50 assorted one-off unexplainable
>>> bug reports.
>>>
>>>
>>>>> I think the solution here is just to run syzkaller instance locally.
>>>>> It's just a program anybody can run it on any kernel with any custom
>>>>> patches. Moreover for local instance it's also possible to limit set
>>>>> of tested syscalls to increase probability of hitting this bug and at
>>>>> the same time filter out most of other bugs.
>>>>
>>>> If this bug is reproducible with VM resources individual developer can afford...
>>>>
>>>> Since my Linux development environment is VMware guests on a Windows PC, I can't
>>>> run VM instance which needs KVM acceleration. Also, due to security policy, I can't
>>>> utilize external VM resources available on the Internet, as well as I can't use ssh
>>>> and git protocols. Speak of this bug, even with a lot of VM instances, syzbot can
>>>> reproduce this bug only once or twice per a day. Thus, the question for me boils
>>>> down to, whether I can reproduce this bug using one VMware guest instance with 4GB
>>>> of memory. Effectively, I don't have access to environments for running syzkaller
>>>> instance...
>>>
>>> Well, I don't know what to say, it does require some resources.
>>>
>>>>> Do we have any idea about the guilty subsystem? You mentioned
>>>>> bdi_unregister, why? What would be the set of syscalls to concentrate
>>>>> on?
>>>>> I will do a custom run when I get around to it, if nobody else beats me to it.
>>>>
>>>> Because bdi_unregister() does "bdi->dev = NULL;" which wb_workfn() is hitting
>>>> NULL pointer dereference.
>>>
>>> Right, wb_workfn is not a generic function, it's fs-specific function.
>>>
>>> Trying to reproduce this locally now.
>>
>>
>> No luck so far.
>>
>> Trying to look from a different angle: is it possible that bdi->dev is
>> not set yet, rather then already reset?
>
>
> I was able to reproduce this once locally running syz-crush utility
> replaying one of the crash logs. Now running with Tetsuo's patch.
>
> I can say we hunting a very subtle race condition with short
> inconsistency window, perhaps few instructions.


Here we go:

[ 2853.033175] WARNING: wb_workfn: device is NULL
[ 2853.034709] wb->state=2
[ 2853.035486] (wb == &wb->bdi->wb)=0
[ 2853.036489] list_empty(&wb->work_list)=1
[ 2853.037603] list_empty(&wb->bdi->bdi_list)=0
[ 2853.038843] wb->bdi->wb.state=0
[ 2853.039819] (wb->congested->__bdi == wb->bdi)=1
[ 2853.041062] list_empty(&wb->congested->__bdi->bdi_list)=0
[ 2853.042609] wb->congested->__bdi->wb.state=0
[ 2853.043793] kasan: CONFIG_KASAN_INLINE enabled
[ 2853.045315] kasan: GPF could be caused by NULL-ptr deref or user
memory access
[ 2853.047376] general protection fault: 0000 [#1] SMP KASAN
[ 2853.048980] CPU: 1 PID: 13971 Comm: kworker/u12:8 Not tainted 4.17.0+ #21
[ 2853.050762] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[ 2853.053034] Workqueue: writeback wb_workfn
[ 2853.054193] RIP: 0010:wb_workfn+0x187/0xab0
[ 2853.055360] Code: 85 70 fd ff ff 48 83 e8 10 48 89 85 60 fd ff ff
e8 5e 38 ab ff 48 8d 7b 50 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
c1 ea 03 <80> 3c 02 00 0f 85 05 08 00 00 4c 8b 63 50 4d 85 e4 0f 84 b5
05 00
[ 2853.060692] RSP: 0018:ffff8800600ef480 EFLAGS: 00010206
[ 2853.062210] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2853.064215] RDX: 000000000000000a RSI: ffffffff81cf0312 RDI: 0000000000000050
[ 2853.066198] RBP: ffff8800600ef750 R08: ffff880061e30400 R09: ffffed000d8ccfc0
[ 2853.068037] R10: ffffed000d8ccfc0 R11: ffff88006c667e07 R12: 1ffff1000c01dead
[ 2853.069970] R13: ffff8800600ef728 R14: ffff8800676bd180 R15: dffffc0000000000
[ 2853.071932] FS:  0000000000000000(0000) GS:ffff88006c640000(0000)
knlGS:0000000000000000
[ 2853.074080] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2853.075699] CR2: 00007ffc8478c000 CR3: 0000000008c6a006 CR4: 00000000001626e0
[ 2853.077633] Call Trace:
[ 2853.078341]  ? inode_wait_for_writeback+0x40/0x40
[ 2853.079642]  ? graph_lock+0x170/0x170
[ 2853.080710]  ? lock_downgrade+0x8e0/0x8e0
[ 2853.081889]  ? find_held_lock+0x36/0x1c0
[ 2853.083047]  ? graph_lock+0x170/0x170
[ 2853.084105]  ? lock_acquire+0x1dc/0x520
[ 2853.085216]  ? process_one_work+0xb8c/0x1b70
[ 2853.086425]  ? kasan_check_read+0x11/0x20
[ 2853.087608]  ? __lock_is_held+0xb5/0x140
[ 2853.088720]  process_one_work+0xc64/0x1b70
[ 2853.089875]  ? finish_task_switch+0x182/0x840
[ 2853.091085]  ? pwq_dec_nr_in_flight+0x490/0x490
[ 2853.092358]  ? __schedule+0x809/0x1e30
[ 2853.093475]  ? retint_kernel+0x10/0x10
[ 2853.094561]  ? retint_kernel+0x10/0x10
[ 2853.095610]  ? graph_lock+0x170/0x170
[ 2853.096623]  ? trace_hardirqs_on_caller+0x421/0x5c0
[ 2853.097981]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 2853.099298]  ? find_held_lock+0x36/0x1c0
[ 2853.100394]  ? lock_acquire+0x1dc/0x520
[ 2853.101472]  ? worker_thread+0x3d4/0x13a0
[ 2853.102558]  ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[ 2853.104060]  ? move_linked_works+0x2f6/0x470
[ 2853.105204]  ? trace_event_raw_event_workqueue_execute_start+0x290/0x290
[ 2853.107013]  ? do_raw_spin_trylock+0x1b0/0x1b0
[ 2853.108243]  worker_thread+0x9e5/0x13a0
[ 2853.109304]  ? process_one_work+0x1b70/0x1b70
[ 2853.110488]  ? graph_lock+0x170/0x170
[ 2853.111514]  ? find_held_lock+0x36/0x1c0
[ 2853.112610]  ? find_held_lock+0x36/0x1c0
[ 2853.113674]  ? __schedule+0x1e30/0x1e30
[ 2853.114714]  ? do_raw_spin_unlock+0x1f9/0x2e0
[ 2853.115885]  ? do_raw_spin_trylock+0x1b0/0x1b0
[ 2853.117060]  ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[ 2853.118492]  ? __kthread_parkme+0x111/0x1d0
[ 2853.119616]  ? parse_args.cold.15+0x1b3/0x1b3
[ 2853.120804]  ? trace_hardirqs_on_caller+0x421/0x5c0
[ 2853.122071]  ? trace_hardirqs_on+0xd/0x10
[ 2853.123117]  kthread+0x345/0x410
[ 2853.123999]  ? process_one_work+0x1b70/0x1b70
[ 2853.125174]  ? kthread_bind+0x40/0x40
[ 2853.126201]  ret_from_fork+0x3a/0x50
[ 2853.127199] Modules linked in:
[ 2853.128022] Dumping ftrace buffer:
[ 2853.128901]    (ftrace buffer empty)
[ 2853.129986] ---[ end trace 3ba28e076cb32fda ]---
[ 2853.131269] RIP: 0010:wb_workfn+0x187/0xab0
[ 2853.132441] Code: 85 70 fd ff ff 48 83 e8 10 48 89 85 60 fd ff ff
e8 5e 38 ab ff 48 8d 7b 50 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
c1 ea 03 <80> 3c 02 00 0f 85 05 08 00 00 4c 8b 63 50 4d 85 e4 0f 84 b5
05 00
[ 2853.137449] RSP: 0018:ffff8800600ef480 EFLAGS: 00010206
[ 2853.138618] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2853.140176] RDX: 000000000000000a RSI: ffffffff81cf0312 RDI: 0000000000000050
[ 2853.141722] RBP: ffff8800600ef750 R08: ffff880061e30400 R09: ffffed000d8ccfc0
[ 2853.143300] R10: ffffed000d8ccfc0 R11: ffff88006c667e07 R12: 1ffff1000c01dead
[ 2853.144841] R13: ffff8800600ef728 R14: ffff8800676bd180 R15: dffffc0000000000
[ 2853.146391] FS:  0000000000000000(0000) GS:ffff88006c640000(0000)
knlGS:0000000000000000
[ 2853.148141] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2853.149406] CR2: 00007ffc8478c000 CR3: 0000000008c6a006 CR4: 00000000001626e0
[ 2853.150968] Kernel panic - not syncing: Fatal exception
[ 2853.152419] Dumping ftrace buffer:
[ 2853.153121]    (ftrace buffer empty)
[ 2853.153786] Kernel Offset: disabled
[ 2853.154442] Rebooting in 86400 seconds..

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: general protection fault in wb_workfn (2)
  2018-06-08 17:14                           ` Dmitry Vyukov
@ 2018-06-09  5:30                             ` Tetsuo Handa
  2018-06-09 14:00                               ` [PATCH] bdi: Fix another oops in wb_workfn() Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2018-06-09  5:30 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel,
	LKML, Al Viro, Tejun Heo, Dave Chinner, linux-block,
	Linus Torvalds

Dmitry Vyukov wrote:
> Here we go:

Great. Thank you.

> 
> [ 2853.033175] WARNING: wb_workfn: device is NULL
> [ 2853.034709] wb->state=2
> 

It is surprising that wb->state == WB_shutting_down .

WB_shutting_down is set by only wb_shutdown() and is always cleared
before leaving wb_shutdown(). This means that someone was calling
wb_shutdown() on this wb object. And bdi->dev == NULL means that
bdi_unregister() already did bdi->dev = NULL while someone was still
inside wb_shutdown().

Since we call wb_shutdown() from bdi_unregister() for each wb object
on this bdi object, this should not happen. But since "INFO: task hung
in wb_shutdown (2)" found that it is possible that wb_shutdown() is
concurrently called on the same wb object, there might be something
complicated concurrency.

Well, is it really true that "we call wb_shutdown() from bdi_unregister()
for each wb object on this bdi object"? It seems it is not always true...

While cgwb_bdi_unregister() from bdi_unregister() calls wb_shutdown() on
each wb object reachable from bdi->wb_list, wb_shutdown() firstly calls
list_del_rcu(&wb->bdi_node) (which was added by
list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list) from cgwb_create()) and
then starts waiting for that wb object by calling
mod_delayed_work()/flush_delayed_work() and then clears WB_shutting_down.

Then, it is possible that cgwb_bdi_unregister() from calls wb_shutdown()
fails to find a wb object which already passed list_del_rcu() from
wb_shutdown(), and cgwb_bdi_unregister() can return without waiting for
somebody who is waiting inside wb_shutdown(). Hence, allows doing
bdi->dev = NULL before a wb object which somebody is waiting inside
wb_shutdown() completes wb_workfn(), and NULL pointer dereference...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-09  5:30                             ` Tetsuo Handa
@ 2018-06-09 14:00                               ` Tetsuo Handa
  2018-06-11  9:12                                 ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2018-06-09 14:00 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel,
	LKML, Al Viro, Tejun Heo, Dave Chinner, linux-block,
	Linus Torvalds

>From 014c4149f2e24cd26b278b32d5dfda056eecf093 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 9 Jun 2018 22:47:52 +0900
Subject: [PATCH] bdi: Fix another oops in wb_workfn()

syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

Since cgwb_bdi_unregister() from bdi_unregister() cannot call wb_shutdown()
on wb objects which have already passed list_del_rcu() in wb_shutdown(),
cgwb_bdi_unregister() from bdi_unregister() can return and set wb->bdi->dev
to NULL before such wb objects enter final round of wb_workfn() via
mod_delayed_work()/flush_delayed_work().

Since WB_registered is already cleared by wb_shutdown(), only wb_shutdown()
can schedule for final round of wb_workfn(). Since concurrent calls to
wb_shutdown() on the same wb object is safe because of WB_shutting_down
state, I think that wb_shutdown() can safely keep a wb object in the
bdi->wb_list until that wb object leaves final round of wb_workfn().
Thus, make wb_shutdown() call list_del_rcu() after flush_delayed_work().

[1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
---
 mm/backing-dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc83..bef4b25 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -370,7 +370,6 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	set_bit(WB_shutting_down, &wb->state);
 	spin_unlock_bh(&wb->work_lock);
 
-	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Drain work list and shutdown the delayed_work.  !WB_registered
 	 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -379,6 +378,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Make sure bit gets cleared after shutdown is finished. Matches with
 	 * the barrier provided by test_and_clear_bit() above.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-09 14:00                               ` [PATCH] bdi: Fix another oops in wb_workfn() Tetsuo Handa
@ 2018-06-11  9:12                                 ` Jan Kara
  2018-06-11 16:01                                   ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2018-06-11  9:12 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Dmitry Vyukov, Jens Axboe, Jan Kara, syzbot, syzkaller-bugs,
	linux-fsdevel, LKML, Al Viro, Tejun Heo, Dave Chinner,
	linux-block, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 2782 bytes --]

On Sat 09-06-18 23:00:05, Tetsuo Handa wrote:
> From 014c4149f2e24cd26b278b32d5dfda056eecf093 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sat, 9 Jun 2018 22:47:52 +0900
> Subject: [PATCH] bdi: Fix another oops in wb_workfn()
> 
> syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
> wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
> WB_shutting_down after wb->bdi->dev became NULL. This indicates that
> unregister_bdi() failed to call wb_shutdown() on one of wb objects.
> 
> Since cgwb_bdi_unregister() from bdi_unregister() cannot call wb_shutdown()
> on wb objects which have already passed list_del_rcu() in wb_shutdown(),
> cgwb_bdi_unregister() from bdi_unregister() can return and set wb->bdi->dev
> to NULL before such wb objects enter final round of wb_workfn() via
> mod_delayed_work()/flush_delayed_work().

Thanks a lot for debugging the issue and also thanks a lot to Dmitry for
taking time to reproduce the race by hand with the debug patch! I really
appreciate it!

> Since WB_registered is already cleared by wb_shutdown(), only wb_shutdown()
> can schedule for final round of wb_workfn(). Since concurrent calls to
> wb_shutdown() on the same wb object is safe because of WB_shutting_down
> state, I think that wb_shutdown() can safely keep a wb object in the
> bdi->wb_list until that wb object leaves final round of wb_workfn().
> Thus, make wb_shutdown() call list_del_rcu() after flush_delayed_work().

However this is wrong and so is the patch. The problem is in
cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's
reference to wb structures before going through the list of wbs again and
calling wb_shutdown() on each of them. The writeback structures we are
accessing at this point can be already freed in principle like:

CPU1				CPU2
				cgwb_bdi_unregister()
				  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
				  wb = list_first_entry(&bdi->wb_list, ...)
				  spin_unlock_irq(&cgwb_lock);
  wb_shutdown(wb);
  ...				
  kfree_rcu(wb, rcu);
				  wb_shutdown(wb); -> oops use-after-free

I'm not 100% sure how to fix this. wb structures can be at various phases of
shutdown (or there may be other external references still existing) when we
enter cgwb_bdi_unregister() so I think adding a way for cgwb_bdi_unregister()
to wait for standard wb shutdown path to finish is the most robust way.
What do you think about attached patch Tejun? So far only compile tested...

Possible problem with it is that now cgwb_bdi_unregister() will wait for
all wb references to be dropped so it adds some implicit dependencies to
bdi shutdown path. 

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-bdi-Fix-another-oops-in-wb_workfn.patch --]
[-- Type: text/x-patch, Size: 4066 bytes --]

>From f5038c6e7a3d1a4a91879187b92ede8c868988ac Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 11 Jun 2018 10:56:04 +0200
Subject: [PATCH] bdi: Fix another oops in wb_workfn()

syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1                            CPU2
                                cgwb_bdi_unregister()
                                  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
                                  wb = list_first_entry(&bdi->wb_list, ...)
                                  spin_unlock_irq(&cgwb_lock);
  wb_shutdown(wb);
  ...
  kfree_rcu(wb, rcu);
                                  wb_shutdown(wb); -> oops use-after-free

We solve all these issues by making cgwb_bdi_unregister() wait for
shutdown of all wb structures instead of going through them and trying
to actively shut them down.

[1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Tejun Heo <tj@kernel.org>
Reported-and-analyzed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc834c04a..39fa5f4ddd5c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -370,7 +370,6 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	set_bit(WB_shutting_down, &wb->state);
 	spin_unlock_bh(&wb->work_lock);
 
-	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Drain work list and shutdown the delayed_work.  !WB_registered
 	 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -379,6 +378,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Make sure bit gets cleared after shutdown is finished. Matches with
 	 * the barrier provided by test_and_clear_bit() above.
@@ -541,6 +541,9 @@ static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
 	spin_lock_irq(&cgwb_lock);
 	list_del_rcu(&wb->bdi_node);
 	spin_unlock_irq(&cgwb_lock);
+	/* Last wb of the bdi? Wake up waiters for shutdown of all wbs. */
+	if (list_empty(&wb->bdi->wb_list))
+		wake_up_all(&wb->bdi->wb_waitq);
 }
 
 static int cgwb_create(struct backing_dev_info *bdi,
@@ -710,22 +713,16 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 {
 	struct radix_tree_iter iter;
 	void **slot;
-	struct bdi_writeback *wb;
 
 	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
 
 	spin_lock_irq(&cgwb_lock);
 	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
 		cgwb_kill(*slot);
-
-	while (!list_empty(&bdi->wb_list)) {
-		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
-				      bdi_node);
-		spin_unlock_irq(&cgwb_lock);
-		wb_shutdown(wb);
-		spin_lock_irq(&cgwb_lock);
-	}
 	spin_unlock_irq(&cgwb_lock);
+
+	/* Wait for all writeback structures to shutdown */
+	wait_event(bdi->wb_waitq, list_empty(&bdi->wb_list));
 }
 
 /**
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-11  9:12                                 ` Jan Kara
@ 2018-06-11 16:01                                   ` Tejun Heo
  2018-06-11 16:29                                     ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2018-06-11 16:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs,
	linux-fsdevel, LKML, Al Viro, Dave Chinner, linux-block,
	Linus Torvalds

Hello,

On Mon, Jun 11, 2018 at 11:12:48AM +0200, Jan Kara wrote:
> However this is wrong and so is the patch. The problem is in
> cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's
> reference to wb structures before going through the list of wbs again and
> calling wb_shutdown() on each of them. The writeback structures we are
> accessing at this point can be already freed in principle like:
> 
> CPU1				CPU2
> 				cgwb_bdi_unregister()
> 				  cgwb_kill(*slot);
> 
> cgwb_release()
>   queue_work(cgwb_release_wq, &wb->release_work);
> cgwb_release_workfn()
> 				  wb = list_first_entry(&bdi->wb_list, ...)
> 				  spin_unlock_irq(&cgwb_lock);
>   wb_shutdown(wb);
>   ...				
>   kfree_rcu(wb, rcu);
> 				  wb_shutdown(wb); -> oops use-after-free
> 
> I'm not 100% sure how to fix this. wb structures can be at various phases of
> shutdown (or there may be other external references still existing) when we
> enter cgwb_bdi_unregister() so I think adding a way for cgwb_bdi_unregister()
> to wait for standard wb shutdown path to finish is the most robust way.
> What do you think about attached patch Tejun? So far only compile tested...
> 
> Possible problem with it is that now cgwb_bdi_unregister() will wait for
> all wb references to be dropped so it adds some implicit dependencies to
> bdi shutdown path. 

Would something like the following work or am I missing the point
entirely?

Thanks.

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc83..359cacd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -715,14 +715,19 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
 
 	spin_lock_irq(&cgwb_lock);
-	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
-		cgwb_kill(*slot);
+	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0) {
+		struct bdi_writeback *wb = *slot;
+
+		wb_get(wb);
+		cgwb_kill(wb);
+	}
 
 	while (!list_empty(&bdi->wb_list)) {
 		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
 				      bdi_node);
 		spin_unlock_irq(&cgwb_lock);
 		wb_shutdown(wb);
+		wb_put(wb);
 		spin_lock_irq(&cgwb_lock);
 	}
 	spin_unlock_irq(&cgwb_lock);


-- 
tejun

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-11 16:01                                   ` Tejun Heo
@ 2018-06-11 16:29                                     ` Jan Kara
  2018-06-11 17:20                                       ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2018-06-11 16:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot,
	syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner,
	linux-block, Linus Torvalds

On Mon 11-06-18 09:01:31, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jun 11, 2018 at 11:12:48AM +0200, Jan Kara wrote:
> > However this is wrong and so is the patch. The problem is in
> > cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's
> > reference to wb structures before going through the list of wbs again and
> > calling wb_shutdown() on each of them. The writeback structures we are
> > accessing at this point can be already freed in principle like:
> > 
> > CPU1				CPU2
> > 				cgwb_bdi_unregister()
> > 				  cgwb_kill(*slot);
> > 
> > cgwb_release()
> >   queue_work(cgwb_release_wq, &wb->release_work);
> > cgwb_release_workfn()
> > 				  wb = list_first_entry(&bdi->wb_list, ...)
> > 				  spin_unlock_irq(&cgwb_lock);
> >   wb_shutdown(wb);
> >   ...				
> >   kfree_rcu(wb, rcu);
> > 				  wb_shutdown(wb); -> oops use-after-free
> > 
> > I'm not 100% sure how to fix this. wb structures can be at various phases of
> > shutdown (or there may be other external references still existing) when we
> > enter cgwb_bdi_unregister() so I think adding a way for cgwb_bdi_unregister()
> > to wait for standard wb shutdown path to finish is the most robust way.
> > What do you think about attached patch Tejun? So far only compile tested...
> > 
> > Possible problem with it is that now cgwb_bdi_unregister() will wait for
> > all wb references to be dropped so it adds some implicit dependencies to
> > bdi shutdown path. 
> 
> Would something like the following work or am I missing the point
> entirely?

I was pondering the same solution for a while but I think it won't work.
The problem is that e.g. wb_memcg_offline() could have already removed
wb from the radix tree but it is still pending in bdi->wb_list
(wb_shutdown() has not run yet) and so we'd drop reference we didn't get.

								Honza
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 347cc83..359cacd 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -715,14 +715,19 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
>  	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
>  
>  	spin_lock_irq(&cgwb_lock);
> -	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
> -		cgwb_kill(*slot);
> +	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0) {
> +		struct bdi_writeback *wb = *slot;
> +
> +		wb_get(wb);
> +		cgwb_kill(wb);
> +	}
>  
>  	while (!list_empty(&bdi->wb_list)) {
>  		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
>  				      bdi_node);
>  		spin_unlock_irq(&cgwb_lock);
>  		wb_shutdown(wb);
> +		wb_put(wb);
>  		spin_lock_irq(&cgwb_lock);
>  	}
>  	spin_unlock_irq(&cgwb_lock);
> 
> 
> -- 
> tejun
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-11 16:29                                     ` Jan Kara
@ 2018-06-11 17:20                                       ` Tejun Heo
  2018-06-12 15:57                                         ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2018-06-11 17:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs,
	linux-fsdevel, LKML, Al Viro, Dave Chinner, linux-block,
	Linus Torvalds

Hello,

On Mon, Jun 11, 2018 at 06:29:20PM +0200, Jan Kara wrote:
> > Would something like the following work or am I missing the point
> > entirely?
> 
> I was pondering the same solution for a while but I think it won't work.
> The problem is that e.g. wb_memcg_offline() could have already removed
> wb from the radix tree but it is still pending in bdi->wb_list
> (wb_shutdown() has not run yet) and so we'd drop reference we didn't get.

Yeah, right, so the root cause is that we're walking the wb_list while
holding lock and expecting the object to stay there even after lock is
released.  Hmm... we can use a mutex to synchronize the two
destruction paths.  It's not like they're hot paths anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-11 17:20                                       ` Tejun Heo
@ 2018-06-12 15:57                                         ` Jan Kara
  2018-06-13 10:43                                           ` Tetsuo Handa
  2018-06-13 14:33                                           ` Tejun Heo
  0 siblings, 2 replies; 43+ messages in thread
From: Jan Kara @ 2018-06-12 15:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot,
	syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner,
	linux-block, Linus Torvalds

On Mon 11-06-18 10:20:53, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jun 11, 2018 at 06:29:20PM +0200, Jan Kara wrote:
> > > Would something like the following work or am I missing the point
> > > entirely?
> > 
> > I was pondering the same solution for a while but I think it won't work.
> > The problem is that e.g. wb_memcg_offline() could have already removed
> > wb from the radix tree but it is still pending in bdi->wb_list
> > (wb_shutdown() has not run yet) and so we'd drop reference we didn't get.
> 
> Yeah, right, so the root cause is that we're walking the wb_list while
> holding lock and expecting the object to stay there even after lock is
> released.  Hmm... we can use a mutex to synchronize the two
> destruction paths.  It's not like they're hot paths anyway.

Hmm, do you mean like having a per-bdi or even a global mutex that would
protect whole wb_shutdown()? Yes, that should work and we could get rid of
WB_shutting_down bit as well with that. Just it seems a bit strange to
introduce a mutex only to synchronize these two shutdown paths - usually
locks protect data structures and in this case we have cgwb_lock for
that so it looks like a duplication from a first look.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-12 15:57                                         ` Jan Kara
@ 2018-06-13 10:43                                           ` Tetsuo Handa
  2018-06-13 11:51                                             ` Tetsuo Handa
                                                               ` (2 more replies)
  2018-06-13 14:33                                           ` Tejun Heo
  1 sibling, 3 replies; 43+ messages in thread
From: Tetsuo Handa @ 2018-06-13 10:43 UTC (permalink / raw)
  To: Jan Kara, Tejun Heo
  Cc: Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel,
	LKML, Al Viro, Dave Chinner, linux-block, Linus Torvalds

On 2018/06/13 0:57, Jan Kara wrote:
> On Mon 11-06-18 10:20:53, Tejun Heo wrote:
>> Hello,
>>
>> On Mon, Jun 11, 2018 at 06:29:20PM +0200, Jan Kara wrote:
>>>> Would something like the following work or am I missing the point
>>>> entirely?
>>>
>>> I was pondering the same solution for a while but I think it won't work.
>>> The problem is that e.g. wb_memcg_offline() could have already removed
>>> wb from the radix tree but it is still pending in bdi->wb_list
>>> (wb_shutdown() has not run yet) and so we'd drop reference we didn't get.
>>
>> Yeah, right, so the root cause is that we're walking the wb_list while
>> holding lock and expecting the object to stay there even after lock is
>> released.  Hmm... we can use a mutex to synchronize the two
>> destruction paths.  It's not like they're hot paths anyway.
> 
> Hmm, do you mean like having a per-bdi or even a global mutex that would
> protect whole wb_shutdown()? Yes, that should work and we could get rid of
> WB_shutting_down bit as well with that. Just it seems a bit strange to
> introduce a mutex only to synchronize these two shutdown paths - usually
> locks protect data structures and in this case we have cgwb_lock for
> that so it looks like a duplication from a first look.
> 

Can't we utilize RCU grace period (like shown below) ?



If wb_shutdown(wb) by cgwb_release_workfn() was faster than wb_shutdown(wb) by cgwb_bdi_unregister():

  cgwb_bdi_unregister(bdi)                     cgwb_release_workfn(work)

                                                 wb = container_of(work, struct bdi_writeback, release_work);
    spin_lock_irq(&cgwb_lock);
    wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); /* Same wb here */
    rcu_read_lock(); /* Prevent kfree_rcu() from invoking kfree() */
    spin_unlock_irq(&cgwb_lock);
                                                 wb_shutdown(wb);
                                                   spin_lock_bh(&wb->work_lock);
                                                   !test_and_clear_bit(WB_registered, &wb->state) is "false".
                                                   set_bit(WB_shutting_down, &wb->state);
                                                   spin_unlock_bh(&wb->work_lock);
                                                   mod_delayed_work(bdi_wq, &wb->dwork, 0);
                                                   flush_delayed_work(&wb->dwork);
                                                   cgwb_remove_from_bdi_list(wb);
                                                     spin_lock_irq(&cgwb_lock);
                                                     list_del_rcu(&wb->bdi_node);
                                                     spin_unlock_irq(&cgwb_lock);
                                                 wb_exit(wb);
                                                 kfree_rcu(wb, rcu); /* Won't call kfree() because of rcu_read_lock() */
    wb_shutdown(wb);
      spin_lock_bh(&wb->work_lock); /* Safe to access because kfree() cannot be called */
      !test_and_clear_bit(WB_registered, &wb->state) is "true".
      spin_unlock_bh(&wb->work_lock);
      rcu_read_unlock();
                                                   kfree(wb);
      schedule_timeout_uninterruptible(HZ / 10); /* Try to wait in case list_del_rcu() is not yet called */
    spin_lock_irq(&cgwb_lock);
    wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); /* Different wb if list_del_rcu() was already called, same wb otherwise */
    rcu_read_lock(); /* Prevent kfree_rcu() from invoking kfree() if still same wb here */
    spin_unlock_irq(&cgwb_lock);

If wb_shutdown(wb) by cgwb_bdi_unregister() was faster than wb_shutdown(wb) by cgwb_release_workfn():

  cgwb_bdi_unregister(bdi)                     cgwb_release_workfn(work)

                                                 wb = container_of(work, struct bdi_writeback, release_work);
    spin_lock_irq(&cgwb_lock);
    wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); /* Same wb here */
    rcu_read_lock();
    spin_unlock_irq(&cgwb_lock);
    wb_shutdown(wb);
      spin_lock_bh(&wb->work_lock);
      !test_and_clear_bit(WB_registered, &wb->state) is "false".
      set_bit(WB_shutting_down, &wb->state);
      spin_unlock_bh(&wb->work_lock);
      rcu_read_unlock();
      mod_delayed_work(bdi_wq, &wb->dwork, 0);
      flush_delayed_work(&wb->dwork);
      cgwb_remove_from_bdi_list(wb);
        spin_lock_irq(&cgwb_lock);
        list_del_rcu(&wb->bdi_node);
        spin_unlock_irq(&cgwb_lock);
    spin_lock_irq(&cgwb_lock);
    wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); /* Different wb here */
    rcu_read_lock();
    spin_unlock_irq(&cgwb_lock);
                                                 wb_shutdown(wb);
                                                   spin_lock_bh(&wb->work_lock); /* Safe to access because kfree() cannot be called */
                                                   !test_and_clear_bit(WB_registered, &wb->state) is "true".
                                                   spin_unlock_bh(&wb->work_lock);
                                                 wb_exit(wb);
                                                 kfree_rcu(wb, rcu);



 mm/backing-dev.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc83..6886cea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -353,13 +353,24 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 /*
  * Remove bdi from the global list and shutdown any threads we have running
  */
-static void wb_shutdown(struct bdi_writeback *wb)
+static void wb_shutdown(struct bdi_writeback *wb, const bool in_rcu)
 {
 	/* Make sure nobody queues further work */
 	spin_lock_bh(&wb->work_lock);
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
 		/*
+		 * Try to wait for cgwb_remove_from_bdi_list() without
+		 * using wait_on_bit(), for kfree_rcu(wb, rcu) from
+		 * cgwb_release_workfn() might invoke kfree() before we
+		 * recheck WB_shutting_down bit.
+		 */
+		if (in_rcu) {
+			rcu_read_unlock();
+			schedule_timeout_uninterruptible(HZ / 10);
+			return;
+		}
+		/*
 		 * Wait for wb shutdown to finish if someone else is just
 		 * running wb_shutdown(). Otherwise we could proceed to wb /
 		 * bdi destruction before wb_shutdown() is finished.
@@ -369,8 +380,9 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	}
 	set_bit(WB_shutting_down, &wb->state);
 	spin_unlock_bh(&wb->work_lock);
+	if (in_rcu)
+		rcu_read_unlock();
 
-	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Drain work list and shutdown the delayed_work.  !WB_registered
 	 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -379,6 +391,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Make sure bit gets cleared after shutdown is finished. Matches with
 	 * the barrier provided by test_and_clear_bit() above.
@@ -508,7 +521,7 @@ static void cgwb_release_workfn(struct work_struct *work)
 	struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
 						release_work);
 
-	wb_shutdown(wb);
+	wb_shutdown(wb, false);
 
 	css_put(wb->memcg_css);
 	css_put(wb->blkcg_css);
@@ -721,8 +734,9 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 	while (!list_empty(&bdi->wb_list)) {
 		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
 				      bdi_node);
+		rcu_read_lock();
 		spin_unlock_irq(&cgwb_lock);
-		wb_shutdown(wb);
+		wb_shutdown(wb, true);
 		spin_lock_irq(&cgwb_lock);
 	}
 	spin_unlock_irq(&cgwb_lock);
@@ -944,7 +958,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
 {
 	/* make sure nobody finds us on the bdi_list anymore */
 	bdi_remove_from_list(bdi);
-	wb_shutdown(&bdi->wb);
+	wb_shutdown(&bdi->wb, false);
 	cgwb_bdi_unregister(bdi);
 
 	if (bdi->dev) {
-- 



Or, equivalent one but easier to read:



 mm/backing-dev.c | 42 +++++++++++++++++++++++++++++++-----------
 1 file changed, 31 insertions(+), 11 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc83..77088a3 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -350,15 +350,25 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 
 static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
 
-/*
- * Remove bdi from the global list and shutdown any threads we have running
- */
-static void wb_shutdown(struct bdi_writeback *wb)
+static bool wb_start_shutdown(struct bdi_writeback *wb)
 {
 	/* Make sure nobody queues further work */
 	spin_lock_bh(&wb->work_lock);
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
+		return false;
+	}
+	set_bit(WB_shutting_down, &wb->state);
+	spin_unlock_bh(&wb->work_lock);
+	return true;
+}
+
+/*
+ * Remove bdi from the global list and shutdown any threads we have running
+ */
+static void wb_shutdown(struct bdi_writeback *wb, const bool started)
+{
+	if (!started && !wb_start_shutdown(wb)) {
 		/*
 		 * Wait for wb shutdown to finish if someone else is just
 		 * running wb_shutdown(). Otherwise we could proceed to wb /
@@ -367,10 +377,6 @@ static void wb_shutdown(struct bdi_writeback *wb)
 		wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
 		return;
 	}
-	set_bit(WB_shutting_down, &wb->state);
-	spin_unlock_bh(&wb->work_lock);
-
-	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Drain work list and shutdown the delayed_work.  !WB_registered
 	 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -379,6 +385,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Make sure bit gets cleared after shutdown is finished. Matches with
 	 * the barrier provided by test_and_clear_bit() above.
@@ -508,7 +515,7 @@ static void cgwb_release_workfn(struct work_struct *work)
 	struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
 						release_work);
 
-	wb_shutdown(wb);
+	wb_shutdown(wb, false);
 
 	css_put(wb->memcg_css);
 	css_put(wb->blkcg_css);
@@ -711,6 +718,7 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 	struct radix_tree_iter iter;
 	void **slot;
 	struct bdi_writeback *wb;
+	bool started;
 
 	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
 
@@ -721,8 +729,20 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 	while (!list_empty(&bdi->wb_list)) {
 		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
 				      bdi_node);
+		rcu_read_lock();
 		spin_unlock_irq(&cgwb_lock);
-		wb_shutdown(wb);
+		started = wb_start_shutdown(wb);
+		rcu_read_unlock();
+		if (started)
+			wb_shutdown(wb, true);
+		else
+			/*
+			 * Try to wait for cgwb_remove_from_bdi_list() without
+			 * using wait_on_bit(), for kfree_rcu(wb, rcu) from
+			 * cgwb_release_workfn() might invoke kfree() before we
+			 * recheck WB_shutting_down bit.
+			 */
+			schedule_timeout_uninterruptible(HZ / 10);
 		spin_lock_irq(&cgwb_lock);
 	}
 	spin_unlock_irq(&cgwb_lock);
@@ -944,7 +964,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
 {
 	/* make sure nobody finds us on the bdi_list anymore */
 	bdi_remove_from_list(bdi);
-	wb_shutdown(&bdi->wb);
+	wb_shutdown(&bdi->wb, false);
 	cgwb_bdi_unregister(bdi);
 
 	if (bdi->dev) {
-- 



Or, toggling a dedicated flag using test_and_change_bit():



 include/linux/backing-dev-defs.h | 1 +
 mm/backing-dev.c                 | 8 +++++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 0bd432a..93ff83c 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -26,6 +26,7 @@ enum wb_state {
 	WB_writeback_running,	/* Writeback is in progress */
 	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
 	WB_start_all,		/* nr_pages == 0 (all) work pending */
+	WB_postpone_kfree,      /* cgwb_bdi_unregister() will access later */
 };
 
 enum wb_congested_state {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc83..422d7a7 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -516,7 +516,10 @@ static void cgwb_release_workfn(struct work_struct *work)
 	fprop_local_destroy_percpu(&wb->memcg_completions);
 	percpu_ref_exit(&wb->refcnt);
 	wb_exit(wb);
-	kfree_rcu(wb, rcu);
+	spin_lock_irq(&cgwb_lock);
+	if (!test_and_change_bit(WB_postpone_kfree, &wb->state))
+		kfree_rcu(wb, rcu);
+	spin_unlock_irq(&cgwb_lock);
 }
 
 static void cgwb_release(struct percpu_ref *refcnt)
@@ -721,9 +724,12 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 	while (!list_empty(&bdi->wb_list)) {
 		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
 				      bdi_node);
+		set_bit(WB_postpone_kfree, &wb->state);
 		spin_unlock_irq(&cgwb_lock);
 		wb_shutdown(wb);
 		spin_lock_irq(&cgwb_lock);
+		if (!test_and_change_bit(WB_postpone_kfree, &wb->state))
+			kfree_rcu(wb, rcu);
 	}
 	spin_unlock_irq(&cgwb_lock);
 }
-- 

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-13 10:43                                           ` Tetsuo Handa
@ 2018-06-13 11:51                                             ` Tetsuo Handa
  2018-06-13 14:06                                             ` Linus Torvalds
  2018-06-13 14:46                                               ` Jan Kara
  2 siblings, 0 replies; 43+ messages in thread
From: Tetsuo Handa @ 2018-06-13 11:51 UTC (permalink / raw)
  To: Jan Kara, Tejun Heo
  Cc: Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel,
	LKML, Al Viro, Dave Chinner, linux-block, Linus Torvalds

Tetsuo Handa wrote:
> Or, toggling a dedicated flag using test_and_change_bit():
> 
> 
>  include/linux/backing-dev-defs.h | 1 +
>  mm/backing-dev.c                 | 8 +++++++-
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index 0bd432a..93ff83c 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -26,6 +26,7 @@ enum wb_state {
>  	WB_writeback_running,	/* Writeback is in progress */
>  	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
>  	WB_start_all,		/* nr_pages == 0 (all) work pending */
> +	WB_postpone_kfree,      /* cgwb_bdi_unregister() will access later */
>  };
>  
>  enum wb_congested_state {
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 347cc83..422d7a7 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -516,7 +516,10 @@ static void cgwb_release_workfn(struct work_struct *work)
>  	fprop_local_destroy_percpu(&wb->memcg_completions);
>  	percpu_ref_exit(&wb->refcnt);
>  	wb_exit(wb);
> -	kfree_rcu(wb, rcu);
> +	spin_lock_irq(&cgwb_lock);
> +	if (!test_and_change_bit(WB_postpone_kfree, &wb->state))
> +		kfree_rcu(wb, rcu);
> +	spin_unlock_irq(&cgwb_lock);
>  }
>  
>  static void cgwb_release(struct percpu_ref *refcnt)
> @@ -721,9 +724,12 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
>  	while (!list_empty(&bdi->wb_list)) {
>  		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
>  				      bdi_node);
> +		set_bit(WB_postpone_kfree, &wb->state);
>  		spin_unlock_irq(&cgwb_lock);
>  		wb_shutdown(wb);
>  		spin_lock_irq(&cgwb_lock);
> +		if (!test_and_change_bit(WB_postpone_kfree, &wb->state))
> +			kfree_rcu(wb, rcu);
>  	}
>  	spin_unlock_irq(&cgwb_lock);
>  }

Forgot to include below change, but isn't this approach the simplest?

@@ -370,7 +370,6 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	set_bit(WB_shutting_down, &wb->state);
 	spin_unlock_bh(&wb->work_lock);
 
-	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Drain work list and shutdown the delayed_work.  !WB_registered
 	 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -379,6 +378,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Make sure bit gets cleared after shutdown is finished. Matches with
 	 * the barrier provided by test_and_clear_bit() above.
-- 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-13 10:43                                           ` Tetsuo Handa
  2018-06-13 11:51                                             ` Tetsuo Handa
@ 2018-06-13 14:06                                             ` Linus Torvalds
  2018-06-13 14:46                                               ` Jan Kara
  2 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2018-06-13 14:06 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jan Kara, Tejun Heo, Dmitry Vyukov, Jens Axboe,
	syzbot+4a7438e774b21ddd8eca, syzkaller-bugs, linux-fsdevel,
	Linux Kernel Mailing List, Al Viro, Dave Chinner, linux-block

On Wed, Jun 13, 2018 at 3:44 AM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> Can't we utilize RCU grace period (like shown below) ?

_Please_ don't do conditional locking. Particularly not the kind where
then a function drops the lock that was taken in another function -
and only does it based on some parameter.

Yes, we have cases where we do it, but it's seldom a great idea, and
it _really_ makes for subtle code (and subtle bugs).

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-12 15:57                                         ` Jan Kara
  2018-06-13 10:43                                           ` Tetsuo Handa
@ 2018-06-13 14:33                                           ` Tejun Heo
  2018-06-15 12:06                                               ` Jan Kara
  1 sibling, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2018-06-13 14:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs,
	linux-fsdevel, LKML, Al Viro, Dave Chinner, linux-block,
	Linus Torvalds

Hello, Jan.

On Tue, Jun 12, 2018 at 05:57:54PM +0200, Jan Kara wrote:
> > Yeah, right, so the root cause is that we're walking the wb_list while
> > holding lock and expecting the object to stay there even after lock is
> > released.  Hmm... we can use a mutex to synchronize the two
> > destruction paths.  It's not like they're hot paths anyway.
> 
> Hmm, do you mean like having a per-bdi or even a global mutex that would
> protect whole wb_shutdown()? Yes, that should work and we could get rid of
> WB_shutting_down bit as well with that. Just it seems a bit strange to

Yeap.

> introduce a mutex only to synchronize these two shutdown paths - usually
> locks protect data structures and in this case we have cgwb_lock for
> that so it looks like a duplication from a first look.

Yeah, I feel a bit reluctant too but I think that's the right thing to
do here.  This is an inherently weird case where there are two ways
that an object can go away with the immediate drain requirement from
one side.  It's not a hot path and the dumber the synchronization the
better, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-13 10:43                                           ` Tetsuo Handa
@ 2018-06-13 14:46                                               ` Jan Kara
  2018-06-13 14:06                                             ` Linus Torvalds
  2018-06-13 14:46                                               ` Jan Kara
  2 siblings, 0 replies; 43+ messages in thread
From: Jan Kara @ 2018-06-13 14:46 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jan Kara, Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot,
	syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner,
	linux-block, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 521 bytes --]

On Wed 13-06-18 19:43:47, Tetsuo Handa wrote:
> Can't we utilize RCU grace period (like shown below) ?

Honestly, the variant 1 looks too ugly to me. However variant 2 looks
mostly OK. We can also avoid the schedule_timeout_uninterruptible(HZ / 10)
from your patch by careful handling of the bit waitqueues. Also I'd avoid
the addition argument to wb_writeback() and split the function instead. The
patch resulting from your and mine ideas is attached. Thoughts?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-bdi-Fix-another-oops-in-wb_workfn.patch --]
[-- Type: text/x-patch, Size: 5898 bytes --]

>From d0fbfad964ae5b0a331f0d814d527150bd0c305c Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Date: Wed, 13 Jun 2018 16:39:00 +0200
Subject: [PATCH] bdi: Fix another oops in wb_workfn()

syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1                            CPU2
                                cgwb_bdi_unregister()
                                  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
                                  wb = list_first_entry(&bdi->wb_list, ...)
                                  spin_unlock_irq(&cgwb_lock);
  wb_shutdown(wb);
  ...
  kfree_rcu(wb, rcu);
                                  wb_shutdown(wb); -> oops use-after-free

We solve these issues by making wb_shutdown() remove the writeback
structure from the bdi_list only after all the shutdown is done and by
making sure cgwb_bdi_unregister() properly synchronizes with concurrent
shutdowns using WB_shutting_down bit.

Signed-off-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 69 ++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 55 insertions(+), 14 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc834c04a..433807ae364e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -350,27 +350,21 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 
 static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
 
-/*
- * Remove bdi from the global list and shutdown any threads we have running
- */
-static void wb_shutdown(struct bdi_writeback *wb)
+static bool wb_start_shutdown(struct bdi_writeback *wb)
 {
 	/* Make sure nobody queues further work */
 	spin_lock_bh(&wb->work_lock);
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
-		/*
-		 * Wait for wb shutdown to finish if someone else is just
-		 * running wb_shutdown(). Otherwise we could proceed to wb /
-		 * bdi destruction before wb_shutdown() is finished.
-		 */
-		wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
-		return;
+		return false;
 	}
 	set_bit(WB_shutting_down, &wb->state);
 	spin_unlock_bh(&wb->work_lock);
+	return true;
+}
 
-	cgwb_remove_from_bdi_list(wb);
+static void __wb_shutdown(struct bdi_writeback *wb)
+{
 	/*
 	 * Drain work list and shutdown the delayed_work.  !WB_registered
 	 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -379,6 +373,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Make sure bit gets cleared after shutdown is finished. Matches with
 	 * the barrier provided by test_and_clear_bit() above.
@@ -387,6 +382,23 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	clear_and_wake_up_bit(WB_shutting_down, &wb->state);
 }
 
+/*
+ * Remove bdi from the global list and shutdown any threads we have running
+ */
+static void wb_shutdown(struct bdi_writeback *wb)
+{
+	if (!wb_start_shutdown(wb)) {
+		/*
+		 * Wait for wb shutdown to finish if someone else is just
+		 * running wb_shutdown(). Otherwise we could proceed to wb /
+		 * bdi destruction before wb_shutdown() is finished.
+		 */
+		wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
+		return;
+	}
+	__wb_shutdown(wb);
+}
+
 static void wb_exit(struct bdi_writeback *wb)
 {
 	int i;
@@ -706,6 +718,35 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	return ret;
 }
 
+/*
+ * Mark writeback structure as shutting down. The function returns true if
+ * we successfully marked the structure, false if shutdown was already in
+ * progress (in which case we also wait for it to finish).
+ *
+ * The function requires cgwb_lock to be held and releases it. Note that the
+ * caller need not hold reference to the writeback structure and thus once
+ * cgwb_lock is released, the writeback structure can be freed.
+ */
+static bool cgwb_start_shutdown(struct bdi_writeback *wb)
+	__releases(cgwb_lock)
+{
+	if (!wb_start_shutdown(wb)) {
+		DEFINE_WAIT(wait);
+		wait_queue_head_t *wqh = bit_waitqueue(&wb->state,
+						       WB_shutting_down);
+		bool sleep;
+
+		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+		sleep = test_bit(WB_shutting_down, &wb->state);
+		spin_unlock_irq(&cgwb_lock);
+		if (sleep)
+			schedule();
+		return false;
+	}
+	spin_unlock_irq(&cgwb_lock);
+	return true;
+}
+
 static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 {
 	struct radix_tree_iter iter;
@@ -721,8 +762,8 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 	while (!list_empty(&bdi->wb_list)) {
 		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
 				      bdi_node);
-		spin_unlock_irq(&cgwb_lock);
-		wb_shutdown(wb);
+		if (cgwb_start_shutdown(wb))
+			__wb_shutdown(wb);
 		spin_lock_irq(&cgwb_lock);
 	}
 	spin_unlock_irq(&cgwb_lock);
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
@ 2018-06-13 14:46                                               ` Jan Kara
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Kara @ 2018-06-13 14:46 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jan Kara, Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot,
	syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner,
	linux-block, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 521 bytes --]

On Wed 13-06-18 19:43:47, Tetsuo Handa wrote:
> Can't we utilize RCU grace period (like shown below) ?

Honestly, the variant 1 looks too ugly to me. However variant 2 looks
mostly OK. We can also avoid the schedule_timeout_uninterruptible(HZ / 10)
from your patch by careful handling of the bit waitqueues. Also I'd avoid
the addition argument to wb_writeback() and split the function instead. The
patch resulting from your and mine ideas is attached. Thoughts?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-bdi-Fix-another-oops-in-wb_workfn.patch --]
[-- Type: text/x-patch, Size: 5897 bytes --]

From d0fbfad964ae5b0a331f0d814d527150bd0c305c Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Date: Wed, 13 Jun 2018 16:39:00 +0200
Subject: [PATCH] bdi: Fix another oops in wb_workfn()

syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1                            CPU2
                                cgwb_bdi_unregister()
                                  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
                                  wb = list_first_entry(&bdi->wb_list, ...)
                                  spin_unlock_irq(&cgwb_lock);
  wb_shutdown(wb);
  ...
  kfree_rcu(wb, rcu);
                                  wb_shutdown(wb); -> oops use-after-free

We solve these issues by making wb_shutdown() remove the writeback
structure from the bdi_list only after all the shutdown is done and by
making sure cgwb_bdi_unregister() properly synchronizes with concurrent
shutdowns using WB_shutting_down bit.

Signed-off-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 69 ++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 55 insertions(+), 14 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc834c04a..433807ae364e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -350,27 +350,21 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 
 static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
 
-/*
- * Remove bdi from the global list and shutdown any threads we have running
- */
-static void wb_shutdown(struct bdi_writeback *wb)
+static bool wb_start_shutdown(struct bdi_writeback *wb)
 {
 	/* Make sure nobody queues further work */
 	spin_lock_bh(&wb->work_lock);
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
-		/*
-		 * Wait for wb shutdown to finish if someone else is just
-		 * running wb_shutdown(). Otherwise we could proceed to wb /
-		 * bdi destruction before wb_shutdown() is finished.
-		 */
-		wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
-		return;
+		return false;
 	}
 	set_bit(WB_shutting_down, &wb->state);
 	spin_unlock_bh(&wb->work_lock);
+	return true;
+}
 
-	cgwb_remove_from_bdi_list(wb);
+static void __wb_shutdown(struct bdi_writeback *wb)
+{
 	/*
 	 * Drain work list and shutdown the delayed_work.  !WB_registered
 	 * tells wb_workfn() that @wb is dying and its work_list needs to
@@ -379,6 +373,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
+	cgwb_remove_from_bdi_list(wb);
 	/*
 	 * Make sure bit gets cleared after shutdown is finished. Matches with
 	 * the barrier provided by test_and_clear_bit() above.
@@ -387,6 +382,23 @@ static void wb_shutdown(struct bdi_writeback *wb)
 	clear_and_wake_up_bit(WB_shutting_down, &wb->state);
 }
 
+/*
+ * Remove bdi from the global list and shutdown any threads we have running
+ */
+static void wb_shutdown(struct bdi_writeback *wb)
+{
+	if (!wb_start_shutdown(wb)) {
+		/*
+		 * Wait for wb shutdown to finish if someone else is just
+		 * running wb_shutdown(). Otherwise we could proceed to wb /
+		 * bdi destruction before wb_shutdown() is finished.
+		 */
+		wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
+		return;
+	}
+	__wb_shutdown(wb);
+}
+
 static void wb_exit(struct bdi_writeback *wb)
 {
 	int i;
@@ -706,6 +718,35 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	return ret;
 }
 
+/*
+ * Mark writeback structure as shutting down. The function returns true if
+ * we successfully marked the structure, false if shutdown was already in
+ * progress (in which case we also wait for it to finish).
+ *
+ * The function requires cgwb_lock to be held and releases it. Note that the
+ * caller need not hold reference to the writeback structure and thus once
+ * cgwb_lock is released, the writeback structure can be freed.
+ */
+static bool cgwb_start_shutdown(struct bdi_writeback *wb)
+	__releases(cgwb_lock)
+{
+	if (!wb_start_shutdown(wb)) {
+		DEFINE_WAIT(wait);
+		wait_queue_head_t *wqh = bit_waitqueue(&wb->state,
+						       WB_shutting_down);
+		bool sleep;
+
+		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+		sleep = test_bit(WB_shutting_down, &wb->state);
+		spin_unlock_irq(&cgwb_lock);
+		if (sleep)
+			schedule();
+		return false;
+	}
+	spin_unlock_irq(&cgwb_lock);
+	return true;
+}
+
 static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 {
 	struct radix_tree_iter iter;
@@ -721,8 +762,8 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 	while (!list_empty(&bdi->wb_list)) {
 		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
 				      bdi_node);
-		spin_unlock_irq(&cgwb_lock);
-		wb_shutdown(wb);
+		if (cgwb_start_shutdown(wb))
+			__wb_shutdown(wb);
 		spin_lock_irq(&cgwb_lock);
 	}
 	spin_unlock_irq(&cgwb_lock);
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-13 14:46                                               ` Jan Kara
  (?)
@ 2018-06-13 14:55                                               ` Linus Torvalds
  -1 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2018-06-13 14:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tetsuo Handa, Tejun Heo, Dmitry Vyukov, Jens Axboe,
	syzbot+4a7438e774b21ddd8eca, syzkaller-bugs, linux-fsdevel,
	Linux Kernel Mailing List, Al Viro, Dave Chinner, linux-block

On Wed, Jun 13, 2018 at 7:46 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 13-06-18 19:43:47, Tetsuo Handa wrote:
> > Can't we utilize RCU grace period (like shown below) ?
>
> Honestly, the variant 1 looks too ugly to me. However variant 2 looks
> mostly OK.

The versions that don't have that conditional locking look fine to me, yes.

> Also I'd avoid the addition argument to wb_writeback() and split the function instead. The
> patch resulting from your and mine ideas is attached. Thoughts?

Is there a reason for this model:

+               if (cgwb_start_shutdown(wb))
+                       __wb_shutdown(wb);

when there is just one call site of this? Why not just make the
function void, and make it do that __wb_shutdown() itself in the true
case?

IOW, just make it be

+               cgwb_shutdown(wb);

instead?

That's what "wb_shutdown()" does - it does the "wb_start_shutdown()"
test internally, and does __wb_shutdown() all inside itself, instead
of expecting the caller to do it.

I dunno.

              Linus

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-13 14:46                                               ` Jan Kara
  (?)
  (?)
@ 2018-06-13 16:20                                               ` Tetsuo Handa
  2018-06-13 16:25                                                 ` Linus Torvalds
  -1 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2018-06-13 16:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs,
	linux-fsdevel, LKML, Al Viro, Dave Chinner, linux-block,
	Linus Torvalds

On 2018/06/13 23:46, Jan Kara wrote:
> On Wed 13-06-18 19:43:47, Tetsuo Handa wrote:
>> Can't we utilize RCU grace period (like shown below) ?
> 
> Honestly, the variant 1 looks too ugly to me. However variant 2 looks
> mostly OK. We can also avoid the schedule_timeout_uninterruptible(HZ / 10)
> from your patch by careful handling of the bit waitqueues. Also I'd avoid
> the addition argument to wb_writeback() and split the function instead. The
> patch resulting from your and mine ideas is attached. Thoughts?
> 
> 								Honza
> 

+static bool cgwb_start_shutdown(struct bdi_writeback *wb)
+	__releases(cgwb_lock)
+{
+	if (!wb_start_shutdown(wb)) {
+		DEFINE_WAIT(wait);
+		wait_queue_head_t *wqh = bit_waitqueue(&wb->state,
+						       WB_shutting_down);
+		bool sleep;
+
+		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+		sleep = test_bit(WB_shutting_down, &wb->state);
+		spin_unlock_irq(&cgwb_lock);
+		if (sleep)
+			schedule();
+		return false;
+	}
+	spin_unlock_irq(&cgwb_lock);
+	return true;
+}

Since multiple addresses share bit_wait_table[256], isn't it possible that
cgwb_start_shutdown() prematurely returns false due to wake_up_bit() by
hash-conflicting addresses (i.e. not limited to clear_and_wake_up_bit() from
wb_shutdown())? I think that we cannot be sure without confirming that
test_bit(WB_shutting_down, &wb->state) == false after returning from schedule().

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-13 16:20                                               ` Tetsuo Handa
@ 2018-06-13 16:25                                                 ` Linus Torvalds
  2018-06-13 16:45                                                   ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2018-06-13 16:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jan Kara, Tejun Heo, Dmitry Vyukov, Jens Axboe,
	syzbot+4a7438e774b21ddd8eca, syzkaller-bugs, linux-fsdevel,
	Linux Kernel Mailing List, Al Viro, Dave Chinner, linux-block

On Wed, Jun 13, 2018 at 9:21 AM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> Since multiple addresses share bit_wait_table[256], isn't it possible that
> cgwb_start_shutdown() prematurely returns false due to wake_up_bit() by
> hash-conflicting addresses (i.e. not limited to clear_and_wake_up_bit() from
> wb_shutdown())? I think that we cannot be sure without confirming that
> test_bit(WB_shutting_down, &wb->state) == false after returning from schedule().

Right.

That's _always_ true, btw. Something else entirely could have woken
you up. TASK_UNINTERRUPTIBLE does not mean "nothing else wakes me", it
just means "_signals_ don't wake me".

So every single sleep always needs to be in a loop. Always.

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-13 16:25                                                 ` Linus Torvalds
@ 2018-06-13 16:45                                                   ` Jan Kara
  2018-06-13 21:04                                                     ` Tetsuo Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Kara @ 2018-06-13 16:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, Jan Kara, Tejun Heo, Dmitry Vyukov, Jens Axboe,
	syzbot+4a7438e774b21ddd8eca, syzkaller-bugs, linux-fsdevel,
	Linux Kernel Mailing List, Al Viro, Dave Chinner, linux-block

On Wed 13-06-18 09:25:03, Linus Torvalds wrote:
> On Wed, Jun 13, 2018 at 9:21 AM Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > Since multiple addresses share bit_wait_table[256], isn't it possible that
> > cgwb_start_shutdown() prematurely returns false due to wake_up_bit() by
> > hash-conflicting addresses (i.e. not limited to clear_and_wake_up_bit() from
> > wb_shutdown())? I think that we cannot be sure without confirming that
> > test_bit(WB_shutting_down, &wb->state) == false after returning from schedule().
> 
> Right.
> 
> That's _always_ true, btw. Something else entirely could have woken
> you up. TASK_UNINTERRUPTIBLE does not mean "nothing else wakes me", it
> just means "_signals_ don't wake me".
> 
> So every single sleep always needs to be in a loop. Always.

Agreed and in my patch it actually is in a loop - the one iterating the
list of active writeback structures. If we get a false wakeup, we find the
same structure in the list again and wait again...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-13 16:45                                                   ` Jan Kara
@ 2018-06-13 21:04                                                     ` Tetsuo Handa
  2018-06-14 10:11                                                       ` Jan Kara
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2018-06-13 21:04 UTC (permalink / raw)
  To: Jan Kara, Linus Torvalds
  Cc: Tejun Heo, Dmitry Vyukov, Jens Axboe,
	syzbot+4a7438e774b21ddd8eca, syzkaller-bugs, linux-fsdevel,
	Linux Kernel Mailing List, Al Viro, Dave Chinner, linux-block

On 2018/06/14 1:45, Jan Kara wrote:
> On Wed 13-06-18 09:25:03, Linus Torvalds wrote:
>> On Wed, Jun 13, 2018 at 9:21 AM Tetsuo Handa
>> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>>
>>> Since multiple addresses share bit_wait_table[256], isn't it possible that
>>> cgwb_start_shutdown() prematurely returns false due to wake_up_bit() by
>>> hash-conflicting addresses (i.e. not limited to clear_and_wake_up_bit() from
>>> wb_shutdown())? I think that we cannot be sure without confirming that
>>> test_bit(WB_shutting_down, &wb->state) == false after returning from schedule().
>>
>> Right.
>>
>> That's _always_ true, btw. Something else entirely could have woken
>> you up. TASK_UNINTERRUPTIBLE does not mean "nothing else wakes me", it
>> just means "_signals_ don't wake me".
>>
>> So every single sleep always needs to be in a loop. Always.
> 
> Agreed and in my patch it actually is in a loop - the one iterating the
> list of active writeback structures. If we get a false wakeup, we find the
> same structure in the list again and wait again...

Indeed. I overlooked that wb = list_first_entry() will select same wb again
if cgwb_remove_from_bdi_list() is not yet called. Well, we could update
"(in which case we also wait for it to finish)" part or move the body of
cgwb_start_shutdown() to cgwb_bdi_unregister() so that it becomes clear
that false wake-up is not a problem in this case.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-13 21:04                                                     ` Tetsuo Handa
@ 2018-06-14 10:11                                                       ` Jan Kara
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Kara @ 2018-06-14 10:11 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jan Kara, Linus Torvalds, Tejun Heo, Dmitry Vyukov, Jens Axboe,
	syzbot+4a7438e774b21ddd8eca, syzkaller-bugs, linux-fsdevel,
	Linux Kernel Mailing List, Al Viro, Dave Chinner, linux-block

On Thu 14-06-18 06:04:04, Tetsuo Handa wrote:
> On 2018/06/14 1:45, Jan Kara wrote:
> > On Wed 13-06-18 09:25:03, Linus Torvalds wrote:
> >> On Wed, Jun 13, 2018 at 9:21 AM Tetsuo Handa
> >> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >>>
> >>> Since multiple addresses share bit_wait_table[256], isn't it possible that
> >>> cgwb_start_shutdown() prematurely returns false due to wake_up_bit() by
> >>> hash-conflicting addresses (i.e. not limited to clear_and_wake_up_bit() from
> >>> wb_shutdown())? I think that we cannot be sure without confirming that
> >>> test_bit(WB_shutting_down, &wb->state) == false after returning from schedule().
> >>
> >> Right.
> >>
> >> That's _always_ true, btw. Something else entirely could have woken
> >> you up. TASK_UNINTERRUPTIBLE does not mean "nothing else wakes me", it
> >> just means "_signals_ don't wake me".
> >>
> >> So every single sleep always needs to be in a loop. Always.
> > 
> > Agreed and in my patch it actually is in a loop - the one iterating the
> > list of active writeback structures. If we get a false wakeup, we find the
> > same structure in the list again and wait again...
> 
> Indeed. I overlooked that wb = list_first_entry() will select same wb again
> if cgwb_remove_from_bdi_list() is not yet called. Well, we could update
> "(in which case we also wait for it to finish)" part or move the body of
> cgwb_start_shutdown() to cgwb_bdi_unregister() so that it becomes clear
> that false wake-up is not a problem in this case.

I prefer to keep the wb shutdown in a separate function but I've added some
comments to explain that.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-13 14:33                                           ` Tejun Heo
@ 2018-06-15 12:06                                               ` Jan Kara
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Kara @ 2018-06-15 12:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot,
	syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner,
	linux-block, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1488 bytes --]

On Wed 13-06-18 07:33:15, Tejun Heo wrote:
> Hello, Jan.
> 
> On Tue, Jun 12, 2018 at 05:57:54PM +0200, Jan Kara wrote:
> > > Yeah, right, so the root cause is that we're walking the wb_list while
> > > holding lock and expecting the object to stay there even after lock is
> > > released.  Hmm... we can use a mutex to synchronize the two
> > > destruction paths.  It's not like they're hot paths anyway.
> > 
> > Hmm, do you mean like having a per-bdi or even a global mutex that would
> > protect whole wb_shutdown()? Yes, that should work and we could get rid of
> > WB_shutting_down bit as well with that. Just it seems a bit strange to
> 
> Yeap.
> 
> > introduce a mutex only to synchronize these two shutdown paths - usually
> > locks protect data structures and in this case we have cgwb_lock for
> > that so it looks like a duplication from a first look.
> 
> Yeah, I feel a bit reluctant too but I think that's the right thing to
> do here.  This is an inherently weird case where there are two ways
> that an object can go away with the immediate drain requirement from
> one side.  It's not a hot path and the dumber the synchronization the
> better, right?

Yeah, fair enough. Something like attached patch? It is indeed considerably
simpler than fixing synchronization using WB_shutting_down. This one even
got some testing using scsi_debug, I want to do more testing next week with
more cgroup writeback included.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-bdi-Fix-another-oops-in-wb_workfn.patch --]
[-- Type: text/x-patch, Size: 3885 bytes --]

>From 95fa7fe8ebc7dcf4d09a2e097c06abdf24ba81b5 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 14 Jun 2018 14:30:03 +0200
Subject: [PATCH] bdi: Fix another oops in wb_workfn()

syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1                            CPU2
                                cgwb_bdi_unregister()
                                  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
                                  wb = list_first_entry(&bdi->wb_list, ...)
                                  spin_unlock_irq(&cgwb_lock);
  wb_shutdown(wb);
  ...
  kfree_rcu(wb, rcu);
                                  wb_shutdown(wb); -> oops use-after-free

We solve these issues by synchronizing writeback structure shutdown from
cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h | 1 +
 mm/backing-dev.c                 | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 0bd432a4d7bd..661e3121e580 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -189,6 +189,7 @@ struct backing_dev_info {
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
 	struct rb_root cgwb_congested_tree; /* their congested states */
+	struct mutex cgwb_release_mutex;  /* protect shutdown of wb structs */
 #else
 	struct bdi_writeback_congested *wb_congested;
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc834c04a..078cb4e0fb62 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -508,10 +508,12 @@ static void cgwb_release_workfn(struct work_struct *work)
 	struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
 						release_work);
 
+	mutex_lock(&wb->bdi->cgwb_release_mutex);
 	wb_shutdown(wb);
 
 	css_put(wb->memcg_css);
 	css_put(wb->blkcg_css);
+	mutex_unlock(&wb->bdi->cgwb_release_mutex);
 
 	fprop_local_destroy_percpu(&wb->memcg_completions);
 	percpu_ref_exit(&wb->refcnt);
@@ -697,6 +699,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
 	INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
 	bdi->cgwb_congested_tree = RB_ROOT;
+	mutex_init(&bdi->cgwb_release_mutex);
 
 	ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
 	if (!ret) {
@@ -717,7 +720,10 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 	spin_lock_irq(&cgwb_lock);
 	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
 		cgwb_kill(*slot);
+	spin_unlock_irq(&cgwb_lock);
 
+	mutex_lock(&bdi->cgwb_release_mutex);
+	spin_lock_irq(&cgwb_lock);
 	while (!list_empty(&bdi->wb_list)) {
 		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
 				      bdi_node);
@@ -726,6 +732,7 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 		spin_lock_irq(&cgwb_lock);
 	}
 	spin_unlock_irq(&cgwb_lock);
+	mutex_unlock(&bdi->cgwb_release_mutex);
 }
 
 /**
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
@ 2018-06-15 12:06                                               ` Jan Kara
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Kara @ 2018-06-15 12:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot,
	syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner,
	linux-block, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1488 bytes --]

On Wed 13-06-18 07:33:15, Tejun Heo wrote:
> Hello, Jan.
> 
> On Tue, Jun 12, 2018 at 05:57:54PM +0200, Jan Kara wrote:
> > > Yeah, right, so the root cause is that we're walking the wb_list while
> > > holding lock and expecting the object to stay there even after lock is
> > > released.  Hmm... we can use a mutex to synchronize the two
> > > destruction paths.  It's not like they're hot paths anyway.
> > 
> > Hmm, do you mean like having a per-bdi or even a global mutex that would
> > protect whole wb_shutdown()? Yes, that should work and we could get rid of
> > WB_shutting_down bit as well with that. Just it seems a bit strange to
> 
> Yeap.
> 
> > introduce a mutex only to synchronize these two shutdown paths - usually
> > locks protect data structures and in this case we have cgwb_lock for
> > that so it looks like a duplication from a first look.
> 
> Yeah, I feel a bit reluctant too but I think that's the right thing to
> do here.  This is an inherently weird case where there are two ways
> that an object can go away with the immediate drain requirement from
> one side.  It's not a hot path and the dumber the synchronization the
> better, right?

Yeah, fair enough. Something like attached patch? It is indeed considerably
simpler than fixing synchronization using WB_shutting_down. This one even
got some testing using scsi_debug, I want to do more testing next week with
more cgroup writeback included.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-bdi-Fix-another-oops-in-wb_workfn.patch --]
[-- Type: text/x-patch, Size: 3884 bytes --]

From 95fa7fe8ebc7dcf4d09a2e097c06abdf24ba81b5 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 14 Jun 2018 14:30:03 +0200
Subject: [PATCH] bdi: Fix another oops in wb_workfn()

syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1                            CPU2
                                cgwb_bdi_unregister()
                                  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
                                  wb = list_first_entry(&bdi->wb_list, ...)
                                  spin_unlock_irq(&cgwb_lock);
  wb_shutdown(wb);
  ...
  kfree_rcu(wb, rcu);
                                  wb_shutdown(wb); -> oops use-after-free

We solve these issues by synchronizing writeback structure shutdown from
cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h | 1 +
 mm/backing-dev.c                 | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 0bd432a4d7bd..661e3121e580 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -189,6 +189,7 @@ struct backing_dev_info {
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
 	struct rb_root cgwb_congested_tree; /* their congested states */
+	struct mutex cgwb_release_mutex;  /* protect shutdown of wb structs */
 #else
 	struct bdi_writeback_congested *wb_congested;
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc834c04a..078cb4e0fb62 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -508,10 +508,12 @@ static void cgwb_release_workfn(struct work_struct *work)
 	struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
 						release_work);
 
+	mutex_lock(&wb->bdi->cgwb_release_mutex);
 	wb_shutdown(wb);
 
 	css_put(wb->memcg_css);
 	css_put(wb->blkcg_css);
+	mutex_unlock(&wb->bdi->cgwb_release_mutex);
 
 	fprop_local_destroy_percpu(&wb->memcg_completions);
 	percpu_ref_exit(&wb->refcnt);
@@ -697,6 +699,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
 	INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
 	bdi->cgwb_congested_tree = RB_ROOT;
+	mutex_init(&bdi->cgwb_release_mutex);
 
 	ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
 	if (!ret) {
@@ -717,7 +720,10 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 	spin_lock_irq(&cgwb_lock);
 	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
 		cgwb_kill(*slot);
+	spin_unlock_irq(&cgwb_lock);
 
+	mutex_lock(&bdi->cgwb_release_mutex);
+	spin_lock_irq(&cgwb_lock);
 	while (!list_empty(&bdi->wb_list)) {
 		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
 				      bdi_node);
@@ -726,6 +732,7 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 		spin_lock_irq(&cgwb_lock);
 	}
 	spin_unlock_irq(&cgwb_lock);
+	mutex_unlock(&bdi->cgwb_release_mutex);
 }
 
 /**
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] bdi: Fix another oops in wb_workfn()
  2018-06-15 12:06                                               ` Jan Kara
  (?)
@ 2018-06-18 12:27                                               ` Jan Kara
  -1 siblings, 0 replies; 43+ messages in thread
From: Jan Kara @ 2018-06-18 12:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot,
	syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner,
	linux-block, Linus Torvalds

On Fri 15-06-18 14:06:20, Jan Kara wrote:
> On Wed 13-06-18 07:33:15, Tejun Heo wrote:
> > Hello, Jan.
> > 
> > On Tue, Jun 12, 2018 at 05:57:54PM +0200, Jan Kara wrote:
> > > > Yeah, right, so the root cause is that we're walking the wb_list while
> > > > holding lock and expecting the object to stay there even after lock is
> > > > released.  Hmm... we can use a mutex to synchronize the two
> > > > destruction paths.  It's not like they're hot paths anyway.
> > > 
> > > Hmm, do you mean like having a per-bdi or even a global mutex that would
> > > protect whole wb_shutdown()? Yes, that should work and we could get rid of
> > > WB_shutting_down bit as well with that. Just it seems a bit strange to
> > 
> > Yeap.
> > 
> > > introduce a mutex only to synchronize these two shutdown paths - usually
> > > locks protect data structures and in this case we have cgwb_lock for
> > > that so it looks like a duplication from a first look.
> > 
> > Yeah, I feel a bit reluctant too but I think that's the right thing to
> > do here.  This is an inherently weird case where there are two ways
> > that an object can go away with the immediate drain requirement from
> > one side.  It's not a hot path and the dumber the synchronization the
> > better, right?
> 
> Yeah, fair enough. Something like attached patch? It is indeed considerably
> simpler than fixing synchronization using WB_shutting_down. This one even
> got some testing using scsi_debug, I want to do more testing next week with
> more cgroup writeback included.

OK, the test has passed some beating with cgroup writeback running. I'll do
official posting shortly.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2018-06-18 12:27 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-26  9:15 general protection fault in wb_workfn (2) syzbot
2018-05-27  0:47 ` Tetsuo Handa
2018-05-27  2:21   ` [PATCH] bdi: Fix another oops in wb_workfn() Tetsuo Handa
2018-05-27  2:36     ` Tejun Heo
2018-05-27  4:43       ` Tetsuo Handa
2018-05-29 13:46         ` Tejun Heo
2018-05-28 13:35   ` general protection fault in wb_workfn (2) Jan Kara
2018-05-30 16:00     ` Tetsuo Handa
2018-05-30 16:00       ` Tetsuo Handa
2018-05-31 11:42       ` Jan Kara
2018-05-31 13:19         ` Tetsuo Handa
2018-05-31 13:42           ` Jan Kara
2018-05-31 16:56             ` Jens Axboe
2018-06-05 13:45               ` Tetsuo Handa
2018-06-07 18:46                 ` Dmitry Vyukov
2018-06-08  2:31                   ` Tetsuo Handa
2018-06-08 14:45                     ` Dmitry Vyukov
2018-06-08 15:16                       ` Dmitry Vyukov
2018-06-08 16:53                         ` Dmitry Vyukov
2018-06-08 17:14                           ` Dmitry Vyukov
2018-06-09  5:30                             ` Tetsuo Handa
2018-06-09 14:00                               ` [PATCH] bdi: Fix another oops in wb_workfn() Tetsuo Handa
2018-06-11  9:12                                 ` Jan Kara
2018-06-11 16:01                                   ` Tejun Heo
2018-06-11 16:29                                     ` Jan Kara
2018-06-11 17:20                                       ` Tejun Heo
2018-06-12 15:57                                         ` Jan Kara
2018-06-13 10:43                                           ` Tetsuo Handa
2018-06-13 11:51                                             ` Tetsuo Handa
2018-06-13 14:06                                             ` Linus Torvalds
2018-06-13 14:46                                             ` Jan Kara
2018-06-13 14:46                                               ` Jan Kara
2018-06-13 14:55                                               ` Linus Torvalds
2018-06-13 16:20                                               ` Tetsuo Handa
2018-06-13 16:25                                                 ` Linus Torvalds
2018-06-13 16:45                                                   ` Jan Kara
2018-06-13 21:04                                                     ` Tetsuo Handa
2018-06-14 10:11                                                       ` Jan Kara
2018-06-13 14:33                                           ` Tejun Heo
2018-06-15 12:06                                             ` Jan Kara
2018-06-15 12:06                                               ` Jan Kara
2018-06-18 12:27                                               ` Jan Kara
2018-06-01  2:30             ` general protection fault in wb_workfn (2) Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.