linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* write call hangs in kernel space after virtio hot-remove
@ 2018-04-25 20:07 Fabiano Rosas
  2018-04-26 18:50 ` Tejun Heo
  2018-05-03 14:42 ` Jan Kara
  0 siblings, 2 replies; 8+ messages in thread
From: Fabiano Rosas @ 2018-04-25 20:07 UTC (permalink / raw)
  To: linux-block; +Cc: linux-fsdevel, jack, tj

I'm looking into an issue where removing a virtio disk via sysfs while another
process is issuing write() calls results in the writing task going into a
livelock:


root@guest # cat test.sh
#!/bin/bash

dd if=/dev/zero of=/dev/vda bs=1M count=10000 &
sleep 1
echo 1 > /sys/bus/pci/devices/0000:00:04.0/remove

root@guest # ls /dev/vd*
/dev/vda

root@guest # grep Dirty /proc/meminfo
Dirty:                 0 kB

root@guest # sh test.sh

root@guest # ps aux | grep "[d]d if"
root      1699 38.6  0.0 111424  1216 hvc0     D+   10:48   0:01 dd if=/dev/zero of=/dev/vda bs=1M count=10000

root@guest # ls /dev/vd*
ls: cannot access /dev/vd*: No such file or directory

root@guest # cat /proc/1699/stack
[<0>] 0xc0000000ffe28218
[<0>] __switch_to+0x31c/0x480
[<0>] balance_dirty_pages+0x990/0xb90
[<0>] balance_dirty_pages_ratelimited+0x50c/0x6c0
[<0>] generic_perform_write+0x1b0/0x260
[<0>] __generic_file_write_iter+0x200/0x240
[<0>] blkdev_write_iter+0xa4/0x150
[<0>] __vfs_write+0x14c/0x240
[<0>] vfs_write+0xd0/0x240
[<0>] ksys_write+0x6c/0x110
[<0>] system_call+0x58/0x6c

root@guest # grep Dirty /proc/meminfo
Dirty:           1506816 kB

---

I have done some tracing and I believe this is caused by the clearing of
'WB_registered' in 'wb_shutdown':

<snip>
sh-1697  [000] ....  3994.541664: sysfs_remove_link <-del_gendisk
sh-1697  [000] ....  3994.541671: wb_shutdown <-bdi_unregister
<snip>

Later, when 'balance_dirty_pages' tries to start writeback, it doesn't happen
because 'WB_registered' is not set:

fs/fs-writeback.c

static void wb_wakeup(struct bdi_writeback *wb)
{
        spin_lock_bh(&wb->work_lock);
        if (test_bit(WB_registered, &wb->state))
                mod_delayed_work(bdi_wq, &wb->dwork, 0);
        spin_unlock_bh(&wb->work_lock);
}

So we get stuck in a loop in 'balance_dirty_pages':

root@guest # cat /sys/kernel/debug/tracing/set_ftrace_filter
balance_dirty_pages_ratelimited
balance_dirty_pages
wb_wakeup
wb_workfn
io_schedule_timeout

<snip>
dd-1699  [000] .... 11192.535946: wb_wakeup <-balance_dirty_pages
dd-1699  [000] .... 11192.535950: io_schedule_timeout <-balance_dirty_pages
dd-1699  [000] .... 11192.745968: wb_wakeup <-balance_dirty_pages
dd-1699  [000] .... 11192.745972: io_schedule_timeout <-balance_dirty_pages
<snip>


The test on 'WB_registered' before starting the writeback task is introduced
by: "5acda9 bdi: avoid oops on device removal".

I have made a *naive* attempt at fixing it by allowing writeback to happen even
without 'WB_registered':


diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d4d04fe..050b067 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -982,7 +982,7 @@ void wb_start_background_writeback(struct bdi_writeback *wb)
 	 * writeback as soon as there is no other work to do.
 	 */
 	trace_writeback_wake_background(wb);
-	wb_wakeup(wb);
+	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 }
 
 /*
@@ -1933,7 +1933,7 @@ void wb_workfn(struct work_struct *work)
 						struct bdi_writeback, dwork);
 	long pages_written;
 
-	set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
+	set_worker_desc("flush-%s", wb->bdi->dev ? dev_name(wb->bdi->dev) : "?" );
 	current->flags |= PF_SWAPWRITE;
 
 	if (likely(!current_is_workqueue_rescuer() ||
--


The effect of that is that the 'dd' process now finishes successfully and we
get "Buffer I/O error"s for the dirty pages that remain. I believe this to be
in conformance with existing interfaces since dd does not issue any fsync()
calls.


Does my analysis make any sense and would something along these lines be
acceptable as a solution?


Cheers

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: write call hangs in kernel space after virtio hot-remove
  2018-04-25 20:07 write call hangs in kernel space after virtio hot-remove Fabiano Rosas
@ 2018-04-26 18:50 ` Tejun Heo
  2018-04-26 21:10   ` Tetsuo Handa
  2018-05-03 14:42 ` Jan Kara
  1 sibling, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2018-04-26 18:50 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: linux-block, linux-fsdevel, jack, Tetsuo Handa

(cc'ing Tetuso and quoting whole message)

Tetuso, could this be the same problem as the hang in wb_shutdown()
syszbot reported?

On Wed, Apr 25, 2018 at 05:07:48PM -0300, Fabiano Rosas wrote:
> I'm looking into an issue where removing a virtio disk via sysfs while another
> process is issuing write() calls results in the writing task going into a
> livelock:
> 
> 
> root@guest # cat test.sh
> #!/bin/bash
> 
> dd if=/dev/zero of=/dev/vda bs=1M count=10000 &
> sleep 1
> echo 1 > /sys/bus/pci/devices/0000:00:04.0/remove
> 
> root@guest # ls /dev/vd*
> /dev/vda
> 
> root@guest # grep Dirty /proc/meminfo
> Dirty:                 0 kB
> 
> root@guest # sh test.sh
> 
> root@guest # ps aux | grep "[d]d if"
> root      1699 38.6  0.0 111424  1216 hvc0     D+   10:48   0:01 dd if=/dev/zero of=/dev/vda bs=1M count=10000
> 
> root@guest # ls /dev/vd*
> ls: cannot access /dev/vd*: No such file or directory
> 
> root@guest # cat /proc/1699/stack
> [<0>] 0xc0000000ffe28218
> [<0>] __switch_to+0x31c/0x480
> [<0>] balance_dirty_pages+0x990/0xb90
> [<0>] balance_dirty_pages_ratelimited+0x50c/0x6c0
> [<0>] generic_perform_write+0x1b0/0x260
> [<0>] __generic_file_write_iter+0x200/0x240
> [<0>] blkdev_write_iter+0xa4/0x150
> [<0>] __vfs_write+0x14c/0x240
> [<0>] vfs_write+0xd0/0x240
> [<0>] ksys_write+0x6c/0x110
> [<0>] system_call+0x58/0x6c
> 
> root@guest # grep Dirty /proc/meminfo
> Dirty:           1506816 kB
> 
> ---
> 
> I have done some tracing and I believe this is caused by the clearing of
> 'WB_registered' in 'wb_shutdown':
> 
> <snip>
> sh-1697  [000] ....  3994.541664: sysfs_remove_link <-del_gendisk
> sh-1697  [000] ....  3994.541671: wb_shutdown <-bdi_unregister
> <snip>
> 
> Later, when 'balance_dirty_pages' tries to start writeback, it doesn't happen
> because 'WB_registered' is not set:
> 
> fs/fs-writeback.c
> 
> static void wb_wakeup(struct bdi_writeback *wb)
> {
>         spin_lock_bh(&wb->work_lock);
>         if (test_bit(WB_registered, &wb->state))
>                 mod_delayed_work(bdi_wq, &wb->dwork, 0);
>         spin_unlock_bh(&wb->work_lock);
> }
> 
> So we get stuck in a loop in 'balance_dirty_pages':
> 
> root@guest # cat /sys/kernel/debug/tracing/set_ftrace_filter
> balance_dirty_pages_ratelimited
> balance_dirty_pages
> wb_wakeup
> wb_workfn
> io_schedule_timeout
> 
> <snip>
> dd-1699  [000] .... 11192.535946: wb_wakeup <-balance_dirty_pages
> dd-1699  [000] .... 11192.535950: io_schedule_timeout <-balance_dirty_pages
> dd-1699  [000] .... 11192.745968: wb_wakeup <-balance_dirty_pages
> dd-1699  [000] .... 11192.745972: io_schedule_timeout <-balance_dirty_pages
> <snip>
> 
> 
> The test on 'WB_registered' before starting the writeback task is introduced
> by: "5acda9 bdi: avoid oops on device removal".
> 
> I have made a *naive* attempt at fixing it by allowing writeback to happen even
> without 'WB_registered':
> 
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d4d04fe..050b067 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -982,7 +982,7 @@ void wb_start_background_writeback(struct bdi_writeback *wb)
>  	 * writeback as soon as there is no other work to do.
>  	 */
>  	trace_writeback_wake_background(wb);
> -	wb_wakeup(wb);
> +	mod_delayed_work(bdi_wq, &wb->dwork, 0);
>  }
>  
>  /*
> @@ -1933,7 +1933,7 @@ void wb_workfn(struct work_struct *work)
>  						struct bdi_writeback, dwork);
>  	long pages_written;
>  
> -	set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
> +	set_worker_desc("flush-%s", wb->bdi->dev ? dev_name(wb->bdi->dev) : "?" );
>  	current->flags |= PF_SWAPWRITE;
>  
>  	if (likely(!current_is_workqueue_rescuer() ||
> --
> 
> 
> The effect of that is that the 'dd' process now finishes successfully and we
> get "Buffer I/O error"s for the dirty pages that remain. I believe this to be
> in conformance with existing interfaces since dd does not issue any fsync()
> calls.
> 
> 
> Does my analysis make any sense and would something along these lines be
> acceptable as a solution?
> 
> 
> Cheers
> 

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write call hangs in kernel space after virtio hot-remove
  2018-04-26 18:50 ` Tejun Heo
@ 2018-04-26 21:10   ` Tetsuo Handa
  2018-04-27 20:10     ` Fabiano Rosas
  0 siblings, 1 reply; 8+ messages in thread
From: Tetsuo Handa @ 2018-04-26 21:10 UTC (permalink / raw)
  To: tj, farosas; +Cc: linux-block, linux-fsdevel, jack

Tejun Heo wrote:
> (cc'ing Tetuso and quoting whole message)
> 
> Tetuso, could this be the same problem as the hang in wb_shutdown()
> syszbot reported?

Excuse me, but I'm too unfamiliar to judge it. ;-)

Anyway, since Fabiano has a reproducer, I appreciate trying a patch at
https://groups.google.com/forum/#!msg/syzkaller-bugs/qeR_1VM0UvU/1zcLE0Y4BAAJ .

Also, Fabiano's attempt might be somewhat related to a NULL pointer race condition
at https://groups.google.com/forum/#!msg/syzkaller-bugs/bVx4WExoTDw/G68X4kPcAQAJ .

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write call hangs in kernel space after virtio hot-remove
  2018-04-26 21:10   ` Tetsuo Handa
@ 2018-04-27 20:10     ` Fabiano Rosas
  0 siblings, 0 replies; 8+ messages in thread
From: Fabiano Rosas @ 2018-04-27 20:10 UTC (permalink / raw)
  To: Tetsuo Handa, tj; +Cc: linux-block, linux-fsdevel, jack

On 2018-04-26 18:10, Tetsuo Handa wrote:

> Anyway, since Fabiano has a reproducer, I appreciate trying a patch at
> https://groups.google.com/forum/#!msg/syzkaller-bugs/qeR_1VM0UvU/1zcLE0Y4BAAJ .

Regarding the "INFO: task hung in wb_shutdown (2)" report mentioned above, I
can't find a relation with the issue I'm seeing. I observe the livelock at
'balance_dirty_pages' in a different task and after 'wb_shutdown' has already
finished.

> Also, Fabiano's attempt might be somewhat related to a NULL pointer race condition
> at https://groups.google.com/forum/#!msg/syzkaller-bugs/bVx4WExoTDw/G68X4kPcAQAJ .

About "general protection fault in wb_workfn", I see that 'bdi_unregister' both
clears the 'WB_registered' bit (via wb_shutdown) and sets bdi->dev = NULL:

void bdi_unregister(struct backing_dev_info *bdi)
{
        /* make sure nobody finds us on the bdi_list anymore */
        bdi_remove_from_list(bdi);
->      wb_shutdown(&bdi->wb);
        cgwb_bdi_unregister(bdi);

        if (bdi->dev) {
                bdi_debug_unregister(bdi);
                device_unregister(bdi->dev);
->              bdi->dev = NULL;
        }
(...)
}


so I believe Jan's patch makes use of that correlation to avoid executing
'wb_workfn' when bdi->dev == NULL thus avoiding the null access:

(from "5acda9d bdi: avoid oops on device removal")

static void wb_wakeup(struct bdi_writeback *wb)
{
        spin_lock_bh(&wb->work_lock);
        if (test_bit(WB_registered, &wb->state))
                mod_delayed_work(bdi_wq, &wb->dwork, 0); <-- wb_workfn
        spin_unlock_bh(&wb->work_lock);
}


Therefore looking at my issue and the syzcaller report, I thinking maybe we shouldn't
avoid running 'wb_workfn' after a device removal but instead fix bdi->dev ==
NULL somewhere else (this is what my code snippet does, as a proof of concept).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write call hangs in kernel space after virtio hot-remove
  2018-04-25 20:07 write call hangs in kernel space after virtio hot-remove Fabiano Rosas
  2018-04-26 18:50 ` Tejun Heo
@ 2018-05-03 14:42 ` Jan Kara
  2018-05-03 16:05   ` Jeff Layton
  1 sibling, 1 reply; 8+ messages in thread
From: Jan Kara @ 2018-05-03 14:42 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: linux-block, linux-fsdevel, jack, tj, Jeff Layton

On Wed 25-04-18 17:07:48, Fabiano Rosas wrote:
> I'm looking into an issue where removing a virtio disk via sysfs while another
> process is issuing write() calls results in the writing task going into a
> livelock:
> 
> 
> root@guest # cat test.sh
> #!/bin/bash
> 
> dd if=/dev/zero of=/dev/vda bs=1M count=10000 &
> sleep 1
> echo 1 > /sys/bus/pci/devices/0000:00:04.0/remove
> 
> root@guest # ls /dev/vd*
> /dev/vda
> 
> root@guest # grep Dirty /proc/meminfo
> Dirty:                 0 kB
> 
> root@guest # sh test.sh
> 
> root@guest # ps aux | grep "[d]d if"
> root      1699 38.6  0.0 111424  1216 hvc0     D+   10:48   0:01 dd if=/dev/zero of=/dev/vda bs=1M count=10000
> 
> root@guest # ls /dev/vd*
> ls: cannot access /dev/vd*: No such file or directory
> 
> root@guest # cat /proc/1699/stack
> [<0>] 0xc0000000ffe28218
> [<0>] __switch_to+0x31c/0x480
> [<0>] balance_dirty_pages+0x990/0xb90
> [<0>] balance_dirty_pages_ratelimited+0x50c/0x6c0
> [<0>] generic_perform_write+0x1b0/0x260
> [<0>] __generic_file_write_iter+0x200/0x240
> [<0>] blkdev_write_iter+0xa4/0x150
> [<0>] __vfs_write+0x14c/0x240
> [<0>] vfs_write+0xd0/0x240
> [<0>] ksys_write+0x6c/0x110
> [<0>] system_call+0x58/0x6c
> 
> root@guest # grep Dirty /proc/meminfo
> Dirty:           1506816 kB
> 
> ---
> 
> I have done some tracing and I believe this is caused by the clearing of
> 'WB_registered' in 'wb_shutdown':
> 
> <snip>
> sh-1697  [000] ....  3994.541664: sysfs_remove_link <-del_gendisk
> sh-1697  [000] ....  3994.541671: wb_shutdown <-bdi_unregister
> <snip>
> 
> Later, when 'balance_dirty_pages' tries to start writeback, it doesn't happen
> because 'WB_registered' is not set:
> 
> fs/fs-writeback.c
> 
> static void wb_wakeup(struct bdi_writeback *wb)
> {
>         spin_lock_bh(&wb->work_lock);
>         if (test_bit(WB_registered, &wb->state))
>                 mod_delayed_work(bdi_wq, &wb->dwork, 0);
>         spin_unlock_bh(&wb->work_lock);
> }
> 
> So we get stuck in a loop in 'balance_dirty_pages':
> 
> root@guest # cat /sys/kernel/debug/tracing/set_ftrace_filter
> balance_dirty_pages_ratelimited
> balance_dirty_pages
> wb_wakeup
> wb_workfn
> io_schedule_timeout
> 
> <snip>
> dd-1699  [000] .... 11192.535946: wb_wakeup <-balance_dirty_pages
> dd-1699  [000] .... 11192.535950: io_schedule_timeout <-balance_dirty_pages
> dd-1699  [000] .... 11192.745968: wb_wakeup <-balance_dirty_pages
> dd-1699  [000] .... 11192.745972: io_schedule_timeout <-balance_dirty_pages
> <snip>
> 
> 
> The test on 'WB_registered' before starting the writeback task is introduced
> by: "5acda9 bdi: avoid oops on device removal".
> 
> I have made a *naive* attempt at fixing it by allowing writeback to happen even
> without 'WB_registered':
> 
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d4d04fe..050b067 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -982,7 +982,7 @@ void wb_start_background_writeback(struct bdi_writeback *wb)
>  	 * writeback as soon as there is no other work to do.
>  	 */
>  	trace_writeback_wake_background(wb);
> -	wb_wakeup(wb);
> +	mod_delayed_work(bdi_wq, &wb->dwork, 0);
>  }
>  
>  /*
> @@ -1933,7 +1933,7 @@ void wb_workfn(struct work_struct *work)
>  						struct bdi_writeback, dwork);
>  	long pages_written;
>  
> -	set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
> +	set_worker_desc("flush-%s", wb->bdi->dev ? dev_name(wb->bdi->dev) : "?" );
>  	current->flags |= PF_SWAPWRITE;
>  
>  	if (likely(!current_is_workqueue_rescuer() ||
> --
> 
> 
> The effect of that is that the 'dd' process now finishes successfully and we
> get "Buffer I/O error"s for the dirty pages that remain. I believe this to be
> in conformance with existing interfaces since dd does not issue any fsync()
> calls.
> 
> 
> Does my analysis make any sense and would something along these lines be
> acceptable as a solution?

Thanks for the debugging of the problem. I agree with your analysis however
I don't like your fix. The issue is that when bdi is unregistered we don't
really expect any writeback to happen after that moment. This is what
prevents various use-after-free issues and I'd like that to stay the way it
is.

What I think we should do is that we'll prevent dirtying of new pages when
we know the underlying device is gone. Because that will fix your problem
and also make sure user sees the IO errors directly instead of just in the
kernel log. The question is how to make this happen in the least painful
way. I think we could intercept writes in grab_cache_page_write_begin()
(which however requires that function to return a proper error code and not
just NULL / non-NULL). And we should also intercept write faults to not
allow page dirtying via mmap - probably somewhere in do_shared_fault() and
do_wp_page(). I've added Jeff to CC since he's dealing with IO error
handling a lot these days. Jeff, what do you think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write call hangs in kernel space after virtio hot-remove
  2018-05-03 14:42 ` Jan Kara
@ 2018-05-03 16:05   ` Jeff Layton
  2018-05-03 17:48     ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Layton @ 2018-05-03 16:05 UTC (permalink / raw)
  To: Jan Kara, Fabiano Rosas; +Cc: linux-block, linux-fsdevel, tj, Matthew Wilcox

On Thu, 2018-05-03 at 16:42 +0200, Jan Kara wrote:
> On Wed 25-04-18 17:07:48, Fabiano Rosas wrote:
> > I'm looking into an issue where removing a virtio disk via sysfs while another
> > process is issuing write() calls results in the writing task going into a
> > livelock:
> > 
> > 
> > root@guest # cat test.sh
> > #!/bin/bash
> > 
> > dd if=/dev/zero of=/dev/vda bs=1M count=10000 &
> > sleep 1
> > echo 1 > /sys/bus/pci/devices/0000:00:04.0/remove
> > 
> > root@guest # ls /dev/vd*
> > /dev/vda
> > 
> > root@guest # grep Dirty /proc/meminfo
> > Dirty:                 0 kB
> > 
> > root@guest # sh test.sh
> > 
> > root@guest # ps aux | grep "[d]d if"
> > root      1699 38.6  0.0 111424  1216 hvc0     D+   10:48   0:01 dd if=/dev/zero of=/dev/vda bs=1M count=10000
> > 
> > root@guest # ls /dev/vd*
> > ls: cannot access /dev/vd*: No such file or directory
> > 
> > root@guest # cat /proc/1699/stack
> > [<0>] 0xc0000000ffe28218
> > [<0>] __switch_to+0x31c/0x480
> > [<0>] balance_dirty_pages+0x990/0xb90
> > [<0>] balance_dirty_pages_ratelimited+0x50c/0x6c0
> > [<0>] generic_perform_write+0x1b0/0x260
> > [<0>] __generic_file_write_iter+0x200/0x240
> > [<0>] blkdev_write_iter+0xa4/0x150
> > [<0>] __vfs_write+0x14c/0x240
> > [<0>] vfs_write+0xd0/0x240
> > [<0>] ksys_write+0x6c/0x110
> > [<0>] system_call+0x58/0x6c
> > 
> > root@guest # grep Dirty /proc/meminfo
> > Dirty:           1506816 kB
> > 
> > ---
> > 
> > I have done some tracing and I believe this is caused by the clearing of
> > 'WB_registered' in 'wb_shutdown':
> > 
> > <snip>
> > sh-1697  [000] ....  3994.541664: sysfs_remove_link <-del_gendisk
> > sh-1697  [000] ....  3994.541671: wb_shutdown <-bdi_unregister
> > <snip>
> > 
> > Later, when 'balance_dirty_pages' tries to start writeback, it doesn't happen
> > because 'WB_registered' is not set:
> > 
> > fs/fs-writeback.c
> > 
> > static void wb_wakeup(struct bdi_writeback *wb)
> > {
> >         spin_lock_bh(&wb->work_lock);
> >         if (test_bit(WB_registered, &wb->state))
> >                 mod_delayed_work(bdi_wq, &wb->dwork, 0);
> >         spin_unlock_bh(&wb->work_lock);
> > }
> > 
> > So we get stuck in a loop in 'balance_dirty_pages':
> > 
> > root@guest # cat /sys/kernel/debug/tracing/set_ftrace_filter
> > balance_dirty_pages_ratelimited
> > balance_dirty_pages
> > wb_wakeup
> > wb_workfn
> > io_schedule_timeout
> > 
> > <snip>
> > dd-1699  [000] .... 11192.535946: wb_wakeup <-balance_dirty_pages
> > dd-1699  [000] .... 11192.535950: io_schedule_timeout <-balance_dirty_pages
> > dd-1699  [000] .... 11192.745968: wb_wakeup <-balance_dirty_pages
> > dd-1699  [000] .... 11192.745972: io_schedule_timeout <-balance_dirty_pages
> > <snip>
> > 
> > 
> > The test on 'WB_registered' before starting the writeback task is introduced
> > by: "5acda9 bdi: avoid oops on device removal".
> > 
> > I have made a *naive* attempt at fixing it by allowing writeback to happen even
> > without 'WB_registered':
> > 
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index d4d04fe..050b067 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -982,7 +982,7 @@ void wb_start_background_writeback(struct bdi_writeback *wb)
> >  	 * writeback as soon as there is no other work to do.
> >  	 */
> >  	trace_writeback_wake_background(wb);
> > -	wb_wakeup(wb);
> > +	mod_delayed_work(bdi_wq, &wb->dwork, 0);
> >  }
> >  
> >  /*
> > @@ -1933,7 +1933,7 @@ void wb_workfn(struct work_struct *work)
> >  						struct bdi_writeback, dwork);
> >  	long pages_written;
> >  
> > -	set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
> > +	set_worker_desc("flush-%s", wb->bdi->dev ? dev_name(wb->bdi->dev) : "?" );
> >  	current->flags |= PF_SWAPWRITE;
> >  
> >  	if (likely(!current_is_workqueue_rescuer() ||
> > --
> > 
> > 
> > The effect of that is that the 'dd' process now finishes successfully and we
> > get "Buffer I/O error"s for the dirty pages that remain. I believe this to be
> > in conformance with existing interfaces since dd does not issue any fsync()
> > calls.
> > 
> > 
> > Does my analysis make any sense and would something along these lines be
> > acceptable as a solution?
> 
> Thanks for the debugging of the problem. I agree with your analysis however
> I don't like your fix. The issue is that when bdi is unregistered we don't
> really expect any writeback to happen after that moment. This is what
> prevents various use-after-free issues and I'd like that to stay the way it
> is.
> 
> What I think we should do is that we'll prevent dirtying of new pages when
> we know the underlying device is gone. Because that will fix your problem
> and also make sure user sees the IO errors directly instead of just in the
> kernel log. The question is how to make this happen in the least painful
> way. I think we could intercept writes in grab_cache_page_write_begin()
> (which however requires that function to return a proper error code and not
> just NULL / non-NULL). And we should also intercept write faults to not
> allow page dirtying via mmap - probably somewhere in do_shared_fault() and
> do_wp_page(). I've added Jeff to CC since he's dealing with IO error
> handling a lot these days. Jeff, what do you think?
> 
> 								Honza

(cc'ing Willy too since he's given this more thought than me)

For the record, I've mostly been looking at error _reporting_. Handling
errors at this level is not something I've really considered in great
detail as of yet.

Still, I think the basic idea sounds reasonable. Not allowing pages to
be dirtied when we can't clean them seems like a reasonable thing to
do.

The big question is how we'll report this to userland:

Would your approach have it return an error on write() and such? What
sort of error if so? ENODEV? Would we have to SIGBUS when someone tries
to dirty the page through mmap?
--
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write call hangs in kernel space after virtio hot-remove
  2018-05-03 16:05   ` Jeff Layton
@ 2018-05-03 17:48     ` Matthew Wilcox
  2018-05-09 14:45       ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2018-05-03 17:48 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Jan Kara, Fabiano Rosas, linux-block, linux-fsdevel, tj

On Thu, May 03, 2018 at 12:05:14PM -0400, Jeff Layton wrote:
> On Thu, 2018-05-03 at 16:42 +0200, Jan Kara wrote:
> > On Wed 25-04-18 17:07:48, Fabiano Rosas wrote:
> > > I'm looking into an issue where removing a virtio disk via sysfs while another
> > > process is issuing write() calls results in the writing task going into a
> > > livelock:
>
> > Thanks for the debugging of the problem. I agree with your analysis however
> > I don't like your fix. The issue is that when bdi is unregistered we don't
> > really expect any writeback to happen after that moment. This is what
> > prevents various use-after-free issues and I'd like that to stay the way it
> > is.
> > 
> > What I think we should do is that we'll prevent dirtying of new pages when
> > we know the underlying device is gone. Because that will fix your problem
> > and also make sure user sees the IO errors directly instead of just in the
> > kernel log. The question is how to make this happen in the least painful
> > way. I think we could intercept writes in grab_cache_page_write_begin()
> > (which however requires that function to return a proper error code and not
> > just NULL / non-NULL). And we should also intercept write faults to not
> > allow page dirtying via mmap - probably somewhere in do_shared_fault() and
> > do_wp_page(). I've added Jeff to CC since he's dealing with IO error
> > handling a lot these days. Jeff, what do you think?
> 
> (cc'ing Willy too since he's given this more thought than me)
> 
> For the record, I've mostly been looking at error _reporting_. Handling
> errors at this level is not something I've really considered in great
> detail as of yet.
> 
> Still, I think the basic idea sounds reasonable. Not allowing pages to
> be dirtied when we can't clean them seems like a reasonable thing to
> do.
> 
> The big question is how we'll report this to userland:
> 
> Would your approach have it return an error on write() and such? What
> sort of error if so? ENODEV? Would we have to SIGBUS when someone tries
> to dirty the page through mmap?

I have been having some thoughts in this direction.  They are perhaps
a little more long-term than this particular bug, so they may not be
relevant to the immediate fix.

I want to separate removing hardware from tearing down the block device
that represents something on those hardware devices.  That allows for
better handling of intermittent transport failures, or accidental device
removal, followed by a speedy insert.

In that happy future, the bug described here wouldn't be getting an
-EIO, it'd be getting an -ENODEV because we've decided this device is
permanently gone.  I think that should indeed be a SIGBUS on mapped
writes.

Looking at the current code, filemap_fault() will return VM_FAULT_SIGBUS
already if page_cache_read() returns any error other than -ENOMEM, so
that's fine.  We probably want some code (and this is where I reach the
edge of my knowledge about the current page cache) to rip all the pages
out of the page cache for every file on an -ENODEV filesystem.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write call hangs in kernel space after virtio hot-remove
  2018-05-03 17:48     ` Matthew Wilcox
@ 2018-05-09 14:45       ` Jan Kara
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2018-05-09 14:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jeff Layton, Jan Kara, Fabiano Rosas, linux-block, linux-fsdevel, tj

On Thu 03-05-18 10:48:20, Matthew Wilcox wrote:
> On Thu, May 03, 2018 at 12:05:14PM -0400, Jeff Layton wrote:
> > On Thu, 2018-05-03 at 16:42 +0200, Jan Kara wrote:
> > > On Wed 25-04-18 17:07:48, Fabiano Rosas wrote:
> > > > I'm looking into an issue where removing a virtio disk via sysfs while another
> > > > process is issuing write() calls results in the writing task going into a
> > > > livelock:
> >
> > > Thanks for the debugging of the problem. I agree with your analysis however
> > > I don't like your fix. The issue is that when bdi is unregistered we don't
> > > really expect any writeback to happen after that moment. This is what
> > > prevents various use-after-free issues and I'd like that to stay the way it
> > > is.
> > > 
> > > What I think we should do is that we'll prevent dirtying of new pages when
> > > we know the underlying device is gone. Because that will fix your problem
> > > and also make sure user sees the IO errors directly instead of just in the
> > > kernel log. The question is how to make this happen in the least painful
> > > way. I think we could intercept writes in grab_cache_page_write_begin()
> > > (which however requires that function to return a proper error code and not
> > > just NULL / non-NULL). And we should also intercept write faults to not
> > > allow page dirtying via mmap - probably somewhere in do_shared_fault() and
> > > do_wp_page(). I've added Jeff to CC since he's dealing with IO error
> > > handling a lot these days. Jeff, what do you think?
> > 
> > (cc'ing Willy too since he's given this more thought than me)
> > 
> > For the record, I've mostly been looking at error _reporting_. Handling
> > errors at this level is not something I've really considered in great
> > detail as of yet.
> > 
> > Still, I think the basic idea sounds reasonable. Not allowing pages to
> > be dirtied when we can't clean them seems like a reasonable thing to
> > do.
> > 
> > The big question is how we'll report this to userland:
> > 
> > Would your approach have it return an error on write() and such? What
> > sort of error if so? ENODEV? Would we have to SIGBUS when someone tries
> > to dirty the page through mmap?
> 
> I have been having some thoughts in this direction.  They are perhaps
> a little more long-term than this particular bug, so they may not be
> relevant to the immediate fix.
> 
> I want to separate removing hardware from tearing down the block device
> that represents something on those hardware devices.  That allows for
> better handling of intermittent transport failures, or accidental device
> removal, followed by a speedy insert.
> 
> In that happy future, the bug described here wouldn't be getting an
> -EIO, it'd be getting an -ENODEV because we've decided this device is
> permanently gone.  I think that should indeed be a SIGBUS on mapped
> writes.
> 
> Looking at the current code, filemap_fault() will return VM_FAULT_SIGBUS
> already if page_cache_read() returns any error other than -ENOMEM, so
> that's fine.  We probably want some code (and this is where I reach the
> edge of my knowledge about the current page cache) to rip all the pages
> out of the page cache for every file on an -ENODEV filesystem.

Note that that already (partially) happens through:

del_gendisk() -> invalidate_partition() -> __invalidate_device() ->
  invalidate_inodes()
  invalidate_bdev()

The slight trouble is this currently won't do anything for open files (as
such inodes will have elevated refcount) so that needs fixing. And then
we need to prevent new dirty pages from being created...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-05-09 14:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-25 20:07 write call hangs in kernel space after virtio hot-remove Fabiano Rosas
2018-04-26 18:50 ` Tejun Heo
2018-04-26 21:10   ` Tetsuo Handa
2018-04-27 20:10     ` Fabiano Rosas
2018-05-03 14:42 ` Jan Kara
2018-05-03 16:05   ` Jeff Layton
2018-05-03 17:48     ` Matthew Wilcox
2018-05-09 14:45       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).