Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] direct-io: use GFP_NOIO to avoid deadlock
@ 2019-08-08  9:50 Mikulas Patocka
  2019-08-08 13:53 ` Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Mikulas Patocka @ 2019-08-08  9:50 UTC (permalink / raw)
  To: Alexander Viro, Darrick J. Wong, Mike Snitzer, junxiao.bi
  Cc: dm-devel, Alasdair Kergon, honglei.wang, linux-kernel,
	linux-fsdevel, linux-xfs

A deadlock with this stacktrace was observed.

The obvious problem here is that in the call chain 
xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc 
we do a GFP_KERNEL allocation while we are in a filesystem driver and in a 
block device driver.

This patch changes the direct-io code to use GFP_NOIO.

PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
   #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
   #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
   #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
   #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
   #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
   #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
   #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
   #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
   #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
   #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
  #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
  #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
  #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
  #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
  #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242

  PID: 14127  TASK: ffff881455749c00  CPU: 11  COMMAND: "loop1"
   #0 [ffff88272f5af228] __schedule at ffffffff8173f405
   #1 [ffff88272f5af280] schedule at ffffffff8173fa27
   #2 [ffff88272f5af2a0] schedule_preempt_disabled at ffffffff8173fd5e
   #3 [ffff88272f5af2b0] __mutex_lock_slowpath at ffffffff81741fb5
   #4 [ffff88272f5af330] mutex_lock at ffffffff81742133
   #5 [ffff88272f5af350] dm_bufio_shrink_count at ffffffffa03865f9 [dm_bufio]
   #6 [ffff88272f5af380] shrink_slab at ffffffff811a86bd
   #7 [ffff88272f5af470] shrink_zone at ffffffff811ad778
   #8 [ffff88272f5af500] do_try_to_free_pages at ffffffff811adb34
   #9 [ffff88272f5af590] try_to_free_pages at ffffffff811adef8
  #10 [ffff88272f5af610] __alloc_pages_nodemask at ffffffff811a09c3
  #11 [ffff88272f5af710] alloc_pages_current at ffffffff811e8b71
  #12 [ffff88272f5af760] new_slab at ffffffff811f4523
  #13 [ffff88272f5af7b0] __slab_alloc at ffffffff8173a1b5
  #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b
  #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3
  #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3
  #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs]
  #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994
  #19 [ffff88272f5afc50] __dta_xfs_file_read_iter_2398 at ffffffffa020c970 [xfs]
  #20 [ffff88272f5afcc0] lo_rw_aio at ffffffffa0377042 [loop]
  #21 [ffff88272f5afd70] loop_queue_work at ffffffffa0377c3b [loop]
  #22 [ffff88272f5afe60] kthread_worker_fn at ffffffff810a8a0c
  #23 [ffff88272f5afec0] kthread at ffffffff810a8428
  #24 [ffff88272f5aff50] ret_from_fork at ffffffff81745242

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

---
 fs/direct-io.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c	2019-08-02 08:51:56.000000000 +0200
+++ linux-2.6/fs/direct-io.c	2019-08-08 11:22:18.000000000 +0200
@@ -436,7 +436,7 @@ dio_bio_alloc(struct dio *dio, struct di
 	 * bio_alloc() is guaranteed to return a bio when allowed to sleep and
 	 * we request a valid number of vectors.
 	 */
-	bio = bio_alloc(GFP_KERNEL, nr_vecs);
+	bio = bio_alloc(GFP_NOIO, nr_vecs);
 
 	bio_set_dev(bio, bdev);
 	bio->bi_iter.bi_sector = first_sector;
@@ -1197,7 +1197,7 @@ do_blockdev_direct_IO(struct kiocb *iocb
 	if (iov_iter_rw(iter) == READ && !count)
 		return 0;
 
-	dio = kmem_cache_alloc(dio_cache, GFP_KERNEL);
+	dio = kmem_cache_alloc(dio_cache, GFP_NOIO);
 	retval = -ENOMEM;
 	if (!dio)
 		goto out;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock
  2019-08-08  9:50 [PATCH] direct-io: use GFP_NOIO to avoid deadlock Mikulas Patocka
@ 2019-08-08 13:53 ` Matthew Wilcox
  2019-08-08 15:13   ` Mikulas Patocka
  2019-08-08 15:17   ` [PATCH] loop: set PF_MEMALLOC_NOIO for the worker thread Mikulas Patocka
  2019-08-08 14:39 ` [PATCH] direct-io: use GFP_NOIO to avoid deadlock Junxiao Bi
  2019-08-09  1:34 ` Dave Chinner
  2 siblings, 2 replies; 11+ messages in thread
From: Matthew Wilcox @ 2019-08-08 13:53 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alexander Viro, Darrick J. Wong, Mike Snitzer, junxiao.bi,
	dm-devel, Alasdair Kergon, honglei.wang, linux-kernel,
	linux-fsdevel, linux-xfs

On Thu, Aug 08, 2019 at 05:50:10AM -0400, Mikulas Patocka wrote:
> A deadlock with this stacktrace was observed.
> 
> The obvious problem here is that in the call chain 
> xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc 
> we do a GFP_KERNEL allocation while we are in a filesystem driver and in a 
> block device driver.

But that's not the problem.  The problem is the loop driver calls into the
filesystem without calling memalloc_noio_save() / memalloc_noio_restore().
There are dozens of places in XFS which use GFP_KERNEL allocations and
all can trigger this same problem if called from the loop driver.

>   #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b
>   #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3
>   #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3
>   #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs]
>   #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994
>   #19 [ffff88272f5afc50] __dta_xfs_file_read_iter_2398 at ffffffffa020c970 [xfs]
>   #20 [ffff88272f5afcc0] lo_rw_aio at ffffffffa0377042 [loop]
>   #21 [ffff88272f5afd70] loop_queue_work at ffffffffa0377c3b [loop]
>   #22 [ffff88272f5afe60] kthread_worker_fn at ffffffff810a8a0c
>   #23 [ffff88272f5afec0] kthread at ffffffff810a8428
>   #24 [ffff88272f5aff50] ret_from_fork at ffffffff81745242

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock
  2019-08-08  9:50 [PATCH] direct-io: use GFP_NOIO to avoid deadlock Mikulas Patocka
  2019-08-08 13:53 ` Matthew Wilcox
@ 2019-08-08 14:39 ` Junxiao Bi
  2019-08-09  1:34 ` Dave Chinner
  2 siblings, 0 replies; 11+ messages in thread
From: Junxiao Bi @ 2019-08-08 14:39 UTC (permalink / raw)
  To: Mikulas Patocka, Alexander Viro, Darrick J. Wong, Mike Snitzer
  Cc: dm-devel, Alasdair Kergon, honglei.wang, linux-kernel,
	linux-fsdevel, linux-xfs

Hi Mikulas,

This seemed not issue on mainline, mutex in dm_bufio_shrink_count() had 
been removed from mainline.

Thanks,

Junxiao.

On 8/8/19 2:50 AM, Mikulas Patocka wrote:
> A deadlock with this stacktrace was observed.
>
> The obvious problem here is that in the call chain
> xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc
> we do a GFP_KERNEL allocation while we are in a filesystem driver and in a
> block device driver.
>
> This patch changes the direct-io code to use GFP_NOIO.
>
> PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
>     #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
>     #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
>     #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
>     #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
>     #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
>     #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
>     #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
>     #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
>     #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
>     #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
>    #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
>    #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
>    #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
>    #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
>    #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242
>
>    PID: 14127  TASK: ffff881455749c00  CPU: 11  COMMAND: "loop1"
>     #0 [ffff88272f5af228] __schedule at ffffffff8173f405
>     #1 [ffff88272f5af280] schedule at ffffffff8173fa27
>     #2 [ffff88272f5af2a0] schedule_preempt_disabled at ffffffff8173fd5e
>     #3 [ffff88272f5af2b0] __mutex_lock_slowpath at ffffffff81741fb5
>     #4 [ffff88272f5af330] mutex_lock at ffffffff81742133
>     #5 [ffff88272f5af350] dm_bufio_shrink_count at ffffffffa03865f9 [dm_bufio]
>     #6 [ffff88272f5af380] shrink_slab at ffffffff811a86bd
>     #7 [ffff88272f5af470] shrink_zone at ffffffff811ad778
>     #8 [ffff88272f5af500] do_try_to_free_pages at ffffffff811adb34
>     #9 [ffff88272f5af590] try_to_free_pages at ffffffff811adef8
>    #10 [ffff88272f5af610] __alloc_pages_nodemask at ffffffff811a09c3
>    #11 [ffff88272f5af710] alloc_pages_current at ffffffff811e8b71
>    #12 [ffff88272f5af760] new_slab at ffffffff811f4523
>    #13 [ffff88272f5af7b0] __slab_alloc at ffffffff8173a1b5
>    #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b
>    #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3
>    #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3
>    #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs]
>    #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994
>    #19 [ffff88272f5afc50] __dta_xfs_file_read_iter_2398 at ffffffffa020c970 [xfs]
>    #20 [ffff88272f5afcc0] lo_rw_aio at ffffffffa0377042 [loop]
>    #21 [ffff88272f5afd70] loop_queue_work at ffffffffa0377c3b [loop]
>    #22 [ffff88272f5afe60] kthread_worker_fn at ffffffff810a8a0c
>    #23 [ffff88272f5afec0] kthread at ffffffff810a8428
>    #24 [ffff88272f5aff50] ret_from_fork at ffffffff81745242
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
>
> ---
>   fs/direct-io.c |    4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/fs/direct-io.c
> ===================================================================
> --- linux-2.6.orig/fs/direct-io.c	2019-08-02 08:51:56.000000000 +0200
> +++ linux-2.6/fs/direct-io.c	2019-08-08 11:22:18.000000000 +0200
> @@ -436,7 +436,7 @@ dio_bio_alloc(struct dio *dio, struct di
>   	 * bio_alloc() is guaranteed to return a bio when allowed to sleep and
>   	 * we request a valid number of vectors.
>   	 */
> -	bio = bio_alloc(GFP_KERNEL, nr_vecs);
> +	bio = bio_alloc(GFP_NOIO, nr_vecs);
>   
>   	bio_set_dev(bio, bdev);
>   	bio->bi_iter.bi_sector = first_sector;
> @@ -1197,7 +1197,7 @@ do_blockdev_direct_IO(struct kiocb *iocb
>   	if (iov_iter_rw(iter) == READ && !count)
>   		return 0;
>   
> -	dio = kmem_cache_alloc(dio_cache, GFP_KERNEL);
> +	dio = kmem_cache_alloc(dio_cache, GFP_NOIO);
>   	retval = -ENOMEM;
>   	if (!dio)
>   		goto out;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock
  2019-08-08 13:53 ` Matthew Wilcox
@ 2019-08-08 15:13   ` Mikulas Patocka
  2019-08-08 15:17   ` [PATCH] loop: set PF_MEMALLOC_NOIO for the worker thread Mikulas Patocka
  1 sibling, 0 replies; 11+ messages in thread
From: Mikulas Patocka @ 2019-08-08 15:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexander Viro, Darrick J. Wong, Mike Snitzer, junxiao.bi,
	dm-devel, Alasdair Kergon, honglei.wang, linux-kernel,
	linux-fsdevel, linux-xfs



On Thu, 8 Aug 2019, Matthew Wilcox wrote:

> On Thu, Aug 08, 2019 at 05:50:10AM -0400, Mikulas Patocka wrote:
> > A deadlock with this stacktrace was observed.
> > 
> > The obvious problem here is that in the call chain 
> > xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc 
> > we do a GFP_KERNEL allocation while we are in a filesystem driver and in a 
> > block device driver.
> 
> But that's not the problem.  The problem is the loop driver calls into the
> filesystem without calling memalloc_noio_save() / memalloc_noio_restore().
> There are dozens of places in XFS which use GFP_KERNEL allocations and
> all can trigger this same problem if called from the loop driver.

OK. I'll send a new patch that sets PF_MEMALLOC_NOIO in the loop driver.

Mikulas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] loop: set PF_MEMALLOC_NOIO for the worker thread
  2019-08-08 13:53 ` Matthew Wilcox
  2019-08-08 15:13   ` Mikulas Patocka
@ 2019-08-08 15:17   ` Mikulas Patocka
  2019-08-08 16:12     ` Jens Axboe
  1 sibling, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2019-08-08 15:17 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox
  Cc: Alexander Viro, Darrick J. Wong, Mike Snitzer, junxiao.bi,
	dm-devel, Alasdair Kergon, honglei.wang, linux-kernel,
	linux-fsdevel, linux-xfs, linux-block

A deadlock with this stacktrace was observed.

The loop thread does a GFP_KERNEL allocation, it calls into dm-bufio
shrinker and the shrinker depends on I/O completion in the dm-bufio
subsystem.

In order to fix the deadlock (and other similar ones), we set the flag
PF_MEMALLOC_NOIO at loop thread entry.

PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
   #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
   #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
   #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
   #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
   #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
   #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
   #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
   #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
   #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
   #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
  #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
  #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
  #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
  #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
  #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242

  PID: 14127  TASK: ffff881455749c00  CPU: 11  COMMAND: "loop1"
   #0 [ffff88272f5af228] __schedule at ffffffff8173f405
   #1 [ffff88272f5af280] schedule at ffffffff8173fa27
   #2 [ffff88272f5af2a0] schedule_preempt_disabled at ffffffff8173fd5e
   #3 [ffff88272f5af2b0] __mutex_lock_slowpath at ffffffff81741fb5
   #4 [ffff88272f5af330] mutex_lock at ffffffff81742133
   #5 [ffff88272f5af350] dm_bufio_shrink_count at ffffffffa03865f9 [dm_bufio]
   #6 [ffff88272f5af380] shrink_slab at ffffffff811a86bd
   #7 [ffff88272f5af470] shrink_zone at ffffffff811ad778
   #8 [ffff88272f5af500] do_try_to_free_pages at ffffffff811adb34
   #9 [ffff88272f5af590] try_to_free_pages at ffffffff811adef8
  #10 [ffff88272f5af610] __alloc_pages_nodemask at ffffffff811a09c3
  #11 [ffff88272f5af710] alloc_pages_current at ffffffff811e8b71
  #12 [ffff88272f5af760] new_slab at ffffffff811f4523
  #13 [ffff88272f5af7b0] __slab_alloc at ffffffff8173a1b5
  #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b
  #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3
  #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3
  #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs]
  #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994
  #19 [ffff88272f5afc50] __dta_xfs_file_read_iter_2398 at ffffffffa020c970 [xfs]
  #20 [ffff88272f5afcc0] lo_rw_aio at ffffffffa0377042 [loop]
  #21 [ffff88272f5afd70] loop_queue_work at ffffffffa0377c3b [loop]
  #22 [ffff88272f5afe60] kthread_worker_fn at ffffffff810a8a0c
  #23 [ffff88272f5afec0] kthread at ffffffff810a8428
  #24 [ffff88272f5aff50] ret_from_fork at ffffffff81745242

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

---
 drivers/block/loop.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/drivers/block/loop.c
===================================================================
--- linux-2.6.orig/drivers/block/loop.c	2019-08-08 17:02:50.000000000 +0200
+++ linux-2.6/drivers/block/loop.c	2019-08-08 17:08:14.000000000 +0200
@@ -885,7 +885,7 @@ static void loop_unprepare_queue(struct
 
 static int loop_kthread_worker_fn(void *worker_ptr)
 {
-	current->flags |= PF_LESS_THROTTLE;
+	current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
 	return kthread_worker_fn(worker_ptr);
 }
 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] loop: set PF_MEMALLOC_NOIO for the worker thread
  2019-08-08 15:17   ` [PATCH] loop: set PF_MEMALLOC_NOIO for the worker thread Mikulas Patocka
@ 2019-08-08 16:12     ` Jens Axboe
  0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2019-08-08 16:12 UTC (permalink / raw)
  To: Mikulas Patocka, Matthew Wilcox
  Cc: Alexander Viro, Darrick J. Wong, Mike Snitzer, junxiao.bi,
	dm-devel, Alasdair Kergon, honglei.wang, linux-kernel,
	linux-fsdevel, linux-xfs, linux-block

On 8/8/19 8:17 AM, Mikulas Patocka wrote:
> A deadlock with this stacktrace was observed.
> 
> The loop thread does a GFP_KERNEL allocation, it calls into dm-bufio
> shrinker and the shrinker depends on I/O completion in the dm-bufio
> subsystem.
> 
> In order to fix the deadlock (and other similar ones), we set the flag
> PF_MEMALLOC_NOIO at loop thread entry.
> 
> PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
>     #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
>     #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
>     #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
>     #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
>     #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
>     #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
>     #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
>     #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
>     #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
>     #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
>    #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
>    #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
>    #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
>    #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
>    #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242
> 
>    PID: 14127  TASK: ffff881455749c00  CPU: 11  COMMAND: "loop1"
>     #0 [ffff88272f5af228] __schedule at ffffffff8173f405
>     #1 [ffff88272f5af280] schedule at ffffffff8173fa27
>     #2 [ffff88272f5af2a0] schedule_preempt_disabled at ffffffff8173fd5e
>     #3 [ffff88272f5af2b0] __mutex_lock_slowpath at ffffffff81741fb5
>     #4 [ffff88272f5af330] mutex_lock at ffffffff81742133
>     #5 [ffff88272f5af350] dm_bufio_shrink_count at ffffffffa03865f9 [dm_bufio]
>     #6 [ffff88272f5af380] shrink_slab at ffffffff811a86bd
>     #7 [ffff88272f5af470] shrink_zone at ffffffff811ad778
>     #8 [ffff88272f5af500] do_try_to_free_pages at ffffffff811adb34
>     #9 [ffff88272f5af590] try_to_free_pages at ffffffff811adef8
>    #10 [ffff88272f5af610] __alloc_pages_nodemask at ffffffff811a09c3
>    #11 [ffff88272f5af710] alloc_pages_current at ffffffff811e8b71
>    #12 [ffff88272f5af760] new_slab at ffffffff811f4523
>    #13 [ffff88272f5af7b0] __slab_alloc at ffffffff8173a1b5
>    #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b
>    #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3
>    #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3
>    #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs]
>    #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994
>    #19 [ffff88272f5afc50] __dta_xfs_file_read_iter_2398 at ffffffffa020c970 [xfs]
>    #20 [ffff88272f5afcc0] lo_rw_aio at ffffffffa0377042 [loop]
>    #21 [ffff88272f5afd70] loop_queue_work at ffffffffa0377c3b [loop]
>    #22 [ffff88272f5afe60] kthread_worker_fn at ffffffff810a8a0c
>    #23 [ffff88272f5afec0] kthread at ffffffff810a8428
>    #24 [ffff88272f5aff50] ret_from_fork at ffffffff81745242

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock
  2019-08-08  9:50 [PATCH] direct-io: use GFP_NOIO to avoid deadlock Mikulas Patocka
  2019-08-08 13:53 ` Matthew Wilcox
  2019-08-08 14:39 ` [PATCH] direct-io: use GFP_NOIO to avoid deadlock Junxiao Bi
@ 2019-08-09  1:34 ` Dave Chinner
  2019-08-09 11:30   ` Mikulas Patocka
  2 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2019-08-09  1:34 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alexander Viro, Darrick J. Wong, Mike Snitzer, junxiao.bi,
	dm-devel, Alasdair Kergon, honglei.wang, linux-kernel,
	linux-fsdevel, linux-xfs

On Thu, Aug 08, 2019 at 05:50:10AM -0400, Mikulas Patocka wrote:
> A deadlock with this stacktrace was observed.
> 
> The obvious problem here is that in the call chain 
> xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc 
> we do a GFP_KERNEL allocation while we are in a filesystem driver and in a 
> block device driver.
> 
> This patch changes the direct-io code to use GFP_NOIO.
> 
> PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
>    #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
>    #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
>    #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
>    #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
>    #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
>    #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
>    #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
>    #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
>    #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
>    #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
>   #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
>   #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
>   #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
>   #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
>   #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242
> 
>   PID: 14127  TASK: ffff881455749c00  CPU: 11  COMMAND: "loop1"
>    #0 [ffff88272f5af228] __schedule at ffffffff8173f405
>    #1 [ffff88272f5af280] schedule at ffffffff8173fa27
>    #2 [ffff88272f5af2a0] schedule_preempt_disabled at ffffffff8173fd5e
>    #3 [ffff88272f5af2b0] __mutex_lock_slowpath at ffffffff81741fb5
>    #4 [ffff88272f5af330] mutex_lock at ffffffff81742133
>    #5 [ffff88272f5af350] dm_bufio_shrink_count at ffffffffa03865f9 [dm_bufio]
>    #6 [ffff88272f5af380] shrink_slab at ffffffff811a86bd
>    #7 [ffff88272f5af470] shrink_zone at ffffffff811ad778
>    #8 [ffff88272f5af500] do_try_to_free_pages at ffffffff811adb34
>    #9 [ffff88272f5af590] try_to_free_pages at ffffffff811adef8
>   #10 [ffff88272f5af610] __alloc_pages_nodemask at ffffffff811a09c3
>   #11 [ffff88272f5af710] alloc_pages_current at ffffffff811e8b71
>   #12 [ffff88272f5af760] new_slab at ffffffff811f4523
>   #13 [ffff88272f5af7b0] __slab_alloc at ffffffff8173a1b5
>   #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b
>   #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3
>   #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3
>   #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs]
>   #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994

Um, what kernel is this? XFS stopped using __blockdev_direct_IO some
time around 4.8 or 4.9, IIRC. Perhaps it would be best to reproduce
problems on a TOT kernel first?

And, FWIW, there's an argument to be made here that the underlying
bug is dm_bufio_shrink_scan() blocking kswapd by waiting on IO
completions while holding a mutex that other IO-level reclaim
contexts require to make progress.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock
  2019-08-09  1:34 ` Dave Chinner
@ 2019-08-09 11:30   ` Mikulas Patocka
  2019-08-09 21:57     ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2019-08-09 11:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Darrick J. Wong, Mike Snitzer, junxiao.bi,
	dm-devel, Alasdair Kergon, honglei.wang, linux-kernel,
	linux-fsdevel, linux-xfs



On Fri, 9 Aug 2019, Dave Chinner wrote:

> And, FWIW, there's an argument to be made here that the underlying
> bug is dm_bufio_shrink_scan() blocking kswapd by waiting on IO
> completions while holding a mutex that other IO-level reclaim
> contexts require to make progress.
> 
> Cheers,
> 
> Dave.

The IO-level reclaim contexts should use GFP_NOIO. If the dm-bufio 
shrinker is called with GFP_NOIO, it cannot be blocked by kswapd, because:
* it uses trylock to acquire the mutex
* it doesn't attempt to free buffers that are dirty or under I/O (see 
  __try_evict_buffer)

I think that this logic is sufficient to prevent the deadlock.

Mikulas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock
  2019-08-09 11:30   ` Mikulas Patocka
@ 2019-08-09 21:57     ` Dave Chinner
  2019-08-13 16:35       ` Mikulas Patocka
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2019-08-09 21:57 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alexander Viro, Darrick J. Wong, Mike Snitzer, junxiao.bi,
	dm-devel, Alasdair Kergon, honglei.wang, linux-kernel,
	linux-fsdevel, linux-xfs

On Fri, Aug 09, 2019 at 07:30:00AM -0400, Mikulas Patocka wrote:
> 
> 
> On Fri, 9 Aug 2019, Dave Chinner wrote:
> 
> > And, FWIW, there's an argument to be made here that the underlying
> > bug is dm_bufio_shrink_scan() blocking kswapd by waiting on IO
> > completions while holding a mutex that other IO-level reclaim
> > contexts require to make progress.
> > 
> > Cheers,
> > 
> > Dave.
> 
> The IO-level reclaim contexts should use GFP_NOIO. If the dm-bufio 
> shrinker is called with GFP_NOIO, it cannot be blocked by kswapd, because:

No, you misunderstand. I'm talking about blocking kswapd being
wrong.  i.e. Blocking kswapd in shrinkers causes problems
because th ememory reclaim code does not expect kswapd to be
arbitrarily delayed by waiting on IO. We've had this problem with
the XFS inode cache shrinker for years, and there are many reports
of extremely long reclaim latencies for both direct and kswapd
reclaim that result from kswapd not making progress while waiting
in shrinkers for IO to complete.

The work I'm currently doing to fix this XFS problem can be found
here:

https://lore.kernel.org/linux-fsdevel/20190801021752.4986-1-david@fromorbit.com/


i.e. the point I'm making is that waiting for IO in kswapd reclaim
context is considered harmful - kswapd context shrinker reclaim
should be as non-blocking as possible, and any back-off to wait for
IO to complete should be done by the high level reclaim core once
it's completed an entire reclaim scan cycle of everything....

What follows from that, and is pertinent for in this situation, is
that if you don't block kswapd, then other reclaim contexts are not
going to get stuck waiting for it regardless of the reclaim context
they use.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock
  2019-08-09 21:57     ` Dave Chinner
@ 2019-08-13 16:35       ` Mikulas Patocka
  2019-08-14 10:43         ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2019-08-13 16:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Darrick J. Wong, Mike Snitzer, junxiao.bi,
	dm-devel, Alasdair Kergon, honglei.wang, linux-kernel,
	linux-fsdevel, linux-xfs



On Sat, 10 Aug 2019, Dave Chinner wrote:

> No, you misunderstand. I'm talking about blocking kswapd being
> wrong.  i.e. Blocking kswapd in shrinkers causes problems
> because th ememory reclaim code does not expect kswapd to be
> arbitrarily delayed by waiting on IO. We've had this problem with
> the XFS inode cache shrinker for years, and there are many reports
> of extremely long reclaim latencies for both direct and kswapd
> reclaim that result from kswapd not making progress while waiting
> in shrinkers for IO to complete.
> 
> The work I'm currently doing to fix this XFS problem can be found
> here:
> 
> https://lore.kernel.org/linux-fsdevel/20190801021752.4986-1-david@fromorbit.com/
> 
> 
> i.e. the point I'm making is that waiting for IO in kswapd reclaim
> context is considered harmful - kswapd context shrinker reclaim
> should be as non-blocking as possible, and any back-off to wait for
> IO to complete should be done by the high level reclaim core once
> it's completed an entire reclaim scan cycle of everything....
> 
> What follows from that, and is pertinent for in this situation, is
> that if you don't block kswapd, then other reclaim contexts are not
> going to get stuck waiting for it regardless of the reclaim context
> they use.
> 
> Cheers,
> 
> Dave.

So, what do you think the dm-bufio shrinker should do?

Currently it tries to free buffers on the clean list and if there are not 
enough buffers on the clean list, it goes into the dirty list - it writes 
the buffers back and then frees them.

What should it do? Should it just start writeback of the dirty list 
without waiting for it? What should it do if all the buffers are under 
writeback?

Mikulas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock
  2019-08-13 16:35       ` Mikulas Patocka
@ 2019-08-14 10:43         ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2019-08-14 10:43 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alexander Viro, Darrick J. Wong, Mike Snitzer, junxiao.bi,
	dm-devel, Alasdair Kergon, honglei.wang, linux-kernel,
	linux-fsdevel, linux-xfs

On Tue, Aug 13, 2019 at 12:35:49PM -0400, Mikulas Patocka wrote:
> 
> 
> On Sat, 10 Aug 2019, Dave Chinner wrote:
> 
> > No, you misunderstand. I'm talking about blocking kswapd being
> > wrong.  i.e. Blocking kswapd in shrinkers causes problems
> > because th ememory reclaim code does not expect kswapd to be
> > arbitrarily delayed by waiting on IO. We've had this problem with
> > the XFS inode cache shrinker for years, and there are many reports
> > of extremely long reclaim latencies for both direct and kswapd
> > reclaim that result from kswapd not making progress while waiting
> > in shrinkers for IO to complete.
> > 
> > The work I'm currently doing to fix this XFS problem can be found
> > here:
> > 
> > https://lore.kernel.org/linux-fsdevel/20190801021752.4986-1-david@fromorbit.com/
> > 
> > 
> > i.e. the point I'm making is that waiting for IO in kswapd reclaim
> > context is considered harmful - kswapd context shrinker reclaim
> > should be as non-blocking as possible, and any back-off to wait for
> > IO to complete should be done by the high level reclaim core once
> > it's completed an entire reclaim scan cycle of everything....
> > 
> > What follows from that, and is pertinent for in this situation, is
> > that if you don't block kswapd, then other reclaim contexts are not
> > going to get stuck waiting for it regardless of the reclaim context
> > they use.
> > 
> > Cheers,
> > 
> > Dave.
> 
> So, what do you think the dm-bufio shrinker should do?

I'm not familiar with the constraints the code operates under, so
I can't guarantee that I have an answer for you... :/

> Currently it tries to free buffers on the clean list and if there are not 
> enough buffers on the clean list, it goes into the dirty list - it writes 
> the buffers back and then frees them.
> 
> What should it do? Should it just start writeback of the dirty list 
> without waiting for it? What should it do if all the buffers are under 
> writeback?

For kswapd, it should do what it can without blocking. e.g. kicking
an async writer thread rather than submitting the IO itself. That's
what I changes XFS to do.

And if you look at the patchset in the above link, it also
introduced a mechanism for shrinkers to communicate back to the high
level reclaim code that kswapd needs to back off
(reclaim_state->need_backoff).

With these mechanism, the shrinker can start IO without blocking
kswapd on congested request queues and tell memory reclaim to wait
before calling this shrinker again. This allows kswapd to aggregate
all the waits that shrinkers and page reclaim require to all
progress to be made into a single backoff event. That means kswapd
does other scanning work while background writeback goes on, and
once everythign is scanned it does a single wait for everything that
needs time to make progress...

I think that should also work for the dm-bufio shrinker, and the the
direct reclaim backoff parts of the patchset should work for
non-blocking direct reclaim scanning as well, like it now does for
XFS.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, back to index

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-08  9:50 [PATCH] direct-io: use GFP_NOIO to avoid deadlock Mikulas Patocka
2019-08-08 13:53 ` Matthew Wilcox
2019-08-08 15:13   ` Mikulas Patocka
2019-08-08 15:17   ` [PATCH] loop: set PF_MEMALLOC_NOIO for the worker thread Mikulas Patocka
2019-08-08 16:12     ` Jens Axboe
2019-08-08 14:39 ` [PATCH] direct-io: use GFP_NOIO to avoid deadlock Junxiao Bi
2019-08-09  1:34 ` Dave Chinner
2019-08-09 11:30   ` Mikulas Patocka
2019-08-09 21:57     ` Dave Chinner
2019-08-13 16:35       ` Mikulas Patocka
2019-08-14 10:43         ` Dave Chinner

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org linux-fsdevel@archiver.kernel.org
	public-inbox-index linux-fsdevel


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox