Re: How to favor memory allocations for WQ_MEM_RECLAIM threads?

From: Michal Hocko <mhocko@kernel.org>
To: Brian Foster <bfoster@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	linux-xfs@vger.kernel.org, linux-mm@kvack.org
Subject: Re: How to favor memory allocations for WQ_MEM_RECLAIM threads?
Date: Fri, 3 Mar 2017 16:52:58 +0100	[thread overview]
Message-ID: <20170303155258.GJ31499@dhcp22.suse.cz> (raw)
In-Reply-To: <20170303153720.GC21245@bfoster.bfoster>

On Fri 03-03-17 10:37:21, Brian Foster wrote:
[...]
> That aside, looking through some of the traces in this case...
> 
> - kswapd0 is waiting on an inode flush lock. This means somebody else
>   flushed the inode and it won't be unlocked until the underlying buffer
>   I/O is completed. This context is also holding pag_ici_reclaim_lock
>   which is what probably blocks other contexts from getting into inode
>   reclaim.
> - xfsaild is in xfs_iflush(), which means it has the inode flush lock.
>   It's waiting on reading the underlying inode buffer. The buffer read
>   sets b_ioend_wq to the xfs-buf wq, which is ultimately going to be
>   queued in xfs_buf_bio_end_io()->xfs_buf_ioend_async(). The associated
>   work item is what eventually triggers the I/O completion in
>   xfs_buf_ioend().
> 
> So at this point reclaim is waiting on a read I/O completion. It's not
> clear to me whether the read had completed and the work item was queued
> or not. I do see the following in the workqueue lockup BUG output:
> 
> [  273.412600] workqueue xfs-buf/sda1: flags=0xc
> [  273.414486]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/1
> [  273.416415]     pending: xfs_buf_ioend_work [xfs]
> 
> ... which suggests that it was queued..? I suppose this could be one of
> the workqueues waiting on a kthread, but xfs-buf also has a rescuer that
> appears to be idle:
> 
> [ 1041.555227] xfs-buf/sda1    S14904   450      2 0x00000000
> [ 1041.556813] Call Trace:
> [ 1041.557796]  __schedule+0x336/0xe00
> [ 1041.558983]  schedule+0x3d/0x90
> [ 1041.560085]  rescuer_thread+0x322/0x3d0
> [ 1041.561333]  kthread+0x10f/0x150
> [ 1041.562464]  ? worker_thread+0x4b0/0x4b0
> [ 1041.563732]  ? kthread_create_on_node+0x70/0x70
> [ 1041.565123]  ret_from_fork+0x31/0x40
> 
> So shouldn't that thread pick up the work item if that is the case?

Is it possible that the progress is done but tediously slow? Keep in
mind that the test case is doing write from 1k processes while one
process basically consumes all the memory. So I wouldn't be surprised
if this just made system to crawl on any attempt to do an IO.
-- 
Michal Hocko
SUSE Labs