All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Chandan Rajendra <chandan@linux.ibm.com>
Cc: Brian Foster <bfoster@redhat.com>,
	Chandan Rajendra <chandanrlinux@gmail.com>,
	linux-xfs@vger.kernel.org, darrick.wong@oracle.com,
	hch@infradead.org
Subject: Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
Date: Fri, 30 Aug 2019 09:08:17 +1000	[thread overview]
Message-ID: <20190829230817.GW1119@dread.disaster.area> (raw)
In-Reply-To: <783535067.D5oYYkGoWf@localhost.localdomain>

On Thu, Aug 29, 2019 at 10:51:59AM +0530, Chandan Rajendra wrote:
> On Monday, August 26, 2019 6:02 AM Dave Chinner wrote: 
> > On Sun, Aug 25, 2019 at 08:35:17PM +0530, Chandan Rajendra wrote:
> > > On Friday, August 23, 2019 7:08 PM Chandan Rajendra wrote:
> > > 
> > > Dave, With the above changes made in xfs_trans_reserve(), mount task is
> > > deadlocking due to the following,
> > > 1. With synchronous transactions, __xfs_trans_commit() now causes iclogs to be
> > > flushed to the disk and hence log items to be ultimately moved to AIL.
> > > 2. xfsaild task is woken up, which acts in items on AIL.
> > > 3. After some time, we stop issuing synchronous transactions because AIL has
> > >    log items in its list and hence !xfs_ail_min(tp->t_mountp->m_ail) evaluates to
> > >    false. In xfsaild_push(), "XFS_LSN_CMP(lip->li_lsn, target) <= 0"
> > >    evaluates to false on the first iteration of the while loop. This means we
> > >    have a log item whose LSN is larger than xfs_ail->ail_target at the
> > >    beginning of the AIL.
> > 
> > The push target for xlog_grant_push_ail() is to free 25% of the log
> > space. So if all the items in the AIL are not within 25% of the tail
> > end of the log, there's nothing for the AIL to push. This indicates
> > that there is at least 25% of physical log space free.
> 
> Sorry for the late response. I was trying to understand the code flow.
> 
> Here is a snippet of perf trace explaining what is going on,
> 
> 	 760881:           mount  8654 [002]   216.813041:                         probe:xlog_grant_push_ail: (c000000000765864) comm="xfsaild/loop1" threshold_cycle_s32=3 threshold_block_s32=3970 need_bytes_s32=389328 last_sync_cycle_u32=2 last_sync_block_u32=19330 free_threshold_s32=5120 free_bytes_s32=383756 free_blocks_s32=749 l_logsize=10485760 reserve_cycle_s32=3 reserve_block_s32=9513204(~18580 blocks) tail_cycle_s32=2 tail_block_s32=19330

So this looks like last_sync_lsn is 2/19330, and the transaction
reservation is ~380kB, or close on 3% of the log. The reserve grant
head is at 3/18580, so we're ~700 * 512 = ~350kB of reservation
remaining. Yup, so we are definitely in the "go to sleep and wait"
situation here.

> 	 786576: kworker/4:1H-kb  1825 [004]   217.041079:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501

200ms later the tail has moved, and last_sync_lsn is now 3/18501.
i.e. the iclog writes have made it to disk, and the items have been
moved into the AIL. I don't know where that came from, but I'm
assuming it's an IO completion based on it being run from a
kworker context that doesn't have an "xfs-" name prefix(*).

As the tail has moved, this should have woken the anything sleeping
on the log tail in xlog_grant_head_wait() via a call to
xfs_log_space_wake(). The first waiter should wake, see that there
still isn't room in the log (only 3 sectors were freed in the log,
we need at least 60). That woken process should then run
xlog_grant_push_ail() again and go back to sleep.

(*) I have a patch that shortens "s/kworker/kw/" so that you can
actually see the name of the kworker in the 16 byte field we have
for the task name. We really should just increase current->comm to
32 bytes.

> 	 786577: kworker/4:1H-kb  1825 [004]   217.041087:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
> 	 793653:   xfsaild/loop1  8661 [004]   265.407708:                probe:xfsaild_push_last_pushed_lsn: (c000000000784644) comm="xfsaild/loop1" cycle_lsn_u32=0 block_lsn_u32=0 target_cycle_lsn_u32=2 target_block_lsn_u32=19330
> 	 793654:   xfsaild/loop1  8661 [004]   265.407717:              probe:xfsaild_push_min_lsn_less_than: (c0000000007846a0) comm="xfsaild/loop1" less_than_s32=0 cycle_lsn_u32=2 block_lsn_u32=19333 lip_x64=0xc000000303fb4a48

Ans some 40s later the xfsaild is woken by something, sees there's
nothing to do, and goes back to sleep. I don't see the process
sleeping on the grant head being ever being woken and calling
xlog_grant_push_ail(), which would see the new last_sync_lsn and
move the push target....

From this trace, it looks like the problem here is a missing or
incorrectly processed wakeup when the log tail moves.

Unfortunately, you haven't used the built in trace points for
debugging log space hangs so I can't tell anything more than this.
i.e the trace we need contains these build in tracepoints:

# trace-cmd record -e xfs_log\* -e xfs_ail\* sleep 120 &
# <run workload that hangs within 120s>

<wait for trace-cmd to exit>
# trace-cmd report | gzip > trace.txt.gz

as that will record all transaction reservations, grant head
manipulations, changes to the tail lsn, when processes sleep on the
grant head and are worken, AIL insert/move/delete, etc.

This will generate a -lot- of data. I often generate and analyse
traces in the order of tens of GBs of events to track down issues
like this, because the problem is often only seen in a single trace
event in amongst the millions that are recorded....

And if we need more info, then we add the appropriate tracepoints
into xlog_grant_push_ail, xfsaild_push, etc under those tracepoint
namespaces, so next time we have a problem we don't ahve to write
custom tracepoints.....

> i.e the log size was 2560 * 4096 = 10485760 bytes.

The default minimum size.

> > I suspect that this means the CIL is overruning it's background push
> > target by more than expected probably because the log is so small. That leads
> > to the outstanding CIL pending commit size (the current CIL context
> > and the previous CIL commit that is held off by the uncommited
> > iclog) is greater than the AIL push target, and so nothing will free
> > up more log space and wake up the transaction waiting for grant
> > space.
> > 
> > e.g. the previous CIL context commit might take 15% of the log
> > space, and the current CIL has reserved 11% of the log space.
> > Now new transactions reservations have run out of grant space and we
> > push on the ail, but it's lowest item is at 26%, and so the AIL push
> > does nothing and we're stuck because the CIL has pinned 26% of the
> > log space.
> > 
> > As a test, can you run the test with larger log sizes? I think
> > the default used was about ~3600 blocks, so it you step that up by
> > 500 blocks at a time we should get an idea of the size of the
> > overrun by the size of the log where the hang goes away. A
> > trace of the transaction reservations and AIL pushing would also be
> > insightful.
> 
> After increasing the log size to 4193 blocks (i.e. 4193 * 4k = 17174528
> bytes) and also the patch applied, I don't see the dead lock happening.

Likely because now the 380k transaction reservation is only 2% of the
log instead of close to 4% of the log, and so the overrun isn't
large enough to trigger whatever wakeup issue we have....

> Meanwhile, I am planning to read more code to map the explaination
> provided below.

Can you get a complete trace (as per above) of a hang? we're going
to need that trace to validate any analysis you do yourself,
anyway...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2019-08-29 23:08 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-21 11:04 [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space Chandan Rajendra
2019-08-21 16:42 ` Brian Foster
2019-08-22 12:46   ` Chandan Rajendra
2019-08-21 22:18 ` Dave Chinner
2019-08-22 12:40   ` Chandan Rajendra
2019-08-22 13:19     ` Chandan Rajendra
2019-08-22 16:34   ` Brian Foster
2019-08-23  0:06     ` Dave Chinner
2019-08-23 13:08       ` Brian Foster
2019-08-23 13:38         ` Chandan Rajendra
2019-08-25 15:05           ` Chandan Rajendra
2019-08-26  0:32             ` Dave Chinner
2019-08-26 12:39               ` Brian Foster
2019-08-27 23:10                 ` Dave Chinner
2019-08-28 12:06                   ` Brian Foster
2019-08-29  5:21               ` Chandan Rajendra
2019-08-29 23:08                 ` Dave Chinner [this message]
2019-08-30  0:34                   ` Dave Chinner
2019-08-30  2:13                     ` Dave Chinner
2019-08-30  6:05                       ` Chandan Rajendra
2019-08-30 17:24                       ` Brian Foster
2019-08-30 22:10                         ` Dave Chinner
2019-08-30 23:18                           ` Brian Foster
     [not found] ` <20190830164750.GD26520@bfoster>
     [not found]   ` <2367290.sgLJaTIShl@localhost.localdomain>
2019-08-31  6:58     ` Chandan Rajendra
2019-08-31 13:07       ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190829230817.GW1119@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=bfoster@redhat.com \
    --cc=chandan@linux.ibm.com \
    --cc=chandanrlinux@gmail.com \
    --cc=darrick.wong@oracle.com \
    --cc=hch@infradead.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.