From: Jan Kara <jack@suse.cz>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.cz>, linux-ext4@vger.kernel.org
Subject: Re: JBD2 transaction running out of space
Date: Mon, 26 Aug 2019 18:30:20 +0200 [thread overview]
Message-ID: <20190826163020.GL10614@quack2.suse.cz> (raw)
In-Reply-To: <20190819161705.GB15175@magnolia>
On Mon 19-08-19 09:17:05, Darrick J. Wong wrote:
> On Mon, Aug 19, 2019 at 10:57:59AM +0200, Jan Kara wrote:
> > Hello,
> >
> > I've recently got a bug report where JBD2 assertion failed due to
> > transaction commit running out of journal space. After closer inspection of
> > the crash dump it seems that the problem is that there were too many
> > journal descriptor blocks (more that max_transaction_size >> 5 + 32 we
> > estimate in jbd2_log_space_left()) due to descriptor blocks with revoke
> > records. In fact the estimate on the number of descriptor blocks looks
> > pretty arbitrary and there can be much more descriptor blocks needed for
> > revoke records. We need one revoke record for every metadata block freed.
> > So in the worst case (1k blocksize, 64-bit journal feature enabled,
> > checksumming enabled) we fit 125 revoke record in one descriptor block. In
> > common cases its about 500 revoke records per descriptor block. Now when
> > we free large directories or large file with data journalling enabled, we can
> > have *lots* of blocks to revoke - with extent mapped files easily millions
> > in a single transaction which can mean 10k descriptor blocks - clearly more
> > than the estimate of 128 descriptor blocks per transaction ;)
>
> Can jbd2 make the jbd2_journal_revoke caller wait until it has
> checkpointed the @blocknr block if it has run out of revoke record
> space?
That would be really hard to implement without introducing deadlocks
(checkpoint of a transaction may need to wait for currently committing
transaction to finish commit in some cases). Also as you mention below, it
isn't even guaranteed revoke descriptor blocks fit into a journal if we
don't limit them in some way.
> > Now users clearly don't hit this problem frequently so this is not common
> > case but still it is possible and malicious user could use this to DoS the
> > machine so I think we need to get even the weird corner-cases fixed. The
> > question is how because as sketched above the worst case is too bad to
> > account for in the common case. I have considered three options:
> >
> > 1) Count number of revoke records currently in the transaction and add
> > needed revoke descriptor blocks to the expected transaction size. This is
> > easy enough but does not solve all the corner cases - single handle
> > can add lot of revoke blocks which may overflow the space we reserve for
> > descriptor blocks.
> >
> > 2) Add argument to jbd2_journal_start() telling how many metadata blocks we
> > are going to free and we would account necessary revoke descriptor blocks
> > into reserved credits. This could work, we would generally need to pass
> > inode->i_blocks / blocksize as the estimate of metadata blocks to free (for
> > inodes to which this applies) as we don't have better estimate but I guess
> > that's bearable. It would require some changes on ext4 side but not too
> > intrusive.
>
> What happens if iblocks / blocksize revoke records exceeds the size of
> the journal?
That's a good point. Doing some math this could happen when we have e.g. a
file with journalled data that is couple GB large. However looking into the
code we could use the fact that we actually truncate file
one-extent-at-a-time, thus we in fact know exactly how many blocks we are
going to free and maximum number of blocks in an extent (65535) generates
~524 revoke descriptor blocks in the worst case which still reasonably fits
within a transaction. So this seems fixable. Thanks for input!
Honza
> > 3) Use the fact that we need to revoke only blocks that are currently in
> > the journal. Thus the number of revoke records we really may need to store
> > is well bound (by the journal size). What is a bit painful is tracking of
> > which blocks are journalled. We could use a variant of counting Bloom
> > filters to store that information with low memory consumption (say 64k of
> > memory in common case) and high-enough accuracy but still that will be some
> > work to write. On the plus side it would reduce the amount revoke records
> > we have to store even in common case.
> >
> > Overall I'm probably leaning towards 2) but I'm happy to hear more opinions
> > or ideas :)
> >
> > Honza
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
prev parent reply other threads:[~2019-08-26 16:30 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-19 8:57 JBD2 transaction running out of space Jan Kara
2019-08-19 16:17 ` Darrick J. Wong
2019-08-26 16:30 ` Jan Kara [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190826163020.GL10614@quack2.suse.cz \
--to=jack@suse.cz \
--cc=darrick.wong@oracle.com \
--cc=linux-ext4@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).