linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* JBD2 transaction running out of space
@ 2019-08-19  8:57 Jan Kara
  2019-08-19 16:17 ` Darrick J. Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Jan Kara @ 2019-08-19  8:57 UTC (permalink / raw)
  To: linux-ext4

Hello,

I've recently got a bug report where JBD2 assertion failed due to
transaction commit running out of journal space. After closer inspection of
the crash dump it seems that the problem is that there were too many
journal descriptor blocks (more that max_transaction_size >> 5 + 32 we
estimate in jbd2_log_space_left()) due to descriptor blocks with revoke
records. In fact the estimate on the number of descriptor blocks looks
pretty arbitrary and there can be much more descriptor blocks needed for
revoke records. We need one revoke record for every metadata block freed.
So in the worst case (1k blocksize, 64-bit journal feature enabled,
checksumming enabled) we fit 125 revoke record in one descriptor block.  In
common cases its about 500 revoke records per descriptor block. Now when
we free large directories or large file with data journalling enabled, we can
have *lots* of blocks to revoke - with extent mapped files easily millions
in a single transaction which can mean 10k descriptor blocks - clearly more
than the estimate of 128 descriptor blocks per transaction ;)

Now users clearly don't hit this problem frequently so this is not common
case but still it is possible and malicious user could use this to DoS the
machine so I think we need to get even the weird corner-cases fixed. The
question is how because as sketched above the worst case is too bad to
account for in the common case. I have considered three options:

1) Count number of revoke records currently in the transaction and add
needed revoke descriptor blocks to the expected transaction size. This is
easy enough but does not solve all the corner cases - single handle
can add lot of revoke blocks which may overflow the space we reserve for
descriptor blocks.

2) Add argument to jbd2_journal_start() telling how many metadata blocks we
are going to free and we would account necessary revoke descriptor blocks
into reserved credits. This could work, we would generally need to pass
inode->i_blocks / blocksize as the estimate of metadata blocks to free (for
inodes to which this applies) as we don't have better estimate but I guess
that's bearable. It would require some changes on ext4 side but not too
intrusive.

3) Use the fact that we need to revoke only blocks that are currently in
the journal. Thus the number of revoke records we really may need to store
is well bound (by the journal size). What is a bit painful is tracking of
which blocks are journalled. We could use a variant of counting Bloom
filters to store that information with low memory consumption (say 64k of
memory in common case) and high-enough accuracy but still that will be some
work to write. On the plus side it would reduce the amount revoke records
we have to store even in common case.

Overall I'm probably leaning towards 2) but I'm happy to hear more opinions
or ideas :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: JBD2 transaction running out of space
  2019-08-19  8:57 JBD2 transaction running out of space Jan Kara
@ 2019-08-19 16:17 ` Darrick J. Wong
  2019-08-26 16:30   ` Jan Kara
  0 siblings, 1 reply; 3+ messages in thread
From: Darrick J. Wong @ 2019-08-19 16:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Mon, Aug 19, 2019 at 10:57:59AM +0200, Jan Kara wrote:
> Hello,
> 
> I've recently got a bug report where JBD2 assertion failed due to
> transaction commit running out of journal space. After closer inspection of
> the crash dump it seems that the problem is that there were too many
> journal descriptor blocks (more that max_transaction_size >> 5 + 32 we
> estimate in jbd2_log_space_left()) due to descriptor blocks with revoke
> records. In fact the estimate on the number of descriptor blocks looks
> pretty arbitrary and there can be much more descriptor blocks needed for
> revoke records. We need one revoke record for every metadata block freed.
> So in the worst case (1k blocksize, 64-bit journal feature enabled,
> checksumming enabled) we fit 125 revoke record in one descriptor block.  In
> common cases its about 500 revoke records per descriptor block. Now when
> we free large directories or large file with data journalling enabled, we can
> have *lots* of blocks to revoke - with extent mapped files easily millions
> in a single transaction which can mean 10k descriptor blocks - clearly more
> than the estimate of 128 descriptor blocks per transaction ;)

Can jbd2 make the jbd2_journal_revoke caller wait until it has
checkpointed the @blocknr block if it has run out of revoke record
space?

> Now users clearly don't hit this problem frequently so this is not common
> case but still it is possible and malicious user could use this to DoS the
> machine so I think we need to get even the weird corner-cases fixed. The
> question is how because as sketched above the worst case is too bad to
> account for in the common case. I have considered three options:
> 
> 1) Count number of revoke records currently in the transaction and add
> needed revoke descriptor blocks to the expected transaction size. This is
> easy enough but does not solve all the corner cases - single handle
> can add lot of revoke blocks which may overflow the space we reserve for
> descriptor blocks.
> 
> 2) Add argument to jbd2_journal_start() telling how many metadata blocks we
> are going to free and we would account necessary revoke descriptor blocks
> into reserved credits. This could work, we would generally need to pass
> inode->i_blocks / blocksize as the estimate of metadata blocks to free (for
> inodes to which this applies) as we don't have better estimate but I guess
> that's bearable. It would require some changes on ext4 side but not too
> intrusive.

What happens if iblocks / blocksize revoke records exceeds the size of
the journal?

--D

> 3) Use the fact that we need to revoke only blocks that are currently in
> the journal. Thus the number of revoke records we really may need to store
> is well bound (by the journal size). What is a bit painful is tracking of
> which blocks are journalled. We could use a variant of counting Bloom
> filters to store that information with low memory consumption (say 64k of
> memory in common case) and high-enough accuracy but still that will be some
> work to write. On the plus side it would reduce the amount revoke records
> we have to store even in common case.
> 
> Overall I'm probably leaning towards 2) but I'm happy to hear more opinions
> or ideas :)
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: JBD2 transaction running out of space
  2019-08-19 16:17 ` Darrick J. Wong
@ 2019-08-26 16:30   ` Jan Kara
  0 siblings, 0 replies; 3+ messages in thread
From: Jan Kara @ 2019-08-26 16:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Jan Kara, linux-ext4

On Mon 19-08-19 09:17:05, Darrick J. Wong wrote:
> On Mon, Aug 19, 2019 at 10:57:59AM +0200, Jan Kara wrote:
> > Hello,
> > 
> > I've recently got a bug report where JBD2 assertion failed due to
> > transaction commit running out of journal space. After closer inspection of
> > the crash dump it seems that the problem is that there were too many
> > journal descriptor blocks (more that max_transaction_size >> 5 + 32 we
> > estimate in jbd2_log_space_left()) due to descriptor blocks with revoke
> > records. In fact the estimate on the number of descriptor blocks looks
> > pretty arbitrary and there can be much more descriptor blocks needed for
> > revoke records. We need one revoke record for every metadata block freed.
> > So in the worst case (1k blocksize, 64-bit journal feature enabled,
> > checksumming enabled) we fit 125 revoke record in one descriptor block.  In
> > common cases its about 500 revoke records per descriptor block. Now when
> > we free large directories or large file with data journalling enabled, we can
> > have *lots* of blocks to revoke - with extent mapped files easily millions
> > in a single transaction which can mean 10k descriptor blocks - clearly more
> > than the estimate of 128 descriptor blocks per transaction ;)
> 
> Can jbd2 make the jbd2_journal_revoke caller wait until it has
> checkpointed the @blocknr block if it has run out of revoke record
> space?

That would be really hard to implement without introducing deadlocks
(checkpoint of a transaction may need to wait for currently committing
transaction to finish commit in some cases). Also as you mention below, it
isn't even guaranteed revoke descriptor blocks fit into a journal if we
don't limit them in some way.

> > Now users clearly don't hit this problem frequently so this is not common
> > case but still it is possible and malicious user could use this to DoS the
> > machine so I think we need to get even the weird corner-cases fixed. The
> > question is how because as sketched above the worst case is too bad to
> > account for in the common case. I have considered three options:
> > 
> > 1) Count number of revoke records currently in the transaction and add
> > needed revoke descriptor blocks to the expected transaction size. This is
> > easy enough but does not solve all the corner cases - single handle
> > can add lot of revoke blocks which may overflow the space we reserve for
> > descriptor blocks.
> > 
> > 2) Add argument to jbd2_journal_start() telling how many metadata blocks we
> > are going to free and we would account necessary revoke descriptor blocks
> > into reserved credits. This could work, we would generally need to pass
> > inode->i_blocks / blocksize as the estimate of metadata blocks to free (for
> > inodes to which this applies) as we don't have better estimate but I guess
> > that's bearable. It would require some changes on ext4 side but not too
> > intrusive.
> 
> What happens if iblocks / blocksize revoke records exceeds the size of
> the journal?

That's a good point. Doing some math this could happen when we have e.g. a
file with journalled data that is couple GB large. However looking into the
code we could use the fact that we actually truncate file
one-extent-at-a-time, thus we in fact know exactly how many blocks we are
going to free and maximum number of blocks in an extent (65535) generates
~524 revoke descriptor blocks in the worst case which still reasonably fits
within a transaction. So this seems fixable. Thanks for input!

								Honza

> > 3) Use the fact that we need to revoke only blocks that are currently in
> > the journal. Thus the number of revoke records we really may need to store
> > is well bound (by the journal size). What is a bit painful is tracking of
> > which blocks are journalled. We could use a variant of counting Bloom
> > filters to store that information with low memory consumption (say 64k of
> > memory in common case) and high-enough accuracy but still that will be some
> > work to write. On the plus side it would reduce the amount revoke records
> > we have to store even in common case.
> > 
> > Overall I'm probably leaning towards 2) but I'm happy to hear more opinions
> > or ideas :)
> > 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-08-26 16:30 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-19  8:57 JBD2 transaction running out of space Jan Kara
2019-08-19 16:17 ` Darrick J. Wong
2019-08-26 16:30   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).