All of lore.kernel.org
 help / color / mirror / Atom feed
* jbd2: fix deadlock while checkpoint thread waits commit thread to finish (backport to 4.14)
@ 2021-07-07 18:42 Ivan Zahariev
  2021-07-07 23:52 ` Theodore Ts'o
  0 siblings, 1 reply; 4+ messages in thread
From: Ivan Zahariev @ 2021-07-07 18:42 UTC (permalink / raw)
  To: linux-ext4

Hello,

We're running Linux kernel 4.14.x and our systems occasionally suffer a 
bug which is already fixed: 
https://github.com/torvalds/linux/commit/53cf978457325d8fb2cdecd7981b31a8229e446e

This bugfix hasn't been ported to Linux kernels 4.14 or 4.19. The patch 
applies cleanly. The two files "fs/jbd2/checkpoint.c" and 
"fs/jbd2/journal.c" seem pretty identical in the affected sections 
compared to kernel 5.4 where we have this bugfix already applied.

Is it on purpose that this bugfix hasn't been ported to 4.14? Is it safe 
that we backport it manually in our kernel 4.14 builds? Or is the "ext4" 
system in 4.14 and 5.4 fundamentally different and this would lead to 
data loss or other problems?

Thank you.

Best regards.
--Ivan


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: jbd2: fix deadlock while checkpoint thread waits commit thread to finish (backport to 4.14)
  2021-07-07 18:42 jbd2: fix deadlock while checkpoint thread waits commit thread to finish (backport to 4.14) Ivan Zahariev
@ 2021-07-07 23:52 ` Theodore Ts'o
  2021-07-08  3:45   ` Ivan Zahariev
  0 siblings, 1 reply; 4+ messages in thread
From: Theodore Ts'o @ 2021-07-07 23:52 UTC (permalink / raw)
  To: Ivan Zahariev; +Cc: linux-ext4

On Wed, Jul 07, 2021 at 09:42:25PM +0300, Ivan Zahariev wrote:
> Hello,
> 
> We're running Linux kernel 4.14.x and our systems occasionally suffer a bug
> which is already fixed: https://github.com/torvalds/linux/commit/53cf978457325d8fb2cdecd7981b31a8229e446e
> 
> This bugfix hasn't been ported to Linux kernels 4.14 or 4.19. The patch
> applies cleanly. The two files "fs/jbd2/checkpoint.c" and
> "fs/jbd2/journal.c" seem pretty identical in the affected sections compared
> to kernel 5.4 where we have this bugfix already applied.
> 
> Is it on purpose that this bugfix hasn't been ported to 4.14? Is it safe
> that we backport it manually in our kernel 4.14 builds? Or is the "ext4"
> system in 4.14 and 5.4 fundamentally different and this would lead to data
> loss or other problems?

The commit was over two years ago, so my memory is not going to be
perfect.  However, Jan had made a comment suggesting the approach in
this commit because it should be easier to backport into older stble
kernels[1].

   "Since proper locking change is going to be a bit more involved, can you
    perhaps fix this deadlock by just dropping j_checkpoint_mutex in
    log_do_checkpoint() when we are going to wait for transaction commit. I've
    checked and that should be fine and that is going to be much easier change
    to backport into stable kernels..."

[1] https://marc.info/?l=linux-ext4&m=154212553014669&w=2

So I suspect it was just that I failed to remember to add a "Cc:
stable@kernel.org" and so it was never automatically backported into
4.14 or 4.19.

Do you have a reliable reproduction which is triggering the deadlock
on your kernels?  If so, have you tried applying the patch and does it
make the problem go away for you?

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: jbd2: fix deadlock while checkpoint thread waits commit thread to finish (backport to 4.14)
  2021-07-07 23:52 ` Theodore Ts'o
@ 2021-07-08  3:45   ` Ivan Zahariev
  2021-07-12  9:04     ` Jan Kara
  0 siblings, 1 reply; 4+ messages in thread
From: Ivan Zahariev @ 2021-07-08  3:45 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

Out of thousand machines, one would trigger the problem about every 1 to 
10 days. Some machines trigger the problem much often than others. So I 
can say that we have a way to verify quickly if applying the patch will 
fix this for us.

The most important question is: Is it safe to apply the patch on 
production machines with kernel 4.14?

We can't risk data loss. And I lack the expertise to asses what risks 
this small patch brings.

Best regards.
--Ivan

On 8.7.2021 г. 2:52, Theodore Ts'o wrote:
> On Wed, Jul 07, 2021 at 09:42:25PM +0300, Ivan Zahariev wrote:
>> Hello,
>>
>> We're running Linux kernel 4.14.x and our systems occasionally suffer a bug
>> which is already fixed: https://github.com/torvalds/linux/commit/53cf978457325d8fb2cdecd7981b31a8229e446e
>>
>> This bugfix hasn't been ported to Linux kernels 4.14 or 4.19. The patch
>> applies cleanly. The two files "fs/jbd2/checkpoint.c" and
>> "fs/jbd2/journal.c" seem pretty identical in the affected sections compared
>> to kernel 5.4 where we have this bugfix already applied.
>>
>> Is it on purpose that this bugfix hasn't been ported to 4.14? Is it safe
>> that we backport it manually in our kernel 4.14 builds? Or is the "ext4"
>> system in 4.14 and 5.4 fundamentally different and this would lead to data
>> loss or other problems?
> The commit was over two years ago, so my memory is not going to be
> perfect.  However, Jan had made a comment suggesting the approach in
> this commit because it should be easier to backport into older stble
> kernels[1].
>
>     "Since proper locking change is going to be a bit more involved, can you
>      perhaps fix this deadlock by just dropping j_checkpoint_mutex in
>      log_do_checkpoint() when we are going to wait for transaction commit. I've
>      checked and that should be fine and that is going to be much easier change
>      to backport into stable kernels..."
>
> [1] https://marc.info/?l=linux-ext4&m=154212553014669&w=2
>
> So I suspect it was just that I failed to remember to add a "Cc:
> stable@kernel.org" and so it was never automatically backported into
> 4.14 or 4.19.
>
> Do you have a reliable reproduction which is triggering the deadlock
> on your kernels?  If so, have you tried applying the patch and does it
> make the problem go away for you?
>
> Cheers,
>
> 						- Ted

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: jbd2: fix deadlock while checkpoint thread waits commit thread to finish (backport to 4.14)
  2021-07-08  3:45   ` Ivan Zahariev
@ 2021-07-12  9:04     ` Jan Kara
  0 siblings, 0 replies; 4+ messages in thread
From: Jan Kara @ 2021-07-12  9:04 UTC (permalink / raw)
  To: Ivan Zahariev; +Cc: Theodore Ts'o, linux-ext4

On Thu 08-07-21 06:45:35, Ivan Zahariev wrote:
> Out of thousand machines, one would trigger the problem about every 1 to 10
> days. Some machines trigger the problem much often than others. So I can say
> that we have a way to verify quickly if applying the patch will fix this for
> us.
> 
> The most important question is: Is it safe to apply the patch on production
> machines with kernel 4.14?
> 
> We can't risk data loss. And I lack the expertise to asses what risks this
> small patch brings.

The fix should work correctly even for older kernels. I'm not aware of any
changes in this area in the past that could conflict...

								Honza

> On 8.7.2021 г. 2:52, Theodore Ts'o wrote:
> > On Wed, Jul 07, 2021 at 09:42:25PM +0300, Ivan Zahariev wrote:
> > > Hello,
> > > 
> > > We're running Linux kernel 4.14.x and our systems occasionally suffer a bug
> > > which is already fixed: https://github.com/torvalds/linux/commit/53cf978457325d8fb2cdecd7981b31a8229e446e
> > > 
> > > This bugfix hasn't been ported to Linux kernels 4.14 or 4.19. The patch
> > > applies cleanly. The two files "fs/jbd2/checkpoint.c" and
> > > "fs/jbd2/journal.c" seem pretty identical in the affected sections compared
> > > to kernel 5.4 where we have this bugfix already applied.
> > > 
> > > Is it on purpose that this bugfix hasn't been ported to 4.14? Is it safe
> > > that we backport it manually in our kernel 4.14 builds? Or is the "ext4"
> > > system in 4.14 and 5.4 fundamentally different and this would lead to data
> > > loss or other problems?
> > The commit was over two years ago, so my memory is not going to be
> > perfect.  However, Jan had made a comment suggesting the approach in
> > this commit because it should be easier to backport into older stble
> > kernels[1].
> > 
> >     "Since proper locking change is going to be a bit more involved, can you
> >      perhaps fix this deadlock by just dropping j_checkpoint_mutex in
> >      log_do_checkpoint() when we are going to wait for transaction commit. I've
> >      checked and that should be fine and that is going to be much easier change
> >      to backport into stable kernels..."
> > 
> > [1] https://marc.info/?l=linux-ext4&m=154212553014669&w=2
> > 
> > So I suspect it was just that I failed to remember to add a "Cc:
> > stable@kernel.org" and so it was never automatically backported into
> > 4.14 or 4.19.
> > 
> > Do you have a reliable reproduction which is triggering the deadlock
> > on your kernels?  If so, have you tried applying the patch and does it
> > make the problem go away for you?
> > 
> > Cheers,
> > 
> > 						- Ted
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-07-12  9:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-07 18:42 jbd2: fix deadlock while checkpoint thread waits commit thread to finish (backport to 4.14) Ivan Zahariev
2021-07-07 23:52 ` Theodore Ts'o
2021-07-08  3:45   ` Ivan Zahariev
2021-07-12  9:04     ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.