All of lore.kernel.org
 help / color / mirror / Atom feed
* Failure with generic/388 test
@ 2018-01-18 17:39 Jan Kara
  2018-01-19 17:40 ` Theodore Ts'o
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2018-01-18 17:39 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4

Hello,

I've been seeing failures with generic/388 test (with 4.12-based distro
kernel). Now after debugging it for some time it seems it is a problem in
EXT4_GOING_FLAGS_NOLOGFLUSH implementation. What seems to be happening is
that e.g. directory is being created, we set EXT4_FLAGS_SHUTDOWN in the
middle and so the creation fails. We do iput() which should delete the
unused inode but since the filesystem is marked as EXT4_FLAGS_SHUTDOWN,
those changes don't hit the disk. *But* the transaction which has allocated
the inode still manages to commit before we abort the journal (as there is
a window provided by msleep() in EXT4_GOING_FLAGS_NOLOGFLUSH
implementation). So after journal recovery, we have unattached inode and
e2fsck complains.

Now this would not happen if the system just crashed - in such case, either
the allocating transaction won't be able to commit, or iput() would have
finished successfully and logged inode deletion in the same transaction.

So why is journal abort happening after EXT4_FLAGS_SHUTDOWN being set and
why is that window even prolonged by msleep? That would deserve a comment
if nothing else BTW...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Failure with generic/388 test
  2018-01-18 17:39 Failure with generic/388 test Jan Kara
@ 2018-01-19 17:40 ` Theodore Ts'o
  2018-01-24 11:11   ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2018-01-19 17:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Thu, Jan 18, 2018 at 06:39:46PM +0100, Jan Kara wrote:
> 
> I've been seeing failures with generic/388 test (with 4.12-based distro
> kernel). Now after debugging it for some time it seems it is a problem in
> EXT4_GOING_FLAGS_NOLOGFLUSH implementation. What seems to be happening is
> that e.g. directory is being created, we set EXT4_FLAGS_SHUTDOWN in the
> middle and so the creation fails. We do iput() which should delete the
> unused inode but since the filesystem is marked as EXT4_FLAGS_SHUTDOWN,
> those changes don't hit the disk. *But* the transaction which has allocated
> the inode still manages to commit before we abort the journal (as there is
> a window provided by msleep() in EXT4_GOING_FLAGS_NOLOGFLUSH
> implementation). So after journal recovery, we have unattached inode and
> e2fsck complains.

Thanks for looking at the problem!  It's been on my todo list to try
to find and fix.

Right, I see the problem, and issue is that we shouldn't be trying to
abort the handle after we set EXT4_FLAGS_SHUTDOWN.  That flag should
prevent new handles from being started, but we should allow already
running handles to complete, so that we correctly handle the
EXT4_GOING_FLAGS_LOGFLUSH case.

> So why is journal abort happening after EXT4_FLAGS_SHUTDOWN being set and
> why is that window even prolonged by msleep? That would deserve a comment
> if nothing else BTW...

I should drop the msleep().  That was there because when we first
started, there was a bug which has since been fixed by 8e39c6c3d5cc:
"ext4: fix a race in the ext4 shutdown path" (in the ext4 git tree;
due for the next merge window).

The msleep() sleep bug significantly reduced the crash caused by the
race.  This was non-ideal, but it was better than the alternative,
which was when the iSCSI server went down, it would hang the system
badly enough that node's cluster daemon (think Kubernetes daemon)
would go non-responsive for long enough that a watchdog would hammer
down the entire system.  We knew we had a race, but the msleep reduced
the incidence to the point where it rarely happened in production
workloads, and it was better than the alternative (which was
guaranteed server death).

Anyway, that bug has since been fixed and with this other problem
which you've pointed out hopefully we will have fixed all/most of our
shutdown issues.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Failure with generic/388 test
  2018-01-19 17:40 ` Theodore Ts'o
@ 2018-01-24 11:11   ` Jan Kara
  2018-01-24 17:00     ` Theodore Ts'o
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2018-01-24 11:11 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4

On Fri 19-01-18 12:40:41, Theodore Ts'o wrote:
> On Thu, Jan 18, 2018 at 06:39:46PM +0100, Jan Kara wrote:
> > 
> > I've been seeing failures with generic/388 test (with 4.12-based distro
> > kernel). Now after debugging it for some time it seems it is a problem in
> > EXT4_GOING_FLAGS_NOLOGFLUSH implementation. What seems to be happening is
> > that e.g. directory is being created, we set EXT4_FLAGS_SHUTDOWN in the
> > middle and so the creation fails. We do iput() which should delete the
> > unused inode but since the filesystem is marked as EXT4_FLAGS_SHUTDOWN,
> > those changes don't hit the disk. *But* the transaction which has allocated
> > the inode still manages to commit before we abort the journal (as there is
> > a window provided by msleep() in EXT4_GOING_FLAGS_NOLOGFLUSH
> > implementation). So after journal recovery, we have unattached inode and
> > e2fsck complains.
> 
> Thanks for looking at the problem!  It's been on my todo list to try
> to find and fix.
> 
> Right, I see the problem, and issue is that we shouldn't be trying to
> abort the handle after we set EXT4_FLAGS_SHUTDOWN.  That flag should
> prevent new handles from being started, but we should allow already
> running handles to complete, so that we correctly handle the
> EXT4_GOING_FLAGS_LOGFLUSH case.
> 
> > So why is journal abort happening after EXT4_FLAGS_SHUTDOWN being set and
> > why is that window even prolonged by msleep? That would deserve a comment
> > if nothing else BTW...
> 
> I should drop the msleep().  That was there because when we first
> started, there was a bug which has since been fixed by 8e39c6c3d5cc:
> "ext4: fix a race in the ext4 shutdown path" (in the ext4 git tree;
> due for the next merge window).
> 
> The msleep() sleep bug significantly reduced the crash caused by the
> race.  This was non-ideal, but it was better than the alternative,
> which was when the iSCSI server went down, it would hang the system
> badly enough that node's cluster daemon (think Kubernetes daemon)
> would go non-responsive for long enough that a watchdog would hammer
> down the entire system.  We knew we had a race, but the msleep reduced
> the incidence to the point where it rarely happened in production
> workloads, and it was better than the alternative (which was
> guaranteed server death).

Ouch, someone is running EXT4_IOC_SHUTDOWN in production? I always thought
it is just a testing thing...

> Anyway, that bug has since been fixed and with this other problem
> which you've pointed out hopefully we will have fixed all/most of our
> shutdown issues.

Well, just removing msleep() does not completely fix the race, just makes
it unlikely. I believe in EXT4_GOING_FLAGS_NOLOGFLUSH case we should first
do jbd2_journal_abort() and only after that set EXT4_FLAGS_SHUTDOWN. That
will fix the race completely. Are you aware of anything that depends on the
"flag first, journal later" ordering in the shutdown path?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Failure with generic/388 test
  2018-01-24 11:11   ` Jan Kara
@ 2018-01-24 17:00     ` Theodore Ts'o
  2018-01-24 17:30       ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2018-01-24 17:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Wed, Jan 24, 2018 at 12:11:32PM +0100, Jan Kara wrote:
> > The msleep() sleep bug significantly reduced the crash caused by the
> > race.  This was non-ideal, but it was better than the alternative,
> > which was when the iSCSI server went down, it would hang the system
> > badly enough that node's cluster daemon (think Kubernetes daemon)
> > would go non-responsive for long enough that a watchdog would hammer
> > down the entire system.  We knew we had a race, but the msleep reduced
> > the incidence to the point where it rarely happened in production
> > workloads, and it was better than the alternative (which was
> > guaranteed server death).
> 
> Ouch, someone is running EXT4_IOC_SHUTDOWN in production? I always thought
> it is just a testing thing...

Yes, it's being used in no-journal mode on thousands and thousands
data center servers at work.  By the time we use it though, the iSCSI
device is presumped dead (we use local loopback per my comments on a
LSF/MM thread, and in most common case the entire container is getting
OOM-killed), and we don't care about the data stored on the volume.
So that's why the various test failures that result in a corrupted
file system hasn't worried us a whole lot; we're running in no journal
mode, so fs corruption was always expected --- and in this case, we
don't actually care about the fs contents, post-shutdown, at all.

> > Anyway, that bug has since been fixed and with this other problem
> > which you've pointed out hopefully we will have fixed all/most of our
> > shutdown issues.
> 
> Well, just removing msleep() does not completely fix the race, just makes
> it unlikely. I believe in EXT4_GOING_FLAGS_NOLOGFLUSH case we should first
> do jbd2_journal_abort() and only after that set EXT4_FLAGS_SHUTDOWN. That
> will fix the race completely. Are you aware of anything that depends on the
> "flag first, journal later" ordering in the shutdown path?

I'm going to change things so that the flag will prevent any *new*
handles from starting, but allow existing handles to complete.  That
should fix up the problems for LOGFLUSH case as well.

							- Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Failure with generic/388 test
  2018-01-24 17:00     ` Theodore Ts'o
@ 2018-01-24 17:30       ` Jan Kara
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2018-01-24 17:30 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4

On Wed 24-01-18 12:00:36, Theodore Ts'o wrote:
> On Wed, Jan 24, 2018 at 12:11:32PM +0100, Jan Kara wrote:
> > > The msleep() sleep bug significantly reduced the crash caused by the
> > > race.  This was non-ideal, but it was better than the alternative,
> > > which was when the iSCSI server went down, it would hang the system
> > > badly enough that node's cluster daemon (think Kubernetes daemon)
> > > would go non-responsive for long enough that a watchdog would hammer
> > > down the entire system.  We knew we had a race, but the msleep reduced
> > > the incidence to the point where it rarely happened in production
> > > workloads, and it was better than the alternative (which was
> > > guaranteed server death).
> > 
> > Ouch, someone is running EXT4_IOC_SHUTDOWN in production? I always thought
> > it is just a testing thing...
> 
> Yes, it's being used in no-journal mode on thousands and thousands
> data center servers at work.  By the time we use it though, the iSCSI
> device is presumped dead (we use local loopback per my comments on a
> LSF/MM thread, and in most common case the entire container is getting
> OOM-killed), and we don't care about the data stored on the volume.
> So that's why the various test failures that result in a corrupted
> file system hasn't worried us a whole lot; we're running in no journal
> mode, so fs corruption was always expected --- and in this case, we
> don't actually care about the fs contents, post-shutdown, at all.

Ah, ok :). Thanks for info.

> > > Anyway, that bug has since been fixed and with this other problem
> > > which you've pointed out hopefully we will have fixed all/most of our
> > > shutdown issues.
> > 
> > Well, just removing msleep() does not completely fix the race, just makes
> > it unlikely. I believe in EXT4_GOING_FLAGS_NOLOGFLUSH case we should first
> > do jbd2_journal_abort() and only after that set EXT4_FLAGS_SHUTDOWN. That
> > will fix the race completely. Are you aware of anything that depends on the
> > "flag first, journal later" ordering in the shutdown path?
> 
> I'm going to change things so that the flag will prevent any *new*
> handles from starting, but allow existing handles to complete.  That
> should fix up the problems for LOGFLUSH case as well.

At this point I'm not sure how you want to make that work - once
ext4_forced_shutdown() starts returning errors, we may fail various
operations and as a result cleanup of some operation where this happens in
the middle is incomplete. E.g. if you allocate inode, then flip the
shutdown bit, then we fail to link inode into its parent directory, but
iput() on error cleanup path will also fail to free this inode as various
ext4_forced_shutdown() checks along that path will stop that... So
transaction will complete but proper cleanup will not be logged in it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-01-24 17:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-18 17:39 Failure with generic/388 test Jan Kara
2018-01-19 17:40 ` Theodore Ts'o
2018-01-24 11:11   ` Jan Kara
2018-01-24 17:00     ` Theodore Ts'o
2018-01-24 17:30       ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.