All of lore.kernel.org
 help / color / mirror / Atom feed
* help with bug_on on ext4 mount
@ 2014-07-01  6:44 Dolev Raviv
  2014-07-01 15:39 ` Theodore Ts'o
  0 siblings, 1 reply; 2+ messages in thread
From: Dolev Raviv @ 2014-07-01  6:44 UTC (permalink / raw)
  To: linux-ext4; +Cc: Tanya Brokhman, Maya Erez, kdorfman, lsusman

Hi All,

I’m working on a crash originating from ext4 mount path. I’m running with
3.10 based kernel.

Crash description:
I saw a BUG_ON assertion failure in function ext4_clear_journal_err(). The
assertion that fails is:  !EXT4_HAS_COMPAT_FEATURE(sb,
EXT4_FEATURE_COMPAT_HAS_JOURNAL).
The strange thing is, that the same BUG_ON assertion is called at the
start of the function that calls ext4_clear_journal_err(), which is
ext4_load_journal(). This means that the capability flag is changed in
ext4_load_journal, before the call for journal_err().

I’m not too familiar with ext4 code unfortunately. From analyzing the
journal path I came to the below conclusions:
This scenario is possible, if during journal replay, the super_block is
restored or overridden from the journal.
I have noticed a case where the sb is marked as dirty and later, it is
evicted through the address_space_operations .writepage = ext4_writepage
cb. This cb is using the journal and can cause the dirty sb appear on the
journal. If during the journal write operation a power cut occurs, and the
sb copy in the journal is corrupted, it may cause the BUG_ON assertion
failure above.

Is the scenario described above even possible (or am I missing something)?
Has anyone encountered similar issues? Are there any known fixes for this?

Thanks,
Dolev
-- 
QUALCOMM ISRAEL, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: help with bug_on on ext4 mount
  2014-07-01  6:44 help with bug_on on ext4 mount Dolev Raviv
@ 2014-07-01 15:39 ` Theodore Ts'o
  0 siblings, 0 replies; 2+ messages in thread
From: Theodore Ts'o @ 2014-07-01 15:39 UTC (permalink / raw)
  To: Dolev Raviv; +Cc: linux-ext4, Tanya Brokhman, Maya Erez, kdorfman, lsusman

On Tue, Jul 01, 2014 at 06:44:45AM -0000, Dolev Raviv wrote:
> 
> Crash description:
> I saw a BUG_ON assertion failure in function ext4_clear_journal_err(). The
> assertion that fails is:  !EXT4_HAS_COMPAT_FEATURE(sb,
> EXT4_FEATURE_COMPAT_HAS_JOURNAL).
> The strange thing is, that the same BUG_ON assertion is called at the
> start of the function that calls ext4_clear_journal_err(), which is
> ext4_load_journal(). This means that the capability flag is changed in
> ext4_load_journal, before the call for journal_err().
> 
> I’m not too familiar with ext4 code unfortunately. From analyzing the
> journal path I came to the below conclusions:
> This scenario is possible, if during journal replay, the super_block is
> restored or overridden from the journal.
> I have noticed a case where the sb is marked as dirty and later, it is
> evicted through the address_space_operations .writepage = ext4_writepage
> cb. This cb is using the journal and can cause the dirty sb appear on the
> journal. If during the journal write operation a power cut occurs, and the
> sb copy in the journal is corrupted, it may cause the BUG_ON assertion
> failure above.

Yes, this is possible --- but if the journal has been corrupted,
something pretty disastrous has happened.  Indeed, if that has
happenned, it may be that some other portions of the file system will
also have been wiped out.  So I'd ask the question of whether you have
a bigger issue, such as crappy flash that is either not properly
implementing the CACHE FLUSH operation, or which does not have proper
transaction handling for its FTL metadata, so that even if the data
blocks were correctly saved, if power gets removed while the SSD or
eMMC flash is doing a GC operation, some data or metadata blocks
(potentially including blocks written days or months ago) can get
corrupted.  Unfortunately, there does seem to be a huge number of
crappy flash out there, and there's not much the file system can do
about it.

> Is the scenario described above even possible (or am I missing something)?
> Has anyone encountered similar issues? Are there any known fixes for this?

We do have journal checksums, but the reason why it hasn't been
enabled by default is that e2fsck doesn't have good recovery from a
corrupted journal.  So it will detect a bad journal block, but we
don't have good recovery strategies implemented yet.

We could add a sanity check to make sure that, in the absense of
journal checksums, if we are replaying the superblock and the journal
copy of the superblock looks insane, to abort the journal replay.
It's not going to help you recover the bad file system, but it will
prevent the BUG_ON.

Personally, I'd focus on why the journal got corrupted in the first
place.  A BUG_ON is transient; you reboot, and move on.  Data
corruption (at least in the absense of backups, and you *have* been
doing backups, right?) is forever....

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-07-01 15:39 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-01  6:44 help with bug_on on ext4 mount Dolev Raviv
2014-07-01 15:39 ` Theodore Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.