Linux-ext4 Archive on
 help / color / Atom feed
From: "Theodore Ts'o" <>
To: Haotian Li <>
Cc: Ext4 Developers List <>,
	"liuzhiqiang (I)" <>,
	linfeilong <>
Subject: Re: [PATCH] e2fsck: try write_primary_superblock() again when it failed
Date: Mon, 10 May 2021 16:59:38 -0400
Message-ID: <> (raw)
In-Reply-To: <>

On Tue, Apr 13, 2021 at 11:19:30AM +0800, Haotian Li wrote:
> Function write_primary_superblock() has two ways to flush
> superblock, byte-by-byte as default. It may use
> io_channel_write_byte() many times. If some errors occur
> during these funcs, the superblock may become inconsistent
> and produce checksum error.
> Try write_primary_superblock() with whole-block way again
> when it failed on byte-by-byte way.

So unless you're using Direct I/O, this patch really shouldn't be
making any difference.  (And tune2fs doesn't actually support Direct
I/O access, and that's the only e2fsprogs program that should normally
be making changes to the superblock on a regular basis.)  That's
becuase with buffered I/O, we aren't going to be actually trying to
write to the device.  The byte-by-byte writes will only be to
in-memory buffer caches, and so write errors should be *very* rare.
Are you actually seeing this happen in actual practice?  If so, what
userspace program is trying to write the superblock, and under what

The problem with writing the whole superblock at one go is that this
can race with changes being made by the kernel --- for example, if we
are unlinking or truncating a file, and the kernel is updated the
first inode in the orphaned inode list, this can lead to other types
of inconsistencies / file system corruption if we get unlucky.  That's
why we switched the byte-by-byte update approach (at least for those
I/O managers which supported it; those that didn't were things like
the Windows I/O manager where concurrent update by the kernel wasn't
an issue).

It is true that since we have added metadata checksums, this can lead
to checksum inconsistencies.  In practice, since we don't validate the
superblock while the file system is mounted, this should only happen
in two cases if there is a race between tune2fs modifying the kernel
and the kernel trying to update the superblock, and we crash before a
subsequent attempt by the kernel to update the superblock; for
example, when the orphaned inode list is being modified, or when the
file system is unmounted.

What we should do fix this, at least for the long term, is to add new
generic ioctls for updating the label and UUID.  This is something I
had discussed with Darrick as something that multiple file systems
would find useful.  For the other use cases, what I think we should do
is to implement an ext4-specific ioctl which takes a pointer to an
in-memory 1k superblock, and a 32-bit flag word.  One bit in the flag
word might cause the ioctl to update the following fields:

	__le16	s_mnt_count;		/* Mount count */
	__le16	s_max_mnt_count;	/* Maximal mount count */
	__le32	s_lastcheck;		/* time of last check */
	__le32	s_checkinterval;	/* max. time between checks */

Another bit might mean updating these fields:

	__le16	s_errors;		/* Behaviour when detecting errors */
	__le32	s_r_blocks_count_lo;	/* Reserved blocks count */
	__le32	s_r_blocks_count_hi;	/* Reserved blocks count */

Yet another bit might mean updating these fields: 

	__le32	s_default_mount_opts;
	__u8	s_mount_opts[64];

etc.  If the flag word is 0, then the ioctl will return success, and
the flag word will be updated with the set of flags supported by the
currently running kernel.

In that way, tune2fs could test and see if it is running on a kernel
which supports superblock updates via the ioctl mechanism.  If it
does, it will use this in preference to trying to update the
superblock via direct writes to the block device, and since the actual
commit is being done by the kernel, it can run those changes through
the journal.  The kernel can also make sure the superblock checksum is
updated in a completely consistent way and with appropriate locking
with other changes to the superblock.

Does that make sense?

					- Ted

      parent reply index

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-13  3:19 Haotian Li
2021-05-10  3:50 ` Zhiqiang Liu
2021-05-10 20:59 ` Theodore Ts'o [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-ext4 Archive on

Archives are clonable:
	git clone --mirror linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ \
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone