All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jun Nie <jun.nie@linaro.org>
To: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org, djwong@kernel.org, jack@suse.cz,
	jlayton@kernel.org, lczerner@redhat.com,
	linux-ext4@vger.kernel.org, xuyang2018.jy@fujitsu.com
Subject: Re: [PATCH v1] ext4: Remove deprecated noacl/nouser_xattr options
Date: Tue, 20 Dec 2022 16:39:27 +0800	[thread overview]
Message-ID: <Y6F0vw1ZhuHPTlzQ@niej-dt-7B47> (raw)
In-Reply-To: <Y6CUJ29YOWtLyeVA@mit.edu>

On Mon, Dec 19, 2022 at 11:41:11AM -0500, Theodore Ts'o wrote:
> On Mon, Dec 19, 2022 at 05:23:18PM +0800, Jun Nie wrote:
> > 
> > Do you mean we have a chance to expand ea_inode in place for some
> > cases? If so, a new ea_inode with larger space should be created
> > to hold expanded ea_inode data, thus data have to be copied and written
> > out through memory in my mind. Or anything other than CPU/memory can
> > utilized for this to avoid memory usage, such as DMA?
> 
> There are two inodes in question here.  The first is the base inode,
> which in this case is /file0.  The second is the ea_inode which stores
> the value of one of the extended attributes.  In the syzkaller fuzzed
> file system, there is an ea_inode field which is already created; it
> contains a value which is too large to fit in the inode or the
> extended attribute block; but that's OK, because we can put it in a
> ea_inode.  Unfortunately, we are unnecessarily created and deleting
> the ea_inode (which contains the xattr *value*) when we move the xattr
> from in-inode storage to the external xattr block.
> 
> Extended attributes can be stored either in the on-disk inode, or in
> an extended attribute block.  The storage in the on-disk inode is
> limited, but extended attributes stored don't require a random access
> 4k read as in the case of the extended attribute block.  So we try to
> store extended attributes in the inode if possible --- especially the
> ones which might be accessed frequently, such as a POSIX ACL or a
> SELinux security id.
> 
> The ext4 inode is composed of two portions.  The base 128 byte inode,
> which is always present, and which is what was originally used in
> ext2.  And the "extra inode fields", which are these fields as
> currently defined at the end of struct ext4_inode:
> 
> 	__le16	i_extra_isize;
> 	__le16	i_checksum_hi;	/* crc32c(uuid+inum+inode) BE */
> 	__le32  i_ctime_extra;  /* extra Change time      (nsec << 2 | epoch) */
> 	__le32  i_mtime_extra;  /* extra Modification time(nsec << 2 | epoch) */
> 	__le32  i_atime_extra;  /* extra Access time      (nsec << 2 | epoch) */
> 	__le32  i_crtime;       /* File Creation time */
> 	__le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
> 	__le32  i_version_hi;	/* high 32 bits for 64-bit version */
> 	__le32	i_projid;	/* Project ID */
> 
> The i_extra_isize is the field that is always present for inodes
> larger than 128 bytes and for which the ext4 file system feature
> EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE is enabled.  The i_extra_isize
> field tells us how many of the fields beyond the first 128 are
> present.  These fields are necessary for various "advanced" (newer
> than ext2) ext4 features, including metadata checksums, support for
> dates beyond 2038 and sub-second timestamp granularity, file creation
> time, 64-bit i_version, and the project id for project quotas.
> Everything beyond i_extra_isize is used for in-inode extended
> attributes.
> 
> Now, what if we need to add extra space for new ext4 features?  Well,
> ext4 has a way of expanding the extra inode fields, and one of the
> ways to trigger this is via the debugging mount option,
> debug_want_extra_isize.  In this particular syzbot reproducer, the
> mount option, "debug_want_extra_isize=128" sets the i_extra_isize
> field to maximum allowable size for a 256 byte inode size, and this
> means that all extended attributes should be ejected out from in-inode
> storage to the external extended attribute block.  We do this on a
> best efforts bases, when a modified inode is written back to the disk.
> 
> The lazytime mount option delays inode updates until the very last
> minute.  The reason for this is to avoid multiple writes to the inode
> table blocks.  This improves performance by reducing random 4k writes,
> and for flash based storage, reducing flash wearout for flash-based
> storage.  For hard drives (HDD's), it reducing random 4k writes
> reduces the need to perform Adjacent Track Interference (ATI)
> mitigations.  ATI mitigations can significantly increase the 99.9
> percentile tail latency on file system operations, and decreasing tail
> latency can be worth $$$ for some use cases[1].
> 
> [1] https://research.google/pubs/pub44830
> 
> The downside of using lazytime updates is that on a crash, the inode
> timestamps might not get updated --- but very often, this is not a big
> deal.  And normally, when some other inode in the same inode table
> block is updated, we take that opportunity to update all of the
> timestamps that were deferred.  Or, in the worst case, this will get
> delayed until the file system is unounted.
> 
> Now, back to the extra space expansion.  Eexpanding to allow extra
> inode fields to be used in the future is a "nice to have" sort of
> thing.  It can fail for a number of reasons, including there not being
> enough space in the extended attribute block to evict the extended
> attributes in the inode; or if the file system is full, we might not
> be able to allocate an external block for the extended attribute block
> in the first place.
> 
> So it's OK for us to simply pass on making space for the extra inode
> fields if it turns out we happen to be in the process of unmounting
> file system.  However, that doesn't fix the performance problem of
> unnecessarily deleting and creating the ea_inode when moving the xattr
> from the inode to the exernal xattr block.  So fixing that performance
> issue is the ideal solution.  Simply passing on the extra_isize
> expansion is the second best issue.  Backporting an unrelated fix[2]
> which papers over the problem by disallowing the mount option
> nouser_xattr is the worst option, since it doesn't actually fix the
> underlying file system bug.
> 
> [2] commit 2d544ec923dbe5 ("ext4: remove deprecated noacl/nouser_xattr
> options")
> 
> Backporting [2] will shut up the syzbot reproducer, yes.  But that's
> because the syzbot reproducer was inadequately minimized.  *This*
> reproducer, which is a easier for a human to understand and which is
> appropriately minimized will trigger exact same issue, with or without
> 
> #!/bin/bash -vx
> #
> # This reproduces an ext4 bug caused by an unfortunate interaction
> # between lazytime updates happening when a file system is being
> # unmounted and expand_extra_isize
> #
> # Initially discovered via syzkaller:
> # https://syzkaller.appspot.com/bug?id=3613786cb88c93aa1c6a279b1df6a7b201347d08
> #
> img=/tmp/foo.img
> dir=/mnt 
> file=$dir/file0
> 
> rm -f $img
> mke2fs -Fq -t ext4 -I 256 -O ea_inode -b 1024 $img 200k
> mount $img $dir
> v=$(dd if=/dev/zero bs=2000 count=1 2>/dev/null | tr '\0' =)
> touch $file
> attr -q -s test -V $v $file
> umount $dir
> mount -o debug_want_extra_isize=128,lazytime /tmp/foo.img $dir
> cat $file
> umount $dir
> 
> This is why your proposal to backport commit 2d544ec923dbe5 is not the
> right answer.
> 
> > per general understanding of a subsystem uninitialization, a flag
> > shall be marked to reject further operation on the sub-system and
> > flush the pending operation, then free the resource. In such a
> > general method, current handling to create a new ea_inode should not
> > crash even it is stupid.  sb->s_root seems to be a key global
> > resource in ext4 subsystem per my understanding, and should not be
> > set as NULL until the last step of unmount operation.
> 
> That's true in general.  And yes, simply bypassing the extra_isize
> expansion when the file system is being unmounted is certainly better
> that backporting the unrelated commit[2].  But the true correct fix is
> to optimize how we migrate the xattr from the in-inode storage to the
> external xattr block.
> 
> This is also at *best* P2 bug, since (a) it's not real security issue;
> just a null pointer derference, and there is no way this could be
> leveraged into any kind of denial of service or privilege escalation
> attack, and (b) it requires root access, and use of a debugging option
> to enable a code path which is in practice never used in production.
> It is a syzkaller report, and unfortunately, there seems to be this
> assumption that all syzkaller issues are P0 or P1 issues that must be
> remediated right away.  Which is not the case in this instance.  It's
> a real bug, and so it should be fixed; but it's not a high priority
> bug.
> 
> That being said, if you'd like ot become more experienced in a portion
> of ext4 internals, I'd certainly invite you to try to understand how
> ext4 extended attributes are managed, and try your hand at fixing this
> bug.
> 
> Best regards,
> 
> 						- Ted

Thanks for the elabration of the logic here. I guess below change is similiar
with what we are expecting. There are 2 question in the change I do not have
anwser yet.

But for the crash with NULL sb->s_root, below change does not impact anything
because all functions are still called. So I guess the protection with
rejecting further request during umount is still needed. Or I still missed
something?


 fs/ext4/xattr.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 7decaaf27e82..546808dbbdd6 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -2551,9 +2551,8 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 
 	is = kzalloc(sizeof(struct ext4_xattr_ibody_find), GFP_NOFS);
 	bs = kzalloc(sizeof(struct ext4_xattr_block_find), GFP_NOFS);
-	buffer = kvmalloc(value_size, GFP_NOFS);
 	b_entry_name = kmalloc(entry->e_name_len + 1, GFP_NOFS);
-	if (!is || !bs || !buffer || !b_entry_name) {
+	if (!is || !bs || !b_entry_name) {
 		error = -ENOMEM;
 		goto out;
 	}
@@ -2565,14 +2564,21 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 
 	/* Save the entry name and the entry value */
 	if (entry->e_value_inum) {
+		buffer = kvmalloc(value_size, GFP_NOFS);
+		if (!buffer) {
+			error = -ENOMEM;
+			goto out;
+		}
+
 		error = ext4_xattr_inode_get(inode, entry, buffer, value_size);
 		if (error)
 			goto out;
 	} else {
 		size_t value_offs = le16_to_cpu(entry->e_value_offs);
-		memcpy(buffer, (void *)IFIRST(header) + value_offs, value_size);
+		buffer = (void *)IFIRST(header) + value_offs;
 	}
 
+	/* Can we reuse entry->e_name with assumption of \0 for all e_name? */
 	memcpy(b_entry_name, entry->e_name, entry->e_name_len);
 	b_entry_name[entry->e_name_len] = '\0';
 	i.name = b_entry_name;
@@ -2585,11 +2591,6 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 	if (error)
 		goto out;
 
-	/* Remove the chosen entry from the inode */
-	error = ext4_xattr_ibody_set(handle, inode, &i, is);
-	if (error)
-		goto out;
-
 	i.value = buffer;
 	i.value_len = value_size;
 	error = ext4_xattr_block_find(inode, &i, bs);
@@ -2597,13 +2598,18 @@ static int ext4_xattr_move_to_block(handle_t *handle, struct inode *inode,
 		goto out;
 
 	/* Add entry which was removed from the inode into the block */
+	/* Can this function remove in inode xattr automatically? */
 	error = ext4_xattr_block_set(handle, inode, &i, bs);
 	if (error)
 		goto out;
-	error = 0;
+
+	/* Remove the chosen entry from the inode */
+	error = ext4_xattr_ibody_set(handle, inode, &i, is);
+
 out:
 	kfree(b_entry_name);
-	kvfree(buffer);
+	if (entry->e_value_inum && buffer)
+		kvfree(buffer);
 	if (is)
 		brelse(is->iloc.bh);
 	if (bs)

      reply	other threads:[~2022-12-20  8:39 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-28  3:02 [PATCH v1] ext4: Remove deprecated noacl/nouser_xattr options Yang Xu
2022-07-28  9:35 ` Jan Kara
2022-08-10  8:32 ` [ext4] a70b1077cd: xfstests.ext4.053.fail kernel test robot
2022-08-10  8:32   ` kernel test robot
2022-08-16 10:11   ` Jan Kara
2022-08-16 10:11     ` Jan Kara
2022-08-17  1:58     ` xuyang2018.jy
2022-08-17  1:58       ` xuyang2018.jy
2022-09-27 21:53 ` [PATCH v1] ext4: Remove deprecated noacl/nouser_xattr options Theodore Ts'o
2022-12-16  3:41   ` Jun Nie
2022-12-16  5:47     ` Theodore Ts'o
2022-12-16 13:10       ` Theodore Ts'o
2022-12-19  9:23       ` Jun Nie
2022-12-19 11:46         ` Jan Kara
2022-12-19 16:41         ` Theodore Ts'o
2022-12-20  8:39           ` Jun Nie [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y6F0vw1ZhuHPTlzQ@niej-dt-7B47 \
    --to=jun.nie@linaro.org \
    --cc=djwong@kernel.org \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=lczerner@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=xuyang2018.jy@fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.