linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ext4: attempt to shrink directory on dentry removal
@ 2019-08-21 18:27 Harshad Shirwadkar
  2019-08-24  2:31 ` Theodore Y. Ts'o
  2019-08-26  5:07 ` Andreas Dilger
  0 siblings, 2 replies; 9+ messages in thread
From: Harshad Shirwadkar @ 2019-08-21 18:27 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

On every dentry deletion, this patch traverses directory inode in
reverse direction and frees up empty dirent blocks until it finds a
non-empty dirent block. We leverage the fact that we never clear
dentry name when we delete dentrys by merging them with the previous
one. So, even though the dirent block has only fake dentry which spans
across the entire block, we can use the name in this dead entry to
perform dx lookup and find intermediate dx node blocks as well as
offset inside these blocks.

As of now, we only support non-indexed directories and indexed
directories with no intermediate dx nodes. This technique can also be
used to remove intermediate dx nodes. But it needs a little more
interesting logic to make that happen since we don't store directory
entry name in intermediate nodes.

Ran kvm-xfstests smoke test-suite and verified that there are no
failures. Also, verified that when all the files are deleted in a
directory, directory shrinks to either 4k for non-indexed directories
or 8k for indexed directories with 1 level.

This patch is an improvement over previous patch that I sent out
earlier this month. So, if this patch looks alright, then we can drop
the other shrinking patch.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>

---
This patch supersedes the other directory shrinking patch sent in Aug
2019 ("ext4: shrink directory when last block is empty").

 fs/ext4/namei.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 176 insertions(+)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 129029534075..00a6602605ab 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -882,6 +882,60 @@ dx_probe(struct ext4_filename *fname, struct inode *dir,
 	return ret_err;
 }
 
+/*
+ * This function tries to remove the entry of a dirent block (which was just
+ * emptied by the caller) from the dx frame. It does so by reducing the count by
+ * 1 and left shifting all the entries after the deleted entry.
+ */
+int
+ext4_remove_dx_entry(handle_t *handle, struct inode *dir,
+		     struct dx_frame *dx_frame)
+{
+	struct dx_entry *entries;
+	unsigned int count;
+	unsigned int limit;
+	int err, i;
+
+	entries = dx_frame->entries;
+	count = dx_get_count(entries);
+	limit = dx_get_limit(entries);
+
+	if (count == 1)
+		return -EINVAL;
+
+	for (i = 0; i < count; i++)
+		if (entries[i].block == cpu_to_le64(dx_get_block(dx_frame->at)))
+			break;
+
+	if (i >= count)
+		return -EINVAL;
+
+	err = ext4_journal_get_write_access(handle, dx_frame->bh);
+	if (err) {
+		ext4_std_error(dir->i_sb, err);
+		return -EINVAL;
+	}
+
+	for (; i < count - 1; i++)
+		entries[i] = entries[i + 1];
+
+	/*
+	 * If i was 0 when we began above loop, we would have overwritten count
+	 * and limit values sin ce those values live in dx_entry->hash of the
+	 * first entry. We need to update count but we should set limit as well.
+	 */
+	dx_set_count(entries, count - 1);
+	dx_set_limit(entries, limit);
+
+	err = ext4_handle_dirty_dx_node(handle, dir, dx_frame->bh);
+	if (err) {
+		ext4_std_error(dir->i_sb, err);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static void dx_release(struct dx_frame *frames)
 {
 	struct dx_root_info *info;
@@ -1409,6 +1463,19 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size,
 	return 0;
 }
 
+static inline bool is_empty_dirent_block(struct inode *dir,
+					 struct buffer_head *bh)
+{
+	struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data;
+	int	csum_size = 0;
+
+	if (ext4_has_metadata_csum(dir->i_sb))
+		csum_size = sizeof(struct ext4_dir_entry_tail);
+
+	return ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize) ==
+			dir->i_sb->s_blocksize - csum_size && de->inode == 0;
+}
+
 static int is_dx_internal_node(struct inode *dir, ext4_lblk_t block,
 			       struct ext4_dir_entry *de)
 {
@@ -2476,6 +2543,113 @@ int ext4_generic_delete_entry(handle_t *handle,
 	return -ENOENT;
 }
 
+int ext4_try_dir_shrink(handle_t *handle, struct inode *dir)
+{
+	struct dx_frame frames[EXT4_HTREE_LEVEL], *frame_ptr;
+	struct dx_root_info *info;
+	struct dx_hash_info hinfo;
+	struct ext4_dir_entry_2 *dead_de;
+	struct buffer_head *bh;
+	int err;
+	ext4_lblk_t lblk, min_lblk;
+
+	hinfo.hash_version = EXT4_SB(dir->i_sb)->s_def_hash_version;
+	if (hinfo.hash_version <= DX_HASH_TEA)
+		hinfo.hash_version +=
+				EXT4_SB(dir->i_sb)->s_hash_unsigned;
+	hinfo.seed = EXT4_SB(dir->i_sb)->s_hash_seed;
+
+	if (is_dx(dir))
+		min_lblk = 2;
+	else
+		min_lblk = 1;
+	/*
+	 * Read blocks from directory in reverse orders and clean them up one by
+	 * one!
+	 */
+	for (lblk = dir->i_size >> dir->i_sb->s_blocksize_bits;
+	     lblk > min_lblk; lblk--) {
+		bh = ext4_bread(handle, dir, lblk - 1, 0);
+		if (IS_ERR(bh))
+			return -EIO;
+
+		if (!is_empty_dirent_block(dir, bh))
+			break;
+
+		if (!is_dx(dir))
+			continue;
+
+		dead_de = (struct ext4_dir_entry_2 *)bh->b_data;
+		ext4fs_dirhash(dir, dead_de->name, dead_de->name_len, &hinfo);
+
+		frame_ptr = dx_probe(NULL, dir, &hinfo, frames);
+		if (IS_ERR(frame_ptr)) {
+			dx_release(frames);
+			break;
+		}
+
+		/*
+		 * Cross-check if the dead de helped us find the block that we
+		 * are looking to delete.
+		 */
+		if (unlikely(dx_get_block(frames[0].at) != lblk - 1)) {
+			dx_release(frames);
+			break;
+		}
+
+		info = &((struct dx_root *)frames[0].bh->b_data)->info;
+		if (info->indirect_levels > 0) {
+			/*
+			 * We don't shrink in this case. That's because the
+			 * block that we just read could very well be an index
+			 * block. If it's an index block, we need to do special
+			 * handling to delete the block. Couple of options:
+			 *
+			 * (1) After deleting any dentry, if the dirent block
+			 *     becomes emtpy, we can remove the entry of the
+			 *     dirent block from its index node. If we do that
+			 *     then after enough deletions, index block could
+			 *     become empty. The problem with this approach is
+			 *     that after removing entry of a particular hash
+			 *     value from an index block, if a new dentry maps
+			 *     to same hash value, we probably will end up
+			 *     allocating one more block. Now our initial dirent
+			 *     block becomes orphan.
+			 *
+			 * (2) If the last block in the directory inode is an
+			 *     index block, then we could check all the leaf
+			 *     dirent blocks that are part of this dirent
+			 *     block. If all of them are empty, then we can
+			 *     remove the entry of this index block from its
+			 *     parent block and truncate this index block off.
+			 *
+			 *  In both the options though, we need a way to look-up
+			 *  the parent of an index block. Since we don't store
+			 *  any dirent name in an index block, it's not easy to
+			 *  lookup the parent of an index block. But if we want
+			 *  to implement this, then we can very well store a
+			 *  dead dirent which has a name and which maps to the
+			 *  index block in question. We could then use that dead
+			 *  entry to lookup parents for the index block.
+			 */
+			dx_release(frames);
+			return -EOPNOTSUPP;
+		}
+
+		err = ext4_remove_dx_entry(handle, dir, &frames[0]);
+		dx_release(frames);
+		if (err)
+			break;
+	}
+
+	if (lblk < dir->i_size >> dir->i_sb->s_blocksize_bits) {
+		dir->i_size = lblk * dir->i_sb->s_blocksize;
+		ext4_truncate(dir);
+	}
+
+	return 0;
+}
+
 static int ext4_delete_entry(handle_t *handle,
 			     struct inode *dir,
 			     struct ext4_dir_entry_2 *de_del,
@@ -2510,6 +2684,8 @@ static int ext4_delete_entry(handle_t *handle,
 	if (unlikely(err))
 		goto out;
 
+	ext4_try_dir_shrink(handle, dir);
+
 	return 0;
 out:
 	if (err != -ENOENT)
-- 
2.23.0.rc1.153.gdeed80330f-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: attempt to shrink directory on dentry removal
  2019-08-21 18:27 [PATCH] ext4: attempt to shrink directory on dentry removal Harshad Shirwadkar
@ 2019-08-24  2:31 ` Theodore Y. Ts'o
  2019-08-26  2:46   ` harshad shirwadkar
  2019-08-26  5:07 ` Andreas Dilger
  1 sibling, 1 reply; 9+ messages in thread
From: Theodore Y. Ts'o @ 2019-08-24  2:31 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Wed, Aug 21, 2019 at 11:27:40AM -0700, Harshad Shirwadkar wrote:
> As of now, we only support non-indexed directories and indexed
> directories with no intermediate dx nodes. 

From my testing, it doen't work on non-indexed directories; the
problem is in the is_empty_dirent_block() function.

> This technique can also be used to remove intermediate dx nodes. But
> it needs a little more interesting logic to make that happen since
> we don't store directory entry name in intermediate nodes.

Actually, that's not the problem.  An empty dx node is not allowed; in
a two-level htree directory, if removing an empty leaf node becomes
causes its parent dx node to become empty, we need to remove the
parent dx node immediately.  Which is fine, because when we did the
dx_probe, we know the path from the root to the leaf node.

But removing the parent dx node means we need to be able to remove an
empty block from the directory, regardless whether it is at the end of
the directory or not.  OK, so how to do that?

First of all, how to do find any directory block's parent?  If it is a
leaf node, we can simply calculate the hash on the first directory
entry (whether it is "dead" or not), and then do a lookup based on
that hash.  For an intermediate dx node, we can do the same thing,
since a dx node is simply a list of hashes and block numbers.  The
first hash in the dx node can be used to do a htree lookup, and that
will give us the full path from the root of the htree to the dx node.

So, suppose we delete a file, and that causes a leaf node to become
empty.  We can actually shrink the directory right then and there, and
not wait until the last block in the directory beomes empty.  How do we do that?

1) We'll call the logical block number of that empty leaf block Empty,
   and the block number of the parent of that empty leaf block Parent(Empty).
   We'll also call the logical block number of the last entry in the
   directory, Last.

2) Remove the pointer to Empty from Parent(Empty).  If that causes
   Parent(Empty) to become empty, we also need to remove the
   reference to Parent(Empty) in Parent(Parent(Empty)).  (And for 3
   level htree directories, which are present in Largedir
   directories, we might also need to do that one more level up.)

3) To free the directory block Empty, if Empty != Last, we copy the
   contents of Last into the directory block Empty, and then determine
   Parent(Last), and find the pointer to Last, and update it to be
   Empty.  At this point, the last directory block is not in use, so
   we can release the last directory block, and shrink the size of the
   directory by one block.

4) If we need to free Parent(Empty) because it was emptied by step 2,
   follow the procedure in step 3.

This has a couple of benefits.  The first is that it will work for
large dx directories, where directory shrinking is most needed.
Secondly, also allows the directory shrinking to take gradually place
as the directory is emptied, instead waiting until last directory
block is empty.  (And for multi-level dx directories that have been
optimized via e2fsck -fD, the above is needed to allow shrinking from
the end won't work at all.  Why that is the case is left as an
exercise to the reader.  Hint: try creating a very large directory and
optimize it using e2fsck -fD, and see where the index nodes end up.)

BTW, for non-indexed directories, we must always shrink from the end.
We can't play games with swapping with the last directory block,
unless we can guarantee that (a) there are no open fd's on the
directories (since there might be telldir/seekdir cookies that have to
stay valid), and (b) the directory isn't exported via NFS.  Given that
establishes these guarantees is tricky, and most of the file systems
we care about will have indexing enabled, I'm not *that* concerned
about making directory shrinking working well for non-indexed
directories.  (Which will tend to only exist from ext2 file systems.)


+static inline bool is_empty_dirent_block(struct inode *dir,
+					 struct buffer_head *bh)
+{
+	struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data;
+	int	csum_size = 0;
+
+	if (ext4_has_metadata_csum(dir->i_sb))
+		csum_size = sizeof(struct ext4_dir_entry_tail);

The dx_tail only exists for indexed directories.  So the if statement
should read:

	if (ext4_has_metadata_csum(dir->i_sb) && is_dx(dir))


> +	/*
> +	 * If i was 0 when we began above loop, we would have overwritten count
> +	 * and limit values sin ce those values live in dx_entry->hash of the

s/sin ce/since/


> +	/*
> +	 * Read blocks from directory in reverse orders and clean them up one by
> +	 * one!
> +	 */

s/reverse orders/reverse order/


> +		info = &((struct dx_root *)frames[0].bh->b_data)->info;
> +		if (info->indirect_levels > 0) {
> +			/*
> +			 * We don't shrink in this case. That's because the
> +			 * block that we just read could very well be an index
> +			 * block. If it's an index block, we need to do special
> +			 * handling to delete the block. 

Please delete this comment, and replace it with "we don't support dx
directories with a depth > 2 for now".  See the above discussion...

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: attempt to shrink directory on dentry removal
  2019-08-24  2:31 ` Theodore Y. Ts'o
@ 2019-08-26  2:46   ` harshad shirwadkar
  2019-08-26  3:19     ` Theodore Y. Ts'o
  0 siblings, 1 reply; 9+ messages in thread
From: harshad shirwadkar @ 2019-08-26  2:46 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Ext4 Developers List

On Fri, Aug 23, 2019 at 7:31 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Wed, Aug 21, 2019 at 11:27:40AM -0700, Harshad Shirwadkar wrote:
> > As of now, we only support non-indexed directories and indexed
> > directories with no intermediate dx nodes.
>
> From my testing, it doen't work on non-indexed directories; the
> problem is in the is_empty_dirent_block() function.
>
> > This technique can also be used to remove intermediate dx nodes. But
> > it needs a little more interesting logic to make that happen since
> > we don't store directory entry name in intermediate nodes.
>
> Actually, that's not the problem.  An empty dx node is not allowed; in
> a two-level htree directory, if removing an empty leaf node becomes
> causes its parent dx node to become empty, we need to remove the
> parent dx node immediately.  Which is fine, because when we did the
> dx_probe, we know the path from the root to the leaf node.
>
> But removing the parent dx node means we need to be able to remove an
> empty block from the directory, regardless whether it is at the end of
> the directory or not.  OK, so how to do that?
>
> First of all, how to do find any directory block's parent?  If it is a
> leaf node, we can simply calculate the hash on the first directory
> entry (whether it is "dead" or not), and then do a lookup based on
> that hash.  For an intermediate dx node, we can do the same thing,
> since a dx node is simply a list of hashes and block numbers.  The
> first hash in the dx node can be used to do a htree lookup, and that
> will give us the full path from the root of the htree to the dx node.
>
> So, suppose we delete a file, and that causes a leaf node to become
> empty.  We can actually shrink the directory right then and there, and
> not wait until the last block in the directory beomes empty.  How do we do that?
>
> 1) We'll call the logical block number of that empty leaf block Empty,
>    and the block number of the parent of that empty leaf block Parent(Empty).
>    We'll also call the logical block number of the last entry in the
>    directory, Last.
>
> 2) Remove the pointer to Empty from Parent(Empty).  If that causes
>    Parent(Empty) to become empty, we also need to remove the
>    reference to Parent(Empty) in Parent(Parent(Empty)).  (And for 3
>    level htree directories, which are present in Largedir
>    directories, we might also need to do that one more level up.)
>
> 3) To free the directory block Empty, if Empty != Last, we copy the
>    contents of Last into the directory block Empty, and then determine
>    Parent(Last), and find the pointer to Last, and update it to be
>    Empty.  At this point, the last directory block is not in use, so
>    we can release the last directory block, and shrink the size of the
>    directory by one block.

If last is an intermediate dx node, is there a way to find out if it
actually is an intermediate dx node? Because an empty dirent block and
an intermediate dx block look the same. Unless we do dx_probe() there
is no way to know if a block is an intermediate dx block. Is that
right or am I missing something?

Looking at your following comment, if metadata_csum feature is
enabled, then we can distinguish if a block is an empty dirent block
or an index block based on dentry->rec_len. If metadata csum is
enabled, then for index blocks, fake_dentry->rec_len is set to
blocksize while for a dirent not dentry->rec_len is set to blocksize -
sizeof(ext4_dir_entry_tail). Is my understanding correct?

>
> 4) If we need to free Parent(Empty) because it was emptied by step 2,
>    follow the procedure in step 3.
>
> This has a couple of benefits.  The first is that it will work for
> large dx directories, where directory shrinking is most needed.
> Secondly, also allows the directory shrinking to take gradually place
> as the directory is emptied, instead waiting until last directory
> block is empty.  (And for multi-level dx directories that have been
> optimized via e2fsck -fD, the above is needed to allow shrinking from
> the end won't work at all.  Why that is the case is left as an
> exercise to the reader.  Hint: try creating a very large directory and
> optimize it using e2fsck -fD, and see where the index nodes end up.)
>
> BTW, for non-indexed directories, we must always shrink from the end.
> We can't play games with swapping with the last directory block,
> unless we can guarantee that (a) there are no open fd's on the
> directories (since there might be telldir/seekdir cookies that have to
> stay valid), and (b) the directory isn't exported via NFS.  Given that
> establishes these guarantees is tricky, and most of the file systems
> we care about will have indexing enabled, I'm not *that* concerned
> about making directory shrinking working well for non-indexed
> directories.  (Which will tend to only exist from ext2 file systems.)
>
>
> +static inline bool is_empty_dirent_block(struct inode *dir,
> +                                        struct buffer_head *bh)
> +{
> +       struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data;
> +       int     csum_size = 0;
> +
> +       if (ext4_has_metadata_csum(dir->i_sb))
> +               csum_size = sizeof(struct ext4_dir_entry_tail);
>
> The dx_tail only exists for indexed directories.  So the if statement
> should read:
>
>         if (ext4_has_metadata_csum(dir->i_sb) && is_dx(dir))
>
Ah I see, thanks. I didn't have metadata csum enabled so it worked for
me. Will fix it.

Thanks,
Harshad

>
> > +     /*
> > +      * If i was 0 when we began above loop, we would have overwritten count
> > +      * and limit values sin ce those values live in dx_entry->hash of the
>
> s/sin ce/since/
>
>
> > +     /*
> > +      * Read blocks from directory in reverse orders and clean them up one by
> > +      * one!
> > +      */
>
> s/reverse orders/reverse order/
>
>
> > +             info = &((struct dx_root *)frames[0].bh->b_data)->info;
> > +             if (info->indirect_levels > 0) {
> > +                     /*
> > +                      * We don't shrink in this case. That's because the
> > +                      * block that we just read could very well be an index
> > +                      * block. If it's an index block, we need to do special
> > +                      * handling to delete the block.
>
> Please delete this comment, and replace it with "we don't support dx
> directories with a depth > 2 for now".  See the above discussion...
>
> Cheers,
>
>                                         - Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: attempt to shrink directory on dentry removal
  2019-08-26  2:46   ` harshad shirwadkar
@ 2019-08-26  3:19     ` Theodore Y. Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Y. Ts'o @ 2019-08-26  3:19 UTC (permalink / raw)
  To: harshad shirwadkar; +Cc: Ext4 Developers List

On Sun, Aug 25, 2019 at 07:46:50PM -0700, harshad shirwadkar wrote:
> If last is an intermediate dx node, is there a way to find out if it
> actually is an intermediate dx node? Because an empty dirent block and
> an intermediate dx block look the same. Unless we do dx_probe() there
> is no way to know if a block is an intermediate dx block. Is that
> right or am I missing something?

You can simply look at the first hash value in the intermediate
dx_node (remember, an empty intermediate node is not allowed), and
then do a dx_probe to search from the root and validate that we find
our way back to the last block.

> Looking at your following comment, if metadata_csum feature is
> enabled, then we can distinguish if a block is an empty dirent block
> or an index block based on dentry->rec_len. If metadata csum is
> enabled, then for index blocks, fake_dentry->rec_len is set to
> blocksize while for a dirent not dentry->rec_len is set to blocksize -
> sizeof(ext4_dir_entry_tail). Is my understanding correct?

Yes.  Although even if the metadata_csum feature is enabled, it's a
good idea to search from the root to make sure this really is the
intermediate dx node block that you are looking for.  You need to do
the dx_probe() to find its parent block anyway, in order to update it.
And if you don't find a pointer to that intermediate node, then it
must not be an a dx node --- or the htree pointers are corrupted.  In
the case of metadata_csum and a dx_tail, that sequence should never
occur normally, so if you don't find an entry for that block when
doing a dx_probe(), it's likely the directory structures have gotten
corrupted.

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: attempt to shrink directory on dentry removal
  2019-08-21 18:27 [PATCH] ext4: attempt to shrink directory on dentry removal Harshad Shirwadkar
  2019-08-24  2:31 ` Theodore Y. Ts'o
@ 2019-08-26  5:07 ` Andreas Dilger
  2019-08-26 21:46   ` harshad shirwadkar
  2019-08-27  2:27   ` Theodore Y. Ts'o
  1 sibling, 2 replies; 9+ messages in thread
From: Andreas Dilger @ 2019-08-26  5:07 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 4949 bytes --]

On Aug 21, 2019, at 12:27 PM, Harshad Shirwadkar <harshadshirwadkar@gmail.com> wrote:
> 
> On every dentry deletion, this patch traverses directory inode in
> reverse direction and frees up empty dirent blocks until it finds a
> non-empty dirent block. We leverage the fact that we never clear
> dentry name when we delete dentrys by merging them with the previous
> one. So, even though the dirent block has only fake dentry which spans
> across the entire block, we can use the name in this dead entry to
> perform dx lookup and find intermediate dx node blocks as well as
> offset inside these blocks.


One high-level limitation with this implementation is that it is unlikely
to remove any directory blocks until the directory is almost entirely
empty since "rm -r" will return entries in hash order, which does not
match the order that the leaf blocks are allocated in the file.  Even
worse, if files in the directory are not deleted in hash order, no leaf
block will be completely empty until about 99% of the files have been
deleted - assume 24-byte filenames in 4096-byte blocks means up to 128
entries per block, typically 3/4 full, or 1/96 entries will be left in
each block before it becomes empty.

One option that was discussed in the past is to use the high 4 bits
of dx_entry->block (i.e. the opposite of dx_get_block()) to store the
"fullness" of each block (in 1/16th of a block, or 256-byte increments
for 4096-byte blocks) and try to merge entries into an adjacent block
if it becomes mostly empty (e.g. if the current block plus the neighbour
are below 50% full).  That allows removing blocks much earlier as the
directory shrinks, rather than waiting until each block is completely
empty.  A fullness of "0" would mean "unset", since we don't set it
yet, and once this feature is available there would never be a block
that is entirely empty.

> As of now, we only support non-indexed directories and indexed
> directories with no intermediate dx nodes. This technique can also be
> used to remove intermediate dx nodes. But it needs a little more
> interesting logic to make that happen since we don't store directory
> entry name in intermediate nodes.
> 
> Ran kvm-xfstests smoke test-suite and verified that there are no
> failures. Also, verified that when all the files are deleted in a
> directory, directory shrinks to either 4k for non-indexed directories
> or 8k for indexed directories with 1 level.
> 
> This patch is an improvement over previous patch that I sent out
> earlier this month. So, if this patch looks alright, then we can drop
> the other shrinking patch.
> 

> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> 
> ---
> This patch supersedes the other directory shrinking patch sent in Aug
> 2019 ("ext4: shrink directory when last block is empty").
> 
> fs/ext4/namei.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 176 insertions(+)
> 
> 
> 
> +static inline bool is_empty_dirent_block(struct inode *dir,
> +					 struct buffer_head *bh)
> +{
> +	struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data;
> +	int	csum_size = 0;
> +
> +	if (ext4_has_metadata_csum(dir->i_sb))
> +		csum_size = sizeof(struct ext4_dir_entry_tail);
> +
> +	return ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize) ==
> +			dir->i_sb->s_blocksize - csum_size && de->inode == 0;
> +}

This may not always detect empty directory blocks properly, because
ext4_generic_delete_entry() will only merge deleted entries with the
previous entry.  It at least appears possible that if entries are not
deleted in the proper order (e.g. in reverse of the order they are
listed in the directory) there may be multiple empty entries in a block,
and the above check will fail.

Instead, this checks should walk all entries in a block and return false
if any one of them has a non-zero de->inode.  In the common case there
may be only a single entry, or the first entry will be used, so it
should be fairly quick to decide that the block cannot be removed.

Another option is to change ext4_generic_delete_entry() to also try
to merge with the immediately following entry to ensure that an empty
block always has rec_len of the full blocksize.  However, I think this
is probably not a worthwhile effort since it would be better to support
removing blocks that are partly empty rather than entirely empty.

> @@ -2510,6 +2684,8 @@ static int ext4_delete_entry(handle_t *handle,
> 	if (unlikely(err))
> 		goto out;
> 
> +	ext4_try_dir_shrink(handle, dir);
> +
> 	return 0;

I think it would be inefficient to try shrinking the directory after
_every_ directory entry is removed.  Instead, there should be some
way to determine here if ext4_generic_delete_entry() removed the last
entry from the directory block, and only shrink in that case.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: attempt to shrink directory on dentry removal
  2019-08-26  5:07 ` Andreas Dilger
@ 2019-08-26 21:46   ` harshad shirwadkar
  2019-08-27  2:38     ` Theodore Y. Ts'o
  2019-08-27 20:53     ` Andreas Dilger
  2019-08-27  2:27   ` Theodore Y. Ts'o
  1 sibling, 2 replies; 9+ messages in thread
From: harshad shirwadkar @ 2019-08-26 21:46 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Ext4 Developers List

I see, this is an interesting approach. I think we can do this without
modifying on-disk format and by bearing the cost of 2 extra reads per
merge. Whenever dentry deletion results in a dirent block that is
_sufficiently_ empty, we can look at its parent dx block and and find
its neighbors that can be merged. We can be aggressive and do this for
every dentry deletion or we could do this whenever the current dirent
block is half empty or something.

By this method we end up reading up to 2 extra blocks (one previous
and one next) that are not going to be merged. That's the trade-off we
have to make in order to avoid any changes to on-disk structure (If we
modify the on-disk structure and store the fullness in the dx block,
we would read only the blocks that need to be merged).

The same logic can also be applied to intermediate dx nodes as well.
After every merge operation, we'll have a set of blocks that need to
be freed. Once we know what blocks we can free, we can use Ted's idea
of swapping them with the last block, one by one.

Since merging approach also requires a way to free up directory
blocks, I think we could first get a patch in that can free up
directory blocks by swapping with the last block. Once we have that
then we could implement merging.

On Sun, Aug 25, 2019 at 10:07 PM Andreas Dilger <adilger@dilger.ca> wrote:
>
> On Aug 21, 2019, at 12:27 PM, Harshad Shirwadkar <harshadshirwadkar@gmail.com> wrote:
> >
> > On every dentry deletion, this patch traverses directory inode in
> > reverse direction and frees up empty dirent blocks until it finds a
> > non-empty dirent block. We leverage the fact that we never clear
> > dentry name when we delete dentrys by merging them with the previous
> > one. So, even though the dirent block has only fake dentry which spans
> > across the entire block, we can use the name in this dead entry to
> > perform dx lookup and find intermediate dx node blocks as well as
> > offset inside these blocks.
>
>
> One high-level limitation with this implementation is that it is unlikely
> to remove any directory blocks until the directory is almost entirely
> empty since "rm -r" will return entries in hash order, which does not
> match the order that the leaf blocks are allocated in the file.  Even
> worse, if files in the directory are not deleted in hash order, no leaf
> block will be completely empty until about 99% of the files have been
> deleted - assume 24-byte filenames in 4096-byte blocks means up to 128
> entries per block, typically 3/4 full, or 1/96 entries will be left in
> each block before it becomes empty.
>
> One option that was discussed in the past is to use the high 4 bits
> of dx_entry->block (i.e. the opposite of dx_get_block()) to store the
> "fullness" of each block (in 1/16th of a block, or 256-byte increments
> for 4096-byte blocks) and try to merge entries into an adjacent block
> if it becomes mostly empty (e.g. if the current block plus the neighbour
> are below 50% full).  That allows removing blocks much earlier as the
> directory shrinks, rather than waiting until each block is completely
> empty.  A fullness of "0" would mean "unset", since we don't set it
> yet, and once this feature is available there would never be a block
> that is entirely empty.
>
> > As of now, we only support non-indexed directories and indexed
> > directories with no intermediate dx nodes. This technique can also be
> > used to remove intermediate dx nodes. But it needs a little more
> > interesting logic to make that happen since we don't store directory
> > entry name in intermediate nodes.
> >
> > Ran kvm-xfstests smoke test-suite and verified that there are no
> > failures. Also, verified that when all the files are deleted in a
> > directory, directory shrinks to either 4k for non-indexed directories
> > or 8k for indexed directories with 1 level.
> >
> > This patch is an improvement over previous patch that I sent out
> > earlier this month. So, if this patch looks alright, then we can drop
> > the other shrinking patch.
> >
>
> > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> >
> > ---
> > This patch supersedes the other directory shrinking patch sent in Aug
> > 2019 ("ext4: shrink directory when last block is empty").
> >
> > fs/ext4/namei.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 176 insertions(+)
> >
> >
> >
> > +static inline bool is_empty_dirent_block(struct inode *dir,
> > +                                      struct buffer_head *bh)
> > +{
> > +     struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data;
> > +     int     csum_size = 0;
> > +
> > +     if (ext4_has_metadata_csum(dir->i_sb))
> > +             csum_size = sizeof(struct ext4_dir_entry_tail);
> > +
> > +     return ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize) ==
> > +                     dir->i_sb->s_blocksize - csum_size && de->inode == 0;
> > +}
>
> This may not always detect empty directory blocks properly, because
> ext4_generic_delete_entry() will only merge deleted entries with the
> previous entry.  It at least appears possible that if entries are not
> deleted in the proper order (e.g. in reverse of the order they are
> listed in the directory) there may be multiple empty entries in a block,
> and the above check will fail.
>
> Instead, this checks should walk all entries in a block and return false
> if any one of them has a non-zero de->inode.  In the common case there
> may be only a single entry, or the first entry will be used, so it
> should be fairly quick to decide that the block cannot be removed.
>
> Another option is to change ext4_generic_delete_entry() to also try
> to merge with the immediately following entry to ensure that an empty
> block always has rec_len of the full blocksize.  However, I think this
> is probably not a worthwhile effort since it would be better to support
> removing blocks that are partly empty rather than entirely empty.
>
> > @@ -2510,6 +2684,8 @@ static int ext4_delete_entry(handle_t *handle,
> >       if (unlikely(err))
> >               goto out;
> >
> > +     ext4_try_dir_shrink(handle, dir);
> > +
> >       return 0;
>
> I think it would be inefficient to try shrinking the directory after
> _every_ directory entry is removed.  Instead, there should be some
> way to determine here if ext4_generic_delete_entry() removed the last
> entry from the directory block, and only shrink in that case.
>
> Cheers, Andreas
>
>
>
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: attempt to shrink directory on dentry removal
  2019-08-26  5:07 ` Andreas Dilger
  2019-08-26 21:46   ` harshad shirwadkar
@ 2019-08-27  2:27   ` Theodore Y. Ts'o
  1 sibling, 0 replies; 9+ messages in thread
From: Theodore Y. Ts'o @ 2019-08-27  2:27 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Harshad Shirwadkar, linux-ext4

On Sun, Aug 25, 2019 at 11:07:46PM -0600, Andreas Dilger wrote:
> This may not always detect empty directory blocks properly, because
> ext4_generic_delete_entry() will only merge deleted entries with the
> previous entry.  It at least appears possible that if entries are not
> deleted in the proper order (e.g. in reverse of the order they are
> listed in the directory) there may be multiple empty entries in a block,
> and the above check will fail.

I don't think that's a problem.  We always merge with the previous
entry, whether it's an empty/deleted entry or an in-use entry.  So
long as all implementations do this, it works just fine.  If there is
an ext2/3/4 implementation which deletes the entry by simply clearing
the inode number *without* merging with the previous one, it's
possible that we might get confused.

But that's easily fixed, too.  In ext4_generic_delete_entry(), we just
need to add a check so that we check to see if the subsequent entry
(if it exists) has a zero de->inode value.  If so, then we absorb the
current directory entry to include the deleted subsequent entry and
repeat the check.

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: attempt to shrink directory on dentry removal
  2019-08-26 21:46   ` harshad shirwadkar
@ 2019-08-27  2:38     ` Theodore Y. Ts'o
  2019-08-27 20:53     ` Andreas Dilger
  1 sibling, 0 replies; 9+ messages in thread
From: Theodore Y. Ts'o @ 2019-08-27  2:38 UTC (permalink / raw)
  To: harshad shirwadkar; +Cc: Andreas Dilger, Ext4 Developers List

On Mon, Aug 26, 2019 at 02:46:01PM -0700, harshad shirwadkar wrote:
> By this method we end up reading up to 2 extra blocks (one previous
> and one next) that are not going to be merged. That's the trade-off we
> have to make in order to avoid any changes to on-disk structure (If we
> modify the on-disk structure and store the fullness in the dx block,
> we would read only the blocks that need to be merged).

We can also adjust the merging strategy depending on whether the
previous and/or next blocks are in memory.  If they are in memory,
that we might try merging if the block is < 50% full.  If they are not
in memory, it might not be worth doing the read until the block is
empty, or maybe, say, 10% full.

> Since merging approach also requires a way to free up directory
> blocks, I think we could first get a patch in that can free up
> directory blocks by swapping with the last block. Once we have that
> then we could implement merging.

I agree; I'd wait to implementing merging until we get directory block
removal working.  Simply trying to shrink the directory when a leaf
block which is *not* the last block in an indexed directory is going
to be substantially better compared to waiting until the last block in
the directory is empty.

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ext4: attempt to shrink directory on dentry removal
  2019-08-26 21:46   ` harshad shirwadkar
  2019-08-27  2:38     ` Theodore Y. Ts'o
@ 2019-08-27 20:53     ` Andreas Dilger
  1 sibling, 0 replies; 9+ messages in thread
From: Andreas Dilger @ 2019-08-27 20:53 UTC (permalink / raw)
  To: harshad shirwadkar; +Cc: Ext4 Developers List

[-- Attachment #1: Type: text/plain, Size: 8681 bytes --]

On Aug 26, 2019, at 3:46 PM, harshad shirwadkar <harshadshirwadkar@gmail.com> wrote:
> 
> I see, this is an interesting approach. I think we can do this without
> modifying on-disk format and by bearing the cost of 2 extra reads per
> merge. Whenever dentry deletion results in a dirent block that is
> _sufficiently_ empty, we can look at its parent dx block and and find
> its neighbors that can be merged. We can be aggressive and do this for
> every dentry deletion or we could do this whenever the current dirent
> block is half empty or something.

The point of storing the "leaf block fullness" into the high bits of
dx_entry->block is that this information would always be available
during htree traversal, even if the leaf blocks are not in memory.
That would allow the unlink code to make a decision in advance whether
shrinking is going to happen (e.g. leaf block plus one neighbour are
under 50% full).

The overhead would be to keep the "leaf block fullness" value updated
enough to be useful.  In this regard, a lower-resolution field (maybe
1 or 2 bits is enough, representing { unset, <= 25% full }, or { unset,
<= 20% full, <= 40% full, > 40% full }.  Anything above this threshold
isn't interesting to consider merging, so there is no need to track it.

> By this method we end up reading up to 2 extra blocks (one previous
> and one next) that are not going to be merged. That's the trade-off we
> have to make in order to avoid any changes to on-disk structure (If we
> modify the on-disk structure and store the fullness in the dx block,
> we would read only the blocks that need to be merged).

Right.  As Ted wrote, we could do the shrinking optimistically if the
dx_entry field is "unset", but the blocks are in memory.  However, in
when the leaf block is first read into memory, we could also update the
dx_entry fullness when it is sanity checked, so that we don't need to
keep checking whether the leaf block is empty or not.  The dx_entry
update could be done only in memory and written opportunistically to
disk if the htree index block is modified.  At worst, if the dx_entry
fullness is incorrect the merge could fail and the block is not freed.

> The same logic can also be applied to intermediate dx nodes as well.
> After every merge operation, we'll have a set of blocks that need to
> be freed. Once we know what blocks we can free, we can use Ted's idea
> of swapping them with the last block, one by one.
> 
> Since merging approach also requires a way to free up directory
> blocks, I think we could first get a patch in that can free up
> directory blocks by swapping with the last block. Once we have that
> then we could implement merging.

Yes, definitely.  I'm not suggesting that we don't need the ability
to free leaf/index blocks first.  I just wanted to point out that we
will not see any blocks being freed until the directory is almost
completely (99%) empty.  For testing purposes you could use 1KB blocks
and 250-character names (which means at most 3 entries per block), and
we would start seeing blocks being freed once the directory was below
about 30% full, but this does not represent a common use case.

Cheers, Andreas

> On Sun, Aug 25, 2019 at 10:07 PM Andreas Dilger <adilger@dilger.ca> wrote:
>> 
>> On Aug 21, 2019, at 12:27 PM, Harshad Shirwadkar <harshadshirwadkar@gmail.com> wrote:
>>> 
>>> On every dentry deletion, this patch traverses directory inode in
>>> reverse direction and frees up empty dirent blocks until it finds a
>>> non-empty dirent block. We leverage the fact that we never clear
>>> dentry name when we delete dentrys by merging them with the previous
>>> one. So, even though the dirent block has only fake dentry which spans
>>> across the entire block, we can use the name in this dead entry to
>>> perform dx lookup and find intermediate dx node blocks as well as
>>> offset inside these blocks.
>> 
>> 
>> One high-level limitation with this implementation is that it is unlikely
>> to remove any directory blocks until the directory is almost entirely
>> empty since "rm -r" will return entries in hash order, which does not
>> match the order that the leaf blocks are allocated in the file.  Even
>> worse, if files in the directory are not deleted in hash order, no leaf
>> block will be completely empty until about 99% of the files have been
>> deleted - assume 24-byte filenames in 4096-byte blocks means up to 128
>> entries per block, typically 3/4 full, or 1/96 entries will be left in
>> each block before it becomes empty.
>> 
>> One option that was discussed in the past is to use the high 4 bits
>> of dx_entry->block (i.e. the opposite of dx_get_block()) to store the
>> "fullness" of each block (in 1/16th of a block, or 256-byte increments
>> for 4096-byte blocks) and try to merge entries into an adjacent block
>> if it becomes mostly empty (e.g. if the current block plus the neighbour
>> are below 50% full).  That allows removing blocks much earlier as the
>> directory shrinks, rather than waiting until each block is completely
>> empty.  A fullness of "0" would mean "unset", since we don't set it
>> yet, and once this feature is available there would never be a block
>> that is entirely empty.
>> 
>>> As of now, we only support non-indexed directories and indexed
>>> directories with no intermediate dx nodes. This technique can also be
>>> used to remove intermediate dx nodes. But it needs a little more
>>> interesting logic to make that happen since we don't store directory
>>> entry name in intermediate nodes.
>>> 
>>> Ran kvm-xfstests smoke test-suite and verified that there are no
>>> failures. Also, verified that when all the files are deleted in a
>>> directory, directory shrinks to either 4k for non-indexed directories
>>> or 8k for indexed directories with 1 level.
>>> 
>>> This patch is an improvement over previous patch that I sent out
>>> earlier this month. So, if this patch looks alright, then we can drop
>>> the other shrinking patch.
>>> 
>> 
>>> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
>>> 
>>> ---
>>> This patch supersedes the other directory shrinking patch sent in Aug
>>> 2019 ("ext4: shrink directory when last block is empty").
>>> 
>>> fs/ext4/namei.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 176 insertions(+)
>>> 
>>> 
>>> 
>>> +static inline bool is_empty_dirent_block(struct inode *dir,
>>> +                                      struct buffer_head *bh)
>>> +{
>>> +     struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data;
>>> +     int     csum_size = 0;
>>> +
>>> +     if (ext4_has_metadata_csum(dir->i_sb))
>>> +             csum_size = sizeof(struct ext4_dir_entry_tail);
>>> +
>>> +     return ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize) ==
>>> +                     dir->i_sb->s_blocksize - csum_size && de->inode == 0;
>>> +}
>> 
>> This may not always detect empty directory blocks properly, because
>> ext4_generic_delete_entry() will only merge deleted entries with the
>> previous entry.  It at least appears possible that if entries are not
>> deleted in the proper order (e.g. in reverse of the order they are
>> listed in the directory) there may be multiple empty entries in a block,
>> and the above check will fail.
>> 
>> Instead, this checks should walk all entries in a block and return false
>> if any one of them has a non-zero de->inode.  In the common case there
>> may be only a single entry, or the first entry will be used, so it
>> should be fairly quick to decide that the block cannot be removed.
>> 
>> Another option is to change ext4_generic_delete_entry() to also try
>> to merge with the immediately following entry to ensure that an empty
>> block always has rec_len of the full blocksize.  However, I think this
>> is probably not a worthwhile effort since it would be better to support
>> removing blocks that are partly empty rather than entirely empty.
>> 
>>> @@ -2510,6 +2684,8 @@ static int ext4_delete_entry(handle_t *handle,
>>>      if (unlikely(err))
>>>              goto out;
>>> 
>>> +     ext4_try_dir_shrink(handle, dir);
>>> +
>>>      return 0;
>> 
>> I think it would be inefficient to try shrinking the directory after
>> _every_ directory entry is removed.  Instead, there should be some
>> way to determine here if ext4_generic_delete_entry() removed the last
>> entry from the directory block, and only shrink in that case.
>> 
>> Cheers, Andreas
>> 
>> 
>> 
>> 
>> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-08-27 20:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-21 18:27 [PATCH] ext4: attempt to shrink directory on dentry removal Harshad Shirwadkar
2019-08-24  2:31 ` Theodore Y. Ts'o
2019-08-26  2:46   ` harshad shirwadkar
2019-08-26  3:19     ` Theodore Y. Ts'o
2019-08-26  5:07 ` Andreas Dilger
2019-08-26 21:46   ` harshad shirwadkar
2019-08-27  2:38     ` Theodore Y. Ts'o
2019-08-27 20:53     ` Andreas Dilger
2019-08-27  2:27   ` Theodore Y. Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).