On Aug 26, 2019, at 3:46 PM, harshad shirwadkar <harshadshirwadkar@gmail.com> wrote:
> 
> I see, this is an interesting approach. I think we can do this without
> modifying on-disk format and by bearing the cost of 2 extra reads per
> merge. Whenever dentry deletion results in a dirent block that is
> _sufficiently_ empty, we can look at its parent dx block and and find
> its neighbors that can be merged. We can be aggressive and do this for
> every dentry deletion or we could do this whenever the current dirent
> block is half empty or something.

The point of storing the "leaf block fullness" into the high bits of
dx_entry->block is that this information would always be available
during htree traversal, even if the leaf blocks are not in memory.
That would allow the unlink code to make a decision in advance whether
shrinking is going to happen (e.g. leaf block plus one neighbour are
under 50% full).

The overhead would be to keep the "leaf block fullness" value updated
enough to be useful.  In this regard, a lower-resolution field (maybe
1 or 2 bits is enough, representing { unset, <= 25% full }, or { unset,
<= 20% full, <= 40% full, > 40% full }.  Anything above this threshold
isn't interesting to consider merging, so there is no need to track it.

> By this method we end up reading up to 2 extra blocks (one previous
> and one next) that are not going to be merged. That's the trade-off we
> have to make in order to avoid any changes to on-disk structure (If we
> modify the on-disk structure and store the fullness in the dx block,
> we would read only the blocks that need to be merged).

Right.  As Ted wrote, we could do the shrinking optimistically if the
dx_entry field is "unset", but the blocks are in memory.  However, in
when the leaf block is first read into memory, we could also update the
dx_entry fullness when it is sanity checked, so that we don't need to
keep checking whether the leaf block is empty or not.  The dx_entry
update could be done only in memory and written opportunistically to
disk if the htree index block is modified.  At worst, if the dx_entry
fullness is incorrect the merge could fail and the block is not freed.

> The same logic can also be applied to intermediate dx nodes as well.
> After every merge operation, we'll have a set of blocks that need to
> be freed. Once we know what blocks we can free, we can use Ted's idea
> of swapping them with the last block, one by one.
> 
> Since merging approach also requires a way to free up directory
> blocks, I think we could first get a patch in that can free up
> directory blocks by swapping with the last block. Once we have that
> then we could implement merging.

Yes, definitely.  I'm not suggesting that we don't need the ability
to free leaf/index blocks first.  I just wanted to point out that we
will not see any blocks being freed until the directory is almost
completely (99%) empty.  For testing purposes you could use 1KB blocks
and 250-character names (which means at most 3 entries per block), and
we would start seeing blocks being freed once the directory was below
about 30% full, but this does not represent a common use case.

Cheers, Andreas

> On Sun, Aug 25, 2019 at 10:07 PM Andreas Dilger <adilger@dilger.ca> wrote:
>> 
>> On Aug 21, 2019, at 12:27 PM, Harshad Shirwadkar <harshadshirwadkar@gmail.com> wrote:
>>> 
>>> On every dentry deletion, this patch traverses directory inode in
>>> reverse direction and frees up empty dirent blocks until it finds a
>>> non-empty dirent block. We leverage the fact that we never clear
>>> dentry name when we delete dentrys by merging them with the previous
>>> one. So, even though the dirent block has only fake dentry which spans
>>> across the entire block, we can use the name in this dead entry to
>>> perform dx lookup and find intermediate dx node blocks as well as
>>> offset inside these blocks.
>> 
>> 
>> One high-level limitation with this implementation is that it is unlikely
>> to remove any directory blocks until the directory is almost entirely
>> empty since "rm -r" will return entries in hash order, which does not
>> match the order that the leaf blocks are allocated in the file.  Even
>> worse, if files in the directory are not deleted in hash order, no leaf
>> block will be completely empty until about 99% of the files have been
>> deleted - assume 24-byte filenames in 4096-byte blocks means up to 128
>> entries per block, typically 3/4 full, or 1/96 entries will be left in
>> each block before it becomes empty.
>> 
>> One option that was discussed in the past is to use the high 4 bits
>> of dx_entry->block (i.e. the opposite of dx_get_block()) to store the
>> "fullness" of each block (in 1/16th of a block, or 256-byte increments
>> for 4096-byte blocks) and try to merge entries into an adjacent block
>> if it becomes mostly empty (e.g. if the current block plus the neighbour
>> are below 50% full).  That allows removing blocks much earlier as the
>> directory shrinks, rather than waiting until each block is completely
>> empty.  A fullness of "0" would mean "unset", since we don't set it
>> yet, and once this feature is available there would never be a block
>> that is entirely empty.
>> 
>>> As of now, we only support non-indexed directories and indexed
>>> directories with no intermediate dx nodes. This technique can also be
>>> used to remove intermediate dx nodes. But it needs a little more
>>> interesting logic to make that happen since we don't store directory
>>> entry name in intermediate nodes.
>>> 
>>> Ran kvm-xfstests smoke test-suite and verified that there are no
>>> failures. Also, verified that when all the files are deleted in a
>>> directory, directory shrinks to either 4k for non-indexed directories
>>> or 8k for indexed directories with 1 level.
>>> 
>>> This patch is an improvement over previous patch that I sent out
>>> earlier this month. So, if this patch looks alright, then we can drop
>>> the other shrinking patch.
>>> 
>> 
>>> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
>>> 
>>> ---
>>> This patch supersedes the other directory shrinking patch sent in Aug
>>> 2019 ("ext4: shrink directory when last block is empty").
>>> 
>>> fs/ext4/namei.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 176 insertions(+)
>>> 
>>> 
>>> 
>>> +static inline bool is_empty_dirent_block(struct inode *dir,
>>> +                                      struct buffer_head *bh)
>>> +{
>>> +     struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data;
>>> +     int     csum_size = 0;
>>> +
>>> +     if (ext4_has_metadata_csum(dir->i_sb))
>>> +             csum_size = sizeof(struct ext4_dir_entry_tail);
>>> +
>>> +     return ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize) ==
>>> +                     dir->i_sb->s_blocksize - csum_size && de->inode == 0;
>>> +}
>> 
>> This may not always detect empty directory blocks properly, because
>> ext4_generic_delete_entry() will only merge deleted entries with the
>> previous entry.  It at least appears possible that if entries are not
>> deleted in the proper order (e.g. in reverse of the order they are
>> listed in the directory) there may be multiple empty entries in a block,
>> and the above check will fail.
>> 
>> Instead, this checks should walk all entries in a block and return false
>> if any one of them has a non-zero de->inode.  In the common case there
>> may be only a single entry, or the first entry will be used, so it
>> should be fairly quick to decide that the block cannot be removed.
>> 
>> Another option is to change ext4_generic_delete_entry() to also try
>> to merge with the immediately following entry to ensure that an empty
>> block always has rec_len of the full blocksize.  However, I think this
>> is probably not a worthwhile effort since it would be better to support
>> removing blocks that are partly empty rather than entirely empty.
>> 
>>> @@ -2510,6 +2684,8 @@ static int ext4_delete_entry(handle_t *handle,
>>>      if (unlikely(err))
>>>              goto out;
>>> 
>>> +     ext4_try_dir_shrink(handle, dir);
>>> +
>>>      return 0;
>> 
>> I think it would be inefficient to try shrinking the directory after
>> _every_ directory entry is removed.  Instead, there should be some
>> way to determine here if ext4_generic_delete_entry() removed the last
>> entry from the directory block, and only shrink in that case.
>> 
>> Cheers, Andreas
>> 
>> 
>> 
>> 
>> 


Cheers, Andreas