On Aug 26, 2019, at 3:46 PM, harshad shirwadkar wrote: > > I see, this is an interesting approach. I think we can do this without > modifying on-disk format and by bearing the cost of 2 extra reads per > merge. Whenever dentry deletion results in a dirent block that is > _sufficiently_ empty, we can look at its parent dx block and and find > its neighbors that can be merged. We can be aggressive and do this for > every dentry deletion or we could do this whenever the current dirent > block is half empty or something. The point of storing the "leaf block fullness" into the high bits of dx_entry->block is that this information would always be available during htree traversal, even if the leaf blocks are not in memory. That would allow the unlink code to make a decision in advance whether shrinking is going to happen (e.g. leaf block plus one neighbour are under 50% full). The overhead would be to keep the "leaf block fullness" value updated enough to be useful. In this regard, a lower-resolution field (maybe 1 or 2 bits is enough, representing { unset, <= 25% full }, or { unset, <= 20% full, <= 40% full, > 40% full }. Anything above this threshold isn't interesting to consider merging, so there is no need to track it. > By this method we end up reading up to 2 extra blocks (one previous > and one next) that are not going to be merged. That's the trade-off we > have to make in order to avoid any changes to on-disk structure (If we > modify the on-disk structure and store the fullness in the dx block, > we would read only the blocks that need to be merged). Right. As Ted wrote, we could do the shrinking optimistically if the dx_entry field is "unset", but the blocks are in memory. However, in when the leaf block is first read into memory, we could also update the dx_entry fullness when it is sanity checked, so that we don't need to keep checking whether the leaf block is empty or not. The dx_entry update could be done only in memory and written opportunistically to disk if the htree index block is modified. At worst, if the dx_entry fullness is incorrect the merge could fail and the block is not freed. > The same logic can also be applied to intermediate dx nodes as well. > After every merge operation, we'll have a set of blocks that need to > be freed. Once we know what blocks we can free, we can use Ted's idea > of swapping them with the last block, one by one. > > Since merging approach also requires a way to free up directory > blocks, I think we could first get a patch in that can free up > directory blocks by swapping with the last block. Once we have that > then we could implement merging. Yes, definitely. I'm not suggesting that we don't need the ability to free leaf/index blocks first. I just wanted to point out that we will not see any blocks being freed until the directory is almost completely (99%) empty. For testing purposes you could use 1KB blocks and 250-character names (which means at most 3 entries per block), and we would start seeing blocks being freed once the directory was below about 30% full, but this does not represent a common use case. Cheers, Andreas > On Sun, Aug 25, 2019 at 10:07 PM Andreas Dilger wrote: >> >> On Aug 21, 2019, at 12:27 PM, Harshad Shirwadkar wrote: >>> >>> On every dentry deletion, this patch traverses directory inode in >>> reverse direction and frees up empty dirent blocks until it finds a >>> non-empty dirent block. We leverage the fact that we never clear >>> dentry name when we delete dentrys by merging them with the previous >>> one. So, even though the dirent block has only fake dentry which spans >>> across the entire block, we can use the name in this dead entry to >>> perform dx lookup and find intermediate dx node blocks as well as >>> offset inside these blocks. >> >> >> One high-level limitation with this implementation is that it is unlikely >> to remove any directory blocks until the directory is almost entirely >> empty since "rm -r" will return entries in hash order, which does not >> match the order that the leaf blocks are allocated in the file. Even >> worse, if files in the directory are not deleted in hash order, no leaf >> block will be completely empty until about 99% of the files have been >> deleted - assume 24-byte filenames in 4096-byte blocks means up to 128 >> entries per block, typically 3/4 full, or 1/96 entries will be left in >> each block before it becomes empty. >> >> One option that was discussed in the past is to use the high 4 bits >> of dx_entry->block (i.e. the opposite of dx_get_block()) to store the >> "fullness" of each block (in 1/16th of a block, or 256-byte increments >> for 4096-byte blocks) and try to merge entries into an adjacent block >> if it becomes mostly empty (e.g. if the current block plus the neighbour >> are below 50% full). That allows removing blocks much earlier as the >> directory shrinks, rather than waiting until each block is completely >> empty. A fullness of "0" would mean "unset", since we don't set it >> yet, and once this feature is available there would never be a block >> that is entirely empty. >> >>> As of now, we only support non-indexed directories and indexed >>> directories with no intermediate dx nodes. This technique can also be >>> used to remove intermediate dx nodes. But it needs a little more >>> interesting logic to make that happen since we don't store directory >>> entry name in intermediate nodes. >>> >>> Ran kvm-xfstests smoke test-suite and verified that there are no >>> failures. Also, verified that when all the files are deleted in a >>> directory, directory shrinks to either 4k for non-indexed directories >>> or 8k for indexed directories with 1 level. >>> >>> This patch is an improvement over previous patch that I sent out >>> earlier this month. So, if this patch looks alright, then we can drop >>> the other shrinking patch. >>> >> >>> Signed-off-by: Harshad Shirwadkar >>> >>> --- >>> This patch supersedes the other directory shrinking patch sent in Aug >>> 2019 ("ext4: shrink directory when last block is empty"). >>> >>> fs/ext4/namei.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 176 insertions(+) >>> >>> >>> >>> +static inline bool is_empty_dirent_block(struct inode *dir, >>> + struct buffer_head *bh) >>> +{ >>> + struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data; >>> + int csum_size = 0; >>> + >>> + if (ext4_has_metadata_csum(dir->i_sb)) >>> + csum_size = sizeof(struct ext4_dir_entry_tail); >>> + >>> + return ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize) == >>> + dir->i_sb->s_blocksize - csum_size && de->inode == 0; >>> +} >> >> This may not always detect empty directory blocks properly, because >> ext4_generic_delete_entry() will only merge deleted entries with the >> previous entry. It at least appears possible that if entries are not >> deleted in the proper order (e.g. in reverse of the order they are >> listed in the directory) there may be multiple empty entries in a block, >> and the above check will fail. >> >> Instead, this checks should walk all entries in a block and return false >> if any one of them has a non-zero de->inode. In the common case there >> may be only a single entry, or the first entry will be used, so it >> should be fairly quick to decide that the block cannot be removed. >> >> Another option is to change ext4_generic_delete_entry() to also try >> to merge with the immediately following entry to ensure that an empty >> block always has rec_len of the full blocksize. However, I think this >> is probably not a worthwhile effort since it would be better to support >> removing blocks that are partly empty rather than entirely empty. >> >>> @@ -2510,6 +2684,8 @@ static int ext4_delete_entry(handle_t *handle, >>> if (unlikely(err)) >>> goto out; >>> >>> + ext4_try_dir_shrink(handle, dir); >>> + >>> return 0; >> >> I think it would be inefficient to try shrinking the directory after >> _every_ directory entry is removed. Instead, there should be some >> way to determine here if ext4_generic_delete_entry() removed the last >> entry from the directory block, and only shrink in that case. >> >> Cheers, Andreas >> >> >> >> >> Cheers, Andreas