On Dec 1, 2019, at 1:21 AM, Daniel Phillips <daniel@phunq.net> wrote:
>>> Important example: how is atomic directory commit going to work for
>>> Ext4?
>> 
>> The same way all metadata updates work in ext4.  Which is to say, you
>> need to declare the maximum number of 4k metadata blocks that an
>> operation might need to change when calling ext4_journal_start() to
>> create a handle; and before modifying a 4k block, you need to call
>> ext4_journal_get_write_access(), passing in the handle and the block's
>> buffer_head.  After modifying the block, you must call
>> ext4_handle_dirty_metadata() on the buffer_head.  And when you are
>> doing with the changes in an atomic metadata operation, you call
>> ext4_journal_stop() on the handle.
>> 
>> This hasn't changed since the days of ext3 and htree.
> 
> OK good. And I presume that directory updates are prevented until
> the journal transaction is at least fully written to buffers. Maybe
> delayed until the journal transaction is actually committed?
> 
> In Tux3 we don't block directory updates during backend commit, and I
> just assumed that Ext4 and others also do that now, so thanks for the
> correction. As far I can see, there will be no new issue with Shardmap,
> as you say. My current plan is that user space mmap will become kmap in
> kernel. I am starting on this part for Tux3 right now. My goal is to
> refine the current Shardmap data access api to hide the fact that mmap
> is used in user space but kmap in kernel. Again, I wish we actually had
> mmap in kernel and maybe we should consider properly supporting it in
> the future, perhaps by improving kmalloc.
> 
> One thing we do a bit differently frou our traditional fs is, in the
> common, unfragmented case, mass inserts go into the same block until
> the block is full. So we get a significant speedup by avoiding a page
> cache lookup and kmap per insert. Borrowing a bit of mechanism from
> the persistent memory version of Shardmap, we create the new entries
> in a separate cache page. Then, on commit, copy this "front buffer" to
> the page cache. I think that will translate pretty well to Ext4 also.

One important use case that we have for Lustre that is not yet in the
upstream ext4[*] is the ability to do parallel directory operations.
This means we can create, lookup, and/or unlink entries in the same
directory concurrently, to increase parallelism for large directories.

This is implemented by progressively locking the htree root and index
blocks (typically read-only), then leaf blocks (read-only for lookup,
read-write for insert/delete).  This provides improved parallelism
as the directory grows in size.

Will there be some similar ability in Shardmap to have parallel ops?
Also, does Shardmap have the ability to shrink as entries are removed?

Cheers, Andreas

[*] we've tried to submit the pdirops patch a couple of times, but the
main blocker is that the VFS has a single directory mutex and couldn't
use the added functionality without significant VFS changes.
Patch at https://git.whamcloud.com/?p=fs/lustre-release.git;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD