Linux-ext4 Archive on
 help / color / Atom feed
From: Andreas Dilger <>
To: Daniel Phillips <>
Cc: "Theodore Y. Ts'o" <>,,,,
	OGAWA Hirofumi <>
Subject: Re: [RFC] Thing 1: Shardmap fox Ext4
Date: Wed, 4 Dec 2019 11:31:50 -0700
Message-ID: <> (raw)
In-Reply-To: <>

[-- Attachment #1: Type: text/plain, Size: 3208 bytes --]

On Dec 1, 2019, at 1:21 AM, Daniel Phillips <> wrote:
>>> Important example: how is atomic directory commit going to work for
>>> Ext4?
>> The same way all metadata updates work in ext4.  Which is to say, you
>> need to declare the maximum number of 4k metadata blocks that an
>> operation might need to change when calling ext4_journal_start() to
>> create a handle; and before modifying a 4k block, you need to call
>> ext4_journal_get_write_access(), passing in the handle and the block's
>> buffer_head.  After modifying the block, you must call
>> ext4_handle_dirty_metadata() on the buffer_head.  And when you are
>> doing with the changes in an atomic metadata operation, you call
>> ext4_journal_stop() on the handle.
>> This hasn't changed since the days of ext3 and htree.
> OK good. And I presume that directory updates are prevented until
> the journal transaction is at least fully written to buffers. Maybe
> delayed until the journal transaction is actually committed?
> In Tux3 we don't block directory updates during backend commit, and I
> just assumed that Ext4 and others also do that now, so thanks for the
> correction. As far I can see, there will be no new issue with Shardmap,
> as you say. My current plan is that user space mmap will become kmap in
> kernel. I am starting on this part for Tux3 right now. My goal is to
> refine the current Shardmap data access api to hide the fact that mmap
> is used in user space but kmap in kernel. Again, I wish we actually had
> mmap in kernel and maybe we should consider properly supporting it in
> the future, perhaps by improving kmalloc.
> One thing we do a bit differently frou our traditional fs is, in the
> common, unfragmented case, mass inserts go into the same block until
> the block is full. So we get a significant speedup by avoiding a page
> cache lookup and kmap per insert. Borrowing a bit of mechanism from
> the persistent memory version of Shardmap, we create the new entries
> in a separate cache page. Then, on commit, copy this "front buffer" to
> the page cache. I think that will translate pretty well to Ext4 also.

One important use case that we have for Lustre that is not yet in the
upstream ext4[*] is the ability to do parallel directory operations.
This means we can create, lookup, and/or unlink entries in the same
directory concurrently, to increase parallelism for large directories.

This is implemented by progressively locking the htree root and index
blocks (typically read-only), then leaf blocks (read-only for lookup,
read-write for insert/delete).  This provides improved parallelism
as the directory grows in size.

Will there be some similar ability in Shardmap to have parallel ops?
Also, does Shardmap have the ability to shrink as entries are removed?

Cheers, Andreas

[*] we've tried to submit the pdirops patch a couple of times, but the
main blocker is that the VFS has a single directory mutex and couldn't
use the added functionality without significant VFS changes.
Patch at;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

  reply index

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27  1:47 Daniel Phillips
2019-11-27  7:40 ` Vyacheslav Dubeyko
2019-11-27  8:28   ` Daniel Phillips
2019-11-27 19:35     ` Viacheslav Dubeyko
2019-11-28  2:54       ` Daniel Phillips
2019-11-28  9:15         ` Andreas Dilger
2019-11-28 10:03           ` Daniel Phillips
2019-11-27 14:25 ` Theodore Y. Ts'o
2019-11-27 22:27   ` Daniel Phillips
2019-11-28  2:28     ` Theodore Y. Ts'o
2019-11-28  4:27       ` Daniel Phillips
2019-11-30 17:50         ` Theodore Y. Ts'o
2019-12-01  8:21           ` Daniel Phillips
2019-12-04 18:31             ` Andreas Dilger [this message]
2019-12-04 21:44               ` Daniel Phillips
2019-12-05  0:36                 ` Andreas Dilger
2019-12-05  2:27                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-04 23:41               ` [RFC] Thing 1: Shardmap fox Ext4 Theodore Y. Ts'o
2019-12-06  1:16                 ` Dave Chinner
2019-12-06  5:09                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-08 22:42                     ` Dave Chinner
2019-11-28 21:17       ` [RFC] Thing 1: Shardmap fox Ext4 Daniel Phillips
2019-12-08 10:25       ` Daniel Phillips
2019-12-02  1:45   ` Daniel Phillips
2019-12-04 15:55     ` Vyacheslav Dubeyko
2019-12-05  9:46       ` Daniel Phillips
2019-12-06 11:47         ` Vyacheslav Dubeyko
2019-12-07  0:46           ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-04 18:03     ` [RFC] Thing 1: Shardmap fox Ext4 Andreas Dilger
2019-12-04 20:47       ` Daniel Phillips
2019-12-04 20:53         ` Daniel Phillips
2019-12-05  5:59           ` Daniel Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-ext4 Archive on

Archives are clonable:
	git clone --mirror linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ \
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone