Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Daniel Phillips <daniel@phunq.net>
To: Andreas Dilger <adilger@dilger.ca>
Cc: "Theodore Y. Ts'o" <tytso@mit.edu>,
	linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Subject: Re: [RFC] Thing 1: Shardmap fox Ext4
Date: Wed, 4 Dec 2019 13:44:30 -0800
Message-ID: <f385445b-4941-cc48-e05d-51480a01f4aa@phunq.net> (raw)
In-Reply-To: <23F33101-065E-445A-AE5C-D05E35E2B78B@dilger.ca>

On 2019-12-04 10:31 a.m., Andreas Dilger wrote:
> One important use case that we have for Lustre that is not yet in the
> upstream ext4[*] is the ability to do parallel directory operations.
> This means we can create, lookup, and/or unlink entries in the same
> directory concurrently, to increase parallelism for large directories.

This is a requirement for an upcoming transactional version of user space
Shardmap. In the database world they call it "row locking". I am working
on a hash based scheme with single record granularity that maps onto the
existing shard buckets, which should be nice and efficient, maybe a bit
tricky with respect to rehash but looks not too bad.

Per-shard rw locks are a simpler alternative, but might get a bit fiddly
if you need to lock multiple entries in the same directory at the same
time, which is required for mv is it not?

> This is implemented by progressively locking the htree root and index
> blocks (typically read-only), then leaf blocks (read-only for lookup,
> read-write for insert/delete).  This provides improved parallelism
> as the directory grows in size.

This will be much easier and more efficient with Shardmap because there
are only three levels: top level shard array; shard hash bucket; record
block. Locking applies only to cache, so no need to worry about possible
upper tier during incremental "reshard".

I think Shardmap will also split more cleanly across metadata nodes than
HTree.

> Will there be some similar ability in Shardmap to have parallel ops?

This work is already in progress for user space Shardmap. If there is
also a kernel use case then we can just go forward assuming that this
work or some variation of it applies to both.

We need VFS changes to exploit parallel dirops in general, I think,
confirmed by your comment below. Seems like a good bit of work for
somebody. I bet the benchmarks will show well, suitable grist for a
master's thesis I would think.

Fine-grained directory locking may have a small enough footprint in
the Shardmap kernel port that there is no strong argument for getting
rid of it, just because VFS doesn't support it yet. Really, this has
the smell of a VFS flaw (interested in Al's comments...)

> Also, does Shardmap have the ability to shrink as entries are removed?

No shrink so far. What would you suggest? Keeping in mind that POSIX+NFS
semantics mean that we cannot in general defrag on the fly. I planned to
just hole_punch blocks that happen to become completely empty.

This aspect has so far not gotten attention because, historically, we
just never shrink a directory except via fsck/tools. What would you
like to see here? Maybe an ioctl to invoke directory defrag? A mode
bit to indicate we don't care about persistent telldir cookies?

How about automatic defrag that only runs when directory open count is
zero, plus a flag to disable?

> [*] we've tried to submit the pdirops patch a couple of times, but the
> main blocker is that the VFS has a single directory mutex and couldn't
> use the added functionality without significant VFS changes.

How significant would it be, really nasty or just somewhat nasty? I bet
the resulting efficiencies would show up in some general use cases.

> Patch at https://git.whamcloud.com/?p=fs/lustre-release.git;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD

This URL gives me git://git.whamcloud.com/fs/lustre-release.git/summary,
am I missing something?

Regards,

Daniel

  reply index

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27  1:47 Daniel Phillips
2019-11-27  7:40 ` Vyacheslav Dubeyko
2019-11-27  8:28   ` Daniel Phillips
2019-11-27 19:35     ` Viacheslav Dubeyko
2019-11-28  2:54       ` Daniel Phillips
2019-11-28  9:15         ` Andreas Dilger
2019-11-28 10:03           ` Daniel Phillips
2019-11-27 14:25 ` Theodore Y. Ts'o
2019-11-27 22:27   ` Daniel Phillips
2019-11-28  2:28     ` Theodore Y. Ts'o
2019-11-28  4:27       ` Daniel Phillips
2019-11-30 17:50         ` Theodore Y. Ts'o
2019-12-01  8:21           ` Daniel Phillips
2019-12-04 18:31             ` Andreas Dilger
2019-12-04 21:44               ` Daniel Phillips [this message]
2019-12-05  0:36                 ` Andreas Dilger
2019-12-05  2:27                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-04 23:41               ` [RFC] Thing 1: Shardmap fox Ext4 Theodore Y. Ts'o
2019-12-06  1:16                 ` Dave Chinner
2019-12-06  5:09                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-08 22:42                     ` Dave Chinner
2019-11-28 21:17       ` [RFC] Thing 1: Shardmap fox Ext4 Daniel Phillips
2019-12-08 10:25       ` Daniel Phillips
2019-12-02  1:45   ` Daniel Phillips
2019-12-04 15:55     ` Vyacheslav Dubeyko
2019-12-05  9:46       ` Daniel Phillips
2019-12-06 11:47         ` Vyacheslav Dubeyko
2019-12-07  0:46           ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-04 18:03     ` [RFC] Thing 1: Shardmap fox Ext4 Andreas Dilger
2019-12-04 20:47       ` Daniel Phillips
2019-12-04 20:53         ` Daniel Phillips
2019-12-05  5:59           ` Daniel Phillips

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f385445b-4941-cc48-e05d-51480a01f4aa@phunq.net \
    --to=daniel@phunq.net \
    --cc=adilger@dilger.ca \
    --cc=hirofumi@mail.parknet.co.jp \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git