Linux-ext4 Archive on
 help / color / Atom feed
From: Daniel Phillips <>
To: Vyacheslav Dubeyko <>,,,, "Theodore Y. Ts'o" <>,
	OGAWA Hirofumi <>,
	"Darrick J. Wong" <>
Subject: Re: [RFC] Thing 1: Shardmap fox Ext4
Date: Wed, 27 Nov 2019 00:28:26 -0800
Message-ID: <> (raw)
In-Reply-To: <>

On 2019-11-26 11:40 p.m., Vyacheslav Dubeyko wrote:
> As far as I know, usually, a folder contains dozens or hundreds
> files/folders in average. There are many research works that had showed
> this fact. Do you mean some special use-case when folder could contain
> the billion files? Could you share some research work that describes
> some practical use-case with billion files per folder?

You are entirely correct that the vast majority of directories contain
only a handful of files. That is my case (1). A few directories on a
typical server can go into the tens of thousands of files. There was
a time when we could not handle those efficiently, and now thanks to
HTree we can. Some directories go into the millions, ask the Lustre
people about that. If you could have a directory with a billion files
then somebody will have a use for it. For example, you may be able to
avoid a database for a particular application and just use the file
system instead.

Now, scaling to a billion files is just one of several things that
Shardmap does better than HTree. More immediately, Shardmap implements
readdir simply, accurately and efficiently, unlike HTree. See here for
some discussion:
   "Widening ext4's readdir() cookie"

See the recommendation that is sometimes offered to work around
HTree's issues with processing files in hash order. Basically, read
the entire directory into memory, sort by inode number, then process
in that order. As an application writer do you really want to do this,
or would you prefer that the filesystem just take care of for you so
the normal, simple and readable code is also the most efficient code?

> If you are talking about improving the performance then do you mean
> some special open-source implementation?

I mean the same kind of kernel filesystem implementation that HTree
currently has. Open source of course, GPLv2 to be specific.

>> For delete, Shardmap avoids write multiplication by appending tombstone
>> entries to index shards, thereby addressing a well known HTree delete
>> performance issue.
> Do you mean Copy-On-Write policy here or some special technique?

The technique Shardmap uses to reduce write amplication under heavy
load is somewhat similar to the technique used by Google's Bigtable to
achieve a similar result for data files. (However, the resemblance to
Bigtable ends there.)

Each update to a Shardmap index is done twice: once in a highly
optimized hash table shard in cache, then again by appending an
entry to the tail of the shard's media "fifo". Media writes are
therefore mostly linear. I say mostly, because if there is a large
number of shards then a single commit may need to update the tail
block of each one, which still adds up to vastly fewer blocks than
the BTree case, where it is easy to construct cases where every
index block must be updated on every commit, a nasty example of
n**2 performance overhead.

> How could be good Shardmap for the SSD use-case? Do you mean that we
> could reduce write amplification issue for NAND flash case?

Correct. Reducing write amplification is particularly important for
flash based storage. It also has a noticeable beneficial effect on
efficiency under many common and not so common loads.

> Let's imagine that it needs to implement the Shardmap approach. Could
> you estimate the implementation and stabilization time? How expensive
> and long-term efforts could it be?

Shardmap is already implemented and stable, though it does need wider
usage and testing. Code is available here:

Shardmap needs to be ported to kernel, already planned and in progress
for Tux3. Now I am proposing that the Ext4 team should consider porting
Shardmap to Ext4, or at least enter into a serious discussion of the

Added Darrick to cc, as he is already fairly familiar with this subject,
once was an Ext4 developer, and perhaps still is should the need arise.
By the way, is there a reason that Ted's MIT address bounced on my
original post?



  reply index

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27  1:47 Daniel Phillips
2019-11-27  7:40 ` Vyacheslav Dubeyko
2019-11-27  8:28   ` Daniel Phillips [this message]
2019-11-27 19:35     ` Viacheslav Dubeyko
2019-11-28  2:54       ` Daniel Phillips
2019-11-28  9:15         ` Andreas Dilger
2019-11-28 10:03           ` Daniel Phillips
2019-11-27 14:25 ` Theodore Y. Ts'o
2019-11-27 22:27   ` Daniel Phillips
2019-11-28  2:28     ` Theodore Y. Ts'o
2019-11-28  4:27       ` Daniel Phillips
2019-11-30 17:50         ` Theodore Y. Ts'o
2019-12-01  8:21           ` Daniel Phillips
2019-12-04 18:31             ` Andreas Dilger
2019-12-04 21:44               ` Daniel Phillips
2019-12-05  0:36                 ` Andreas Dilger
2019-12-05  2:27                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-04 23:41               ` [RFC] Thing 1: Shardmap fox Ext4 Theodore Y. Ts'o
2019-12-06  1:16                 ` Dave Chinner
2019-12-06  5:09                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-08 22:42                     ` Dave Chinner
2019-11-28 21:17       ` [RFC] Thing 1: Shardmap fox Ext4 Daniel Phillips
2019-12-08 10:25       ` Daniel Phillips
2019-12-02  1:45   ` Daniel Phillips
2019-12-04 15:55     ` Vyacheslav Dubeyko
2019-12-05  9:46       ` Daniel Phillips
2019-12-06 11:47         ` Vyacheslav Dubeyko
2019-12-07  0:46           ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-04 18:03     ` [RFC] Thing 1: Shardmap fox Ext4 Andreas Dilger
2019-12-04 20:47       ` Daniel Phillips
2019-12-04 20:53         ` Daniel Phillips
2019-12-05  5:59           ` Daniel Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-ext4 Archive on

Archives are clonable:
	git clone --mirror linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ \
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone