Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Vyacheslav Dubeyko <slava@dubeyko.com>
To: Daniel Phillips <daniel@phunq.net>,
	linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, "Theodore Y. Ts'o" <tytso@mit.edu>,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Subject: Re: [RFC] Thing 1: Shardmap fox Ext4
Date: Wed, 27 Nov 2019 10:40:59 +0300
Message-ID: <8ece0424ceeeffbc4df5d52bfa270a9522f81cda.camel@dubeyko.com> (raw)
In-Reply-To: <176a1773-f5ea-e686-ec7b-5f0a46c6f731@phunq.net>

On Tue, 2019-11-26 at 17:47 -0800, Daniel Phillips wrote:
> Hi folks,
> 
> Here is my somewhat tardy followup to my Three Things post from
> earlier
> this fall. I give you Thing 1, Shardmap. What I hope to accomplish
> today
> is to initiate a conversation with Ext4 developers, and other
> interested
> observers, which will eventually result in merging the new Shardmap
> directory index code into Ext4, thereby solving a number of
> longstanding
> issues with the venerable and somewhat problematic HTree.
> 
> HTree is the most used directory index in the known universe. HTree
> does
> some things fantastically well, particularly in the most common range
> of
> directory sizes, but also exhibits some well known flaws and at least
> one
> previously unknown flaw, explained below. Shardmap is a new index
> design,
> just seven years old, an O(1) extensible hash table, meant to address
> all
> of HTree's flaws while improving performance and scaling into the
> previously inaccessible billion file per directory range. Subject to
> verifying these claims, it would seem to be logical to move on to the
> logistics of porting Shardmap to Ext4 as an optional index algorithm,
> eventually deprecating HTree.
> 


As far as I know, usually, a folder contains dozens or hundreds
files/folders in average. There are many research works that had showed
this fact. Do you mean some special use-case when folder could contain
the billion files? Could you share some research work that describes
some practical use-case with billion files per folder?

If you are talking about improving the performance then do you mean
some special open-source implementation?


> Shardmap equals or outperforms HTree at all scales, with the single
> exception of one theoretical case with a likewise theoretical
> solution.
> Shardmap is O(1) in all operations - insert, delete, lookup and
> readdir,
> while HTree is O(log N) in the first three and suffers from a
> relatively
> large constant readdir overhead. This is not the only reason Shardmap
> is
> faster than HTree, far from it.
> 
> I will break performance discussion down into four ranges:
>    1) unindexed single block, to about 300 entries
>    2) indexed up to 30 thousand entries
>    3) indexed up to 3 million entries
>    4) indexed up to 1 billion entries.
> 
> In theory, Shardmap and HTree are exactly tied in the common single
> block case, because creating the index is delayed in both cases until
> there are at least two blocks to index. However, Shardmap broke away
> from the traditional Ext2 entry block format in order to improve
> block
> level operation efficiency and to prevent record fragmentation under
> heavy aging, and is therefore significantly faster than HTree even in
> the single block case.
> 
> Shardmap does not function at all in the fourth range, up to 1
> billion
> entries, because its Btree has at most 2 levels. This simple flaw
> could be
> corrected without much difficulty but Shardmap would remain superior
> for
> a number of reasons.
> 
> The most interesting case is the 300 to 30 thousand entry range,
> where
> Htree and Shardmap should theoretically be nearly equal, each
> requiring
> two accesses per lookup. However, HTree does two radix tree lookups
> while
> Shardmap does one, and the other lookup is in a cached memory object.
> Coupled with the faster record level operations, Shardmap is
> significantly
> faster in this range. In the 30 thousand to 3 million range,
> Shardmap's
> performance advantage further widens in accordance with O(1) / O(log
> N).
> 
> For inserts, Shardmap's streaming update strategy is far superior to
> HTree's random index block write approach. HTree will tend to dirty
> every
> index block under heavy insert load, so that every index block must
> be
> written to media per commit, causing serious write multiplication
> issues. In fact, all Btree schemes suffer from this issue, which on
> the
> face of it appears to be enough reason to eliminate the Btree as a
> realistic design alternative. Shardmap dramatically reduces such per
> commit write multiplication by appending new index entries linearly
> to
> the tail blocks of a small number of index shards. For delete,
> Shardmap avoids write multiplication by appending tombstone entries
> to
> index shards, thereby addressing a well known HTree delete
> performance
> issue.


Do you mean Copy-On-Write policy here or some special technique? How
could be good Shardmap for the SSD use-case? Do you mean that we could
reduce write amplification issue for NAND flash case?


> 
> HTree has always suffered from a serious mismatch between name
> storage
> order and inode storage order, greatly exacerbated by the large
> number
> of directory entries and inodes stored per block (false sharing). In
> particular, a typical HTree hash order directory traversal accesses
> the
> inode table randomly, multiplying both the cache footprint and write
> traffic. Historically, this was the cause of many performance
> complaints
> about HTree, though to be sure, such complaints have fallen off with
> the advent of solid state storage. Even so, this issue will continue
> rear
> its ugly head when users push the boundaries of cache and directory
> size
> (google telldir+horror). Shardmap avoids this issue entirely by
> storing
> and traversing directory entries in simple, classic linear order.
> 
> This touches on the single most significant difference between
> Shardmap
> and HTree: Shardmap strictly separates its index from record entry
> blocks,
> while HTree embeds entries directly in the BTree index. The HTree
> strategy performs impressively well at low to medium directory
> scales,
> but at higher scales it causes significantly more cache pressure, due
> to
> the fact that the cache footprint of any randomly accessed index is
> necessarily the entire index. In contrast, Shardmap stores directory
> entries permanently in creation order, so that directory traversal is
> in simple linear order with effectively zero overhead. This avoids
> perhaps HTree's most dubious design feature, its arcane and not
> completely
> reliable hash order readdir traversal, which miraculously has avoided
> serious meltdown over these past two decades due to a legendary hack
> by
> Ted and subsequent careful adaptation to handle various corner cases.
> Nowhere in Shardmap will you find any such arcane and marginally
> supportable code, which should be a great relief to Ext4 maintainers.
> Or to put it another way, if somebody out there wishes to export a
> billion file directory using NFSv2, Shardmap will not be the reason
> why that does not work whereas HTree most probably would be.
> 
> Besides separating out the index from entry records and accessing
> those
> records linearly in most situations, Shardmap also benefits from a
> very
> compact index design. Even if a directory has a billion entries, each
> index entry is only eight bytes in size. This exercise in structure
> compaction proved possible because the number of hash bits needed for
> the
> hash code decreases as the number of index shards increases, freeing
> up
> bits for larger block numbers as the directory expands. Shardmap
> therefore implements an index entry as several variable sized fields.
> This strategy works well up to the billion entry range, above which
> the
> number of hash index collisions (each of which must be resolved by
> accessing an underlying record block) starts to increase noticeably.
> This is really the only factor that limits Shardmap performance above
> a billion entries. Should we wish Shardmap to scale to trillions of
> entries without losing performance, we will need to increase the
> index
> entry size to ten bytes or so, or come up with some as yet unknown
> clever improvement (potential thesis project goes here).
> 
> There are many additional technical details of Shardmap that ought to
> be
> explained, however today my primary purpose is simply to introduce
> what
> I view as a compelling case for obsoleting HTree. To that end, fewer
> words are better and this post is already quite long enough. I would
> love
> to get into some other interesting details, for example, the Bigmap
> free
> space mapping strategy, but that really needs its own post to do it
> justice, as do a number of other subjects.
> 
> Wrapping up, what about that theoretical case where HTree outperforms
> Shardmap? This is theoretical because one needs to operate on a huge
> directory with tiny cache to trigger it. Both Shardmap and HTree will
> exhibit read multiplication under such conditions, due to frequent
> cache evictions, however the Shardmap read multiplication will be
> many
> times higher than HTree because of its coarser cache granularity. In
> the
> unlikely event that we ever need to fix this, one viable solution is
> to
> add paging support for Shardmap's in-memory cache structure, a
> standard
> technique sometimes called "anticache".
> 
> That is it for today. There remains much to explain about Shardmap
> both
> within and beyond the Ext4 context. For example, Shardmap has proved
> to
> work very well as a standalone key value store, particularly with
> persistent memory. In fact, we have benchmarked Shardmap at over
> three
> million atomic, durable database operations per second on an Intel
> Optane server, which might well be a new world record. The details of
> how this was done are fascinating, however this post is far too small
> to
> contain them today. Perhaps this should be thing 1(b) for next week.
> 


Let's imagine that it needs to implement the Shardmap approach. Could
you estimate the implementation and stabilization time? How expensive
and long-term efforts could it be?

Thanks,
Viacheslav Dubeyko.



  reply index

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27  1:47 Daniel Phillips
2019-11-27  7:40 ` Vyacheslav Dubeyko [this message]
2019-11-27  8:28   ` Daniel Phillips
2019-11-27 19:35     ` Viacheslav Dubeyko
2019-11-28  2:54       ` Daniel Phillips
2019-11-28  9:15         ` Andreas Dilger
2019-11-28 10:03           ` Daniel Phillips
2019-11-27 14:25 ` Theodore Y. Ts'o
2019-11-27 22:27   ` Daniel Phillips
2019-11-28  2:28     ` Theodore Y. Ts'o
2019-11-28  4:27       ` Daniel Phillips
2019-11-30 17:50         ` Theodore Y. Ts'o
2019-12-01  8:21           ` Daniel Phillips
2019-12-04 18:31             ` Andreas Dilger
2019-12-04 21:44               ` Daniel Phillips
2019-12-05  0:36                 ` Andreas Dilger
2019-12-05  2:27                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-04 23:41               ` [RFC] Thing 1: Shardmap fox Ext4 Theodore Y. Ts'o
2019-12-06  1:16                 ` Dave Chinner
2019-12-06  5:09                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-08 22:42                     ` Dave Chinner
2019-11-28 21:17       ` [RFC] Thing 1: Shardmap fox Ext4 Daniel Phillips
2019-12-08 10:25       ` Daniel Phillips
2019-12-02  1:45   ` Daniel Phillips
2019-12-04 15:55     ` Vyacheslav Dubeyko
2019-12-05  9:46       ` Daniel Phillips
2019-12-06 11:47         ` Vyacheslav Dubeyko
2019-12-07  0:46           ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-04 18:03     ` [RFC] Thing 1: Shardmap fox Ext4 Andreas Dilger
2019-12-04 20:47       ` Daniel Phillips
2019-12-04 20:53         ` Daniel Phillips
2019-12-05  5:59           ` Daniel Phillips

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8ece0424ceeeffbc4df5d52bfa270a9522f81cda.camel@dubeyko.com \
    --to=slava@dubeyko.com \
    --cc=daniel@phunq.net \
    --cc=hirofumi@mail.parknet.co.jp \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git