linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Thing 1: Shardmap fox Ext4
@ 2019-11-27  1:47 Daniel Phillips
  2019-11-27  7:40 ` Vyacheslav Dubeyko
  2019-11-27 14:25 ` Theodore Y. Ts'o
  0 siblings, 2 replies; 32+ messages in thread
From: Daniel Phillips @ 2019-11-27  1:47 UTC (permalink / raw)
  To: linux-ext4, linux-kernel, linux-fsdevel, Theodore Y. Ts'o,
	OGAWA Hirofumi

Hi folks,

Here is my somewhat tardy followup to my Three Things post from earlier
this fall. I give you Thing 1, Shardmap. What I hope to accomplish today
is to initiate a conversation with Ext4 developers, and other interested
observers, which will eventually result in merging the new Shardmap
directory index code into Ext4, thereby solving a number of longstanding
issues with the venerable and somewhat problematic HTree.

HTree is the most used directory index in the known universe. HTree does
some things fantastically well, particularly in the most common range of
directory sizes, but also exhibits some well known flaws and at least one
previously unknown flaw, explained below. Shardmap is a new index design,
just seven years old, an O(1) extensible hash table, meant to address all
of HTree's flaws while improving performance and scaling into the
previously inaccessible billion file per directory range. Subject to
verifying these claims, it would seem to be logical to move on to the
logistics of porting Shardmap to Ext4 as an optional index algorithm,
eventually deprecating HTree.

Shardmap equals or outperforms HTree at all scales, with the single
exception of one theoretical case with a likewise theoretical solution.
Shardmap is O(1) in all operations - insert, delete, lookup and readdir,
while HTree is O(log N) in the first three and suffers from a relatively
large constant readdir overhead. This is not the only reason Shardmap is
faster than HTree, far from it.

I will break performance discussion down into four ranges:
   1) unindexed single block, to about 300 entries
   2) indexed up to 30 thousand entries
   3) indexed up to 3 million entries
   4) indexed up to 1 billion entries.

In theory, Shardmap and HTree are exactly tied in the common single
block case, because creating the index is delayed in both cases until
there are at least two blocks to index. However, Shardmap broke away
from the traditional Ext2 entry block format in order to improve block
level operation efficiency and to prevent record fragmentation under
heavy aging, and is therefore significantly faster than HTree even in
the single block case.

Shardmap does not function at all in the fourth range, up to 1 billion
entries, because its Btree has at most 2 levels. This simple flaw could be
corrected without much difficulty but Shardmap would remain superior for
a number of reasons.

The most interesting case is the 300 to 30 thousand entry range, where
Htree and Shardmap should theoretically be nearly equal, each requiring
two accesses per lookup. However, HTree does two radix tree lookups while
Shardmap does one, and the other lookup is in a cached memory object.
Coupled with the faster record level operations, Shardmap is significantly
faster in this range. In the 30 thousand to 3 million range, Shardmap's
performance advantage further widens in accordance with O(1) / O(log N).

For inserts, Shardmap's streaming update strategy is far superior to
HTree's random index block write approach. HTree will tend to dirty every
index block under heavy insert load, so that every index block must be
written to media per commit, causing serious write multiplication
issues. In fact, all Btree schemes suffer from this issue, which on the
face of it appears to be enough reason to eliminate the Btree as a
realistic design alternative. Shardmap dramatically reduces such per
commit write multiplication by appending new index entries linearly to
the tail blocks of a small number of index shards. For delete,
Shardmap avoids write multiplication by appending tombstone entries to
index shards, thereby addressing a well known HTree delete performance
issue.

HTree has always suffered from a serious mismatch between name storage
order and inode storage order, greatly exacerbated by the large number
of directory entries and inodes stored per block (false sharing). In
particular, a typical HTree hash order directory traversal accesses the
inode table randomly, multiplying both the cache footprint and write
traffic. Historically, this was the cause of many performance complaints
about HTree, though to be sure, such complaints have fallen off with
the advent of solid state storage. Even so, this issue will continue rear
its ugly head when users push the boundaries of cache and directory size
(google telldir+horror). Shardmap avoids this issue entirely by storing
and traversing directory entries in simple, classic linear order.

This touches on the single most significant difference between Shardmap
and HTree: Shardmap strictly separates its index from record entry blocks,
while HTree embeds entries directly in the BTree index. The HTree
strategy performs impressively well at low to medium directory scales,
but at higher scales it causes significantly more cache pressure, due to
the fact that the cache footprint of any randomly accessed index is
necessarily the entire index. In contrast, Shardmap stores directory
entries permanently in creation order, so that directory traversal is
in simple linear order with effectively zero overhead. This avoids
perhaps HTree's most dubious design feature, its arcane and not completely
reliable hash order readdir traversal, which miraculously has avoided
serious meltdown over these past two decades due to a legendary hack by
Ted and subsequent careful adaptation to handle various corner cases.
Nowhere in Shardmap will you find any such arcane and marginally
supportable code, which should be a great relief to Ext4 maintainers.
Or to put it another way, if somebody out there wishes to export a
billion file directory using NFSv2, Shardmap will not be the reason
why that does not work whereas HTree most probably would be.

Besides separating out the index from entry records and accessing those
records linearly in most situations, Shardmap also benefits from a very
compact index design. Even if a directory has a billion entries, each
index entry is only eight bytes in size. This exercise in structure
compaction proved possible because the number of hash bits needed for the
hash code decreases as the number of index shards increases, freeing up
bits for larger block numbers as the directory expands. Shardmap
therefore implements an index entry as several variable sized fields.
This strategy works well up to the billion entry range, above which the
number of hash index collisions (each of which must be resolved by
accessing an underlying record block) starts to increase noticeably.
This is really the only factor that limits Shardmap performance above
a billion entries. Should we wish Shardmap to scale to trillions of
entries without losing performance, we will need to increase the index
entry size to ten bytes or so, or come up with some as yet unknown
clever improvement (potential thesis project goes here).

There are many additional technical details of Shardmap that ought to be
explained, however today my primary purpose is simply to introduce what
I view as a compelling case for obsoleting HTree. To that end, fewer
words are better and this post is already quite long enough. I would love
to get into some other interesting details, for example, the Bigmap free
space mapping strategy, but that really needs its own post to do it
justice, as do a number of other subjects.

Wrapping up, what about that theoretical case where HTree outperforms
Shardmap? This is theoretical because one needs to operate on a huge
directory with tiny cache to trigger it. Both Shardmap and HTree will
exhibit read multiplication under such conditions, due to frequent
cache evictions, however the Shardmap read multiplication will be many
times higher than HTree because of its coarser cache granularity. In the
unlikely event that we ever need to fix this, one viable solution is to
add paging support for Shardmap's in-memory cache structure, a standard
technique sometimes called "anticache".

That is it for today. There remains much to explain about Shardmap both
within and beyond the Ext4 context. For example, Shardmap has proved to
work very well as a standalone key value store, particularly with
persistent memory. In fact, we have benchmarked Shardmap at over three
million atomic, durable database operations per second on an Intel
Optane server, which might well be a new world record. The details of
how this was done are fascinating, however this post is far too small to
contain them today. Perhaps this should be thing 1(b) for next week.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-27  1:47 [RFC] Thing 1: Shardmap fox Ext4 Daniel Phillips
@ 2019-11-27  7:40 ` Vyacheslav Dubeyko
  2019-11-27  8:28   ` Daniel Phillips
  2019-11-27 14:25 ` Theodore Y. Ts'o
  1 sibling, 1 reply; 32+ messages in thread
From: Vyacheslav Dubeyko @ 2019-11-27  7:40 UTC (permalink / raw)
  To: Daniel Phillips, linux-ext4, linux-kernel, linux-fsdevel,
	Theodore Y. Ts'o, OGAWA Hirofumi

On Tue, 2019-11-26 at 17:47 -0800, Daniel Phillips wrote:
> Hi folks,
> 
> Here is my somewhat tardy followup to my Three Things post from
> earlier
> this fall. I give you Thing 1, Shardmap. What I hope to accomplish
> today
> is to initiate a conversation with Ext4 developers, and other
> interested
> observers, which will eventually result in merging the new Shardmap
> directory index code into Ext4, thereby solving a number of
> longstanding
> issues with the venerable and somewhat problematic HTree.
> 
> HTree is the most used directory index in the known universe. HTree
> does
> some things fantastically well, particularly in the most common range
> of
> directory sizes, but also exhibits some well known flaws and at least
> one
> previously unknown flaw, explained below. Shardmap is a new index
> design,
> just seven years old, an O(1) extensible hash table, meant to address
> all
> of HTree's flaws while improving performance and scaling into the
> previously inaccessible billion file per directory range. Subject to
> verifying these claims, it would seem to be logical to move on to the
> logistics of porting Shardmap to Ext4 as an optional index algorithm,
> eventually deprecating HTree.
> 


As far as I know, usually, a folder contains dozens or hundreds
files/folders in average. There are many research works that had showed
this fact. Do you mean some special use-case when folder could contain
the billion files? Could you share some research work that describes
some practical use-case with billion files per folder?

If you are talking about improving the performance then do you mean
some special open-source implementation?


> Shardmap equals or outperforms HTree at all scales, with the single
> exception of one theoretical case with a likewise theoretical
> solution.
> Shardmap is O(1) in all operations - insert, delete, lookup and
> readdir,
> while HTree is O(log N) in the first three and suffers from a
> relatively
> large constant readdir overhead. This is not the only reason Shardmap
> is
> faster than HTree, far from it.
> 
> I will break performance discussion down into four ranges:
>    1) unindexed single block, to about 300 entries
>    2) indexed up to 30 thousand entries
>    3) indexed up to 3 million entries
>    4) indexed up to 1 billion entries.
> 
> In theory, Shardmap and HTree are exactly tied in the common single
> block case, because creating the index is delayed in both cases until
> there are at least two blocks to index. However, Shardmap broke away
> from the traditional Ext2 entry block format in order to improve
> block
> level operation efficiency and to prevent record fragmentation under
> heavy aging, and is therefore significantly faster than HTree even in
> the single block case.
> 
> Shardmap does not function at all in the fourth range, up to 1
> billion
> entries, because its Btree has at most 2 levels. This simple flaw
> could be
> corrected without much difficulty but Shardmap would remain superior
> for
> a number of reasons.
> 
> The most interesting case is the 300 to 30 thousand entry range,
> where
> Htree and Shardmap should theoretically be nearly equal, each
> requiring
> two accesses per lookup. However, HTree does two radix tree lookups
> while
> Shardmap does one, and the other lookup is in a cached memory object.
> Coupled with the faster record level operations, Shardmap is
> significantly
> faster in this range. In the 30 thousand to 3 million range,
> Shardmap's
> performance advantage further widens in accordance with O(1) / O(log
> N).
> 
> For inserts, Shardmap's streaming update strategy is far superior to
> HTree's random index block write approach. HTree will tend to dirty
> every
> index block under heavy insert load, so that every index block must
> be
> written to media per commit, causing serious write multiplication
> issues. In fact, all Btree schemes suffer from this issue, which on
> the
> face of it appears to be enough reason to eliminate the Btree as a
> realistic design alternative. Shardmap dramatically reduces such per
> commit write multiplication by appending new index entries linearly
> to
> the tail blocks of a small number of index shards. For delete,
> Shardmap avoids write multiplication by appending tombstone entries
> to
> index shards, thereby addressing a well known HTree delete
> performance
> issue.


Do you mean Copy-On-Write policy here or some special technique? How
could be good Shardmap for the SSD use-case? Do you mean that we could
reduce write amplification issue for NAND flash case?


> 
> HTree has always suffered from a serious mismatch between name
> storage
> order and inode storage order, greatly exacerbated by the large
> number
> of directory entries and inodes stored per block (false sharing). In
> particular, a typical HTree hash order directory traversal accesses
> the
> inode table randomly, multiplying both the cache footprint and write
> traffic. Historically, this was the cause of many performance
> complaints
> about HTree, though to be sure, such complaints have fallen off with
> the advent of solid state storage. Even so, this issue will continue
> rear
> its ugly head when users push the boundaries of cache and directory
> size
> (google telldir+horror). Shardmap avoids this issue entirely by
> storing
> and traversing directory entries in simple, classic linear order.
> 
> This touches on the single most significant difference between
> Shardmap
> and HTree: Shardmap strictly separates its index from record entry
> blocks,
> while HTree embeds entries directly in the BTree index. The HTree
> strategy performs impressively well at low to medium directory
> scales,
> but at higher scales it causes significantly more cache pressure, due
> to
> the fact that the cache footprint of any randomly accessed index is
> necessarily the entire index. In contrast, Shardmap stores directory
> entries permanently in creation order, so that directory traversal is
> in simple linear order with effectively zero overhead. This avoids
> perhaps HTree's most dubious design feature, its arcane and not
> completely
> reliable hash order readdir traversal, which miraculously has avoided
> serious meltdown over these past two decades due to a legendary hack
> by
> Ted and subsequent careful adaptation to handle various corner cases.
> Nowhere in Shardmap will you find any such arcane and marginally
> supportable code, which should be a great relief to Ext4 maintainers.
> Or to put it another way, if somebody out there wishes to export a
> billion file directory using NFSv2, Shardmap will not be the reason
> why that does not work whereas HTree most probably would be.
> 
> Besides separating out the index from entry records and accessing
> those
> records linearly in most situations, Shardmap also benefits from a
> very
> compact index design. Even if a directory has a billion entries, each
> index entry is only eight bytes in size. This exercise in structure
> compaction proved possible because the number of hash bits needed for
> the
> hash code decreases as the number of index shards increases, freeing
> up
> bits for larger block numbers as the directory expands. Shardmap
> therefore implements an index entry as several variable sized fields.
> This strategy works well up to the billion entry range, above which
> the
> number of hash index collisions (each of which must be resolved by
> accessing an underlying record block) starts to increase noticeably.
> This is really the only factor that limits Shardmap performance above
> a billion entries. Should we wish Shardmap to scale to trillions of
> entries without losing performance, we will need to increase the
> index
> entry size to ten bytes or so, or come up with some as yet unknown
> clever improvement (potential thesis project goes here).
> 
> There are many additional technical details of Shardmap that ought to
> be
> explained, however today my primary purpose is simply to introduce
> what
> I view as a compelling case for obsoleting HTree. To that end, fewer
> words are better and this post is already quite long enough. I would
> love
> to get into some other interesting details, for example, the Bigmap
> free
> space mapping strategy, but that really needs its own post to do it
> justice, as do a number of other subjects.
> 
> Wrapping up, what about that theoretical case where HTree outperforms
> Shardmap? This is theoretical because one needs to operate on a huge
> directory with tiny cache to trigger it. Both Shardmap and HTree will
> exhibit read multiplication under such conditions, due to frequent
> cache evictions, however the Shardmap read multiplication will be
> many
> times higher than HTree because of its coarser cache granularity. In
> the
> unlikely event that we ever need to fix this, one viable solution is
> to
> add paging support for Shardmap's in-memory cache structure, a
> standard
> technique sometimes called "anticache".
> 
> That is it for today. There remains much to explain about Shardmap
> both
> within and beyond the Ext4 context. For example, Shardmap has proved
> to
> work very well as a standalone key value store, particularly with
> persistent memory. In fact, we have benchmarked Shardmap at over
> three
> million atomic, durable database operations per second on an Intel
> Optane server, which might well be a new world record. The details of
> how this was done are fascinating, however this post is far too small
> to
> contain them today. Perhaps this should be thing 1(b) for next week.
> 


Let's imagine that it needs to implement the Shardmap approach. Could
you estimate the implementation and stabilization time? How expensive
and long-term efforts could it be?

Thanks,
Viacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-27  7:40 ` Vyacheslav Dubeyko
@ 2019-11-27  8:28   ` Daniel Phillips
  2019-11-27 19:35     ` Viacheslav Dubeyko
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Phillips @ 2019-11-27  8:28 UTC (permalink / raw)
  To: Vyacheslav Dubeyko, linux-ext4, linux-kernel, linux-fsdevel,
	Theodore Y. Ts'o, OGAWA Hirofumi, Darrick J. Wong

On 2019-11-26 11:40 p.m., Vyacheslav Dubeyko wrote:
> As far as I know, usually, a folder contains dozens or hundreds
> files/folders in average. There are many research works that had showed
> this fact. Do you mean some special use-case when folder could contain
> the billion files? Could you share some research work that describes
> some practical use-case with billion files per folder?

You are entirely correct that the vast majority of directories contain
only a handful of files. That is my case (1). A few directories on a
typical server can go into the tens of thousands of files. There was
a time when we could not handle those efficiently, and now thanks to
HTree we can. Some directories go into the millions, ask the Lustre
people about that. If you could have a directory with a billion files
then somebody will have a use for it. For example, you may be able to
avoid a database for a particular application and just use the file
system instead.

Now, scaling to a billion files is just one of several things that
Shardmap does better than HTree. More immediately, Shardmap implements
readdir simply, accurately and efficiently, unlike HTree. See here for
some discussion:

   https://lwn.net/Articles/544520/
   "Widening ext4's readdir() cookie"

See the recommendation that is sometimes offered to work around
HTree's issues with processing files in hash order. Basically, read
the entire directory into memory, sort by inode number, then process
in that order. As an application writer do you really want to do this,
or would you prefer that the filesystem just take care of for you so
the normal, simple and readable code is also the most efficient code?

> If you are talking about improving the performance then do you mean
> some special open-source implementation?

I mean the same kind of kernel filesystem implementation that HTree
currently has. Open source of course, GPLv2 to be specific.

>> For delete, Shardmap avoids write multiplication by appending tombstone
>> entries to index shards, thereby addressing a well known HTree delete
>> performance issue.
> 
> Do you mean Copy-On-Write policy here or some special technique?

The technique Shardmap uses to reduce write amplication under heavy
load is somewhat similar to the technique used by Google's Bigtable to
achieve a similar result for data files. (However, the resemblance to
Bigtable ends there.)

Each update to a Shardmap index is done twice: once in a highly
optimized hash table shard in cache, then again by appending an
entry to the tail of the shard's media "fifo". Media writes are
therefore mostly linear. I say mostly, because if there is a large
number of shards then a single commit may need to update the tail
block of each one, which still adds up to vastly fewer blocks than
the BTree case, where it is easy to construct cases where every
index block must be updated on every commit, a nasty example of
n**2 performance overhead.

> How could be good Shardmap for the SSD use-case? Do you mean that we
> could reduce write amplification issue for NAND flash case?

Correct. Reducing write amplification is particularly important for
flash based storage. It also has a noticeable beneficial effect on
efficiency under many common and not so common loads.

> Let's imagine that it needs to implement the Shardmap approach. Could
> you estimate the implementation and stabilization time? How expensive
> and long-term efforts could it be?

Shardmap is already implemented and stable, though it does need wider
usage and testing. Code is available here:

   https://github.com/danielbot/Shardmap

Shardmap needs to be ported to kernel, already planned and in progress
for Tux3. Now I am proposing that the Ext4 team should consider porting
Shardmap to Ext4, or at least enter into a serious discussion of the
logistics.

Added Darrick to cc, as he is already fairly familiar with this subject,
once was an Ext4 developer, and perhaps still is should the need arise.
By the way, is there a reason that Ted's MIT address bounced on my
original post?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-27  1:47 [RFC] Thing 1: Shardmap fox Ext4 Daniel Phillips
  2019-11-27  7:40 ` Vyacheslav Dubeyko
@ 2019-11-27 14:25 ` Theodore Y. Ts'o
  2019-11-27 22:27   ` Daniel Phillips
  2019-12-02  1:45   ` Daniel Phillips
  1 sibling, 2 replies; 32+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-27 14:25 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

A couple of quick observations about Shardmap.

(1) It's licensed[1] under the GPLv3, so it's not compatible with the
kernel license.  That doesn't matter much for ext4, because...

[1] https://github.com/danielbot/Shardmap/blob/master/LICENSE


(2) It's implemented as userspace code (e.g., it uses open(2),
mmap(2), et. al) and using C++, so it would need to be reimplemented
from scratch for use in the kernel.


(3) It's not particularly well documented, making the above more
challenging, but it appears to be a variation of an extensible hashing
scheme, which was used by dbx and Berkley DB.


(4) Because of (2), we won't be able to do any actual benchmarks for a
while.  I just checked the latest version of Tux3[2], and it appears
to be be still using a linear search scheme for its directory ---
e.g., an O(n) lookup ala ext2.  So I'm guessing Shardmap may have been
*designed* for Tux3, but it has not yet been *implemented* for Tux3?

[2] https://github.com/OGAWAHirofumi/linux-tux3/blob/hirofumi/fs/tux3/dir.c#L283


(5) The claim is made that readdir() accesses files sequentially; but
there is also mention in Shardmap of compressing shards (e.g.,
rewriting them) to squeeze out deleted and tombstone entries.  This
pretty much guarantees that it will not be possible to satisfy POSIX
requirements of telldir(2)/seekdir(3) (using a 32-bit or 64-bitt
cookie), NFS (which also requires use of a 32-bit or 64-bit cookie
while doing readdir scan), or readdir() semantics in the face of
directory entries getting inserted or removed from the directory.

(To be specific, POSIX requires readdir returns each entry in a
directory once and only once, and in the case of a directory entry
which is removed or inserted, that directory entry must be returned
exactly zero or one times.  This is true even if telldir(2) ort
seekdir(2) is used to memoize a particular location in the directory,
which means you have a 32-bit or 64-bit cookie to define a particular
location in the readdir(2) stream.  If the file system wants to be
exportable via NFS, it must meet similar requirements ---- except the
32-bit or 64-bit cookie MUST survive a reboot.)

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-27  8:28   ` Daniel Phillips
@ 2019-11-27 19:35     ` Viacheslav Dubeyko
  2019-11-28  2:54       ` Daniel Phillips
  0 siblings, 1 reply; 32+ messages in thread
From: Viacheslav Dubeyko @ 2019-11-27 19:35 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-ext4, linux-kernel, linux-fsdevel, Theodore Y. Ts'o,
	OGAWA Hirofumi, Darrick J. Wong



> On Nov 27, 2019, at 11:28 AM, Daniel Phillips <daniel@phunq.net> wrote:
> 
> On 2019-11-26 11:40 p.m., Vyacheslav Dubeyko wrote:
>> As far as I know, usually, a folder contains dozens or hundreds
>> files/folders in average. There are many research works that had showed
>> this fact. Do you mean some special use-case when folder could contain
>> the billion files? Could you share some research work that describes
>> some practical use-case with billion files per folder?
> 
> You are entirely correct that the vast majority of directories contain
> only a handful of files. That is my case (1). A few directories on a
> typical server can go into the tens of thousands of files. There was
> a time when we could not handle those efficiently, and now thanks to
> HTree we can. Some directories go into the millions, ask the Lustre
> people about that. If you could have a directory with a billion files
> then somebody will have a use for it. For example, you may be able to
> avoid a database for a particular application and just use the file
> system instead.
> 
> Now, scaling to a billion files is just one of several things that
> Shardmap does better than HTree. More immediately, Shardmap implements
> readdir simply, accurately and efficiently, unlike HTree. See here for
> some discussion:
> 
>   https://lwn.net/Articles/544520/
>   "Widening ext4's readdir() cookie"
> 


So, it looks like that Shardmap could be better for the case of billion files in one folder.
But what’s about the regular case when it could be dozens/hundreds of files in one
folder? Will Shardmap be better than HTree? If the ordinary user hasn’t visible
performance improvement then it makes sense to consider Shardmap like the
optional feature. What do you think?

Does it mean that Shardmap is ext4 oriented only? Could it be used for another
file systems?


> See the recommendation that is sometimes offered to work around
> HTree's issues with processing files in hash order. Basically, read
> the entire directory into memory, sort by inode number, then process
> in that order. As an application writer do you really want to do this,
> or would you prefer that the filesystem just take care of for you so
> the normal, simple and readable code is also the most efficient code?
> 


I slightly missed the point here. To read the whole directory sounds like to read
the dentries tree from the volume. As far as I can see, the dentries are ordered
by names or by hashes. But if we are talking about inode number then we mean
the inodes tree. So, I have misunderstanding here. What do you mean?


>> If you are talking about improving the performance then do you mean
>> some special open-source implementation?
> 
> I mean the same kind of kernel filesystem implementation that HTree
> currently has. Open source of course, GPLv2 to be specific.
> 

I meant the Shardmap implementation. As far as I can see, the user-space implementation
is available only now. So, my question is still here. It’s hard to say how efficient the Shardmap
could be on kernel side as ext4 subsystem, for example.


>>> For delete, Shardmap avoids write multiplication by appending tombstone
>>> entries to index shards, thereby addressing a well known HTree delete
>>> performance issue.
>> 
>> Do you mean Copy-On-Write policy here or some special technique?
> 
> The technique Shardmap uses to reduce write amplication under heavy
> load is somewhat similar to the technique used by Google's Bigtable to
> achieve a similar result for data files. (However, the resemblance to
> Bigtable ends there.)
> 
> Each update to a Shardmap index is done twice: once in a highly
> optimized hash table shard in cache, then again by appending an
> entry to the tail of the shard's media "fifo". Media writes are
> therefore mostly linear. I say mostly, because if there is a large
> number of shards then a single commit may need to update the tail
> block of each one, which still adds up to vastly fewer blocks than
> the BTree case, where it is easy to construct cases where every
> index block must be updated on every commit, a nasty example of
> n**2 performance overhead.
> 


It sounds like adding updates in log-structured manner. But what’s about
the obsolete/invalid blocks? Does it mean that it need to use some GC technique
here? I am not completely sure that it could be beneficial for the ext4.

By the way, could the old index blocks be used like the snapshots in the case
of corruptions or other nasty issues?


>> How could be good Shardmap for the SSD use-case? Do you mean that we
>> could reduce write amplification issue for NAND flash case?
> 
> Correct. Reducing write amplification is particularly important for
> flash based storage. It also has a noticeable beneficial effect on
> efficiency under many common and not so common loads.
> 
>> Let's imagine that it needs to implement the Shardmap approach. Could
>> you estimate the implementation and stabilization time? How expensive
>> and long-term efforts could it be?
> 
> Shardmap is already implemented and stable, though it does need wider
> usage and testing. Code is available here:
> 
>   https://github.com/danielbot/Shardmap
> 
> Shardmap needs to be ported to kernel, already planned and in progress
> for Tux3. Now I am proposing that the Ext4 team should consider porting
> Shardmap to Ext4, or at least enter into a serious discussion of the
> logistics.
> 
> Added Darrick to cc, as he is already fairly familiar with this subject,
> once was an Ext4 developer, and perhaps still is should the need arise.
> By the way, is there a reason that Ted's MIT address bounced on my
> original post?
> 

It’s hard to talk about stability because we haven’t kernel-side implementation
of Shardmap for ext4. I suppose that it needs to spend about a year for the porting
and twice more time for the stabilization. To port a user-space code to the kernel
could be the tricky task. Could you estimate how many lines of code the core
part of Shardmap contains? Does it need to change the ext4 on-disk layout for
this feature? How easy ext4 functionality can be modified for Shardmap support?

Thanks,
Viacheslav Dubeyko.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-27 14:25 ` Theodore Y. Ts'o
@ 2019-11-27 22:27   ` Daniel Phillips
  2019-11-28  2:28     ` Theodore Y. Ts'o
  2019-12-02  1:45   ` Daniel Phillips
  1 sibling, 1 reply; 32+ messages in thread
From: Daniel Phillips @ 2019-11-27 22:27 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

Hi Ted,

I trust you will find your initial points satisfactorily addressed below.

On 2019-11-27 6:25 a.m., Theodore Y. Ts'o wrote:
> A couple of quick observations about Shardmap.
> 
> (1) It's licensed[1] under the GPLv3, so it's not compatible with the
> kernel license.  That doesn't matter much for ext4, because...
> 
> [1] https://github.com/danielbot/Shardmap/blob/master/LICENSE

The kernel port of Shardmap will (necessarily) be licensed under GPLv2.

> (2) It's implemented as userspace code (e.g., it uses open(2),
> mmap(2), et. al) and using C++, so it would need to be reimplemented
> from scratch for use in the kernel.

Right. Some of these details, like open, are obviously trivial, others
less so. Reimplementing from scratch is an overstatement because the
actual intrusions of user space code are just a small portion of the code
and nearly all abstracted behind APIs that can be implemented as needed
for userspace or kernel in out of line helpers, so that the main source
is strictly unaware of the difference. That said, we can just fork off a
kernel version and not worry about keeping compatiblity with user space
if you wish, though putting in the extra effort to make it dual mode
would probably be helpful for e2fsck.

Also, most of this work is already being done for Tux3, so the only
Ext4-specific work needing doing may well be the differences in atomic
commit required to accommodate Ext4's ordered journaling, versus Tux3's
(more precise) delta commit. To that end, we could discuss the atomic
commit strategy that we use for the persistent memory implementation of
Shardmap, which may turn out to be largely applicable to Ext4's journal
transaction model.

> (3) It's not particularly well documented, making the above more
> challenging, but it appears to be a variation of an extensible hashing
> scheme, which was used by dbx and Berkley DB.

Sorry about that. There is this post from a few years back:

   https://lkml.org/lkml/2013/6/18/869

And there is a paper in the works. I can also follow up here with a post
on Shardmap internals, a number of which are interesting and unique.

Shardmap (introduced above as an "an O(1) extensible hash table") is indeed
an extensible hashing scheme. Fixed size hash tables are impractical for
databases and file system directories because small data sets waste too
much table space and large data sets have too many collisions. Therefore
every such design must incorporate some form of extensibility. Shardmap's
extension scheme is unique, and worthy of note in its own right as a
contribution to hash table technology. We did benchmark against Berkeley
DB and found Shardmap to be markedly faster. I will hunt around for those
numbers.

Very briefly, the Shardmap index has two distinct forms, one optimized
for media and the other for cache. These are bijective, each being
constructable from the other. The media form (the backing store) only
has a single purpose: to reconstruct the cache form on demand, one shard
at a time.

The cache form is the main source of Shardmap's efficiency. This is a
two level hash table with each entry in the top level table being a
pointer to a self contained hash table object. In contrast to other
extensible hashing schemes, these cache shard are not themselves
extensible. Rather, we simply rewrite entire shards into subshards
as needed.

The top level hash table is where the extensibility happens. At some
threshold, the top level table is expanded by duplicating the pointers
to the hash objects so that multiple buckets may reference the same
hash object. When any of those objects passes a threshold number of
entries, it is split into multiple, smaller hash objects, each with a
unique pointer from the top level table. Traversing this two level
table for lookup or existence tests takes just a few nanoseconds.

Extending the hash in cache is mirrored by extending the media form,
by serializing the cache shard into multiple linear regions on media.
Now here is the key idea: even taking the cost of this media rewrite
into account, insert performance remains O(1), just with a slightly
higher constant factor. Shardmap exploits this subtle mathematical
fact to get the best of both worlds: O(1) performance like a hash and
extensibility like a BTree.

In fact, if you wish to avoid that constant media rewrite factor
entirely, Shardmap lets you do it, by allowing you to specify the
number and size of shards at directory creation time. I have not
benchmarked this, but it could improve average create performance by 20%
or so. However, even with the "extra" media copy, Shardmap still has
impressively high insert performance, in fact it is significantly
faster than any of the high performance key value stores we have tried
so far.

> (4) Because of (2), we won't be able to do any actual benchmarks for a
> while.

(2) is not an issue, the copyright is entirely mine and the license can
be retuned as convenient. Just indicate where the GPLv2 version should
be posted and I will make it so. Perhaps a new Github repo, or Gitlab?

> I just checked the latest version of Tux3[2], and it appears
> to be be still using a linear search scheme for its directory ---
> e.g., an O(n) lookup ala ext2.  So I'm guessing Shardmap may have been
> *designed* for Tux3, but it has not yet been *implemented* for Tux3?
> 
> [2] https://github.com/OGAWAHirofumi/linux-tux3/blob/hirofumi/fs/tux3/dir.c#L283

Correct, not yet ported to Tux3, however this work is in progress. There
are some sticky little points to work out such as how to implement the
largish cache shard objects without using virtual memory. The PAGEMAP
compilation option in the current source breaks those objects up into
pages, essentially doing virtual memory by hand, which will add some
small amount of additional overhead to the kernel version versus the
user space version, nothing to worry about. However it does make me wish
that we had better kernel support for virtual memory.

There are various other kernel porting details that are maybe a bit too
fine grained for this post. Example: Shardmap is a memory mapped db but
we don't have mmap in kernel, so must do this by hand also.

> (5) The claim is made that readdir() accesses files sequentially; but
> there is also mention in Shardmap of compressing shards (e.g.,
> rewriting them) to squeeze out deleted and tombstone entries.  This
> pretty much guarantees that it will not be possible to satisfy POSIX
> requirements of telldir(2)/seekdir(3) (using a 32-bit or 64-bitt
> cookie), NFS (which also requires use of a 32-bit or 64-bit cookie
> while doing readdir scan), or readdir() semantics in the face of
> directory entries getting inserted or removed from the directory.

No problem, the data blocks are completely separate from the index so
readdir just walks through them in linear order a la classic UFS/Ext2.
What could possibly be simpler, faster or more POSIX compliant?

> (To be specific, POSIX requires readdir returns each entry in a
> directory once and only once, and in the case of a directory entry
> which is removed or inserted, that directory entry must be returned
> exactly zero or one times.  This is true even if telldir(2) ort
> seekdir(2) is used to memoize a particular location in the directory,
> which means you have a 32-bit or 64-bit cookie to define a particular
> location in the readdir(2) stream.  If the file system wants to be
> exportable via NFS, it must meet similar requirements ---- except the
> 32-bit or 64-bit cookie MUST survive a reboot.)

So we finally get to fix this nagging HTree defect after all these
years. Thank you once again for that sweet hack, but with luck we
will be able to obsolete it by this time next year.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-27 22:27   ` Daniel Phillips
@ 2019-11-28  2:28     ` Theodore Y. Ts'o
  2019-11-28  4:27       ` Daniel Phillips
                         ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-28  2:28 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

On Wed, Nov 27, 2019 at 02:27:27PM -0800, Daniel Phillips wrote:
> > (2) It's implemented as userspace code (e.g., it uses open(2),
> > mmap(2), et. al) and using C++, so it would need to be reimplemented
> > from scratch for use in the kernel.
> 
> Right. Some of these details, like open, are obviously trivial, others
> less so. Reimplementing from scratch is an overstatement because the
> actual intrusions of user space code are just a small portion of the code
> and nearly all abstracted behind APIs that can be implemented as needed
> for userspace or kernel in out of line helpers, so that the main source
> is strictly unaware of the difference.

The use of C++ with templates is presumably one of the "less so"
parts, and it was that which I had in mind when I said,
"reimplementing from scratch".

> Also, most of this work is already being done for Tux3,

Great, when that work is done, we can take a look at the code and
see....

> > (5) The claim is made that readdir() accesses files sequentially; but
> > there is also mention in Shardmap of compressing shards (e.g.,
> > rewriting them) to squeeze out deleted and tombstone entries.  This
> > pretty much guarantees that it will not be possible to satisfy POSIX
> > requirements of telldir(2)/seekdir(3) (using a 32-bit or 64-bitt
> > cookie), NFS (which also requires use of a 32-bit or 64-bit cookie
> > while doing readdir scan), or readdir() semantics in the face of
> > directory entries getting inserted or removed from the directory.
> 
> No problem, the data blocks are completely separate from the index so
> readdir just walks through them in linear order a la classic UFS/Ext2.
> What could possibly be simpler, faster or more POSIX compliant?

OK, so what you're saying then is for every single directory entry
addition or removal, there must be (at least) two blocks which must be
modified, an (at least one) index block, and a data block, no?  That
makes it worse than htree, where most of the time we only need to
modify a single leaf node.  We only have to touch an index block when
a leaf node gets full and it needs to be split.

Anyway, let's wait and see how you and Hirofumi-san work out those
details for Tux3, and we can look at that and consider next steps at
that time.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-27 19:35     ` Viacheslav Dubeyko
@ 2019-11-28  2:54       ` Daniel Phillips
  2019-11-28  9:15         ` Andreas Dilger
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Phillips @ 2019-11-28  2:54 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: linux-ext4, linux-kernel, linux-fsdevel, Theodore Y. Ts'o,
	OGAWA Hirofumi, Darrick J. Wong, Andreas Dilger

On 2019-11-27 11:35 a.m., Viacheslav Dubeyko wrote:
> So, it looks like that Shardmap could be better for the case of billion files in one folder.
> But what’s about the regular case when it could be dozens/hundreds of files in one
> folder? Will Shardmap be better than HTree?

Yes, Shardmap is also faster than HTree in that case. Both Shardmap and
HTree are unindexed in that range, however Shardmap is faster because of
two things:

 1) Shardmap uses a more efficient record block format that incorporates
 a single byte hash code that avoids 99% of failed string compares.

 2) Shardmap "pins" the current target block in which new entries are
 created, avoiding repeated radix tree lookups for insert under load.

As soon as you get into thousands of files, the difference between
Shardmap and HTree widens dramatically so that Shardmap ends up faster
by a factor of 2, 3, 4, etc as directory size increases. Not just
faster than HTree, but faster than any tree based scheme, because of
the O(1) / O(log N) equation.

Up in the millions and billions of files, HTree just drops out of the
running, but if it were actually improved to operate in that range then
lookups would be at least 4 times slower due to index block probes, and
heavy insert loads would be orders of magnitude slower due to write
multiplication on commit. Of course I hear you when you say that you
don't have any million file directories to worry about, but some folks
do. (Any comment, Andreas?)

> If the ordinary user hasn’t visible
> performance improvement then it makes sense to consider Shardmap like the
> optional feature. What do you think?

I am confident that a significant number of users will perceive the
performance improvement, and that few if any will perceive a slowdown for
any reason short of an outright bug.

> Does it mean that Shardmap is ext4 oriented only? Could it be used for another
> file systems?

Shardmap is not Ext4-only. In fact, Shardmap is firstly the directory
index for Tux3, and I am now proposing that Shardmap should also be
the new directory index for Ext4.

I also heard a suggestion that Shardmap could/should become a generic
kernel library facility that could be used by any file system or
kernel subsystem that requires a high performance persistent
associative string mapping.

Shardmap is also well on its way to being released as a generic high
performance KVS, including supporting persistent memory, a role it
performs in a highly satisfactory way. There will be a post about
this later, but for today, a spoiler: atomic, durable KVS database
transactions at a rate in excess of three per microsecond(!)

>> See the recommendation that is sometimes offered to work around
>> HTree's issues with processing files in hash order. Basically, read
>> the entire directory into memory, sort by inode number, then process
>> in that order. As an application writer do you really want to do this,
>> or would you prefer that the filesystem just take care of for you so
>> the normal, simple and readable code is also the most efficient code?
> 
> I slightly missed the point here. To read the whole directory sounds like to read
> the dentries tree from the volume. As far as I can see, the dentries are ordered
> by names or by hashes. But if we are talking about inode number then we mean
> the inodes tree. So, I have misunderstanding here. What do you mean?

It's a bit of a horror show. Ted is the expert on it, I was largely a
bystander at the time this was implemented. Basically, the issue is that
HTree is a hash-ordered BTree (the H in HTree) and the only way to
traverse it for readdir that can possibly satisfy POSIX requirements is
by hash order. If you try to traverse in linear block order then a
simultaneous insert could split a block and move some entries to other
blocks, which then may be returned by readdir twice or not at all. So
hash order traversal is necessary, but this is not easy because hashes
can collide. So what Ted did is, create a temporary structure that
persists for some period of time, to utilize a higher resolution hash
code which is used to resolve collisions and provide telldir cookies
for readdir. Basically. This is even more tricky than it sounds for
various horrifying reasons.

If you want the whole story you will have to ask Ted. Suffice to say that
it takes a special kind of mind to even conceive of such a mechanism, let
alone get it working so well that we have not seen a single complaint
about it for years. However, this code is by any standard, scary and only
marginally maintainable. It further seems likely that a sufficiently
determined person could construct a case where it fails, though I cannot
swear to that one way or the other.

Why did we go to all this effort as opposed to just introducing some
additional ordering metadata as XFS does? Because HTree is faster
without that additional metadata to maintain, and more compact. The
user perceives this; the user does not perceive the pain suffered by
implementors to get this working, nor does the user need to confront
the full horror of the source code. The user cares only about how it
works, and it does work pretty darn well. That said, deprecating this
code will still be immensely satisfying from where I sit. It is more
than likely that Ted shares the same view, though I certainly cannot
speak for him.

In summary, we should all just be happy that this readdir hack worked
well enough over the last 15 years or so to run the world's internet
and phones so reliably. Now let's retire it please, and move on to
something with a sounder design basis, and which is far easier to
understand and maintain, and runs faster to boot. Now, with Shardmap,
readdir runs at media transfer speed, with near zero cpu, unlike
HTree.

>>> If you are talking about improving the performance then do you mean
>>> some special open-source implementation?
>>
>> I mean the same kind of kernel filesystem implementation that HTree
>> currently has. Open source of course, GPLv2 to be specific.
> 
> I meant the Shardmap implementation. As far as I can see, the
> user-space implementation is available only now. So, my question is
> still here. It’s hard to say how efficient the Shardmap could be on> kernel side as ext4 subsystem, for example.

That is actually quite easy to predict. All of our benchmarking so far
has been with user space Shardmap running on top of Ext4, so we already
have a pretty accurate picture of the overheads involved. That said,
there will be two main differences between the user space code and the
kernel code:

   1) We don't have virtual memory in kernel in any practical form, so
   we need to simulate it with explicit lookups in a vector of page
   pointers, costing a tiny but likely measurable amount of cpu compared
   to the hardware implementation that user space enjoys.

   2) We don't know the overhead of atomic commit for Ext4 yet. We do
   have a pretty good picture for Tux3: near zero. And we have a very
   clear picture of the atomic commit overhead for persistent memory,
   which is nonzero but much less than annoying. So atomic commit
   overhead for Ext4 should be similar. This is really where the skill
   of Ext4 developers kicks in, and of course I expect great things
   in that regard, as has been the case consistently to date.

>>>> For delete, Shardmap avoids write multiplication by appending tombstone
>>>> entries to index shards, thereby addressing a well known HTree delete
>>>> performance issue.
>>>
>>> Do you mean Copy-On-Write policy here or some special technique?
>>
>> The technique Shardmap uses to reduce write amplication under heavy
>> load is somewhat similar to the technique used by Google's Bigtable to
>> achieve a similar result for data files. (However, the resemblance to
>> Bigtable ends there.)
>>
>> Each update to a Shardmap index is done twice: once in a highly
>> optimized hash table shard in cache, then again by appending an
>> entry to the tail of the shard's media "fifo". Media writes are
>> therefore mostly linear. I say mostly, because if there is a large
>> number of shards then a single commit may need to update the tail
>> block of each one, which still adds up to vastly fewer blocks than
>> the BTree case, where it is easy to construct cases where every
>> index block must be updated on every commit, a nasty example of
>> n**2 performance overhead.
> 
> It sounds like adding updates in log-structured manner. But what’s about
> the obsolete/invalid blocks? Does it mean that it need to use some GC technique
> here? I am not completely sure that it could be beneficial for the ext4.

This is vaguely similar to log structured updating, but then again, it
is more different than similar. A better term might be "streaming
updates". This is a popular theme of modern file system and database
design, and the basis of many recent performance breakthroughs.

Shardmap never garbage collects. Instead, when a shard fifo gets too
many tombstones, it is just compacted by writing out the entire cache
shard on top of the old, excessively fluffy shard fifo. This is both
efficient and rare, for various reasons that require detailed analysis
of the data structures involved. I will get to that eventually, but now
is probably not the best time. The source code makes it clear.

> By the way, could the old index blocks be used like the snapshots in the case
> of corruptions or other nasty issues?

My feeling is, that is not a natural fit. However, rescuing Shardmap from
index corruption is easy: just throw away the entire index and construct
a new one by walking the entry record blocks. Maybe there are cute ways
to make that incremental, but the simplest easiest way should actually be
enough for the long term.

>>> Let's imagine that it needs to implement the Shardmap approach. Could
>>> you estimate the implementation and stabilization time? How expensive
>>> and long-term efforts could it be?
>>
>> Shardmap is already implemented and stable, though it does need wider
>> usage and testing. Code is available here:
>>
>>    https://github.com/danielbot/Shardmap
>>
>> Shardmap needs to be ported to kernel, already planned and in progress
>> for Tux3. Now I am proposing that the Ext4 team should consider porting
>> Shardmap to Ext4, or at least enter into a serious discussion of the
>> logistics.
>>
>> Added Darrick to cc, as he is already fairly familiar with this subject,
>> once was an Ext4 developer, and perhaps still is should the need arise.
>> By the way, is there a reason that Ted's MIT address bounced on my
>> original post?
> 
> It’s hard to talk about stability because we haven’t kernel-side implementation
> of Shardmap for ext4. I suppose that it needs to spend about a year for the porting
> and twice more time for the stabilization.

Agreed, my best guess is roughly similar.

> To port a user-space code to the kernel could be the tricky task.

Sometimes true, however not in this case. Shardmap is broadly similar to
other code we have ported from user space to kernel in the past, with the
two exceptions I noted above. All just part of a kernel hacker's normal day
in my humble opinion.

> Could you estimate how many lines of code the core
> part of Shardmap contains?

The finished Ext4 code should be somewhere between 2k and 3k lines, about
the same as HTree.

> Does it need to change the ext4 on-disk layout for this feature?

No, this new form of index would be a mount option, and only used for
new directories. The HTree code will necessarily remain part of Ext4
forever, for compatibility with existing volumes. By "deprecate HTree"
I meant, eventually make the Shardmap index the default after it has
gone through its stabilization period of who knows how long? Three
years? Five? It's hard to know the future in that regard.

This would be roughly similar to the transition we did from unindexed
to HTree index some 18 years ago. (Deja vu all over again!) Last time it
went smoothly and this time we additionally have the benefit of having
done it before.

How easy ext4 functionality can be modified for Shardmap support?

From the user's point of view, completely trivial. Initially just a mount
option, and later, no action at all. From the sysadmin's point of view,
something new to learn about, some new procedures in case things go wrong,
but essentially all in a day's work. From the developer's point of view,
one of the easier major hacks that one could contemplate, I expect.
Technical risk is nearly nil because Shardmap is already already quite
mature, being seven years old as it is, and having had the benefit of
considerable earlier design experience.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-28  2:28     ` Theodore Y. Ts'o
@ 2019-11-28  4:27       ` Daniel Phillips
  2019-11-30 17:50         ` Theodore Y. Ts'o
  2019-11-28 21:17       ` [RFC] Thing 1: Shardmap fox Ext4 Daniel Phillips
  2019-12-08 10:25       ` Daniel Phillips
  2 siblings, 1 reply; 32+ messages in thread
From: Daniel Phillips @ 2019-11-28  4:27 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

On 2019-11-27 6:28 p.m., Theodore Y. Ts'o wrote:
> On Wed, Nov 27, 2019 at 02:27:27PM -0800, Daniel Phillips wrote:
>>> (2) It's implemented as userspace code (e.g., it uses open(2),
>>> mmap(2), et. al) and using C++, so it would need to be reimplemented
>>> from scratch for use in the kernel.
>>
>> Right. Some of these details, like open, are obviously trivial, others
>> less so. Reimplementing from scratch is an overstatement because the
>> actual intrusions of user space code are just a small portion of the code
>> and nearly all abstracted behind APIs that can be implemented as needed
>> for userspace or kernel in out of line helpers, so that the main source
>> is strictly unaware of the difference.
> 
> The use of C++ with templates is presumably one of the "less so"
> parts, and it was that which I had in mind when I said,
> "reimplementing from scratch".

Ah, I see what you mean. To be honest, C++ has now become so natural for
me that I don't even think about it. You and Linus really ought to get
some of that :-)

If you look closely, you will notice that my C++ is largely just "C
compiled by C++", and I cheerfully undertake to convert away the few
places where I have simplified my life and sped up development by using
actual idiomatic C++ constructs.

By way of anecdote, coding the current user space version of Shardmap
in C++ cut my development time to go from the pure C prototype to the
world record persistent memory demonstration by a factor of roughly 3.
I now find it far faster to develop in C++ and then translate mindlessly
back to C as necessary, than to slog through the entire development
process using nothing but classic C, a language that really ought to
have done the right thing and stayed back in the century for which it
was created.

But to each his own. Ask for a pure C version licensed under GPLv2
and you shall receive. Note that we already have one here:

   https://github.com/OGAWAHirofumi/tux3/blob/master/user/devel/shard.c

Perhaps this will be easier on your eyes. It is essentially the same
thing less the persistent memory support and plus a bug or two.

Ah, one more anecdote. Shardmap implements polymorphic record block
operations, so that low level record format can be uniquely tailored
to the kind of data being stored. Overlooking the fact that we can
simply remove that mechanism for the kernel port because Ext4 does
not need more than one kind of record format, I can cheerfully
report that the official C++ way of implementing polymorphism using
virtual functions turned out to suck donkey dung compared to the
classic C/kernel way, where function vectors are handled as first
class data objects.

I actually implemented it both ways, but the virtual function way
turned out to be unspeakably painful for various reasons, hard to
read, and hard to modify without having it regularly blow up into
zillions of itty bitty little insane pieces. One day, after sinking
a couple of weeks into getting it finally working the official C++
way, I just threw this all out and recoded in kernel style, which
took about 3 hours and the result was not only much easier to read
and write, it generated better machine code.

So there you have it, ammunition to use against C++ if you want it.
But oh wait, it's still C++ isn't it? Why yes it is. C++, just try
it, you'll like it, and nobody is too late to learn it.

But once again, let's be very clear about it: I'm going to remove
*all* the C++ from Shardmap in aid of integrating with Tux3 and
Ext4. So there is no need at all to stay awake at night worrying
about this question.

>> Also, most of this work is already being done for Tux3,
> 
> Great, when that work is done, we can take a look at the code and
> see....

Surely there is much to discuss even before the Tux3 kernel port is
completed. Discussing and planning being cheap compared to leaving
things to the last minute as usual, then rushing them.

>>> (5) The claim is made that readdir() accesses files sequentially; but
>>> there is also mention in Shardmap of compressing shards (e.g.,
>>> rewriting them) to squeeze out deleted and tombstone entries.  This
>>> pretty much guarantees that it will not be possible to satisfy POSIX
>>> requirements of telldir(2)/seekdir(3) (using a 32-bit or 64-bitt
>>> cookie), NFS (which also requires use of a 32-bit or 64-bit cookie
>>> while doing readdir scan), or readdir() semantics in the face of
>>> directory entries getting inserted or removed from the directory.
>>
>> No problem, the data blocks are completely separate from the index so
>> readdir just walks through them in linear order a la classic UFS/Ext2.
>> What could possibly be simpler, faster or more POSIX compliant?
> 
> OK, so what you're saying then is for every single directory entry
> addition or removal, there must be (at least) two blocks which must be
> modified, an (at least one) index block, and a data block, no?  That
> makes it worse than htree, where most of the time we only need to
> modify a single leaf node.  We only have to touch an index block when
> a leaf node gets full and it needs to be split.

The operative word above is "single". Usually when we modify a single
entry in a directory we do not care whether the file system touches one
block or two, because a typical minimum commit involves many more than
that.

It may be that you were really thinking about mass instances of single
updates, which Shardmap handles much more efficiently than HTree. Under
mass insert, Shardmap repeatedly updates the same record block whereas
HTree updates some random, usually different leaf block per insert.

You are right that Shardmap also must update the shard fifo tail block,
however there is only one index shard up to 64K entries, so all the new
index entries go into the same tail block(s). Shardmap wins this one by
a mile.

As far as deletes go, of course you know how bad HTree is at that. Sure,
HTree only needs to update a single block to remove an entry, but then
it does unspeakable things to the inode table that tend to cause serious
performance losses in not-so-rare corner cases. Shardmap definitely
fixes that.

For O_SYNC operation you have more of a point, however again I doubt
that one directory block versus two will move the needle, and if it did
to the point of somebody actually caring about it, we can easily finesse
away one or both of those updates using journal techniques. More likely,
nobody will ever notice or care about this single extra block per sync
commit.

> Anyway, let's wait and see how you and Hirofumi-san work out those
> details for Tux3

We need to discuss those details up front in order to avoid duplicating
work or heading off in costly design directions that you may reject
later. And who would dream of depriving the LKML viewing public of their
weekly kernel design discussion entertainment? Not me.

Important example: how is atomic directory commit going to work for
Ext4? What can we do in the immediate future to make that work easier
for Ext4 devs? And many other details. The atomic commit discussion
alone is essential, and is a long lead item as we have all
experienced.

Bottom line: let's keep talking, it's better, and there is much of
interest to discuss. Surely you would at least like to know what
happened to your suggestion back in New Orleans about how to track
free records in huge directories?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-28  2:54       ` Daniel Phillips
@ 2019-11-28  9:15         ` Andreas Dilger
  2019-11-28 10:03           ` Daniel Phillips
  0 siblings, 1 reply; 32+ messages in thread
From: Andreas Dilger @ 2019-11-28  9:15 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Viacheslav Dubeyko, Ext4 Developers List, linux-kernel,
	linux-fsdevel, Theodore Y. Ts'o, OGAWA Hirofumi,
	Darrick J. Wong

[-- Attachment #1: Type: text/plain, Size: 14763 bytes --]

On Nov 27, 2019, at 7:54 PM, Daniel Phillips <daniel@phunq.net> wrote:
> 
> On 2019-11-27 11:35 a.m., Viacheslav Dubeyko wrote:
>> So, it looks like that Shardmap could be better for the case of billion files in one folder.  But what’s about the regular case when it could be
>> dozens/hundreds of files in one folder? Will Shardmap be better than HTree?
> 
> Yes, Shardmap is also faster than HTree in that case. Both Shardmap and
> HTree are unindexed in that range, however Shardmap is faster because of
> two things:
> 
> 1) Shardmap uses a more efficient record block format that incorporates
> a single byte hash code that avoids 99% of failed string compares.
> 
> 2) Shardmap "pins" the current target block in which new entries are
> created, avoiding repeated radix tree lookups for insert under load.
> 
> As soon as you get into thousands of files, the difference between
> Shardmap and HTree widens dramatically so that Shardmap ends up faster
> by a factor of 2, 3, 4, etc as directory size increases. Not just
> faster than HTree, but faster than any tree based scheme, because of
> the O(1) / O(log N) equation.
> 
> Up in the millions and billions of files, HTree just drops out of the
> running, but if it were actually improved to operate in that range then

Actually, 3-level htree was implemented for ext4 several years ago, but
was held back because there was no e2fsck support for it.  That was
finished and the 3-level htree support was landed to ext4 in 2017 in
commit v4.12-rc2-15-ge08ac99.  In theory it could allow ~5B entries in
a single directory (the 2GB size limit was also removed at the same time).

The code change for this was relatively straight forward, but as you
wrote elsewhere the big problem is each htree insert/lookup/remove
operation degrades to random 4KB IOPS for every change (for the
directory leaf blocks on insert/remove, and for the itable blocks on
readdir), so has about 4096 / 64 = 64x write inflation or more.

A log-structured directory insert/remove feature is appealing to me
if it can avoid this overhead in practise.

> lookups would be at least 4 times slower due to index block probes, and
> heavy insert loads would be orders of magnitude slower due to write
> multiplication on commit. Of course I hear you when you say that you
> don't have any million file directories to worry about, but some folks
> do. (Any comment, Andreas?)

We regularly have directories in the 1M+ size, because of users can easily
run many thousands of processes concurrently creating files in the same
directory.  The 2-level htree topped out around 10-12M entries, which was
hit occasionally.  At the same time, we also put in directory size limits
so that admins could *prevent* users from doing this, because it also can
cause problems for the user/admin when they need to process such large
directories ("ls -l" will of course never finish).

>> If the ordinary user hasn’t visible performance improvement then it makes
>> sense to consider Shardmap like the optional feature. What do you think?
> 
> I am confident that a significant number of users will perceive the
> performance improvement, and that few if any will perceive a slowdown for
> any reason short of an outright bug.
> 
>> Does it mean that Shardmap is ext4 oriented only? Could it be used for
>> another file systems?
> 
> Shardmap is not Ext4-only. In fact, Shardmap is firstly the directory
> index for Tux3, and I am now proposing that Shardmap should also be
> the new directory index for Ext4.
> 
> I also heard a suggestion that Shardmap could/should become a generic
> kernel library facility that could be used by any file system or
> kernel subsystem that requires a high performance persistent
> associative string mapping.
> 
> Shardmap is also well on its way to being released as a generic high
> performance KVS, including supporting persistent memory, a role it
> performs in a highly satisfactory way. There will be a post about
> this later, but for today, a spoiler: atomic, durable KVS database
> transactions at a rate in excess of three per microsecond(!)
> 
>>> See the recommendation that is sometimes offered to work around
>>> HTree's issues with processing files in hash order. Basically, read
>>> the entire directory into memory, sort by inode number, then process
>>> in that order. As an application writer do you really want to do this,
>>> or would you prefer that the filesystem just take care of for you so
>>> the normal, simple and readable code is also the most efficient code?
>> 
>> I slightly missed the point here. To read the whole directory sounds like
>> to read the dentries tree from the volume. As far as I can see, the
>> dentries are ordered by names or by hashes. But if we are talking about
>> inode number then we mean the inodes tree. So, I have misunderstanding
>> here. What do you mean?
> 
> It's a bit of a horror show. Ted is the expert on it, I was largely a
> bystander at the time this was implemented. Basically, the issue is that
> HTree is a hash-ordered BTree (the H in HTree) and the only way to
> traverse it for readdir that can possibly satisfy POSIX requirements is
> by hash order. If you try to traverse in linear block order then a
> simultaneous insert could split a block and move some entries to other
> blocks, which then may be returned by readdir twice or not at all. So
> hash order traversal is necessary, but this is not easy because hashes
> can collide. So what Ted did is, create a temporary structure that
> persists for some period of time, to utilize a higher resolution hash
> code which is used to resolve collisions and provide telldir cookies
> for readdir. Basically. This is even more tricky than it sounds for
> various horrifying reasons.
> 
> If you want the whole story you will have to ask Ted. Suffice to say that
> it takes a special kind of mind to even conceive of such a mechanism, let
> alone get it working so well that we have not seen a single complaint
> about it for years. However, this code is by any standard, scary and only
> marginally maintainable. It further seems likely that a sufficiently
> determined person could construct a case where it fails, though I cannot
> swear to that one way or the other.
> 
> Why did we go to all this effort as opposed to just introducing some
> additional ordering metadata as XFS does? Because HTree is faster
> without that additional metadata to maintain, and more compact. The
> user perceives this; the user does not perceive the pain suffered by
> implementors to get this working, nor does the user need to confront
> the full horror of the source code. The user cares only about how it
> works, and it does work pretty darn well. That said, deprecating this
> code will still be immensely satisfying from where I sit. It is more
> than likely that Ted shares the same view, though I certainly cannot
> speak for him.
> 
> In summary, we should all just be happy that this readdir hack worked
> well enough over the last 15 years or so to run the world's internet
> and phones so reliably. Now let's retire it please, and move on to
> something with a sounder design basis, and which is far easier to
> understand and maintain, and runs faster to boot. Now, with Shardmap,
> readdir runs at media transfer speed, with near zero cpu, unlike
> HTree.
> 
>>>> If you are talking about improving the performance then do you mean
>>>> some special open-source implementation?
>>> 
>>> I mean the same kind of kernel filesystem implementation that HTree
>>> currently has. Open source of course, GPLv2 to be specific.
>> 
>> I meant the Shardmap implementation. As far as I can see, the
>> user-space implementation is available only now. So, my question is
>> still here. It’s hard to say how efficient the Shardmap could be on
>> kernel side as ext4 subsystem, for example.
> 
> That is actually quite easy to predict. All of our benchmarking so far
> has been with user space Shardmap running on top of Ext4, so we already
> have a pretty accurate picture of the overheads involved. That said,
> there will be two main differences between the user space code and the
> kernel code:
> 
>   1) We don't have virtual memory in kernel in any practical form, so
>   we need to simulate it with explicit lookups in a vector of page
>   pointers, costing a tiny but likely measurable amount of cpu compared
>   to the hardware implementation that user space enjoys.
> 
>   2) We don't know the overhead of atomic commit for Ext4 yet. We do
>   have a pretty good picture for Tux3: near zero. And we have a very
>   clear picture of the atomic commit overhead for persistent memory,
>   which is nonzero but much less than annoying. So atomic commit
>   overhead for Ext4 should be similar. This is really where the skill
>   of Ext4 developers kicks in, and of course I expect great things
>   in that regard, as has been the case consistently to date.
> 
>>>>> For delete, Shardmap avoids write multiplication by appending tombstone
>>>>> entries to index shards, thereby addressing a well known HTree delete
>>>>> performance issue.
>>>> 
>>>> Do you mean Copy-On-Write policy here or some special technique?
>>> 
>>> The technique Shardmap uses to reduce write amplication under heavy
>>> load is somewhat similar to the technique used by Google's Bigtable to
>>> achieve a similar result for data files. (However, the resemblance to
>>> Bigtable ends there.)
>>> 
>>> Each update to a Shardmap index is done twice: once in a highly
>>> optimized hash table shard in cache, then again by appending an
>>> entry to the tail of the shard's media "fifo". Media writes are
>>> therefore mostly linear. I say mostly, because if there is a large
>>> number of shards then a single commit may need to update the tail
>>> block of each one, which still adds up to vastly fewer blocks than
>>> the BTree case, where it is easy to construct cases where every
>>> index block must be updated on every commit, a nasty example of
>>> n**2 performance overhead.
>> 
>> It sounds like adding updates in log-structured manner. But what’s about
>> the obsolete/invalid blocks? Does it mean that it need to use some GC
>> here? I am not completely sure that it could be beneficial for the ext4.
> 
> This is vaguely similar to log structured updating, but then again, it
> is more different than similar. A better term might be "streaming
> updates". This is a popular theme of modern file system and database
> design, and the basis of many recent performance breakthroughs.
> 
> Shardmap never garbage collects. Instead, when a shard fifo gets too
> many tombstones, it is just compacted by writing out the entire cache
> shard on top of the old, excessively fluffy shard fifo. This is both
> efficient and rare, for various reasons that require detailed analysis
> of the data structures involved. I will get to that eventually, but now
> is probably not the best time. The source code makes it clear.
> 
>> By the way, could the old index blocks be used like the snapshots in the
>> case of corruptions or other nasty issues?
> 
> My feeling is, that is not a natural fit. However, rescuing Shardmap from
> index corruption is easy: just throw away the entire index and construct
> a new one by walking the entry record blocks. Maybe there are cute ways
> to make that incremental, but the simplest easiest way should actually be
> enough for the long term.
> 
>>>> Let's imagine that it needs to implement the Shardmap approach. Could
>>>> you estimate the implementation and stabilization time? How expensive
>>>> and long-term efforts could it be?
>>> 
>>> Shardmap is already implemented and stable, though it does need wider
>>> usage and testing. Code is available here:
>>> 
>>>   https://github.com/danielbot/Shardmap
>>> 
>>> Shardmap needs to be ported to kernel, already planned and in progress
>>> for Tux3. Now I am proposing that the Ext4 team should consider porting
>>> Shardmap to Ext4, or at least enter into a serious discussion of the
>>> logistics.
>>> 
>>> Added Darrick to cc, as he is already fairly familiar with this subject,
>>> once was an Ext4 developer, and perhaps still is should the need arise.
>>> By the way, is there a reason that Ted's MIT address bounced on my
>>> original post?
>> 
>> It’s hard to talk about stability because we haven’t kernel-side implementation
>> of Shardmap for ext4. I suppose that it needs to spend about a year for the
>> porting and twice more time for the stabilization.
> 
> Agreed, my best guess is roughly similar.
> 
>> To port a user-space code to the kernel could be the tricky task.
> 
> Sometimes true, however not in this case. Shardmap is broadly similar to
> other code we have ported from user space to kernel in the past, with the
> two exceptions I noted above. All just part of a kernel hacker's normal day
> in my humble opinion.
> 
>> Could you estimate how many lines of code the core
>> part of Shardmap contains?
> 
> The finished Ext4 code should be somewhere between 2k and 3k lines, about
> the same as HTree.
> 
>> Does it need to change the ext4 on-disk layout for this feature?
> 
> No, this new form of index would be a mount option, and only used for
> new directories. The HTree code will necessarily remain part of Ext4
> forever, for compatibility with existing volumes. By "deprecate HTree"
> I meant, eventually make the Shardmap index the default after it has
> gone through its stabilization period of who knows how long? Three
> years? Five? It's hard to know the future in that regard.
> 
> This would be roughly similar to the transition we did from unindexed
> to HTree index some 18 years ago. (Deja vu all over again!) Last time it
> went smoothly and this time we additionally have the benefit of having
> done it before.
> 
> How easy ext4 functionality can be modified for Shardmap support?
> 
> From the user's point of view, completely trivial. Initially just a mount
> option, and later, no action at all. From the sysadmin's point of view,
> something new to learn about, some new procedures in case things go wrong,
> but essentially all in a day's work. From the developer's point of view,
> one of the easier major hacks that one could contemplate, I expect.
> Technical risk is nearly nil because Shardmap is already already quite
> mature, being seven years old as it is, and having had the benefit of
> considerable earlier design experience.
> 
> Regards,
> 
> Daniel


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-28  9:15         ` Andreas Dilger
@ 2019-11-28 10:03           ` Daniel Phillips
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel Phillips @ 2019-11-28 10:03 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Viacheslav Dubeyko, Ext4 Developers List, linux-kernel,
	linux-fsdevel, Theodore Y. Ts'o, OGAWA Hirofumi,
	Darrick J. Wong



On 2019-11-28 1:15 a.m., Andreas Dilger wrote:
> On Nov 27, 2019, at 7:54 PM, Daniel Phillips <daniel@phunq.net> wrote:
>>
>> On 2019-11-27 11:35 a.m., Viacheslav Dubeyko wrote:
>>> So, it looks like that Shardmap could be better for the case of billion files in one folder.  But what’s about the regular case when it could be
>>> dozens/hundreds of files in one folder? Will Shardmap be better than HTree?
>>
>> Yes, Shardmap is also faster than HTree in that case. Both Shardmap and
>> HTree are unindexed in that range, however Shardmap is faster because of
>> two things:
>>
>> 1) Shardmap uses a more efficient record block format that incorporates
>> a single byte hash code that avoids 99% of failed string compares.
>>
>> 2) Shardmap "pins" the current target block in which new entries are
>> created, avoiding repeated radix tree lookups for insert under load.
>>
>> As soon as you get into thousands of files, the difference between
>> Shardmap and HTree widens dramatically so that Shardmap ends up faster
>> by a factor of 2, 3, 4, etc as directory size increases. Not just
>> faster than HTree, but faster than any tree based scheme, because of
>> the O(1) / O(log N) equation.
>>
>> Up in the millions and billions of files, HTree just drops out of the
>> running, but if it were actually improved to operate in that range then
> 
> Actually, 3-level htree was implemented for ext4 several years ago, but
> was held back because there was no e2fsck support for it.  That was
> finished and the 3-level htree support was landed to ext4 in 2017 in
> commit v4.12-rc2-15-ge08ac99.  In theory it could allow ~5B entries in
> a single directory (the 2GB size limit was also removed at the same time).
> 
> The code change for this was relatively straight forward,

Let's clarify: straightforward for you. The fact that HTree sat for 15
years with my lame, coded-in-a-day two level scheme suggests that this
would not be straightforward for everyone. In my defence, I plead that
at the time I regarded 10 millions files as essentially infinite. This
makes me think of "640K ought to be enough for anyone" naturally.

Some day somebody will have a use case for billion file directories.
Until that day comes, I am ok with regarding this capability as a mere
stunt, or proof of scalability. Scaling 2 orders of magnitude higher is
just one of the improvements Shardmap offers, and in some sense, the
least important of them.

> but as you
> wrote elsewhere the big problem is each htree insert/lookup/remove
> operation degrades to random 4KB IOPS for every change (for the
> directory leaf blocks on insert/remove, and for the itable blocks on
> readdir), so has about 4096 / 64 = 64x write inflation or more.

Right. Shardmap linearizes pretty much everything, which seems to be
pretty much a necessity to achieve optimal performance with lowest
cache pressure, particularly with machines that are only marginally
capable of the load they are attempting.

> A log-structured directory insert/remove feature is appealing to me
> if it can avoid this overhead in practise.

The main problem with pure log structured updates is reading back the
log, which is randomly interleaved if written out in strict creation
order. This is the issue that LSM (as used by RocksDB and friends)
attempts to address:

   https://en.wikipedia.org/wiki/Log-structured_merge-tree

Unfortunately, these solutions tend to be complex and top heavy on the
read side. In practice Shardmap won solidly in all of insert, lookup
and delete benchmarks when we benchmarked it. I am not sure why that
is, but it seems to be trying to do a lot of fancy stuff.

Shardmap uses a far simpler update technique: each new index entry is
simply appended to the end of the appropriate shard. The fewer the
shards, the more efficient that is. However, even when there are 1K
or up to 8K shards as we use for our 1 billion file benchmark, this
algorithm still works well in practice. The key trick is, include
enough updates in the commit so that each shard receives enough to
reduce the tail block write multiplication to a dull roar.

On my not very spectacular Ryzen 7 workstation backed by SSD, we
create 1 billion directory entries in a bit less than 12 minutes. Is
this fast enough? On spinning rust it takes two or three minutes
longer. Clearly Ext4 plus the block layer are doing a nice job of
minimizing seeks for this load.

>> lookups would be at least 4 times slower due to index block probes, and
>> heavy insert loads would be orders of magnitude slower due to write
>> multiplication on commit. Of course I hear you when you say that you
>> don't have any million file directories to worry about, but some folks
>> do. (Any comment, Andreas?)
> 
> We regularly have directories in the 1M+ size, because of users can easily
> run many thousands of processes concurrently creating files in the same
> directory.  The 2-level htree topped out around 10-12M entries, which was
> hit occasionally.  At the same time, we also put in directory size limits
> so that admins could *prevent* users from doing this, because it also can
> cause problems for the user/admin when they need to process such large
> directories ("ls -l" will of course never finish).

OK, this confirms my impression of the sort of loads you feed to HTree.
In this range, I promise that you will see a solid speedup from Shardmap
for the directory ops alone, plus far better cache coherency with respect
to the inode table, and of course, a readdir that does not verge on
brilliant insanity.

Thanks for the comments Andreas, and I hope it is clear that I am also
offering Shardmap for Lustre. Not that you would even need to ask.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-28  2:28     ` Theodore Y. Ts'o
  2019-11-28  4:27       ` Daniel Phillips
@ 2019-11-28 21:17       ` Daniel Phillips
  2019-12-08 10:25       ` Daniel Phillips
  2 siblings, 0 replies; 32+ messages in thread
From: Daniel Phillips @ 2019-11-28 21:17 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

On 2019-11-27 6:28 p.m., Theodore Y. Ts'o wrote:
> The use of C++ with templates is presumably one of the "less so"
> parts, and it was that which I had in mind when I said,
> "reimplementing from scratch".

Ah, duopack and tripack. Please consider the C code that was replaced.
by those. See tons of bogus extra parameters resulting in nonsense like
this:

   set_entry(shard_entry(shard, bucket), key & bitmask(shard->lowbits), loc, next, shard->locbits, nextbits);

which in the c++ version looks like:

   *entry = {trio.pack(next, loc, key & bitmask(lowbits))};

They generate roughly identical machine code, but I know which one I prefer
to maintain. That said, porting back to C (in progress right now) includes
substituting some appropriate macro code for those sweet little variable
field width structure templates. Which by the way are central to Shardmap's
scalability and performance. These are what allow the index entry to stay
at just 8 bytes over the entire range from one entry to one billion.

So right, tripack and duopack weill be reimplemented from scratch, using 
the template code as a model. Basically just expanding the templates by
hand and adding in some macro sugar for clarity. Shardmap itself does not
need to be rewritten from scratch in order to port to kernel, far from it.

Regards,

Daniel








^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-28  4:27       ` Daniel Phillips
@ 2019-11-30 17:50         ` Theodore Y. Ts'o
  2019-12-01  8:21           ` Daniel Phillips
  0 siblings, 1 reply; 32+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-30 17:50 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

On Wed, Nov 27, 2019 at 08:27:59PM -0800, Daniel Phillips wrote:
> You are right that Shardmap also must update the shard fifo tail block,
> however there is only one index shard up to 64K entries, so all the new
> index entries go into the same tail block(s).

So how big is an index shard?  If it is 64k entries, and each entry is
16 bytes (8 bytes hash, 8 bytes block number), then a shard is a
megabyte, right?  Are entries in an index shard stored in sorted or
unsorted manner?  If they are stored in an unsorted manner, then when
trying to do a lookup, you need to search all of the index shard ---
which means for a directory that is being frequently accessed, the
entire index shard has to be kept in memory, no?  (Or paged in as
necessary, if you are using mmap in userspace).

> Important example: how is atomic directory commit going to work for
> Ext4?

The same way all metadata updates work in ext4.  Which is to say, you
need to declare the maximum number of 4k metadata blocks that an
operation might need to change when calling ext4_journal_start() to
create a handle; and before modifying a 4k block, you need to call
ext4_journal_get_write_access(), passing in the handle and the block's
buffer_head.  After modifying the block, you must call
ext4_handle_dirty_metadata() on the buffer_head.  And when you are
doing with the changes in an atomic metadata operation, you call
ext4_journal_stop() on the handle.

This hasn't changed since the days of ext3 and htree.

     	    	    	      	      	   - Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-30 17:50         ` Theodore Y. Ts'o
@ 2019-12-01  8:21           ` Daniel Phillips
  2019-12-04 18:31             ` Andreas Dilger
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Phillips @ 2019-12-01  8:21 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi



On 2019-11-30 9:50 a.m., Theodore Y. Ts'o wrote:
> On Wed, Nov 27, 2019 at 08:27:59PM -0800, Daniel Phillips wrote:
>> You are right that Shardmap also must update the shard fifo tail block,
>> however there is only one index shard up to 64K entries, so all the new
>> index entries go into the same tail block(s).
> 
> So how big is an index shard?  If it is 64k entries, and each entry is
> 16 bytes (8 bytes hash, 8 bytes block number), then a shard is a
> megabyte, right?  Are entries in an index shard stored in sorted or
> unsorted manner?  If they are stored in an unsorted manner, then when
> trying to do a lookup, you need to search all of the index shard ---
> which means for a directory that is being frequently accessed, the
> entire index shard has to be kept in memory, no?  (Or paged in as
> necessary, if you are using mmap in userspace).

Exactly correct, except that in cache a shard is a hash table, while
on media it is just an unordered collection that is entered into hash
buckets at shard load time.

This is actually the main defining characteristic of Shardmap, both
giving rise to the theoretical read multiply issue alluded to above
and on the positive side, acting as a kind of cache read ahead due to
the coarse demand read granularity. In other words, each time we hit
a not present shard, we read multiple blocks of the given shard into
cache all at once, instead of loading blocks piecemeal with lots of
inefficient little reads. This is especially beneficial for spinning
disk, which we Tux3 devs still worry about, and I would think, you
also. Paraphrasing the immortal bard, "it's not dead yet".

Shard size is tunable at directory creation time. A shard entry is
actually 8 bytes, not 16, because block number can initially be very
small, just 10 bits by default, leaving 53 bits for the hash and one
bit to indicate tombstone or not. As the directory size increases,
the block number field size increases to accommodate more record
blocks and the hash field size decreases, however number of shards
increases at the same rate (more or less linear, enforced by logic)
so that, together with the hash bits implied by the shard number,
the hash resolution stays constant. Isn't that nice?

The importance of hash resolution is that, at high scale any hash
collision within a bucket must be resolved by accessing the record
block and resolving it there. So high collision rate corresponds to
significant slowdown in operations, getting worse linearly as the
directory expands. This is N**2 behavior in the sense that the time
to perform N operations increases as N**2 (IOW our usual definition
of file system performance.) It turns out that 53 hash bits are
sufficient to hold the collision rate to a few tens in one billion
inserts, insignificant at that scale, even more so at typical scale.

The question of cache footprint is indeed central, as you imply. 8
bytes per entry cuts the cache footprint in half, so that is nice.
But would it be better if we could hold only the pieces of index
in cache that we actually need? This question is far more subtle
than it first seems. Here is a possibly surprising mathematical
fact: when number of accesses is similar to the number of cache
blocks the cache footprint of any randomly accessed index is the
entire cache. This entirely independent of the index algorithm in
use: you see exactly the same behavior with BTrees. The best and
possibly only remedy is to make the index as compact as possible,
hence the impact of 8 byte vs 16 byte entries.

This highlights another significant advantage that Shardmap has
over HTree: HTree embeds its entries directly in the index while
Shardmap separates them out into traditional record entry blocks.
The HTree strategy does save significant CPU by avoiding one
block level deference, however as mentioned earlier, this merely
allows HTree to tie Shardmap in block accesses at the lowest
index scale, because Shardmap does one of its accesses into a
cache object, this avoiding radix tree overhead. The cost of
HTree's strategy at high scale, or with a large number of
directories open, is large, a factor of 2 or more greater cache
pressure depending on average name size.

So Shardmap is significantly more cache friendly than HTree, however,
as you deduced, if cache thrashing does happen then Shardmap with
shard size in the 256K to 1 Mbyte range might have to read a hundred
times as many blocks to reload an evicted shard than HTree does to
load a single btree block. On the other hand, the thrashing knee 
occurs with 3-5 times less cache for Shardmap than HTree, so which
one wins here? I say Shardmap, because Shardmap does more with less
cache. If you are thrashing that badly then your machine must be
grossly misconfigured for its load.

Now suppose you wanted to fix this theoretical read multiplication,
then how? An emerging technique aimed at precisely the kind of dual
format caching scheme that Shardmap uses has been dubbed "anticache".
Instead of evicting and reloading the cache object, page the cache
to swap, or any available backing store (e.g., to the file system
volume itself). Then the cache object can be reloaded at your choice
of granularity, for example, 4K pages, loaded in via the hash table
walk algorithm. This will be one or more steps: 1) look up hash chain
head in table possibly loading a page; 2+) if entry not already found
then look up in next chain entry, possibly loading a page.

The thing is, I can't imagine any Linux configuration that could hit
this, short of an artificial construction. Take the case of a 1M file
directory. The shardmap media fifos for that will be 8MB, there will
be 16 of them (depending on tuning) and each cache shard will be
somewhere around 640K or 160 pages per shard for a total of 1280
cache pages, or 5 MB cache footprint. If this is going to put your
system into thrash then I submit that the real problem is not
Shardmap.

So this theoretical read multiplication issue is essentially just a
question of how fast can we thrash. If somebody does come up with a
valid use case where we need to thrash faster than now, we can always
implement anticache or something similar, and maybe make that a
generic facility while at it, because again, the real problem is not
Shardmap, it is somebody feeding in an inappropriate load for their
machine configuration.

>> Important example: how is atomic directory commit going to work for
>> Ext4?
> 
> The same way all metadata updates work in ext4.  Which is to say, you
> need to declare the maximum number of 4k metadata blocks that an
> operation might need to change when calling ext4_journal_start() to
> create a handle; and before modifying a 4k block, you need to call
> ext4_journal_get_write_access(), passing in the handle and the block's
> buffer_head.  After modifying the block, you must call
> ext4_handle_dirty_metadata() on the buffer_head.  And when you are
> doing with the changes in an atomic metadata operation, you call
> ext4_journal_stop() on the handle.
> 
> This hasn't changed since the days of ext3 and htree.

OK good. And I presume that directory updates are prevented until
the journal transaction is at least fully written to buffers. Maybe
delayed until the journal transaction is actually committed?

In Tux3 we don't block directory updates during backend commit, and I
just assumed that Ext4 and others also do that now, so thanks for the
correction. As far I can see, there will be no new issue with Shardmap,
as you say. My current plan is that user space mmap will become kmap in
kernel. I am starting on this part for Tux3 right now. My goal is to
refine the current Shardmap data access api to hide the fact that mmap
is used in user space but kmap in kernel. Again, I wish we actually had
mmap in kernel and maybe we should consider properly supporting it in
the future, perhaps by improving kmalloc.

One thing we do a bit differently frou our traditional fs is, in the
common, unfragmented case, mass inserts go into the same block until
the block is full. So we get a significant speedup by avoiding a page
cache lookup and kmap per insert. Borrowing a bit of mechanism from
the persistent memory version of Shardmap, we create the new entries
in a separate cache page. Then, on commit, copy this "front buffer" to
the page cache. I think that will translate pretty well to Ext4 also.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-27 14:25 ` Theodore Y. Ts'o
  2019-11-27 22:27   ` Daniel Phillips
@ 2019-12-02  1:45   ` Daniel Phillips
  2019-12-04 15:55     ` Vyacheslav Dubeyko
  2019-12-04 18:03     ` [RFC] Thing 1: Shardmap fox Ext4 Andreas Dilger
  1 sibling, 2 replies; 32+ messages in thread
From: Daniel Phillips @ 2019-12-02  1:45 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

On 2019-11-27 6:25 a.m., Theodore Y. Ts'o wrote:
> (3) It's not particularly well documented...

We regard that as an issue needing attention. Here is a pretty picture
to get started:

   https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format

This needs some explaining. The bottom part of the directory file is
a simple linear range of directory blocks, with a freespace map block
appearing once every 4K blocks or so. This freespace mapping needs a
post of its own, it is somewhat subtle. This will be a couple of posts
in the future.

The Shardmap index appears at a higher logical address, sufficiently
far above the directory base to accommodate a reasonable number of
record entry blocks below it. We try not to place the index at so high
an address that the radix tree gets extra levels, slowing everything
down.

When the index needs to be expanded, either because some shard exceeded
a threshold number of entries, or the record entry blocks ran into the
the bottom of the index, then a new index tier with more shards is
created at a higher logical address. The lower index tier is not copied
immediately to the upper tier, but rather, each shard is incrementally
split when it hits the threshold because of an insert. This bounds the
latency of any given insert to the time needed to split one shard, which
we target nominally at less than one millisecond. Thus, Shardmap takes a
modest step in the direction of real time response.

Each index tier is just a simple array of shards, each of which fills
up with 8 byte entries from bottom to top. The count of entries in each
shard is stored separately in a table just below the shard array. So at
shard load time, we can determine rapidly from the count table which
tier a given shard belongs to. There are other advantages to breaking
the shard counts out separately having to do with the persistent memory
version of Shardmap, interesting details that I will leave for later.

When all lower tier shards have been deleted, the lower tier may be
overwritten by the expanding record entry block region. In practice,
a Shardmap file normally has just one tier most of the time, the other
tier existing only long enough to complete the incremental expansion
of the shard table, insert by insert.

There is a small header in the lowest record entry block, giving the
positions of the one or two index tiers, count of entry blocks, and
various tuning parameters such as maximum shard size and average depth
of cache hash collision lists.

That is it for media format. Very simple, is it not? My next post
will explain the Shardmap directory block format, with a focus on
deficiencies of the traditional Ext2 format that were addressed.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-12-02  1:45   ` Daniel Phillips
@ 2019-12-04 15:55     ` Vyacheslav Dubeyko
  2019-12-05  9:46       ` Daniel Phillips
  2019-12-04 18:03     ` [RFC] Thing 1: Shardmap fox Ext4 Andreas Dilger
  1 sibling, 1 reply; 32+ messages in thread
From: Vyacheslav Dubeyko @ 2019-12-04 15:55 UTC (permalink / raw)
  To: Daniel Phillips, Theodore Y. Ts'o
  Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

On Sun, 2019-12-01 at 17:45 -0800, Daniel Phillips wrote:
> On 2019-11-27 6:25 a.m., Theodore Y. Ts'o wrote:
> > (3) It's not particularly well documented...
> 
> We regard that as an issue needing attention. Here is a pretty
> picture
> to get started:
> 
>    https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format
> 
> This needs some explaining. The bottom part of the directory file is
> a simple linear range of directory blocks, with a freespace map block
> appearing once every 4K blocks or so. This freespace mapping needs a
> post of its own, it is somewhat subtle. This will be a couple of
> posts
> in the future.
> 
> The Shardmap index appears at a higher logical address, sufficiently
> far above the directory base to accommodate a reasonable number of
> record entry blocks below it. We try not to place the index at so
> high
> an address that the radix tree gets extra levels, slowing everything
> down.
> 
> When the index needs to be expanded, either because some shard
> exceeded
> a threshold number of entries, or the record entry blocks ran into
> the
> the bottom of the index, then a new index tier with more shards is
> created at a higher logical address. The lower index tier is not
> copied
> immediately to the upper tier, but rather, each shard is
> incrementally
> split when it hits the threshold because of an insert. This bounds
> the
> latency of any given insert to the time needed to split one shard,
> which
> we target nominally at less than one millisecond. Thus, Shardmap
> takes a
> modest step in the direction of real time response.
> 
> Each index tier is just a simple array of shards, each of which fills
> up with 8 byte entries from bottom to top. The count of entries in
> each
> shard is stored separately in a table just below the shard array. So
> at
> shard load time, we can determine rapidly from the count table which
> tier a given shard belongs to. There are other advantages to breaking
> the shard counts out separately having to do with the persistent
> memory
> version of Shardmap, interesting details that I will leave for later.
> 
> When all lower tier shards have been deleted, the lower tier may be
> overwritten by the expanding record entry block region. In practice,
> a Shardmap file normally has just one tier most of the time, the
> other
> tier existing only long enough to complete the incremental expansion
> of the shard table, insert by insert.
> 
> There is a small header in the lowest record entry block, giving the
> positions of the one or two index tiers, count of entry blocks, and
> various tuning parameters such as maximum shard size and average
> depth
> of cache hash collision lists.
> 
> That is it for media format. Very simple, is it not? My next post
> will explain the Shardmap directory block format, with a focus on
> deficiencies of the traditional Ext2 format that were addressed.
> 

I've tried to take a look into the source code. And it was not easy
try. :) I expected to have the bird-fly understanding from shardmap.h
file. My expectation was to find the initial set of structure
declarations with the good comments. But, frankly speaking, it's very
complicated path for the concept understanding. Even from C++ point of
view, the class declarations look very complicated if there are mixing
of fields with methods declarations. It's tough to read such
implementation.

So, I believe it makes sense to declare the necessary set of structures
in the file's beginning with the good comments. Even it will be good to
split the structure declarations and methods in different files. I
believe it will ease the way to understand the concept. Otherwise, it
will be tough to review such code.

Thanks,
Viacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-12-02  1:45   ` Daniel Phillips
  2019-12-04 15:55     ` Vyacheslav Dubeyko
@ 2019-12-04 18:03     ` Andreas Dilger
  2019-12-04 20:47       ` Daniel Phillips
  1 sibling, 1 reply; 32+ messages in thread
From: Andreas Dilger @ 2019-12-04 18:03 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Theodore Y. Ts'o, linux-ext4, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi

[-- Attachment #1: Type: text/plain, Size: 3080 bytes --]

On Dec 1, 2019, at 6:45 PM, Daniel Phillips <daniel@phunq.net> wrote:
> 
> On 2019-11-27 6:25 a.m., Theodore Y. Ts'o wrote:
>> (3) It's not particularly well documented...
> 
> We regard that as an issue needing attention. Here is a pretty picture
> to get started:
> 
>   https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format

The shardmap diagram is good conceptually, but it would be useful
to add a legend on the empty side of the diagram that shows the on-disk
structures.

> 
> This needs some explaining. The bottom part of the directory file is
> a simple linear range of directory blocks, with a freespace map block
> appearing once every 4K blocks or so. This freespace mapping needs a
> post of its own, it is somewhat subtle. This will be a couple of posts
> in the future.
> 
> The Shardmap index appears at a higher logical address, sufficiently
> far above the directory base to accommodate a reasonable number of
> record entry blocks below it. We try not to place the index at so high
> an address that the radix tree gets extra levels, slowing everything
> down.
> 
> When the index needs to be expanded, either because some shard exceeded
> a threshold number of entries, or the record entry blocks ran into the
> the bottom of the index, then a new index tier with more shards is
> created at a higher logical address. The lower index tier is not copied
> immediately to the upper tier, but rather, each shard is incrementally
> split when it hits the threshold because of an insert. This bounds the
> latency of any given insert to the time needed to split one shard, which
> we target nominally at less than one millisecond. Thus, Shardmap takes a
> modest step in the direction of real time response.
> 
> Each index tier is just a simple array of shards, each of which fills
> up with 8 byte entries from bottom to top. The count of entries in each
> shard is stored separately in a table just below the shard array. So at
> shard load time, we can determine rapidly from the count table which
> tier a given shard belongs to. There are other advantages to breaking
> the shard counts out separately having to do with the persistent memory
> version of Shardmap, interesting details that I will leave for later.
> 
> When all lower tier shards have been deleted, the lower tier may be
> overwritten by the expanding record entry block region. In practice,
> a Shardmap file normally has just one tier most of the time, the other
> tier existing only long enough to complete the incremental expansion
> of the shard table, insert by insert.
> 
> There is a small header in the lowest record entry block, giving the
> positions of the one or two index tiers, count of entry blocks, and
> various tuning parameters such as maximum shard size and average depth
> of cache hash collision lists.
> 
> That is it for media format. Very simple, is it not? My next post
> will explain the Shardmap directory block format, with a focus on
> deficiencies of the traditional Ext2 format that were addressed.
> 
> Regards,
> 
> Daniel


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-12-01  8:21           ` Daniel Phillips
@ 2019-12-04 18:31             ` Andreas Dilger
  2019-12-04 21:44               ` Daniel Phillips
  2019-12-04 23:41               ` [RFC] Thing 1: Shardmap fox Ext4 Theodore Y. Ts'o
  0 siblings, 2 replies; 32+ messages in thread
From: Andreas Dilger @ 2019-12-04 18:31 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Theodore Y. Ts'o, linux-ext4, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi

[-- Attachment #1: Type: text/plain, Size: 3208 bytes --]

On Dec 1, 2019, at 1:21 AM, Daniel Phillips <daniel@phunq.net> wrote:
>>> Important example: how is atomic directory commit going to work for
>>> Ext4?
>> 
>> The same way all metadata updates work in ext4.  Which is to say, you
>> need to declare the maximum number of 4k metadata blocks that an
>> operation might need to change when calling ext4_journal_start() to
>> create a handle; and before modifying a 4k block, you need to call
>> ext4_journal_get_write_access(), passing in the handle and the block's
>> buffer_head.  After modifying the block, you must call
>> ext4_handle_dirty_metadata() on the buffer_head.  And when you are
>> doing with the changes in an atomic metadata operation, you call
>> ext4_journal_stop() on the handle.
>> 
>> This hasn't changed since the days of ext3 and htree.
> 
> OK good. And I presume that directory updates are prevented until
> the journal transaction is at least fully written to buffers. Maybe
> delayed until the journal transaction is actually committed?
> 
> In Tux3 we don't block directory updates during backend commit, and I
> just assumed that Ext4 and others also do that now, so thanks for the
> correction. As far I can see, there will be no new issue with Shardmap,
> as you say. My current plan is that user space mmap will become kmap in
> kernel. I am starting on this part for Tux3 right now. My goal is to
> refine the current Shardmap data access api to hide the fact that mmap
> is used in user space but kmap in kernel. Again, I wish we actually had
> mmap in kernel and maybe we should consider properly supporting it in
> the future, perhaps by improving kmalloc.
> 
> One thing we do a bit differently frou our traditional fs is, in the
> common, unfragmented case, mass inserts go into the same block until
> the block is full. So we get a significant speedup by avoiding a page
> cache lookup and kmap per insert. Borrowing a bit of mechanism from
> the persistent memory version of Shardmap, we create the new entries
> in a separate cache page. Then, on commit, copy this "front buffer" to
> the page cache. I think that will translate pretty well to Ext4 also.

One important use case that we have for Lustre that is not yet in the
upstream ext4[*] is the ability to do parallel directory operations.
This means we can create, lookup, and/or unlink entries in the same
directory concurrently, to increase parallelism for large directories.

This is implemented by progressively locking the htree root and index
blocks (typically read-only), then leaf blocks (read-only for lookup,
read-write for insert/delete).  This provides improved parallelism
as the directory grows in size.

Will there be some similar ability in Shardmap to have parallel ops?
Also, does Shardmap have the ability to shrink as entries are removed?

Cheers, Andreas

[*] we've tried to submit the pdirops patch a couple of times, but the
main blocker is that the VFS has a single directory mutex and couldn't
use the added functionality without significant VFS changes.
Patch at https://git.whamcloud.com/?p=fs/lustre-release.git;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-12-04 18:03     ` [RFC] Thing 1: Shardmap fox Ext4 Andreas Dilger
@ 2019-12-04 20:47       ` Daniel Phillips
  2019-12-04 20:53         ` Daniel Phillips
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Phillips @ 2019-12-04 20:47 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Y. Ts'o, linux-ext4, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi

On 2019-12-04 10:03 a.m., Andreas Dilger wrote:
>> Here is a pretty picture to get started:
>>
>>    https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format
> 
> The shardmap diagram is good conceptually, but it would be useful
> to add a legend on the empty side of the diagram that shows the on-disk
> structures.

Sounds good, but not sure exactly what you had in mind. Fields of a
shard entry? Fields of the block 0 header? The record entry block has
its own diagram, and is polymorphic anyway, so no fixed format.

Regards,

Daniel



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-12-04 20:47       ` Daniel Phillips
@ 2019-12-04 20:53         ` Daniel Phillips
  2019-12-05  5:59           ` Daniel Phillips
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Phillips @ 2019-12-04 20:53 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Y. Ts'o, linux-ext4, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi



On 2019-12-04 12:47 p.m., Daniel Phillips wrote:
> On 2019-12-04 10:03 a.m., Andreas Dilger wrote:
>>> Here is a pretty picture to get started:
>>>
>>>     https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format
>>
>> The shardmap diagram is good conceptually, but it would be useful
>> to add a legend on the empty side of the diagram that shows the on-disk
>> structures.
> 
> Sounds good, but not sure exactly what you had in mind. Fields of a
> shard entry? Fields of the block 0 header? The record entry block has
> its own diagram, and is polymorphic anyway, so no fixed format.

Ah, the legend can explain that lower tier shard 2 is in process of
being split into upper tier shards 4 and 5, and label the shards
with offset numbers.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-12-04 18:31             ` Andreas Dilger
@ 2019-12-04 21:44               ` Daniel Phillips
  2019-12-05  0:36                 ` Andreas Dilger
  2019-12-04 23:41               ` [RFC] Thing 1: Shardmap fox Ext4 Theodore Y. Ts'o
  1 sibling, 1 reply; 32+ messages in thread
From: Daniel Phillips @ 2019-12-04 21:44 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Y. Ts'o, linux-ext4, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi

On 2019-12-04 10:31 a.m., Andreas Dilger wrote:
> One important use case that we have for Lustre that is not yet in the
> upstream ext4[*] is the ability to do parallel directory operations.
> This means we can create, lookup, and/or unlink entries in the same
> directory concurrently, to increase parallelism for large directories.

This is a requirement for an upcoming transactional version of user space
Shardmap. In the database world they call it "row locking". I am working
on a hash based scheme with single record granularity that maps onto the
existing shard buckets, which should be nice and efficient, maybe a bit
tricky with respect to rehash but looks not too bad.

Per-shard rw locks are a simpler alternative, but might get a bit fiddly
if you need to lock multiple entries in the same directory at the same
time, which is required for mv is it not?

> This is implemented by progressively locking the htree root and index
> blocks (typically read-only), then leaf blocks (read-only for lookup,
> read-write for insert/delete).  This provides improved parallelism
> as the directory grows in size.

This will be much easier and more efficient with Shardmap because there
are only three levels: top level shard array; shard hash bucket; record
block. Locking applies only to cache, so no need to worry about possible
upper tier during incremental "reshard".

I think Shardmap will also split more cleanly across metadata nodes than
HTree.

> Will there be some similar ability in Shardmap to have parallel ops?

This work is already in progress for user space Shardmap. If there is
also a kernel use case then we can just go forward assuming that this
work or some variation of it applies to both.

We need VFS changes to exploit parallel dirops in general, I think,
confirmed by your comment below. Seems like a good bit of work for
somebody. I bet the benchmarks will show well, suitable grist for a
master's thesis I would think.

Fine-grained directory locking may have a small enough footprint in
the Shardmap kernel port that there is no strong argument for getting
rid of it, just because VFS doesn't support it yet. Really, this has
the smell of a VFS flaw (interested in Al's comments...)

> Also, does Shardmap have the ability to shrink as entries are removed?

No shrink so far. What would you suggest? Keeping in mind that POSIX+NFS
semantics mean that we cannot in general defrag on the fly. I planned to
just hole_punch blocks that happen to become completely empty.

This aspect has so far not gotten attention because, historically, we
just never shrink a directory except via fsck/tools. What would you
like to see here? Maybe an ioctl to invoke directory defrag? A mode
bit to indicate we don't care about persistent telldir cookies?

How about automatic defrag that only runs when directory open count is
zero, plus a flag to disable?

> [*] we've tried to submit the pdirops patch a couple of times, but the
> main blocker is that the VFS has a single directory mutex and couldn't
> use the added functionality without significant VFS changes.

How significant would it be, really nasty or just somewhat nasty? I bet
the resulting efficiencies would show up in some general use cases.

> Patch at https://git.whamcloud.com/?p=fs/lustre-release.git;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD

This URL gives me git://git.whamcloud.com/fs/lustre-release.git/summary,
am I missing something?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-12-04 18:31             ` Andreas Dilger
  2019-12-04 21:44               ` Daniel Phillips
@ 2019-12-04 23:41               ` Theodore Y. Ts'o
  2019-12-06  1:16                 ` Dave Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: Theodore Y. Ts'o @ 2019-12-04 23:41 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Daniel Phillips, linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

On Wed, Dec 04, 2019 at 11:31:50AM -0700, Andreas Dilger wrote:
> One important use case that we have for Lustre that is not yet in the
> upstream ext4[*] is the ability to do parallel directory operations.
> This means we can create, lookup, and/or unlink entries in the same
> directory concurrently, to increase parallelism for large directories.
> 
> 
> [*] we've tried to submit the pdirops patch a couple of times, but the
> main blocker is that the VFS has a single directory mutex and couldn't
> use the added functionality without significant VFS changes.
> Patch at https://git.whamcloud.com/?p=fs/lustre-release.git;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD
>

The XFS folks recently added support for parallel directory operations
into the VFS, for the benefit of XFS has this feature.  So it should
be possible adjust the patch so it will work with the upstream kernel
now.

Cheers,

					- Ted



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-12-04 21:44               ` Daniel Phillips
@ 2019-12-05  0:36                 ` Andreas Dilger
  2019-12-05  2:27                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
  0 siblings, 1 reply; 32+ messages in thread
From: Andreas Dilger @ 2019-12-05  0:36 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Theodore Y. Ts'o, Ext4 Developers List,
	Linux Kernel Mailing List, linux-fsdevel, OGAWA Hirofumi

[-- Attachment #1: Type: text/plain, Size: 5834 bytes --]

On Dec 4, 2019, at 2:44 PM, Daniel Phillips <daniel@phunq.net> wrote:
> 
> On 2019-12-04 10:31 a.m., Andreas Dilger wrote:
>> One important use case that we have for Lustre that is not yet in the
>> upstream ext4[*] is the ability to do parallel directory operations.
>> This means we can create, lookup, and/or unlink entries in the same
>> directory concurrently, to increase parallelism for large directories.
> 
> This is a requirement for an upcoming transactional version of user space
> Shardmap. In the database world they call it "row locking". I am working
> on a hash based scheme with single record granularity that maps onto the
> existing shard buckets, which should be nice and efficient, maybe a bit
> tricky with respect to rehash but looks not too bad.
> 
> Per-shard rw locks are a simpler alternative, but might get a bit fiddly
> if you need to lock multiple entries in the same directory at the same
> time, which is required for mv is it not?

We currently have a "big filesystem lock" (BFL) for rename(), as rename
is not an operation many people care about the performance.  We've
discussed a number of times to optimize this for the common cases of
rename a regular file within a single directory and rename a regular
file between directories, but no plans at all to optimize rename of
directories between parents.

>> This is implemented by progressively locking the htree root and index
>> blocks (typically read-only), then leaf blocks (read-only for lookup,
>> read-write for insert/delete).  This provides improved parallelism
>> as the directory grows in size.
> 
> This will be much easier and more efficient with Shardmap because there
> are only three levels: top level shard array; shard hash bucket; record
> block. Locking applies only to cache, so no need to worry about possible
> upper tier during incremental "reshard".
> 
> I think Shardmap will also split more cleanly across metadata nodes than
> HTree.

We don't really split "htree" across metadata nodes, that is handled by
Lustre at a higher level than the underlying filesystem.  Hash filename
with per-directory hash type, modulo number of directory shards to find
index within that directory, then map index to a directory shard on a
particular server.  The backing filesystem directories are normal from
the POV of the local filesystem.

>> Will there be some similar ability in Shardmap to have parallel ops?
> 
> This work is already in progress for user space Shardmap. If there is
> also a kernel use case then we can just go forward assuming that this
> work or some variation of it applies to both.
> 
> We need VFS changes to exploit parallel dirops in general, I think,
> confirmed by your comment below. Seems like a good bit of work for
> somebody. I bet the benchmarks will show well, suitable grist for a
> master's thesis I would think.
> 
> Fine-grained directory locking may have a small enough footprint in
> the Shardmap kernel port that there is no strong argument for getting
> rid of it, just because VFS doesn't support it yet. Really, this has
> the smell of a VFS flaw (interested in Al's comments...)

I think that the VFS could get 95% of the benefit for 10% of the effort
would be by allowing only rename of regular files within a directory
with only a per-directory mutex.  The only workload that I know which
does a lot of rename is rsync, or parallel versions of it, that create
temporary files during data transfer, then rename the file over the
target atomically after the data is sync'd to disk.

>> Also, does Shardmap have the ability to shrink as entries are removed?
> 
> No shrink so far. What would you suggest? Keeping in mind that POSIX+NFS
> semantics mean that we cannot in general defrag on the fly. I planned to
> just hole_punch blocks that happen to become completely empty.
> 
> This aspect has so far not gotten attention because, historically, we
> just never shrink a directory except via fsck/tools. What would you
> like to see here? Maybe an ioctl to invoke directory defrag? A mode
> bit to indicate we don't care about persistent telldir cookies?

There are a few patches floating around to shrink ext4 directories which
I'd like to see landed at some point.  The current code is sub-optimal,
in that it only tries to shrink "easy" blocks from the end of directory,
but hopefully there can be more aggressive shrinking in later patches.

> How about automatic defrag that only runs when directory open count is
> zero, plus a flag to disable?

As long as the shrinking doesn't break POSIX readdir ordering semantics.
I'm obviously not savvy on the Shardmap details, but I'd think that the
shards need to be garbage collected/packed periodically since they are
log structured (write at end, tombstones for unlinks), so that would be
an opportunity to shrink the shards?

>> [*] we've tried to submit the pdirops patch a couple of times, but the
>> main blocker is that the VFS has a single directory mutex and couldn't
>> use the added functionality without significant VFS changes.
> 
> How significant would it be, really nasty or just somewhat nasty? I bet
> the resulting efficiencies would show up in some general use cases.

As stated above, I think the common case could be implemented relatively
easily (rename within a directory), then maybe rename files across
directories, and maybe never rename subdirectories across directories.

>> Patch at https://git.whamcloud.com/?p=fs/lustre-release.git;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD
> 
> This URL gives me git://git.whamcloud.com/fs/lustre-release.git/summary,
> am I missing something?

Just walk down the tree for the "f=ldiskfs/..." pathname...


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap for Ext4
  2019-12-05  0:36                 ` Andreas Dilger
@ 2019-12-05  2:27                   ` Daniel Phillips
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel Phillips @ 2019-12-05  2:27 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Y. Ts'o, Ext4 Developers List,
	Linux Kernel Mailing List, linux-fsdevel, OGAWA Hirofumi

(finally noticed the gross error in the subject line)

On 2019-12-04 4:36 p.m., Andreas Dilger wrote:
> On Dec 4, 2019, at 2:44 PM, Daniel Phillips <daniel@phunq.net> wrote:
>>
>> On 2019-12-04 10:31 a.m., Andreas Dilger wrote:
>>> One important use case that we have for Lustre that is not yet in the
>>> upstream ext4[*] is the ability to do parallel directory operations.
>>> This means we can create, lookup, and/or unlink entries in the same
>>> directory concurrently, to increase parallelism for large directories.
>>
>> This is a requirement for an upcoming transactional version of user space
>> Shardmap. In the database world they call it "row locking". I am working
>> on a hash based scheme with single record granularity that maps onto the
>> existing shard buckets, which should be nice and efficient, maybe a bit
>> tricky with respect to rehash but looks not too bad.
>>
>> Per-shard rw locks are a simpler alternative, but might get a bit fiddly
>> if you need to lock multiple entries in the same directory at the same
>> time, which is required for mv is it not?
> 
> We currently have a "big filesystem lock" (BFL) for rename(), as rename
> is not an operation many people care about the performance.  We've
> discussed a number of times to optimize this for the common cases of
> rename a regular file within a single directory and rename a regular
> file between directories, but no plans at all to optimize rename of
> directories between parents.
> 
>>> This is implemented by progressively locking the htree root and index
>>> blocks (typically read-only), then leaf blocks (read-only for lookup,
>>> read-write for insert/delete).  This provides improved parallelism
>>> as the directory grows in size.
>>
>> This will be much easier and more efficient with Shardmap because there
>> are only three levels: top level shard array; shard hash bucket; record
>> block. Locking applies only to cache, so no need to worry about possible
>> upper tier during incremental "reshard".
>>
>> I think Shardmap will also split more cleanly across metadata nodes than
>> HTree.
> 
> We don't really split "htree" across metadata nodes, that is handled by
> Lustre at a higher level than the underlying filesystem.  Hash filename
> with per-directory hash type, modulo number of directory shards to find
> index within that directory, then map index to a directory shard on a
> particular server.  The backing filesystem directories are normal from
> the POV of the local filesystem.

OK, Lustre's higher level seems to somewhat resemble Shardmap, though
your extensibility scheme must be quite different. It does lend weight
to the proposition that hash sharding is the technique of choice at high
scale.

>>> Will there be some similar ability in Shardmap to have parallel ops?
>>
>> This work is already in progress for user space Shardmap. If there is
>> also a kernel use case then we can just go forward assuming that this
>> work or some variation of it applies to both.
>>
>> We need VFS changes to exploit parallel dirops in general, I think,
>> confirmed by your comment below. Seems like a good bit of work for
>> somebody. I bet the benchmarks will show well, suitable grist for a
>> master's thesis I would think.
>>
>> Fine-grained directory locking may have a small enough footprint in
>> the Shardmap kernel port that there is no strong argument for getting
>> rid of it, just because VFS doesn't support it yet. Really, this has
>> the smell of a VFS flaw (interested in Al's comments...)
> 
> I think that the VFS could get 95% of the benefit for 10% of the effort
> would be by allowing only rename of regular files within a directory
> with only a per-directory mutex.  The only workload that I know which
> does a lot of rename is rsync, or parallel versions of it, that create
> temporary files during data transfer, then rename the file over the
> target atomically after the data is sync'd to disk.

MTA is another rename-heavy workload, and I seem to recall, KDE config
update. I agree that parallel cross directory rename locking would be
basically nuts.

>>> Also, does Shardmap have the ability to shrink as entries are removed?
>>
>> No shrink so far. What would you suggest? Keeping in mind that POSIX+NFS
>> semantics mean that we cannot in general defrag on the fly. I planned to
>> just hole_punch blocks that happen to become completely empty.
>>
>> This aspect has so far not gotten attention because, historically, we
>> just never shrink a directory except via fsck/tools. What would you
>> like to see here? Maybe an ioctl to invoke directory defrag? A mode
>> bit to indicate we don't care about persistent telldir cookies?
> 
> There are a few patches floating around to shrink ext4 directories which
> I'd like to see landed at some point.  The current code is sub-optimal,
> in that it only tries to shrink "easy" blocks from the end of directory,
> but hopefully there can be more aggressive shrinking in later patches.

I intend to add some easy ones like that to Shardmap, in particular so
deleting every entry leaves the directory with a single block containing
just the header.

BTW, in Tux3 we plan to add a special Shardmap inode attribute to hold
the header information, so that an empty directory will have zero blocks
instead of one. I am still fussing about this detail, because it seems
a bit odd to not have at least the record block count present in the
dir file itself. Maybe the inode attr should just hold the tuning
parameters and the file header can hold the essential geometry details,
like record block count and logical index position.

>> How about automatic defrag that only runs when directory open count is
>> zero, plus a flag to disable?
> 
> As long as the shrinking doesn't break POSIX readdir ordering semantics.
> I'm obviously not savvy on the Shardmap details, but I'd think that the
> shards need to be garbage collected/packed periodically since they are
> log structured (write at end, tombstones for unlinks), so that would be
> an opportunity to shrink the shards?

Shardmap already does that. Every time a new tier is added, all shards
are incrementally compacted. Any shard that fills up because is compacted
instead of forcing a new index level.

Shards are not currently compacted if they have tombstones but still have
room for more entries. We could add some more logic there, so that shards
are automatically compacted according to heuristics based on the ratio of
tombstones to shard size. Mass delete creates a lot of tombstones however,
and we probably do not want to spend resources compacting under this load,
except when shards actually fill up. Such logic could be tricky to tune
for all loads.

We are always able to compact the index without affecting POSIX semantics,
so a lot of flexibility exists in what can be done there. However, if
there are sparse blocks near the top of the directory, we can't do much
about them without breaking POSIX.

We will hole_punch any completely empty record blocks, which should help
avoid wasting media space for very sparse directories. But we could still
end up with ten bytes in use per directory block, giving a fragmentation
ratio of 400 to one. For this, maybe we better provide the user with a
way to indicate that compaction should be done irrespective of POSIX
considerations, or at least slightly relaxing them.

>>> [*] we've tried to submit the pdirops patch a couple of times, but the
>>> main blocker is that the VFS has a single directory mutex and couldn't
>>> use the added functionality without significant VFS changes.
>>
>> How significant would it be, really nasty or just somewhat nasty? I bet
>> the resulting efficiencies would show up in some general use cases.
> 
> As stated above, I think the common case could be implemented relatively
> easily (rename within a directory), then maybe rename files across
> directories, and maybe never rename subdirectories across directories.

It seems parallel directory ops are now in play, according to Ted. Will
you refresh your patch?

>>> Patch at https://git.whamcloud.com/?p=fs/lustre-release.git;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD
>>
>> This URL gives me git://git.whamcloud.com/fs/lustre-release.git/summary,
>> am I missing something?
> 
> Just walk down the tree for the "f=ldiskfs/..." pathname...

Got it. Not sure what the issue was before, this time the patch pops up
as expected.

BTW, Ted, Microsoft seems to have implemented a nefarious scheme to
bounce your direct emails so you have no choice but to read them from
the internet:

==========
A message that you sent could not be delivered to one or more of its
recipients. This is a permanent error. The following address(es) failed:

  <you>@mit.edu
    host mit-edu.mail.protection.outlook.com [104.47.41.36]
    SMTP error from remote mail server after RCPT TO:<tytso@mit.edu>:
    550 5.7.606 Access denied, banned sending IP [66.183.183.73]. To request removal from this list please visit https://sender.office.com/ and follow the directions. For more information please go to  http://go.microsoft.com/fwlink/?LinkID=526655 (AS16012609)
==========

And of course requesting removal as suggested fails with some sort of "we
suck and had some kind of internal bug so please just keep trying so we
can waste more of your time" message. Malice or incompetence? We need to
know. And MIT+Outlook? I'm shocked. Shocked, I tell you.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-12-04 20:53         ` Daniel Phillips
@ 2019-12-05  5:59           ` Daniel Phillips
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel Phillips @ 2019-12-05  5:59 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Y. Ts'o, linux-ext4, linux-kernel, linux-fsdevel,
	OGAWA Hirofumi

Updated

https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-12-04 15:55     ` Vyacheslav Dubeyko
@ 2019-12-05  9:46       ` Daniel Phillips
  2019-12-06 11:47         ` Vyacheslav Dubeyko
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Phillips @ 2019-12-05  9:46 UTC (permalink / raw)
  To: Vyacheslav Dubeyko, Theodore Y. Ts'o
  Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

On 2019-12-04 7:55 a.m., Vyacheslav Dubeyko wrote:
>> That is it for media format. Very simple, is it not? My next post
>> will explain the Shardmap directory block format, with a focus on
>> deficiencies of the traditional Ext2 format that were addressed.
> 
> I've tried to take a look into the source code. And it was not easy
> try. :)

Let's see what we can do about that, starting with removing the duopack
(media index entry) and tripack (cache index entry) templates. Now that
the design has settled down we don't need that level of generality so
much any more. The replacements are mostly C-style now and by the time
the Tux3 kernel port is done, will be officially C.

So far I only described the media format, implemented in define_layout(),
which I hope is self explanatory. You should be able to tie it back to
this diagram pretty easily.

   https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format

> I expected to have the bird-fly understanding from shardmap.h
> file. My expectation was to find the initial set of structure
> declarations with the good comments.

Our wiki is slowly getting populated with design documentation. Most of
what you see in shardmap.h is concerned with the Shardmap cache form,
where all the action happens. I have not said much about that yet, but
there is a post on the way. The main structures are struct shard (a
self contained hash table) and struct keymap (a key value store
populated with shards). Those are obvious I think, please correct me
if I am wrong. A more tricky one is struct tier, which implements our
incremental hash table expansion. You might expect that to be a bit
subtle, and it is.

Before getting into those details, there is an upcoming post about
the record block format, which is pretty non-abstract and, I think,
easy enough to understand from the API declaration in shardmap.h and
the code in recops.c.

There is a diagram here:

   https://github.com/danielbot/Shardmap/wiki/Shardmap-record-block-format

but the post this belongs to is not quite ready to go out yet. That one
will be an interlude before for the cache form discussion, which is
where the magic happens, things like rehash and reshard and add_tier,
and the way the hash code gets chopped up as it runs through the access
stack.

Here is a diagram of the cache structures, very simple:

   https://github.com/danielbot/Shardmap/wiki/Shardmap-cache-format

And here is a diagram of the Shardmap three level hashing scheme,
which ties everything together:

   https://github.com/danielbot/Shardmap/wiki/Shardmap-hashing-scheme

This needs explanation. It is something new that you won't find in any
textbook, this is the big reveal right here.

> But, frankly speaking, it's very
> complicated path for the concept understanding. Even from C++ point of
> view, the class declarations look very complicated if there are mixing
> of fields with methods declarations.

In each class, fields are declared first, then methods. In the kernel
port of course we will not have classes, and the function names will be
longer as usual.

> So, I believe it makes sense to declare the necessary set of structures
> in the file's beginning with the good comments. Even it will be good to
> split the structure declarations and methods in different files. I
> believe it will ease the way to understand the concept. Otherwise, it
> will be tough to review such code.

Declaring structures and functions in the same file is totally normal
for kernel code, you don't really want these in separate files unless
they break out naturally that way.

This code is dense, there is a lot going on in not very many lines. So
we need lots of lines of documentation to make up for that, which has
not been a priority until now, so please bear with me. And please do
not hesitate to ask specific questions - the answers may well end up in
the wiki.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-12-04 23:41               ` [RFC] Thing 1: Shardmap fox Ext4 Theodore Y. Ts'o
@ 2019-12-06  1:16                 ` Dave Chinner
  2019-12-06  5:09                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2019-12-06  1:16 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Andreas Dilger, Daniel Phillips, linux-ext4, linux-kernel,
	linux-fsdevel, OGAWA Hirofumi

On Wed, Dec 04, 2019 at 06:41:06PM -0500, Theodore Y. Ts'o wrote:
> On Wed, Dec 04, 2019 at 11:31:50AM -0700, Andreas Dilger wrote:
> > One important use case that we have for Lustre that is not yet in the
> > upstream ext4[*] is the ability to do parallel directory operations.
> > This means we can create, lookup, and/or unlink entries in the same
> > directory concurrently, to increase parallelism for large directories.
> > 
> > 
> > [*] we've tried to submit the pdirops patch a couple of times, but the
> > main blocker is that the VFS has a single directory mutex and couldn't
> > use the added functionality without significant VFS changes.
> > Patch at https://git.whamcloud.com/?p=fs/lustre-release.git;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD
> >
> 
> The XFS folks recently added support for parallel directory operations
> into the VFS, for the benefit of XFS has this feature.

The use of shared i_rwsem locking on the directory inode during
lookup/pathwalk allows for concurrent lookup/readdir operations on
a single directory. However, the parent dir i_rwsem is still held
exclusive for directory modifications like create, unlink, etc.

IOWs, the VFS doesn't allow for concurrent directory modification
right now, and that's going to be the limiting factor no matter what
you do with internal filesystem locking.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap for Ext4
  2019-12-06  1:16                 ` Dave Chinner
@ 2019-12-06  5:09                   ` Daniel Phillips
  2019-12-08 22:42                     ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Phillips @ 2019-12-06  5:09 UTC (permalink / raw)
  To: Dave Chinner, Theodore Y. Ts'o
  Cc: Andreas Dilger, linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

On 2019-12-05 5:16 p.m., Dave Chinner wrote:
> On Wed, Dec 04, 2019 at 06:41:06PM -0500, Theodore Y. Ts'o wrote:
>> On Wed, Dec 04, 2019 at 11:31:50AM -0700, Andreas Dilger wrote:
>>> One important use case that we have for Lustre that is not yet in the
>>> upstream ext4[*] is the ability to do parallel directory operations.
>>> This means we can create, lookup, and/or unlink entries in the same
>>> directory concurrently, to increase parallelism for large directories.
>>>
>>> [*] we've tried to submit the pdirops patch a couple of times, but the
>>> main blocker is that the VFS has a single directory mutex and couldn't
>>> use the added functionality without significant VFS changes.
>>> Patch at https://git.whamcloud.com/?p=fs/lustre-release.git;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD
>>>
>>
>> The XFS folks recently added support for parallel directory operations
>> into the VFS, for the benefit of XFS has this feature.
> 
> The use of shared i_rwsem locking on the directory inode during
> lookup/pathwalk allows for concurrent lookup/readdir operations on
> a single directory. However, the parent dir i_rwsem is still held
> exclusive for directory modifications like create, unlink, etc.
> 
> IOWs, the VFS doesn't allow for concurrent directory modification
> right now, and that's going to be the limiting factor no matter what
> you do with internal filesystem locking.

On a scale of 0 to 10, how hard do you think that would be to relax
in VFS, given the restriction of no concurrent inter-directory moves?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-12-05  9:46       ` Daniel Phillips
@ 2019-12-06 11:47         ` Vyacheslav Dubeyko
  2019-12-07  0:46           ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
  0 siblings, 1 reply; 32+ messages in thread
From: Vyacheslav Dubeyko @ 2019-12-06 11:47 UTC (permalink / raw)
  To: Daniel Phillips, Theodore Y. Ts'o
  Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

On Thu, 2019-12-05 at 01:46 -0800, Daniel Phillips wrote:
> On 2019-12-04 7:55 a.m., Vyacheslav Dubeyko wrote:
> > > 

<snipped and reoredered>

> And here is a diagram of the Shardmap three level hashing scheme,
> which ties everything together:
> 
>    https://github.com/danielbot/Shardmap/wiki/Shardmap-hashing-scheme
> 
> This needs explanation. It is something new that you won't find in
> any
> textbook, this is the big reveal right here.
> 

This diagram is pretty good and provides the high-level view of the
whole scheme. But, maybe, it makes sense to show the granularity of
hash code. It looks like the low hash is the hash of a name. Am I
correct? But how the mid- and high- parts of the hash code are defined?
It looks like that cached shard stores LBAs of record entry blocks are
associated with the low hash values. But what does it mean that shard
is cached?

Here is a diagram of the cache structures, very simple:
> 
>    https://github.com/danielbot/Shardmap/wiki/Shardmap-cache-format
> 

This diagram is not easy to relate with the previous one. So, shard
table and shard array are the same entities or not? Or do you mean that
shard table is storeed on the volume but shard array is constructed in
memory?

>There is a diagram here:
> 
>    
> 
https://github.com/danielbot/Shardmap/wiki/Shardmap-record-block-format

I am slightly confused here. Does header be located at the bottom of
the record block? My understanding is that records grow from top of the
block down to the header direction. Am I correct? Why header is not
located at the top of the block with entry dictionary? Any special
purpose here?

Thanks,
Viacheslav Dubeyko.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap for Ext4
  2019-12-06 11:47         ` Vyacheslav Dubeyko
@ 2019-12-07  0:46           ` Daniel Phillips
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel Phillips @ 2019-12-07  0:46 UTC (permalink / raw)
  To: Vyacheslav Dubeyko, Theodore Y. Ts'o
  Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

On 2019-12-06 3:47 a.m., Vyacheslav Dubeyko wrote:
> On Thu, 2019-12-05 at 01:46 -0800, Daniel Phillips wrote:
>> On 2019-12-04 7:55 a.m., Vyacheslav Dubeyko wrote:
>>>>
> 
> <snipped and reoredered>
> 
>> And here is a diagram of the Shardmap three level hashing scheme,
>> which ties everything together:
>>
>>     https://github.com/danielbot/Shardmap/wiki/Shardmap-hashing-scheme
>>
>> This needs explanation. It is something new that you won't find in
>> any
>> textbook, this is the big reveal right here.
>>
> 
> This diagram is pretty good and provides the high-level view of the
> whole scheme. But, maybe, it makes sense to show the granularity of
> hash code. It looks like the low hash is the hash of a name. Am I
> correct?

Not quite. A 64 bit hash code is computed per name, then divided up into
three parts as shown in the diagram. Each part of the hash addresses a
different level of the Shardmap index hierarchy: high bits address the
top level shard array, giving a pointer to a shard; middle bits address
a hash bucket within that shard; low bits are used to resolve collisions
within the hash bucket (and collisions still may occur even when the low
bits are considered, forcing a record block access and full string
compare.

> But how the mid- and high- parts of the hash code are defined?

Given the above description, does the diagram make sense? If so I will
add this description to the wiki.

> It looks like that cached shard stores LBAs of record entry blocks are
> associated with the low hash values.

Rather, associated with the entire hash value.

> But what does it mean that shard is cached?

This is the cache form of the shard, meaning that the unordered hash/lba
index pairs (duopack) were read in from media and loaded into this cache
object (or newly created by recent directory operations.)

> Here is a diagram of the cache structures, very simple:
>>
>>     https://github.com/danielbot/Shardmap/wiki/Shardmap-cache-format
> 
> This diagram is not easy to relate with the previous one. So, shard
> table and shard array are the same entities or not?

They are, and I have updated the hashing scheme diagram to refer to both
as "array". I will similarly update the code, which currently calls the
shard array field "map".

> Or do you mean that
> shard table is storeed on the volume but shard array is constructed in
> memory?

Sorry about that, it should be clear now. On the volume, a simple
unordered collection of hash:lba pairs is stored per shard, which is
reorganized into shard cache form (a hash table object) at demand-load
time.

>> There is a diagram here:
>>
> https://github.com/danielbot/Shardmap/wiki/Shardmap-record-block-format
> 
> I am slightly confused here. Does header be located at the bottom of
> the record block?

The header (just 32 bytes at the moment, possibly to be expanded to 48
or 64) is stored at the top of the zeroth record entry block, which is
therefore a little smaller than any other record entry block.

> My understanding is that records grow from top of the
> block down to the header direction. Am I correct? Why header is not
> located at the top of the block with entry dictionary? Any special
> purpose here?

That should be clear now. I will add the above descriptive text to the
wiki.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap fox Ext4
  2019-11-28  2:28     ` Theodore Y. Ts'o
  2019-11-28  4:27       ` Daniel Phillips
  2019-11-28 21:17       ` [RFC] Thing 1: Shardmap fox Ext4 Daniel Phillips
@ 2019-12-08 10:25       ` Daniel Phillips
  2 siblings, 0 replies; 32+ messages in thread
From: Daniel Phillips @ 2019-12-08 10:25 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: linux-ext4, linux-kernel, linux-fsdevel, OGAWA Hirofumi

On 2019-11-27 6:28 p.m., Theodore Y. Ts'o wrote:
> The use of C++ with templates is presumably one of the "less so"
> parts, and it was that which I had in mind when I said,
> "reimplementing from scratch".

The templates were removed without reimplementing from scratch:

   https://github.com/danielbot/Shardmap/blob/master/shardmap.h#L88
   https://github.com/danielbot/Shardmap/blob/master/shardmap.cc#L82

The duopack/tripack facility, central to Shardmap efficient scalability, are
now just ordinary C code that happens to be compiled by a C++ compiler. I
think the machine code should be identical to what the templates produced,
though I did not verify.

This was a strictly mechanical conversion, less error prone than
reimplementing from scratch I would think. I expect the rest of the
back conversions to be similarly mechanical.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Thing 1: Shardmap for Ext4
  2019-12-06  5:09                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
@ 2019-12-08 22:42                     ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2019-12-08 22:42 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Theodore Y. Ts'o, Andreas Dilger, linux-ext4, linux-kernel,
	linux-fsdevel, OGAWA Hirofumi

On Thu, Dec 05, 2019 at 09:09:28PM -0800, Daniel Phillips wrote:
> On 2019-12-05 5:16 p.m., Dave Chinner wrote:
> > On Wed, Dec 04, 2019 at 06:41:06PM -0500, Theodore Y. Ts'o wrote:
> >> On Wed, Dec 04, 2019 at 11:31:50AM -0700, Andreas Dilger wrote:
> >>> One important use case that we have for Lustre that is not yet in the
> >>> upstream ext4[*] is the ability to do parallel directory operations.
> >>> This means we can create, lookup, and/or unlink entries in the same
> >>> directory concurrently, to increase parallelism for large directories.
> >>>
> >>> [*] we've tried to submit the pdirops patch a couple of times, but the
> >>> main blocker is that the VFS has a single directory mutex and couldn't
> >>> use the added functionality without significant VFS changes.
> >>> Patch at https://git.whamcloud.com/?p=fs/lustre-release.git;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD
> >>>
> >>
> >> The XFS folks recently added support for parallel directory operations
> >> into the VFS, for the benefit of XFS has this feature.
> > 
> > The use of shared i_rwsem locking on the directory inode during
> > lookup/pathwalk allows for concurrent lookup/readdir operations on
> > a single directory. However, the parent dir i_rwsem is still held
> > exclusive for directory modifications like create, unlink, etc.
> > 
> > IOWs, the VFS doesn't allow for concurrent directory modification
> > right now, and that's going to be the limiting factor no matter what
> > you do with internal filesystem locking.
> 
> On a scale of 0 to 10, how hard do you think that would be to relax
> in VFS, given the restriction of no concurrent inter-directory moves?

My initial reaction is to run away screaming in horror. Beyond that,
I have no idea what terrible dangers lurk in the dark shadows where
mortals fear to tread...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2019-12-08 22:43 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-27  1:47 [RFC] Thing 1: Shardmap fox Ext4 Daniel Phillips
2019-11-27  7:40 ` Vyacheslav Dubeyko
2019-11-27  8:28   ` Daniel Phillips
2019-11-27 19:35     ` Viacheslav Dubeyko
2019-11-28  2:54       ` Daniel Phillips
2019-11-28  9:15         ` Andreas Dilger
2019-11-28 10:03           ` Daniel Phillips
2019-11-27 14:25 ` Theodore Y. Ts'o
2019-11-27 22:27   ` Daniel Phillips
2019-11-28  2:28     ` Theodore Y. Ts'o
2019-11-28  4:27       ` Daniel Phillips
2019-11-30 17:50         ` Theodore Y. Ts'o
2019-12-01  8:21           ` Daniel Phillips
2019-12-04 18:31             ` Andreas Dilger
2019-12-04 21:44               ` Daniel Phillips
2019-12-05  0:36                 ` Andreas Dilger
2019-12-05  2:27                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-04 23:41               ` [RFC] Thing 1: Shardmap fox Ext4 Theodore Y. Ts'o
2019-12-06  1:16                 ` Dave Chinner
2019-12-06  5:09                   ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-08 22:42                     ` Dave Chinner
2019-11-28 21:17       ` [RFC] Thing 1: Shardmap fox Ext4 Daniel Phillips
2019-12-08 10:25       ` Daniel Phillips
2019-12-02  1:45   ` Daniel Phillips
2019-12-04 15:55     ` Vyacheslav Dubeyko
2019-12-05  9:46       ` Daniel Phillips
2019-12-06 11:47         ` Vyacheslav Dubeyko
2019-12-07  0:46           ` [RFC] Thing 1: Shardmap for Ext4 Daniel Phillips
2019-12-04 18:03     ` [RFC] Thing 1: Shardmap fox Ext4 Andreas Dilger
2019-12-04 20:47       ` Daniel Phillips
2019-12-04 20:53         ` Daniel Phillips
2019-12-05  5:59           ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).