Re: [RFC] Thing 1: Shardmap fox Ext4

From: Viacheslav Dubeyko <slava@dubeyko.com>
To: Daniel Phillips <daniel@phunq.net>
Cc: linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, "Theodore Y. Ts'o" <tytso@mit.edu>,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
	"Darrick J. Wong" <djwong@kernel.org>
Subject: Re: [RFC] Thing 1: Shardmap fox Ext4
Date: Wed, 27 Nov 2019 22:35:13 +0300	[thread overview]
Message-ID: <B9F8658C-B88F-44A1-BBEF-98A8259E0712@dubeyko.com> (raw)
In-Reply-To: <5c9b5bd3-028a-5211-30a6-a5a8706b373e@phunq.net>

> On Nov 27, 2019, at 11:28 AM, Daniel Phillips <daniel@phunq.net> wrote:
> 
> On 2019-11-26 11:40 p.m., Vyacheslav Dubeyko wrote:
>> As far as I know, usually, a folder contains dozens or hundreds
>> files/folders in average. There are many research works that had showed
>> this fact. Do you mean some special use-case when folder could contain
>> the billion files? Could you share some research work that describes
>> some practical use-case with billion files per folder?
> 
> You are entirely correct that the vast majority of directories contain
> only a handful of files. That is my case (1). A few directories on a
> typical server can go into the tens of thousands of files. There was
> a time when we could not handle those efficiently, and now thanks to
> HTree we can. Some directories go into the millions, ask the Lustre
> people about that. If you could have a directory with a billion files
> then somebody will have a use for it. For example, you may be able to
> avoid a database for a particular application and just use the file
> system instead.
> 
> Now, scaling to a billion files is just one of several things that
> Shardmap does better than HTree. More immediately, Shardmap implements
> readdir simply, accurately and efficiently, unlike HTree. See here for
> some discussion:
> 
>   https://lwn.net/Articles/544520/
>   "Widening ext4's readdir() cookie"
> 

So, it looks like that Shardmap could be better for the case of billion files in one folder.
But what’s about the regular case when it could be dozens/hundreds of files in one
folder? Will Shardmap be better than HTree? If the ordinary user hasn’t visible
performance improvement then it makes sense to consider Shardmap like the
optional feature. What do you think?

Does it mean that Shardmap is ext4 oriented only? Could it be used for another
file systems?

> See the recommendation that is sometimes offered to work around
> HTree's issues with processing files in hash order. Basically, read
> the entire directory into memory, sort by inode number, then process
> in that order. As an application writer do you really want to do this,
> or would you prefer that the filesystem just take care of for you so
> the normal, simple and readable code is also the most efficient code?
> 

I slightly missed the point here. To read the whole directory sounds like to read
the dentries tree from the volume. As far as I can see, the dentries are ordered
by names or by hashes. But if we are talking about inode number then we mean
the inodes tree. So, I have misunderstanding here. What do you mean?

>> If you are talking about improving the performance then do you mean
>> some special open-source implementation?
> 
> I mean the same kind of kernel filesystem implementation that HTree
> currently has. Open source of course, GPLv2 to be specific.
> 

I meant the Shardmap implementation. As far as I can see, the user-space implementation
is available only now. So, my question is still here. It’s hard to say how efficient the Shardmap
could be on kernel side as ext4 subsystem, for example.

>>> For delete, Shardmap avoids write multiplication by appending tombstone
>>> entries to index shards, thereby addressing a well known HTree delete
>>> performance issue.
>> 
>> Do you mean Copy-On-Write policy here or some special technique?
> 
> The technique Shardmap uses to reduce write amplication under heavy
> load is somewhat similar to the technique used by Google's Bigtable to
> achieve a similar result for data files. (However, the resemblance to
> Bigtable ends there.)
> 
> Each update to a Shardmap index is done twice: once in a highly
> optimized hash table shard in cache, then again by appending an
> entry to the tail of the shard's media "fifo". Media writes are
> therefore mostly linear. I say mostly, because if there is a large
> number of shards then a single commit may need to update the tail
> block of each one, which still adds up to vastly fewer blocks than
> the BTree case, where it is easy to construct cases where every
> index block must be updated on every commit, a nasty example of
> n**2 performance overhead.
> 

It sounds like adding updates in log-structured manner. But what’s about
the obsolete/invalid blocks? Does it mean that it need to use some GC technique
here? I am not completely sure that it could be beneficial for the ext4.

By the way, could the old index blocks be used like the snapshots in the case
of corruptions or other nasty issues?

>> How could be good Shardmap for the SSD use-case? Do you mean that we
>> could reduce write amplification issue for NAND flash case?
> 
> Correct. Reducing write amplification is particularly important for
> flash based storage. It also has a noticeable beneficial effect on
> efficiency under many common and not so common loads.
> 
>> Let's imagine that it needs to implement the Shardmap approach. Could
>> you estimate the implementation and stabilization time? How expensive
>> and long-term efforts could it be?
> 
> Shardmap is already implemented and stable, though it does need wider
> usage and testing. Code is available here:
> 
>   https://github.com/danielbot/Shardmap
> 
> Shardmap needs to be ported to kernel, already planned and in progress
> for Tux3. Now I am proposing that the Ext4 team should consider porting
> Shardmap to Ext4, or at least enter into a serious discussion of the
> logistics.
> 
> Added Darrick to cc, as he is already fairly familiar with this subject,
> once was an Ext4 developer, and perhaps still is should the need arise.
> By the way, is there a reason that Ted's MIT address bounced on my
> original post?
> 

It’s hard to talk about stability because we haven’t kernel-side implementation
of Shardmap for ext4. I suppose that it needs to spend about a year for the porting
and twice more time for the stabilization. To port a user-space code to the kernel
could be the tricky task. Could you estimate how many lines of code the core
part of Shardmap contains? Does it need to change the ext4 on-disk layout for
this feature? How easy ext4 functionality can be modified for Shardmap support?

Thanks,
Viacheslav Dubeyko.