All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexey Lyashkov <alexey.lyashkov@gmail.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger@dilger.ca>,
	Artem Blagodarenko <artem.blagodarenko@gmail.com>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	Yang Sheng <yang.sheng@intel.com>,
	Zhen Liang <liang.zhen@intel.com>,
	Artem Blagodarenko <artem.blagodarenko@seagate.com>
Subject: Re: [PATCH] Add largedir feature
Date: Sat, 18 Mar 2017 20:17:55 +0300	[thread overview]
Message-ID: <EA1BCCB1-AEC7-4193-823B-6230B6860C0C@gmail.com> (raw)
In-Reply-To: <20170318162953.ubn3lvglxqq6ux2e@thunk.org>


> 18 марта 2017 г., в 19:29, Theodore Ts'o <tytso@mit.edu> написал(а):
> 
> On Sat, Mar 18, 2017 at 11:16:26AM +0300, Alexey Lyashkov wrote:
>> Andreas,
>> 
>> it not about a feature flag. It’s about a situation in whole.
>> Yes, we may increase a directory size, but it open a number a large problems.
> 
>> 1) readdir. It tries to read all entries in memory before send to
>> the user. currently it may eats 20*10^6 * 256 so several gigs, so
>> increasing it size may produce a problems for a system.
> 
> That's not true.  We normally only read in one block a time.  If there
> is a hash collision, then we may need to insert into the rbtree in a
> subsequent block's worth of dentries to make sure we have all of the
> directory entries corresponding to a particular hash value.  I think
> you misunderstood the code.
As i see it not about hash collisions, but about merging a several blocks into same hash range on up level hash entry.
so if we have a large hash range originally assigned to the single block, all that range will read at memory at single step.
With «aged» directory when hash blocks used already - it’s easy to hit.


> 
>> 2) inode allocation. Current code tries to allocate an inode as near as possible to the directory inode, but one GD may hold 32k entries only, so increase a directory size will use up 1k GD for now and more than it, after it. It increase a seek time with file allocation. It was i mean when say - «dramatically decrease a file creation rate».
> 
> But there are also only 32k blocks in a group descriptor, and we try
> to keep the blocks allocated close to the inode.
with bigalloc feature it’s not a 32k blocks. but 32k clusters with 1M cluster size(as example), it very large space.


>  So if you are using
> a huge directory, and you are using a storage device with a
> significant seek penalty, then yes, no matter what as the directory
> grows, the time to iterate over all of the files does grow.  But there
> is more to life than microbenchmarks which creates huge numbers of zero
> length files!  If we assume that the files are going to need to
> contain _data_, and the data blocks should be close to the inodes,
> then there are going to be some performance impacts no matter what.
> 
Yes, i expect to have some seek penalty. But may testing say it too huge now.
directory creation rate started with 80k create/s have dropped to the 20k-30k create/s with hash tree extend to the level 3.
Same testing with hard links same create rate dropped slightly.


>> 3) current limit with 4G inodes - currently 32-128 directories may eats a full inode number space. From it perspective large dir don’t need to be used.
> 
> I can imagine a new feature flag which defines the use a 64-bit inode
> number, but that's more for people who are creating a file system that
> takes advantage of 64-bit block numbers, and they are intending on
> using all of that space to store small (< 4k or < 8k) files.
> 
> And it's also true that there are huge advantges to using a
> multi-level directory hierarchy --- e.g.:
> 
> .git/objects/03/08e42105258d4e53ffeb81ffb2a4b2480bb8b8
> 
> or even
> 
> .git/objects/03/08/e42105258d4e53ffeb81ffb2a4b2480bb8b8
> 
> instead of:
> 
> .git/objects/0308e42105258d4e53ffeb81ffb2a4b2480bb8b8
> 
> but that's an application level question.  If for some reason some
> silly application programmer wants to have a single gargantuan
> directory, if the patches to support it are fairly simple, even if
> someone is going to give us patches to do something more general,
> burning an extra feature flag doesn't seem like the most terrible
> thing in the world.
From other side - application don’t expect to have very slow directory and have access with some constant or near or it speed.



> 
> As for the other optimizations --- things like allowing parallel
> directory modifications, or being able to shrink empty directory
> blocks or shorten the tree are all improvements we can make without
> impacting the on-disk format.  So they aren't an argument for halting
> the submission of the new on-disk format, no?
> 
It’s argument about using this feature. Yes, we can land it, but it decrease an expected speed in some cases.


Alexey

  reply	other threads:[~2017-03-18 17:18 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-16  9:51 [PATCH] Add largedir feature Artem Blagodarenko
2017-03-16 21:44 ` Andreas Dilger
2017-03-17  6:15   ` Alexey Lyashkov
2017-03-17 20:51     ` Andreas Dilger
2017-03-18  8:16       ` Alexey Lyashkov
2017-03-18 16:29         ` Theodore Ts'o
2017-03-18 17:17           ` Alexey Lyashkov [this message]
2017-03-19  0:39             ` Theodore Ts'o
2017-03-19  4:19               ` Alexey Lyashkov
2017-03-19  6:13               ` Andreas Dilger
2017-03-19  5:38           ` Andreas Dilger
2017-03-19 13:34             ` Theodore Ts'o
2017-03-19 23:54               ` Andreas Dilger
2017-03-20 11:34                 ` Alexey Lyashkov
2017-03-20 14:20                   ` Theodore Ts'o
2017-03-21 15:38                     ` Andreas Dilger
2017-03-20 11:42                 ` Theodore Ts'o
2017-04-30  0:59 ` Theodore Ts'o
2017-05-01 18:58   ` Eric Biggers
2017-05-01 23:39     ` Andreas Dilger
2017-05-02  2:44       ` Theodore Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=EA1BCCB1-AEC7-4193-823B-6230B6860C0C@gmail.com \
    --to=alexey.lyashkov@gmail.com \
    --cc=adilger@dilger.ca \
    --cc=artem.blagodarenko@gmail.com \
    --cc=artem.blagodarenko@seagate.com \
    --cc=liang.zhen@intel.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=yang.sheng@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.