All of lore.kernel.org
 help / color / mirror / Atom feed
* Filestore directory splitting (ZFS/FreeBSD)
@ 2017-04-18 10:12 Willem Jan Withagen
  2017-04-18 13:27 ` Sage Weil
  0 siblings, 1 reply; 4+ messages in thread
From: Willem Jan Withagen @ 2017-04-18 10:12 UTC (permalink / raw)
  To: Ceph Development

Hi,

I'm running some alrger tests with ceph-fuse on FreeBSD, and noticed
that I start to run into directory splitting I think?
I just Rsynced my full src and ports tree into it, which is like >
1.000.000 small files.
At least I saw a rather large number of directories with relatively few
files in them.

I know from UFS that large directories used to be a problem, although
some of the issues on FreeBSD have been fixed long ago.

So can anybody explain the rationale behind this process and give a bit
of a feeling why we start splitting at 320 files??

FreeBSD's filestore runs of ZFS, and From waht I've seen thusfar is that
ZFS has a different (better) behaviour with large directories.
(I once ended up with > 1.000.000 security cam pictures in one
directory, and that still was sort of workable.)

So has there been any testing to quatify the settings? And how would I
be able to determine if ZFS deserves better/larger settings?

--WjW

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Filestore directory splitting (ZFS/FreeBSD)
  2017-04-18 10:12 Filestore directory splitting (ZFS/FreeBSD) Willem Jan Withagen
@ 2017-04-18 13:27 ` Sage Weil
  2017-04-18 13:32   ` Willem Jan Withagen
  0 siblings, 1 reply; 4+ messages in thread
From: Sage Weil @ 2017-04-18 13:27 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: Ceph Development

On Tue, 18 Apr 2017, Willem Jan Withagen wrote:
> Hi,
> 
> I'm running some alrger tests with ceph-fuse on FreeBSD, and noticed
> that I start to run into directory splitting I think?
> I just Rsynced my full src and ports tree into it, which is like >
> 1.000.000 small files.
> At least I saw a rather large number of directories with relatively few
> files in them.
> 
> I know from UFS that large directories used to be a problem, although
> some of the issues on FreeBSD have been fixed long ago.
> 
> So can anybody explain the rationale behind this process and give a bit
> of a feeling why we start splitting at 320 files??

It actually has little to do with the underlying file system's ability to 
handle large directories.  Ceph needs to do object enumeration in sorted 
order (based on the [g]hobject_t sort order) and readdir return entries in 
a semi-random order based on how the fs is implemented.  We keep 
directories smallish so that we can list the whole directory, sort in 
memory, and then return the correct result.

sage


> FreeBSD's filestore runs of ZFS, and From waht I've seen thusfar is that
> ZFS has a different (better) behaviour with large directories.
> (I once ended up with > 1.000.000 security cam pictures in one
> directory, and that still was sort of workable.)
> 
> So has there been any testing to quatify the settings? And how would I
> be able to determine if ZFS deserves better/larger settings?
> 
> --WjW
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Filestore directory splitting (ZFS/FreeBSD)
  2017-04-18 13:27 ` Sage Weil
@ 2017-04-18 13:32   ` Willem Jan Withagen
  2017-04-18 13:47     ` Mark Nelson
  0 siblings, 1 reply; 4+ messages in thread
From: Willem Jan Withagen @ 2017-04-18 13:32 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

On 18-4-2017 15:27, Sage Weil wrote:
> On Tue, 18 Apr 2017, Willem Jan Withagen wrote:
>> Hi,
>>
>> I'm running some alrger tests with ceph-fuse on FreeBSD, and noticed
>> that I start to run into directory splitting I think?
>> I just Rsynced my full src and ports tree into it, which is like >
>> 1.000.000 small files.
>> At least I saw a rather large number of directories with relatively few
>> files in them.
>>
>> I know from UFS that large directories used to be a problem, although
>> some of the issues on FreeBSD have been fixed long ago.
>>
>> So can anybody explain the rationale behind this process and give a bit
>> of a feeling why we start splitting at 320 files??
> 
> It actually has little to do with the underlying file system's ability to 
> handle large directories.  Ceph needs to do object enumeration in sorted 
> order (based on the [g]hobject_t sort order) and readdir return entries in 
> a semi-random order based on how the fs is implemented.  We keep 
> directories smallish so that we can list the whole directory, sort in 
> memory, and then return the correct result.

Oke, I see.
So it has more to do with the efficiency of readdir and friends....?

And then the odd question: ;-)

why would this be (still) so extensively configurable?
I mean, once it has been "sorted out" it could be changed into more or
less fixed. Either bij making it a *h-file constant, or marking it so in
common_opt.h in the comments.
And perhaps even remove it from the documentation.

I was alos triggered by some of the remarks on the user list that
splitting was a expensive process that did have impact on performance.

--WjW

>> FreeBSD's filestore runs of ZFS, and From waht I've seen thusfar is that
>> ZFS has a different (better) behaviour with large directories.
>> (I once ended up with > 1.000.000 security cam pictures in one
>> directory, and that still was sort of workable.)
>>
>> So has there been any testing to quatify the settings? And how would I
>> be able to determine if ZFS deserves better/larger settings?
>>
>> --WjW
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Filestore directory splitting (ZFS/FreeBSD)
  2017-04-18 13:32   ` Willem Jan Withagen
@ 2017-04-18 13:47     ` Mark Nelson
  0 siblings, 0 replies; 4+ messages in thread
From: Mark Nelson @ 2017-04-18 13:47 UTC (permalink / raw)
  To: Willem Jan Withagen, Sage Weil; +Cc: Ceph Development



On 04/18/2017 08:32 AM, Willem Jan Withagen wrote:
> On 18-4-2017 15:27, Sage Weil wrote:
>> On Tue, 18 Apr 2017, Willem Jan Withagen wrote:
>>> Hi,
>>>
>>> I'm running some alrger tests with ceph-fuse on FreeBSD, and noticed
>>> that I start to run into directory splitting I think?
>>> I just Rsynced my full src and ports tree into it, which is like >
>>> 1.000.000 small files.
>>> At least I saw a rather large number of directories with relatively few
>>> files in them.
>>>
>>> I know from UFS that large directories used to be a problem, although
>>> some of the issues on FreeBSD have been fixed long ago.
>>>
>>> So can anybody explain the rationale behind this process and give a bit
>>> of a feeling why we start splitting at 320 files??
>>
>> It actually has little to do with the underlying file system's ability to
>> handle large directories.  Ceph needs to do object enumeration in sorted
>> order (based on the [g]hobject_t sort order) and readdir return entries in
>> a semi-random order based on how the fs is implemented.  We keep
>> directories smallish so that we can list the whole directory, sort in
>> memory, and then return the correct result.
>
> Oke, I see.
> So it has more to do with the efficiency of readdir and friends....?
>
> And then the odd question: ;-)
>
> why would this be (still) so extensively configurable?
> I mean, once it has been "sorted out" it could be changed into more or
> less fixed. Either bij making it a *h-file constant, or marking it so in
> common_opt.h in the comments.
> And perhaps even remove it from the documentation.
>
> I was alos triggered by some of the remarks on the user list th
> splitting was a expensive process that did have impact on performance.

It's still a bit of an on-going debate where those split values should 
be tuned, and frankly whether or not splitting should happen at all. 
This was sort of the motivation to try to "pre-split" directories if you 
think you're going to end up with lots of objects per PG.  There's a lot 
of nuance here regarding directory fragmentation, cached dentries and 
inodes, the effects of large dentry/inode cache on other things 
(syncfs), selinux behavior (or similar behavior, ie what actions 
triggers xattr security lookups).  If we could make the things that need 
to scan through the directories asynchronous that would probably let us 
push the directory limits much higher, but that would be a fairly major 
change for filestore and we are trying to keep it stable until bluestore 
is ready.

Speaking of which, bluestore doesn't suffer from this particular issue 
so eventually this whole discussion sort of becomes obsolete. We have 
other performance challenges in bluestore that we are still working on 
(mostly metadata/rocksdb related).

Mark

>
> --WjW
>
>>> FreeBSD's filestore runs of ZFS, and From waht I've seen thusfar is that
>>> ZFS has a different (better) behaviour with large directories.
>>> (I once ended up with > 1.000.000 security cam pictures in one
>>> directory, and that still was sort of workable.)
>>>
>>> So has there been any testing to quatify the settings? And how would I
>>> be able to determine if ZFS deserves better/larger settings?
>>>
>>> --WjW
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-04-18 13:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-18 10:12 Filestore directory splitting (ZFS/FreeBSD) Willem Jan Withagen
2017-04-18 13:27 ` Sage Weil
2017-04-18 13:32   ` Willem Jan Withagen
2017-04-18 13:47     ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.