All of lore.kernel.org
 help / color / mirror / Atom feed
* CephFS First product release discussion
@ 2013-03-05 17:03 ` Greg Farnum
  2013-03-05 18:08   ` Wido den Hollander
                     ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Greg Farnum @ 2013-03-05 17:03 UTC (permalink / raw)
  To: ceph-devel, ceph-users

This is a companion discussion to the blog post at http://ceph.com/dev-notes/cephfs-mds-status-discussion/ — go read that!

The short and slightly alternate version: I spent most of about two weeks working on bugs related to snapshots in the MDS, and we started realizing that we could probably do our first supported release of CephFS and the related infrastructure much sooner if we didn't need to support all of the whizbang features. (This isn't to say that the base feature set is stable now, but it's much closer than when you turn on some of the other things.) I'd like to get feedback from you in the community on what minimum supported feature set would prompt or allow you to start using CephFS in real environments — not what you'd *like* to see, but what you *need* to see. This will allow us at Inktank to prioritize more effectively and hopefully get out a supported release much more quickly! :)

The current proposed feature set is basically what's left over after we've trimmed off everything we can think to split off, but if any of the proposed included features are also particularly important or don't matter, be sure to mention them (NFS export in particular — it works right now but isn't in great shape due to NFS filehandle caching).

Thanks,
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com  


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS First product release discussion
  2013-03-05 17:03 ` CephFS First product release discussion Greg Farnum
@ 2013-03-05 18:08   ` Wido den Hollander
  2013-03-05 18:17     ` Greg Farnum
  2013-03-06  5:01   ` [ceph-users] CephFS First product release discussion Neil Levine
       [not found]   ` <E0B1337A572647BA9FCC0CE8CA946F42-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
  2 siblings, 1 reply; 31+ messages in thread
From: Wido den Hollander @ 2013-03-05 18:08 UTC (permalink / raw)
  To: Greg Farnum; +Cc: ceph-devel

On 03/05/2013 06:03 PM, Greg Farnum wrote:
> This is a companion discussion to the blog post at http://ceph.com/dev-notes/cephfs-mds-status-discussion/ — go read that!
>
> The short and slightly alternate version: I spent most of about two weeks working on bugs related to snapshots in the MDS, and we started realizing that we could probably do our first supported release of CephFS and the related infrastructure much sooner if we didn't need to support all of the whizbang features. (This isn't to say that the base feature set is stable now, but it's much closer than when you turn on some of the other things.) I'd like to get feedback from you in the community on what minimum supported feature set would prompt or allow you to start using CephFS in real environments — not what you'd *like* to see, but what you *need* to see. This will allow us at Inktank to prioritize more effectively and hopefully get out a supported release much more quickly! :)
>
> The current proposed feature set is basically what's left over after we've trimmed off everything we can think to split off, but if any of the proposed included features are also particularly important or don't matter, be sure to mention them (NFS export in particular — it works right now but isn't in great shape due to NFS filehandle caching).
>

Great news! Although RBD and RADOS itself are already great, a lot of 
applications would still require a shared filesystem.

Think about a (Cloud|Open)Stack environment with thousands of instances 
running but also need some form of shared filesystem.

One thing I'm missing though is user-quotas, have they been discussed at 
all and what would the work to implement those involve?

I know it would require a lot more tracking per file so it's not that 
easy and would certainly not make it into a first release, but are they 
on the roadmap at all?

> Thanks,
> -Greg
>
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>

^ awesome title ;)

>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS First product release discussion
  2013-03-05 18:08   ` Wido den Hollander
@ 2013-03-05 18:17     ` Greg Farnum
  2013-03-05 18:28       ` Sage Weil
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Farnum @ 2013-03-05 18:17 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

On Tuesday, March 5, 2013 at 10:08 AM, Wido den Hollander wrote:
> On 03/05/2013 06:03 PM, Greg Farnum wrote:
> > This is a companion discussion to the blog post at http://ceph.com/dev-notes/cephfs-mds-status-discussion/ — go read that!
> >  
> > The short and slightly alternate version: I spent most of about two weeks working on bugs related to snapshots in the MDS, and we started realizing that we could probably do our first supported release of CephFS and the related infrastructure much sooner if we didn't need to support all of the whizbang features. (This isn't to say that the base feature set is stable now, but it's much closer than when you turn on some of the other things.) I'd like to get feedback from you in the community on what minimum supported feature set would prompt or allow you to start using CephFS in real environments — not what you'd *like* to see, but what you *need* to see. This will allow us at Inktank to prioritize more effectively and hopefully get out a supported release much more quickly! :)
> >  
> > The current proposed feature set is basically what's left over after we've trimmed off everything we can think to split off, but if any of the proposed included features are also particularly important or don't matter, be sure to mention them (NFS export in particular — it works right now but isn't in great shape due to NFS filehandle caching).
>  
> Great news! Although RBD and RADOS itself are already great, a lot of  
> applications would still require a shared filesystem.
>  
> Think about a (Cloud|Open)Stack environment with thousands of instances  
> running but also need some form of shared filesystem.
>  
> One thing I'm missing though is user-quotas, have they been discussed at  
> all and what would the work to implement those involve?
>  
> I know it would require a lot more tracking per file so it's not that  
> easy and would certainly not make it into a first release, but are they  
> on the roadmap at all?

Not at present. I think there are some tickets related to this in the tracker as feature requests, but CephFS needs more groundwork about multi-tenancy in general before we can do reasonable planning around a robust user quota feature. (Near-real-time hacks are possible now based around the rstats infrastructure and I believe somebody has built them, though I've never seen them myself.)
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS First product release discussion
  2013-03-05 18:17     ` Greg Farnum
@ 2013-03-05 18:28       ` Sage Weil
  2013-03-05 18:36         ` Wido den Hollander
  0 siblings, 1 reply; 31+ messages in thread
From: Sage Weil @ 2013-03-05 18:28 UTC (permalink / raw)
  To: Greg Farnum; +Cc: Wido den Hollander, ceph-devel

On Tue, 5 Mar 2013, Greg Farnum wrote:
> On Tuesday, March 5, 2013 at 10:08 AM, Wido den Hollander wrote:
> > On 03/05/2013 06:03 PM, Greg Farnum wrote:
> > > This is a companion discussion to the blog post at http://ceph.com/dev-notes/cephfs-mds-status-discussion/ ? go read that!
> > >  
> > > The short and slightly alternate version: I spent most of about two weeks working on bugs related to snapshots in the MDS, and we started realizing that we could probably do our first supported release of CephFS and the related infrastructure much sooner if we didn't need to support all of the whizbang features. (This isn't to say that the base feature set is stable now, but it's much closer than when you turn on some of the other things.) I'd like to get feedback from you in the community on what minimum supported feature set would prompt or allow you to start using CephFS in real environments ? not what you'd *like* to see, but what you *need* to see. This will allow us at Inktank to prioritize more effectively and hopefully get out a supported release much more quickly! :)
> > >  
> > > The current proposed feature set is basically what's left over after we've trimmed off everything we can think to split off, but if any of the proposed included features are also particularly important or don't matter, be sure to mention them (NFS export in particular ? it works right now but isn't in great shape due to NFS filehandle caching).
> >  
> > Great news! Although RBD and RADOS itself are already great, a lot of  
> > applications would still require a shared filesystem.
> >  
> > Think about a (Cloud|Open)Stack environment with thousands of instances  
> > running but also need some form of shared filesystem.
> >  
> > One thing I'm missing though is user-quotas, have they been discussed at  
> > all and what would the work to implement those involve?
> >  
> > I know it would require a lot more tracking per file so it's not that  
> > easy and would certainly not make it into a first release, but are they  
> > on the roadmap at all?
> 
> Not at present. I think there are some tickets related to this in the 
> tracker as feature requests, but CephFS needs more groundwork about 
> multi-tenancy in general before we can do reasonable planning around a 
> robust user quota feature. (Near-real-time hacks are possible now based 
> around the rstats infrastructure and I believe somebody has built them, 
> though I've never seen them myself.)

Wido, by 'user quota' do you mean something that is uid-based, or would 
enforcement on subtree/directory quotas be sufficient for your use cases?  
I've been holding out hope that uid-based usage accounting is a thing of 
the past and that subtrees are sufficient for real users... in which case 
adding enfocement to the existing rstats infrastructure is a very 
manageable task.

sage

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS First product release discussion
  2013-03-05 18:28       ` Sage Weil
@ 2013-03-05 18:36         ` Wido den Hollander
  2013-03-05 18:48           ` Jim Schutt
  2013-03-05 19:33           ` Sage Weil
  0 siblings, 2 replies; 31+ messages in thread
From: Wido den Hollander @ 2013-03-05 18:36 UTC (permalink / raw)
  To: Sage Weil; +Cc: Greg Farnum, ceph-devel

On 03/05/2013 07:28 PM, Sage Weil wrote:
> On Tue, 5 Mar 2013, Greg Farnum wrote:
>> On Tuesday, March 5, 2013 at 10:08 AM, Wido den Hollander wrote:
>>> On 03/05/2013 06:03 PM, Greg Farnum wrote:
>>>> This is a companion discussion to the blog post at http://ceph.com/dev-notes/cephfs-mds-status-discussion/ ? go read that!
>>>>
>>>> The short and slightly alternate version: I spent most of about two weeks working on bugs related to snapshots in the MDS, and we started realizing that we could probably do our first supported release of CephFS and the related infrastructure much sooner if we didn't need to support all of the whizbang features. (This isn't to say that the base feature set is stable now, but it's much closer than when you turn on some of the other things.) I'd like to get feedback from you in the community on what minimum supported feature set would prompt or allow you to start using CephFS in real environments ? not what you'd *like* to see, but what you *need* to see. This will allow us at Inktank to prioritize more effectively and hopefully get out a supported release much more quickly! :)
>>>>
>>>> The current proposed feature set is basically what's left over after we've trimmed off everything we can think to split off, but if any of the proposed included features are also particularly important or don't matter, be sure to mention them (NFS export in particular ? it works right now but isn't in great shape due to NFS filehandle caching).
>>>
>>> Great news! Although RBD and RADOS itself are already great, a lot of
>>> applications would still require a shared filesystem.
>>>
>>> Think about a (Cloud|Open)Stack environment with thousands of instances
>>> running but also need some form of shared filesystem.
>>>
>>> One thing I'm missing though is user-quotas, have they been discussed at
>>> all and what would the work to implement those involve?
>>>
>>> I know it would require a lot more tracking per file so it's not that
>>> easy and would certainly not make it into a first release, but are they
>>> on the roadmap at all?
>>
>> Not at present. I think there are some tickets related to this in the
>> tracker as feature requests, but CephFS needs more groundwork about
>> multi-tenancy in general before we can do reasonable planning around a
>> robust user quota feature. (Near-real-time hacks are possible now based
>> around the rstats infrastructure and I believe somebody has built them,
>> though I've never seen them myself.)
>
> Wido, by 'user quota' do you mean something that is uid-based, or would
> enforcement on subtree/directory quotas be sufficient for your use cases?
> I've been holding out hope that uid-based usage accounting is a thing of
> the past and that subtrees are sufficient for real users... in which case
> adding enfocement to the existing rstats infrastructure is a very
> manageable task.
>

I mean actual uid-based quotas. That still plays nice with shared 
environments like Samba or so where you have all homedirectories on a 
shared filesystems and you set per user quotas. Samba reads out those 
quotas and propagates them to the (Windows) client.

I know this was a problem with ZFS as well. They also said they could do 
"per filesystem quotas" so that would be sufficient, but for example NFS 
doesn't export filesystems mounted in a export, so if you have a bunch 
of homedirectories on the filesystem and you want to account the usage 
of each user it's getting kind of hard.

This could be solved if the clients directly mounted CephFS though.

I'm talking about setups where you have 100k users in a LDAP and they 
all have their data in a single filesystem and you want to track the 
usage of each user, that's not an easy task without uid-based quotas.

Running 'du' on each directory would be much faster with Ceph since it 
accounts tracks the subdirectories and shows their total size with an 
'ls -al'.

Environments with 100k users also tend to be very dynamic with adding 
and removing users all the time, so creating separate filesystems for 
them would be very time consuming.

Now, I'm not talking about enforcing soft or hard quotas, I'm just 
talking about knowing how much space uid X and Y consume on the filesystem.

Wido

> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS First product release discussion
  2013-03-05 18:36         ` Wido den Hollander
@ 2013-03-05 18:48           ` Jim Schutt
  2013-03-05 19:33           ` Sage Weil
  1 sibling, 0 replies; 31+ messages in thread
From: Jim Schutt @ 2013-03-05 18:48 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Sage Weil, Greg Farnum, ceph-devel

On 03/05/2013 11:36 AM, Wido den Hollander wrote:
> 
> Now, I'm not talking about enforcing soft or hard quotas, I'm just
> talking about knowing how much space uid X and Y consume on the
> filesystem.

FWIW, we'd like this capability for our HPC systems - we need
to be able to disable scheduling of new jobs for users that
are consuming too much storage....

-- Jim

> 
> Wido



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS First product release discussion
  2013-03-05 18:36         ` Wido den Hollander
  2013-03-05 18:48           ` Jim Schutt
@ 2013-03-05 19:33           ` Sage Weil
  2013-03-06 17:24             ` Wido den Hollander
  2013-03-06 19:07             ` Jim Schutt
  1 sibling, 2 replies; 31+ messages in thread
From: Sage Weil @ 2013-03-05 19:33 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Greg Farnum, ceph-devel

On Tue, 5 Mar 2013, Wido den Hollander wrote:
> > Wido, by 'user quota' do you mean something that is uid-based, or would
> > enforcement on subtree/directory quotas be sufficient for your use cases?
> > I've been holding out hope that uid-based usage accounting is a thing of
> > the past and that subtrees are sufficient for real users... in which case
> > adding enfocement to the existing rstats infrastructure is a very
> > manageable task.
> > 
> 
> I mean actual uid-based quotas. That still plays nice with shared environments
> like Samba or so where you have all homedirectories on a shared filesystems
> and you set per user quotas. Samba reads out those quotas and propagates them
> to the (Windows) client.

Does samba propagate the quota information (how much space is 
used/available) or do enforcement on the client side?  (Is client 
enforcement even necessary/useful if the backend will stop writes when the 
quota is exceeded?)

> I know this was a problem with ZFS as well. They also said they could do "per
> filesystem quotas" so that would be sufficient, but for example NFS doesn't
> export filesystems mounted in a export, so if you have a bunch of
> homedirectories on the filesystem and you want to account the usage of each
> user it's getting kind of hard.
> 
> This could be solved if the clients directly mounted CephFS though.
> 
> I'm talking about setups where you have 100k users in a LDAP and they all have
> their data in a single filesystem and you want to track the usage of each
> user, that's not an easy task without uid-based quotas.

Wouldn't each user live in a sub- or home directory?  If so, it seems like 
the existing rstats would be sufficient to do the accounting piece; only 
enforcement is missing.

> Running 'du' on each directory would be much faster with Ceph since it
> accounts tracks the subdirectories and shows their total size with an 'ls
> -al'.
> 
> Environments with 100k users also tend to be very dynamic with adding and
> removing users all the time, so creating separate filesystems for them would
> be very time consuming.
> 
> Now, I'm not talking about enforcing soft or hard quotas, I'm just talking
> about knowing how much space uid X and Y consume on the filesystem.

The part I'm most unclear on is what use cases people have where uid X and 
Y are spread around the file system (not in a single or small set of sub 
directories) and per-user (not, say, per-project) quotas are still 
necessary.  In most environments, users get their own home directory and 
everything lives there...

sage


> 
> Wido
> 
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> -- 
> Wido den Hollander
> 42on B.V.
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [ceph-users] CephFS First product release discussion
  2013-03-05 17:03 ` CephFS First product release discussion Greg Farnum
  2013-03-05 18:08   ` Wido den Hollander
@ 2013-03-06  5:01   ` Neil Levine
       [not found]     ` <CANygib-U_MQi1TMmQuT_Q9MVwPfT+PzJwN=+BMcBK69WuRfu3w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]   ` <E0B1337A572647BA9FCC0CE8CA946F42-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
  2 siblings, 1 reply; 31+ messages in thread
From: Neil Levine @ 2013-03-06  5:01 UTC (permalink / raw)
  To: Greg Farnum; +Cc: ceph-devel, ceph-users

As an extra request, it would be great if people explained a little
about their use-case for the filesystem so we can better understand
how the features requested map to the type of workloads people are
trying.

Thanks

Neil

On Tue, Mar 5, 2013 at 9:03 AM, Greg Farnum <greg@inktank.com> wrote:
> This is a companion discussion to the blog post at http://ceph.com/dev-notes/cephfs-mds-status-discussion/ — go read that!
>
> The short and slightly alternate version: I spent most of about two weeks working on bugs related to snapshots in the MDS, and we started realizing that we could probably do our first supported release of CephFS and the related infrastructure much sooner if we didn't need to support all of the whizbang features. (This isn't to say that the base feature set is stable now, but it's much closer than when you turn on some of the other things.) I'd like to get feedback from you in the community on what minimum supported feature set would prompt or allow you to start using CephFS in real environments — not what you'd *like* to see, but what you *need* to see. This will allow us at Inktank to prioritize more effectively and hopefully get out a supported release much more quickly! :)
>
> The current proposed feature set is basically what's left over after we've trimmed off everything we can think to split off, but if any of the proposed included features are also particularly important or don't matter, be sure to mention them (NFS export in particular — it works right now but isn't in great shape due to NFS filehandle caching).
>
> Thanks,
> -Greg
>
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS First product release discussion
  2013-03-05 19:33           ` Sage Weil
@ 2013-03-06 17:24             ` Wido den Hollander
  2013-03-06 19:07             ` Jim Schutt
  1 sibling, 0 replies; 31+ messages in thread
From: Wido den Hollander @ 2013-03-06 17:24 UTC (permalink / raw)
  To: Sage Weil; +Cc: Greg Farnum, ceph-devel

On 03/05/2013 08:33 PM, Sage Weil wrote:
> On Tue, 5 Mar 2013, Wido den Hollander wrote:
>>> Wido, by 'user quota' do you mean something that is uid-based, or would
>>> enforcement on subtree/directory quotas be sufficient for your use cases?
>>> I've been holding out hope that uid-based usage accounting is a thing of
>>> the past and that subtrees are sufficient for real users... in which case
>>> adding enfocement to the existing rstats infrastructure is a very
>>> manageable task.
>>>
>>
>> I mean actual uid-based quotas. That still plays nice with shared environments
>> like Samba or so where you have all homedirectories on a shared filesystems
>> and you set per user quotas. Samba reads out those quotas and propagates them
>> to the (Windows) client.
>
> Does samba propagate the quota information (how much space is
> used/available) or do enforcement on the client side?  (Is client
> enforcement even necessary/useful if the backend will stop writes when the
> quota is exceeded?)
>

I'm not sure. It will at least tell the user how much he/she is using on 
that volume and what the quota is. Not sure who enforces, Samba or the 
filesystem.

 From a quick Google it seems like the filesystem has to enforce the 
quota, Samba doesn't.

>> I know this was a problem with ZFS as well. They also said they could do "per
>> filesystem quotas" so that would be sufficient, but for example NFS doesn't
>> export filesystems mounted in a export, so if you have a bunch of
>> homedirectories on the filesystem and you want to account the usage of each
>> user it's getting kind of hard.
>>
>> This could be solved if the clients directly mounted CephFS though.
>>
>> I'm talking about setups where you have 100k users in a LDAP and they all have
>> their data in a single filesystem and you want to track the usage of each
>> user, that's not an easy task without uid-based quotas.
>
> Wouldn't each user live in a sub- or home directory?  If so, it seems like
> the existing rstats would be sufficient to do the accounting piece; only
> enforcement is missing.
>
>> Running 'du' on each directory would be much faster with Ceph since it
>> accounts tracks the subdirectories and shows their total size with an 'ls
>> -al'.
>>
>> Environments with 100k users also tend to be very dynamic with adding and
>> removing users all the time, so creating separate filesystems for them would
>> be very time consuming.
>>
>> Now, I'm not talking about enforcing soft or hard quotas, I'm just talking
>> about knowing how much space uid X and Y consume on the filesystem.
>
> The part I'm most unclear on is what use cases people have where uid X and
> Y are spread around the file system (not in a single or small set of sub
> directories) and per-user (not, say, per-project) quotas are still
> necessary.  In most environments, users get their own home directory and
> everything lives there...
>

I see a POSIX-filesystem as being partially legacy and a part of that 
legacy is user quotas.

If you want existing applications who rely on userquotas to seamlessly 
switch from NFS to CephFS they will need this to work.

We only talked about userquotas, but groupquotas are just as important.

If you have 10 users where 5 of them are in the group "webdev" and you 
wan't to know how much space is being used by the group "webdev" you 
want to probe the group quotas and you are done.

In some setups like we have users have data in different directories 
outside their home directories / NFS exports. On one machine you just 
run "quota -u <uid>" and you know how much user X is using spread out 
over all the filesystems.

With rstats you would be able to achieve the same with some scripting, 
but that doesn't make the migration seamless.

Wido

> sage
>
>
>>
>> Wido
>>
>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS First product release discussion
  2013-03-05 19:33           ` Sage Weil
  2013-03-06 17:24             ` Wido den Hollander
@ 2013-03-06 19:07             ` Jim Schutt
  2013-03-06 19:13               ` CephFS Space Accounting and Quotas (was: CephFS First product release discussion) Greg Farnum
  1 sibling, 1 reply; 31+ messages in thread
From: Jim Schutt @ 2013-03-06 19:07 UTC (permalink / raw)
  To: Sage Weil; +Cc: Wido den Hollander, Greg Farnum, ceph-devel

On 03/05/2013 12:33 PM, Sage Weil wrote:
>> > Running 'du' on each directory would be much faster with Ceph since it
>> > accounts tracks the subdirectories and shows their total size with an 'ls
>> > -al'.
>> > 
>> > Environments with 100k users also tend to be very dynamic with adding and
>> > removing users all the time, so creating separate filesystems for them would
>> > be very time consuming.
>> > 
>> > Now, I'm not talking about enforcing soft or hard quotas, I'm just talking
>> > about knowing how much space uid X and Y consume on the filesystem.
> The part I'm most unclear on is what use cases people have where uid X and 
> Y are spread around the file system (not in a single or small set of sub 
> directories) and per-user (not, say, per-project) quotas are still 
> necessary.  In most environments, users get their own home directory and 
> everything lives there...

Hmmm, is there a tool I should be using that will return the space
used by a directory, and all its descendants?

If it's 'du', that tool is definitely not fast for me.

I'm doing an 'strace du -s <path>', where <path> has one
subdirectory which contains ~600 files.  I've got ~200 clients
mounting the file system, and each client wrote 3 files in that
directory.

I'm doing the 'du' from one of those nodes, and the strace is showing
me du is doing a 'newfstat' for each file.  For each file that was
written on a different client from where du is running, that 'newfstat'
takes tens of seconds to return.  Which means my 'du' has been running
for quite some time and hasn't finished yet....

I'm hoping there's another tool I'm supposed to be using that I
don't know about yet.  Our use case includes tens of millions
of files written from thousands of clients, and whatever tool
we use to do space accounting needs to not walk an entire directory
tree, checking each file.

-- Jim


> 
> sage
> 
> 
>> > 
>> > Wido



^ permalink raw reply	[flat|nested] 31+ messages in thread

* CephFS Space Accounting and Quotas (was: CephFS First product release discussion)
  2013-03-06 19:07             ` Jim Schutt
@ 2013-03-06 19:13               ` Greg Farnum
  2013-03-06 19:58                 ` CephFS Space Accounting and Quotas Jim Schutt
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Farnum @ 2013-03-06 19:13 UTC (permalink / raw)
  To: ceph-devel, Jim Schutt; +Cc: Sage Weil, Wido den Hollander

On Wednesday, March 6, 2013 at 11:07 AM, Jim Schutt wrote:
> On 03/05/2013 12:33 PM, Sage Weil wrote:
> > > > Running 'du' on each directory would be much faster with Ceph since it
> > > > accounts tracks the subdirectories and shows their total size with an 'ls
> > > > -al'.
> > > >  
> > > > Environments with 100k users also tend to be very dynamic with adding and
> > > > removing users all the time, so creating separate filesystems for them would
> > > > be very time consuming.
> > > >  
> > > > Now, I'm not talking about enforcing soft or hard quotas, I'm just talking
> > > > about knowing how much space uid X and Y consume on the filesystem.
> > >  
> >  
> >  
> > The part I'm most unclear on is what use cases people have where uid X and  
> > Y are spread around the file system (not in a single or small set of sub  
> > directories) and per-user (not, say, per-project) quotas are still  
> > necessary. In most environments, users get their own home directory and  
> > everything lives there...
>  
>  
>  
> Hmmm, is there a tool I should be using that will return the space
> used by a directory, and all its descendants?
>  
> If it's 'du', that tool is definitely not fast for me.
>  
> I'm doing an 'strace du -s <path>', where <path> has one
> subdirectory which contains ~600 files. I've got ~200 clients
> mounting the file system, and each client wrote 3 files in that
> directory.
>  
> I'm doing the 'du' from one of those nodes, and the strace is showing
> me du is doing a 'newfstat' for each file. For each file that was
> written on a different client from where du is running, that 'newfstat'
> takes tens of seconds to return. Which means my 'du' has been running
> for quite some time and hasn't finished yet....
>  
> I'm hoping there's another tool I'm supposed to be using that I
> don't know about yet. Our use case includes tens of millions
> of files written from thousands of clients, and whatever tool
> we use to do space accounting needs to not walk an entire directory
> tree, checking each file.

Check out the directory sizes with ls -l or whatever — those numbers are semantically meaningful! :)

Unfortunately we can't (currently) use those "recursive statistics" to do proper hard quotas on subdirectories as they're lazily propagated following client ops, not as part of the updates. (Lazily in the technical sense — it's actually quite fast in general). But they'd work fine for soft quotas if somebody wrote the code, or to block writes on a slight time lag.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-06 19:13               ` CephFS Space Accounting and Quotas (was: CephFS First product release discussion) Greg Farnum
@ 2013-03-06 19:58                 ` Jim Schutt
  2013-03-06 20:21                   ` Greg Farnum
  0 siblings, 1 reply; 31+ messages in thread
From: Jim Schutt @ 2013-03-06 19:58 UTC (permalink / raw)
  To: Greg Farnum; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On 03/06/2013 12:13 PM, Greg Farnum wrote:
> On Wednesday, March 6, 2013 at 11:07 AM, Jim Schutt wrote:
>> On 03/05/2013 12:33 PM, Sage Weil wrote:
>>>>> Running 'du' on each directory would be much faster with Ceph since it
>>>>> accounts tracks the subdirectories and shows their total size with an 'ls
>>>>> -al'.
>>>>>  
>>>>> Environments with 100k users also tend to be very dynamic with adding and
>>>>> removing users all the time, so creating separate filesystems for them would
>>>>> be very time consuming.
>>>>>  
>>>>> Now, I'm not talking about enforcing soft or hard quotas, I'm just talking
>>>>> about knowing how much space uid X and Y consume on the filesystem.
>>>>  
>>>  
>>>  
>>> The part I'm most unclear on is what use cases people have where uid X and  
>>> Y are spread around the file system (not in a single or small set of sub  
>>> directories) and per-user (not, say, per-project) quotas are still  
>>> necessary. In most environments, users get their own home directory and  
>>> everything lives there...
>>  
>>  
>>  
>> Hmmm, is there a tool I should be using that will return the space
>> used by a directory, and all its descendants?
>>  
>> If it's 'du', that tool is definitely not fast for me.
>>  
>> I'm doing an 'strace du -s <path>', where <path> has one
>> subdirectory which contains ~600 files. I've got ~200 clients
>> mounting the file system, and each client wrote 3 files in that
>> directory.
>>  
>> I'm doing the 'du' from one of those nodes, and the strace is showing
>> me du is doing a 'newfstat' for each file. For each file that was
>> written on a different client from where du is running, that 'newfstat'
>> takes tens of seconds to return. Which means my 'du' has been running
>> for quite some time and hasn't finished yet....
>>  
>> I'm hoping there's another tool I'm supposed to be using that I
>> don't know about yet. Our use case includes tens of millions
>> of files written from thousands of clients, and whatever tool
>> we use to do space accounting needs to not walk an entire directory
>> tree, checking each file.
> 
> Check out the directory sizes with ls -l or whatever — those numbers are semantically meaningful! :)

That is just exceptionally cool!

> 
> Unfortunately we can't (currently) use those "recursive statistics"
> to do proper hard quotas on subdirectories as they're lazily
> propagated following client ops, not as part of the updates. (Lazily
> in the technical sense — it's actually quite fast in general). But
> they'd work fine for soft quotas if somebody wrote the code, or to
> block writes on a slight time lag.

'ls -lh <dir>' seems to be just the thing if you already know <dir>.

And it's perfectly suitable for our use case of not scheduling
new jobs for users consuming too much space.

I was thinking I might need to find a subtree where all the
subdirectories are owned by the same user, on the theory that
all the files in such a subtree would be owned by that same
user.  E.g., we might want such a capability to manage space per
user in shared project directories.

So, I tried 'find <dir> -type d -exec ls -lhd {} \;'

Unfortunately, that ended up doing a 'newfstatat' on each file
under <dir>, evidently to learn if it was a directory.  The
result was that same slowdown for files written on other clients.

Is there some other way I should be looking for directories if I
don't already know what they are?

Also, this issue of stat on files created on other clients seems
like it's going to be problematic for many interactions our users
will have with the files created by their parallel compute jobs -
any suggestion on how to avoid or fix it?

Thanks!

-- Jim

> -Greg
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-06 19:58                 ` CephFS Space Accounting and Quotas Jim Schutt
@ 2013-03-06 20:21                   ` Greg Farnum
  2013-03-06 21:28                     ` Jim Schutt
  2013-03-06 21:42                     ` Sage Weil
  0 siblings, 2 replies; 31+ messages in thread
From: Greg Farnum @ 2013-03-06 20:21 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On Wednesday, March 6, 2013 at 11:58 AM, Jim Schutt wrote:
> On 03/06/2013 12:13 PM, Greg Farnum wrote:
> > Check out the directory sizes with ls -l or whatever — those numbers are semantically meaningful! :)
>  
>  
> That is just exceptionally cool!
>  
> >  
> > Unfortunately we can't (currently) use those "recursive statistics"
> > to do proper hard quotas on subdirectories as they're lazily
> > propagated following client ops, not as part of the updates. (Lazily
> > in the technical sense — it's actually quite fast in general). But
> > they'd work fine for soft quotas if somebody wrote the code, or to
> > block writes on a slight time lag.
>  
>  
>  
> 'ls -lh <dir>' seems to be just the thing if you already know <dir>.
>  
> And it's perfectly suitable for our use case of not scheduling
> new jobs for users consuming too much space.
>  
> I was thinking I might need to find a subtree where all the
> subdirectories are owned by the same user, on the theory that
> all the files in such a subtree would be owned by that same
> user. E.g., we might want such a capability to manage space per
> user in shared project directories.
>  
> So, I tried 'find <dir> -type d -exec ls -lhd {} \;'
>  
> Unfortunately, that ended up doing a 'newfstatat' on each file
> under <dir>, evidently to learn if it was a directory. The
> result was that same slowdown for files written on other clients.
>  
> Is there some other way I should be looking for directories if I
> don't already know what they are?
>  
> Also, this issue of stat on files created on other clients seems
> like it's going to be problematic for many interactions our users
> will have with the files created by their parallel compute jobs -
> any suggestion on how to avoid or fix it?
>  

Brief background: stat is required to provide file size information, and so when you do a stat Ceph needs to find out the actual file size. If the file is currently in use by somebody, that requires gathering up the latest metadata from them.
Separately, while Ceph allows a client and the MDS to proceed with a bunch of operations (ie, mknod) without having it go to disk first, it requires anything which is visible to a third party (another client) be durable on disk for consistency reasons.

These combine to mean that if you do a stat on a file which a client currently has buffered writes for, that buffer must be flushed out to disk before the stat can return. This is the usual cause of the slow stats you're seeing. You should be able to adjust dirty data thresholds to encourage faster writeouts, do fsyncs once a client is done with a file, etc in order to minimize the likelihood of running into this.
Also, I'd have to check but I believe opening a file with LAZY_IO or whatever will weaken those requirements — it's probably not the solution you'd like here but it's an option, and if this turns out to be a serious issue then config options to reduce consistency on certain operations are likely to make their way into the roadmap. :)
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-06 20:21                   ` Greg Farnum
@ 2013-03-06 21:28                     ` Jim Schutt
  2013-03-06 21:39                       ` Greg Farnum
  2013-03-06 21:42                     ` Sage Weil
  1 sibling, 1 reply; 31+ messages in thread
From: Jim Schutt @ 2013-03-06 21:28 UTC (permalink / raw)
  To: Greg Farnum; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On 03/06/2013 01:21 PM, Greg Farnum wrote:
>> > Also, this issue of stat on files created on other clients seems
>> > like it's going to be problematic for many interactions our users
>> > will have with the files created by their parallel compute jobs -
>> > any suggestion on how to avoid or fix it?
>> >  
> Brief background: stat is required to provide file size information,
> and so when you do a stat Ceph needs to find out the actual file
> size. If the file is currently in use by somebody, that requires
> gathering up the latest metadata from them. Separately, while Ceph
> allows a client and the MDS to proceed with a bunch of operations
> (ie, mknod) without having it go to disk first, it requires anything
> which is visible to a third party (another client) be durable on disk
> for consistency reasons.
> 
> These combine to mean that if you do a stat on a file which a client
> currently has buffered writes for, that buffer must be flushed out to
> disk before the stat can return. This is the usual cause of the slow
> stats you're seeing. You should be able to adjust dirty data
> thresholds to encourage faster writeouts, do fsyncs once a client is
> done with a file, etc in order to minimize the likelihood of running
> into this. Also, I'd have to check but I believe opening a file with
> LAZY_IO or whatever will weaken those requirements — it's probably
> not the solution you'd like here but it's an option, and if this
> turns out to be a serious issue then config options to reduce
> consistency on certain operations are likely to make their way into
> the roadmap. :)

That all makes sense.

But, it turns out the files in question were written yesterday,
and I did the stat operations today.

So, shouldn't the dirty buffer issue not be in play here?

Is there anything else that might be going on?

Thanks -- Jim

> -Greg
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-06 21:28                     ` Jim Schutt
@ 2013-03-06 21:39                       ` Greg Farnum
  2013-03-06 23:14                         ` Jim Schutt
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Farnum @ 2013-03-06 21:39 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On Wednesday, March 6, 2013 at 1:28 PM, Jim Schutt wrote:
> On 03/06/2013 01:21 PM, Greg Farnum wrote:
> > > > Also, this issue of stat on files created on other clients seems
> > > > like it's going to be problematic for many interactions our users
> > > > will have with the files created by their parallel compute jobs -
> > > > any suggestion on how to avoid or fix it?
> > >  
> >  
> >  
> > Brief background: stat is required to provide file size information,
> > and so when you do a stat Ceph needs to find out the actual file
> > size. If the file is currently in use by somebody, that requires
> > gathering up the latest metadata from them. Separately, while Ceph
> > allows a client and the MDS to proceed with a bunch of operations
> > (ie, mknod) without having it go to disk first, it requires anything
> > which is visible to a third party (another client) be durable on disk
> > for consistency reasons.
> >  
> > These combine to mean that if you do a stat on a file which a client
> > currently has buffered writes for, that buffer must be flushed out to
> > disk before the stat can return. This is the usual cause of the slow
> > stats you're seeing. You should be able to adjust dirty data
> > thresholds to encourage faster writeouts, do fsyncs once a client is
> > done with a file, etc in order to minimize the likelihood of running
> > into this. Also, I'd have to check but I believe opening a file with
> > LAZY_IO or whatever will weaken those requirements — it's probably
> > not the solution you'd like here but it's an option, and if this
> > turns out to be a serious issue then config options to reduce
> > consistency on certain operations are likely to make their way into
> > the roadmap. :)
>  
>  
>  
> That all makes sense.
>  
> But, it turns out the files in question were written yesterday,
> and I did the stat operations today.
>  
> So, shouldn't the dirty buffer issue not be in play here?
Probably not. :/


> Is there anything else that might be going on?
In that case it sounds like either there's a slowdown on disk access that is propagating up the chain very bizarrely, there's a serious performance issue on the MDS (ie, swapping for everything), or the clients are still holding onto capabilities for the files in question and you're running into some issues with the capability revocation mechanisms.
Can you describe your setup a bit more? What versions are you running, kernel or userspace clients, etc. What config options are you setting on the MDS? Assuming you're on something semi-recent, getting a perfcounter dump from the MDS might be illuminating as well.

We'll probably want to get a high-debug log of the MDS during these slow stats as well.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-06 20:21                   ` Greg Farnum
  2013-03-06 21:28                     ` Jim Schutt
@ 2013-03-06 21:42                     ` Sage Weil
  1 sibling, 0 replies; 31+ messages in thread
From: Sage Weil @ 2013-03-06 21:42 UTC (permalink / raw)
  To: Greg Farnum; +Cc: Jim Schutt, ceph-devel, Wido den Hollander

On Wed, 6 Mar 2013, Greg Farnum wrote:
> > 'ls -lh <dir>' seems to be just the thing if you already know <dir>.
> >  
> > And it's perfectly suitable for our use case of not scheduling
> > new jobs for users consuming too much space.
> >  
> > I was thinking I might need to find a subtree where all the
> > subdirectories are owned by the same user, on the theory that
> > all the files in such a subtree would be owned by that same
> > user. E.g., we might want such a capability to manage space per
> > user in shared project directories.
> >  
> > So, I tried 'find <dir> -type d -exec ls -lhd {} \;'
> >  
> > Unfortunately, that ended up doing a 'newfstatat' on each file
> > under <dir>, evidently to learn if it was a directory. The
> > result was that same slowdown for files written on other clients.
> >  
> > Is there some other way I should be looking for directories if I
> > don't already know what they are?

Normally the readdir result as the d_type field filled in to indicate 
whether the dentry is a directory or not, which makes the stat 
unnecessary.  I'm surprised that find isn't doing that properly already!  
It's possible we aren't populating a field we should be in our readdir 
code...

> > Also, this issue of stat on files created on other clients seems
> > like it's going to be problematic for many interactions our users
> > will have with the files created by their parallel compute jobs -
> > any suggestion on how to avoid or fix it?
> >  
> 
> Brief background: stat is required to provide file size information, and 
> so when you do a stat Ceph needs to find out the actual file size. If 
> the file is currently in use by somebody, that requires gathering up the 
> latest metadata from them. Separately, while Ceph allows a client and 
> the MDS to proceed with a bunch of operations (ie, mknod) without having 
> it go to disk first, it requires anything which is visible to a third 
> party (another client) be durable on disk for consistency reasons.
> 
> These combine to mean that if you do a stat on a file which a client 
> currently has buffered writes for, that buffer must be flushed out to 
> disk before the stat can return. This is the usual cause of the slow 
> stats you're seeing. You should be able to adjust dirty data thresholds 
> to encourage faster writeouts, do fsyncs once a client is done with a 
> file, etc in order to minimize the likelihood of running into this.

This is the current behavior.  There is a bug in the tracker to introduce 
a new lock state to optimize the stat case so that writers are paused but 
buffers aren't flushed.  It hasn't been prioritized, but is not terribly 
complex.

sage

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-06 21:39                       ` Greg Farnum
@ 2013-03-06 23:14                         ` Jim Schutt
  2013-03-07  0:18                           ` Greg Farnum
  0 siblings, 1 reply; 31+ messages in thread
From: Jim Schutt @ 2013-03-06 23:14 UTC (permalink / raw)
  To: Greg Farnum; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On 03/06/2013 02:39 PM, Greg Farnum wrote:
> On Wednesday, March 6, 2013 at 1:28 PM, Jim Schutt wrote:
>> On 03/06/2013 01:21 PM, Greg Farnum wrote:
>>>>> Also, this issue of stat on files created on other clients seems
>>>>> like it's going to be problematic for many interactions our users
>>>>> will have with the files created by their parallel compute jobs -
>>>>> any suggestion on how to avoid or fix it?
>>>>  
>>>  
>>>  
>>> Brief background: stat is required to provide file size information,
>>> and so when you do a stat Ceph needs to find out the actual file
>>> size. If the file is currently in use by somebody, that requires
>>> gathering up the latest metadata from them. Separately, while Ceph
>>> allows a client and the MDS to proceed with a bunch of operations
>>> (ie, mknod) without having it go to disk first, it requires anything
>>> which is visible to a third party (another client) be durable on disk
>>> for consistency reasons.
>>>  
>>> These combine to mean that if you do a stat on a file which a client
>>> currently has buffered writes for, that buffer must be flushed out to
>>> disk before the stat can return. This is the usual cause of the slow
>>> stats you're seeing. You should be able to adjust dirty data
>>> thresholds to encourage faster writeouts, do fsyncs once a client is
>>> done with a file, etc in order to minimize the likelihood of running
>>> into this. Also, I'd have to check but I believe opening a file with
>>> LAZY_IO or whatever will weaken those requirements — it's probably
>>> not the solution you'd like here but it's an option, and if this
>>> turns out to be a serious issue then config options to reduce
>>> consistency on certain operations are likely to make their way into
>>> the roadmap. :)
>>  
>>  
>>  
>> That all makes sense.
>>  
>> But, it turns out the files in question were written yesterday,
>> and I did the stat operations today.
>>  
>> So, shouldn't the dirty buffer issue not be in play here?
> Probably not. :/
> 
> 
>> Is there anything else that might be going on?
> In that case it sounds like either there's a slowdown on disk access
> that is propagating up the chain very bizarrely, there's a serious
> performance issue on the MDS (ie, swapping for everything), or the
> clients are still holding onto capabilities for the files in question
> and you're running into some issues with the capability revocation
> mechanisms.
> Can you describe your setup a bit more? What versions are you
> running, kernel or userspace clients, etc. What config options are
> you setting on the MDS? Assuming you're on something semi-recent,
> getting a perfcounter dump from the MDS might be illuminating as
> well.

When I'm doing these stat operations the file system is otherwise
idle.

What is happening is that once one of these slow stat operations
on a file completes, it never happens again for that file, from
any client.  At least, that's the case if I'm not writing to
the file any more.  I haven't checked if appending to the files
restarts the behavior.

On the client side I'm running with 3.8.2 + the ceph patch queue
that was merged into 3.9-rc1.

On the server side I'm running recent next branch (commit 0f42eddef5),
with the tcp receive socket buffer option patches cherry-picked.
I've also got a patch that allows mkcephfs to use osd_pool_default_pg_num
rather than pg_bits to set initial number of PGs (same for pgp_num),
and a patch that lets me run with just one pool that contains both
data and metadata.  I'm testing data distribution uniformity with 512K PGs.

My MDS tunables are all at default settings.

> 
> We'll probably want to get a high-debug log of the MDS during these slow stats as well.

OK.

Do you want me to try to reproduce with a more standard setup?

Also,  I see Sage just pushed a patch to pgid decoding - I expect
I need that as well, if I'm running the latest client code.

Do you want the MDS log at 10 or 20?

-- Jim


> -Greg
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-06 23:14                         ` Jim Schutt
@ 2013-03-07  0:18                           ` Greg Farnum
  2013-03-07 15:15                             ` Jim Schutt
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Farnum @ 2013-03-07  0:18 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
> When I'm doing these stat operations the file system is otherwise
> idle.

What's the cluster look like? This is just one active MDS and a couple hundred clients?

> What is happening is that once one of these slow stat operations
> on a file completes, it never happens again for that file, from
> any client. At least, that's the case if I'm not writing to
> the file any more. I haven't checked if appending to the files
> restarts the behavior.

I assume it'll come back, but if you could verify that'd be good.

 
> On the client side I'm running with 3.8.2 + the ceph patch queue
> that was merged into 3.9-rc1.
> 
> On the server side I'm running recent next branch (commit 0f42eddef5),
> with the tcp receive socket buffer option patches cherry-picked.
> I've also got a patch that allows mkcephfs to use osd_pool_default_pg_num
> rather than pg_bits to set initial number of PGs (same for pgp_num),
> and a patch that lets me run with just one pool that contains both
> data and metadata. I'm testing data distribution uniformity with 512K PGs.
> 
> My MDS tunables are all at default settings.
> 
> > 
> > We'll probably want to get a high-debug log of the MDS during these slow stats as well.
> 
> OK.
> 
> Do you want me to try to reproduce with a more standard setup?
No, this is fine. 
 
> Also, I see Sage just pushed a patch to pgid decoding - I expect
> I need that as well, if I'm running the latest client code.

Yeah, if you've got the commit it references you'll want it.

> Do you want the MDS log at 10 or 20?
More is better. ;)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS First product release discussion
       [not found]   ` <E0B1337A572647BA9FCC0CE8CA946F42-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
@ 2013-03-07 11:54     ` Jimmy Tang
  0 siblings, 0 replies; 31+ messages in thread
From: Jimmy Tang @ 2013-03-07 11:54 UTC (permalink / raw)
  To: Greg Farnum
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-idqoXFIVOFJgJs9I8MT0rw


On 5 Mar 2013, at 17:03, Greg Farnum wrote:

> This is a companion discussion to the blog post at http://ceph.com/dev-notes/cephfs-mds-status-discussion/ — go read that!
> 
> The short and slightly alternate version: I spent most of about two weeks working on bugs related to snapshots in the MDS, and we started realizing that we could probably do our first supported release of CephFS and the related infrastructure much sooner if we didn't need to support all of the whizbang features. (This isn't to say that the base feature set is stable now, but it's much closer than when you turn on some of the other things.) I'd like to get feedback from you in the community on what minimum supported feature set would prompt or allow you to start using CephFS in real environments — not what you'd *like* to see, but what you *need* to see. This will allow us at Inktank to prioritize more effectively and hopefully get out a supported release much more quickly! :)
> 
> The current proposed feature set is basically what's left over after we've trimmed off everything we can think to split off, but if any of the proposed included features are also particularly important or don't matter, be sure to mention them (NFS export in particular — it works right now but isn't in great shape due to NFS filehandle caching).
> 

fsck would be desirable, even if its just something that tells me that something is 'corrupted' or 'dangling' would be useful. quotas on sub-tree's like how the du feature is currently implemented would be nice. 

some sort of a smarter exporting of sub-tree's that would nice too e.g. if i mounted /ceph/fileset_1 as /myfs1 on a client, I'd like the /myfs1 to report 100gb when i run df instead of 100tb which is the entire system that /ceph/ has, we're currently using rbd's here to limit what the users should have so we can present a subset of the storage managed by ceph to end users so they don't get excited with seeing 100tb available in cephfs (the numbers here are fictional). managing one cephfs is probably easier than managing lots of rbd's in certain cases.

Regards,
Jimmy Tang

--
Senior Software Engineer, Digital Repository of Ireland (DRI)
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | jtang-TdlRit5Z4I6YFDSwBDOiMg@public.gmane.org
Tel: +353-1-896-3847

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS First product release discussion
       [not found]     ` <CANygib-U_MQi1TMmQuT_Q9MVwPfT+PzJwN=+BMcBK69WuRfu3w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-03-07 13:11       ` Félix Ortega Hortigüela
  0 siblings, 0 replies; 31+ messages in thread
From: Félix Ortega Hortigüela @ 2013-03-07 13:11 UTC (permalink / raw)
  To: Neil Levine
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-idqoXFIVOFJgJs9I8MT0rw


[-- Attachment #1.1: Type: text/plain, Size: 3179 bytes --]

I think stable mds daemon and fsck or a way to recover some of the data
once the mds crash is the only thing we need.

We are using ceph as a very big fs for doing nightly backups of our 3000+
servers. We have some front servers doing rsync over slow adsl lines,
saving all data on a very big cephfs mount. We have some kind of versioning
(with rsync --link-dest) and a custom software over all of this that allows
the user to copy his files or program uploads of the data.

We need to scale the storage quickly and to be able to recover one server
or disk malfunction searching minimum downtime. We don't need a lot of
speed (since the data lines we are using are slow).

It seems ceph is the perfect choice, but perhaps a Plan B for recovering
part of our data in case a catastrohpic failure arises is our most needed
feature.

Regards.

--
Félix Ortega Hortigüela


On Wed, Mar 6, 2013 at 6:01 AM, Neil Levine <neil.levine-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:

> As an extra request, it would be great if people explained a little
> about their use-case for the filesystem so we can better understand
> how the features requested map to the type of workloads people are
> trying.
>
> Thanks
>
> Neil
>
> On Tue, Mar 5, 2013 at 9:03 AM, Greg Farnum <greg-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:
> > This is a companion discussion to the blog post at
> http://ceph.com/dev-notes/cephfs-mds-status-discussion/ — go read that!
> >
> > The short and slightly alternate version: I spent most of about two
> weeks working on bugs related to snapshots in the MDS, and we started
> realizing that we could probably do our first supported release of CephFS
> and the related infrastructure much sooner if we didn't need to support all
> of the whizbang features. (This isn't to say that the base feature set is
> stable now, but it's much closer than when you turn on some of the other
> things.) I'd like to get feedback from you in the community on what minimum
> supported feature set would prompt or allow you to start using CephFS in
> real environments — not what you'd *like* to see, but what you *need* to
> see. This will allow us at Inktank to prioritize more effectively and
> hopefully get out a supported release much more quickly! :)
> >
> > The current proposed feature set is basically what's left over after
> we've trimmed off everything we can think to split off, but if any of the
> proposed included features are also particularly important or don't matter,
> be sure to mention them (NFS export in particular — it works right now but
> isn't in great shape due to NFS filehandle caching).
> >
> > Thanks,
> > -Greg
> >
> > Software Engineer #42 @ http://inktank.com | http://ceph.com
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

[-- Attachment #1.2: Type: text/html, Size: 4422 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-07  0:18                           ` Greg Farnum
@ 2013-03-07 15:15                             ` Jim Schutt
  2013-03-08 22:45                               ` Jim Schutt
  0 siblings, 1 reply; 31+ messages in thread
From: Jim Schutt @ 2013-03-07 15:15 UTC (permalink / raw)
  To: Greg Farnum; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On 03/06/2013 05:18 PM, Greg Farnum wrote:
> On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
>> When I'm doing these stat operations the file system is otherwise
>> idle.
> 
> What's the cluster look like? This is just one active MDS and a couple hundred clients?

1 mds, 1 mon, 576 osds, 198 cephfs clients.

> 
>> What is happening is that once one of these slow stat operations
>> on a file completes, it never happens again for that file, from
>> any client. At least, that's the case if I'm not writing to
>> the file any more. I haven't checked if appending to the files
>> restarts the behavior.
> 
> I assume it'll come back, but if you could verify that'd be good.

OK, I'll check it out.

> 
>  
>> On the client side I'm running with 3.8.2 + the ceph patch queue
>> that was merged into 3.9-rc1.
>>
>> On the server side I'm running recent next branch (commit 0f42eddef5),
>> with the tcp receive socket buffer option patches cherry-picked.
>> I've also got a patch that allows mkcephfs to use osd_pool_default_pg_num
>> rather than pg_bits to set initial number of PGs (same for pgp_num),
>> and a patch that lets me run with just one pool that contains both
>> data and metadata. I'm testing data distribution uniformity with 512K PGs.
>>
>> My MDS tunables are all at default settings.
>>
>>>
>>> We'll probably want to get a high-debug log of the MDS during these slow stats as well.
>>
>> OK.
>>
>> Do you want me to try to reproduce with a more standard setup?
> No, this is fine. 
>  
>> Also, I see Sage just pushed a patch to pgid decoding - I expect
>> I need that as well, if I'm running the latest client code.
> 
> Yeah, if you've got the commit it references you'll want it.
> 
>> Do you want the MDS log at 10 or 20?
> More is better. ;)

OK, thanks.

-- Jim

> 
> 
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-07 15:15                             ` Jim Schutt
@ 2013-03-08 22:45                               ` Jim Schutt
  2013-03-09  2:05                                 ` Greg Farnum
  0 siblings, 1 reply; 31+ messages in thread
From: Jim Schutt @ 2013-03-08 22:45 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Greg Farnum, ceph-devel, Sage Weil, Wido den Hollander

On 03/07/2013 08:15 AM, Jim Schutt wrote:
> On 03/06/2013 05:18 PM, Greg Farnum wrote:
>> On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:

[snip]

>>> Do you want the MDS log at 10 or 20?
>> More is better. ;)
> 
> OK, thanks.

I've sent some mds logs via private email...

-- Jim


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-08 22:45                               ` Jim Schutt
@ 2013-03-09  2:05                                 ` Greg Farnum
  2013-03-11 14:47                                   ` Jim Schutt
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Farnum @ 2013-03-09  2:05 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote:
> On 03/07/2013 08:15 AM, Jim Schutt wrote:
> > On 03/06/2013 05:18 PM, Greg Farnum wrote:
> > > On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
> >  
>  
>  
>  
> [snip]
>  
> > > > Do you want the MDS log at 10 or 20?
> > > More is better. ;)
> >  
> >  
> >  
> > OK, thanks.
>  
> I've sent some mds logs via private email...
>  
> -- Jim  
I'm going to need to probe into this a bit more, but on an initial examination I see that most of your stats are actually happening very quickly — it's just that occasionally they take quite a while. Going through the MDS log for one of those, the inode in question is flagged with "needsrecover" from its first appearance in the log — that really shouldn't happen unless a client had write caps on it and the client disappeared. Any ideas? The slowness is being caused by the MDS going out and looking at every object which could be in the file — there are a lot since the file has a listed size of 8GB.
(There are several other mysteries here that can probably be traced to different varieties of non-optimal and buggy code as well — there is a client which has write caps on the inode in question despite it needing recovery, but the recovery isn't triggered until the stat event occurs, etc).
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-09  2:05                                 ` Greg Farnum
@ 2013-03-11 14:47                                   ` Jim Schutt
  2013-03-11 15:48                                     ` Greg Farnum
  0 siblings, 1 reply; 31+ messages in thread
From: Jim Schutt @ 2013-03-11 14:47 UTC (permalink / raw)
  To: Greg Farnum; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On 03/08/2013 07:05 PM, Greg Farnum wrote:
> On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote:
>> On 03/07/2013 08:15 AM, Jim Schutt wrote:
>>> On 03/06/2013 05:18 PM, Greg Farnum wrote:
>>>> On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
>>>  
>>  
>>  
>>  
>> [snip]
>>  
>>>>> Do you want the MDS log at 10 or 20?
>>>> More is better. ;)
>>>  
>>>  
>>>  
>>> OK, thanks.
>>  
>> I've sent some mds logs via private email...
>>  
>> -- Jim  
> I'm going to need to probe into this a bit more, but on an initial
> examination I see that most of your stats are actually happening very
> quickly — it's just that occasionally they take quite a while.

Interesting...

> Going
> through the MDS log for one of those, the inode in question is
> flagged with "needsrecover" from its first appearance in the log —
> that really shouldn't happen unless a client had write caps on it and
> the client disappeared. Any ideas? The slowness is being caused by
> the MDS going out and looking at every object which could be in the
> file — there are a lot since the file has a listed size of 8GB.

For this run, the MDS logging slowed it down enough to cause the
client caps to occasionally go stale.  I don't think it's the cause
of the issue, because I was having it before I turned MDS debugging
up.  My client caps never go stale at, e.g., debug mds 5.

Otherwise, there were no signs of trouble while writing the files.

Can you suggest which kernel client debugging I might enable that
would help understand what is happening?  Also, I have the full
MDS log from writing the files, if that will help.  It's big (~10 GiB).

> (There are several other mysteries here that can probably be traced
> to different varieties of non-optimal and buggy code as well — there
> is a client which has write caps on the inode in question despite it
> needing recovery, but the recovery isn't triggered until the stat
> event occurs, etc).

OK, thanks for taking a look.  Let me know if there is other
logging I can enable that will be helpful.

-- Jim

> -Greg
> 
> Software Engineer #42 @ http://inktank.com | http://ceph.com 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-11 14:47                                   ` Jim Schutt
@ 2013-03-11 15:48                                     ` Greg Farnum
  2013-03-11 16:48                                       ` Jim Schutt
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Farnum @ 2013-03-11 15:48 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote:
> On 03/08/2013 07:05 PM, Greg Farnum wrote:
> > On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote:
> > > On 03/07/2013 08:15 AM, Jim Schutt wrote:
> > > > On 03/06/2013 05:18 PM, Greg Farnum wrote:
> > > > > On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
> > > >  
> > >  
> > >  
> > >  
> > >  
> > >  
> > > [snip]
> > >  
> > > > > > Do you want the MDS log at 10 or 20?
> > > > >  
> > > > > More is better. ;)
> > > >  
> > > >  
> > > >  
> > > >  
> > > > OK, thanks.
> > >  
> > >  
> > > I've sent some mds logs via private email...
> > >  
> > > -- Jim  
> >  
> > I'm going to need to probe into this a bit more, but on an initial
> > examination I see that most of your stats are actually happening very
> > quickly — it's just that occasionally they take quite a while.
>  
>  
>  
> Interesting...
>  
> > Going
> > through the MDS log for one of those, the inode in question is
> > flagged with "needsrecover" from its first appearance in the log —
> > that really shouldn't happen unless a client had write caps on it and
> > the client disappeared. Any ideas? The slowness is being caused by
> > the MDS going out and looking at every object which could be in the
> > file — there are a lot since the file has a listed size of 8GB.
>  
>  
>  
> For this run, the MDS logging slowed it down enough to cause the
> client caps to occasionally go stale. I don't think it's the cause
> of the issue, because I was having it before I turned MDS debugging
> up. My client caps never go stale at, e.g., debug mds 5.

Oh, so this might be behaviorally different than you were seeing before? Drat.

You had said before that each newfstatat was taking tens of seconds, whereas in the strace log you sent along most of the individual calls were taking a bit less than 20 milliseconds. Do you have an strace of them individually taking much more than that, or were you just noticing that they took a long time in aggregate?
I suppose if you were going to run it again then just the message logging could also be helpful. That way we could at least check and see the message delays and if the MDS is doing other work in the course of answering a request.

> Otherwise, there were no signs of trouble while writing the files.
>  
> Can you suggest which kernel client debugging I might enable that
> would help understand what is happening? Also, I have the full
> MDS log from writing the files, if that will help. It's big (~10 GiB).
>  
> > (There are several other mysteries here that can probably be traced
> > to different varieties of non-optimal and buggy code as well — there
> > is a client which has write caps on the inode in question despite it
> > needing recovery, but the recovery isn't triggered until the stat
> > event occurs, etc).
>  
>  
>  
> OK, thanks for taking a look. Let me know if there is other
> logging I can enable that will be helpful.

I'm going to want to spend more time with the log I've got, but I'll think about if there's a different set of data we can gather less disruptively.  
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-11 15:48                                     ` Greg Farnum
@ 2013-03-11 16:48                                       ` Jim Schutt
  2013-03-11 16:57                                         ` Greg Farnum
  0 siblings, 1 reply; 31+ messages in thread
From: Jim Schutt @ 2013-03-11 16:48 UTC (permalink / raw)
  To: Greg Farnum; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On 03/11/2013 09:48 AM, Greg Farnum wrote:
> On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote:
>> On 03/08/2013 07:05 PM, Greg Farnum wrote:
>>> On Friday, March 8, 2013 at 2:45 PM, Jim Schutt wrote:
>>>> On 03/07/2013 08:15 AM, Jim Schutt wrote:
>>>>> On 03/06/2013 05:18 PM, Greg Farnum wrote:
>>>>>> On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
>>>>>  
>>>>  
>>>>  
>>>>  
>>>>  
>>>>  
>>>> [snip]
>>>>  
>>>>>>> Do you want the MDS log at 10 or 20?
>>>>>>  
>>>>>> More is better. ;)
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>> OK, thanks.
>>>>  
>>>>  
>>>> I've sent some mds logs via private email...
>>>>  
>>>> -- Jim  
>>>  
>>> I'm going to need to probe into this a bit more, but on an initial
>>> examination I see that most of your stats are actually happening very
>>> quickly — it's just that occasionally they take quite a while.
>>  
>>  
>>  
>> Interesting...
>>  
>>> Going
>>> through the MDS log for one of those, the inode in question is
>>> flagged with "needsrecover" from its first appearance in the log —
>>> that really shouldn't happen unless a client had write caps on it and
>>> the client disappeared. Any ideas? The slowness is being caused by
>>> the MDS going out and looking at every object which could be in the
>>> file — there are a lot since the file has a listed size of 8GB.
>>  
>>  
>>  
>> For this run, the MDS logging slowed it down enough to cause the
>> client caps to occasionally go stale. I don't think it's the cause
>> of the issue, because I was having it before I turned MDS debugging
>> up. My client caps never go stale at, e.g., debug mds 5.
> 
> Oh, so this might be behaviorally different than you were seeing before? Drat.
> 
> You had said before that each newfstatat was taking tens of seconds,
> whereas in the strace log you sent along most of the individual calls
> were taking a bit less than 20 milliseconds. Do you have an strace of
> them individually taking much more than that, or were you just
> noticing that they took a long time in aggregate?

When I did the first strace, I didn't turn on timestamps, and I was
watching it scroll by.  I saw several stats in a row take ~30 secs,
at which point I got bored, and took a look at the strace man page to
figure out how to get timestamps ;)

Also, another difference is for that test, I was looking at files
I had written the day before, whereas for the strace log I sent,
there was only several minutes between writing and the strace of find.

I thought I had eliminated the page cache issue by using fdatasync
when writing the files.  Perhaps the real issue is affected by that
delay?

> I suppose if you were going to run it again then just the message
> logging could also be helpful. That way we could at least check and
> see the message delays and if the MDS is doing other work in the
> course of answering a request.

I can do as many trials as needed to isolate the issue.

What message debugging level is sufficient on the MDS; 1?

If you want I can attempt to duplicate my memory of the first
test I reported, writing the files today and doing the strace
tomorrow (with timestamps, this time).

Also, would it be helpful to write the files with minimal logging, in
hopes of inducing minimal timing changes, then upping the logging
for the stat phase?

> 
>> Otherwise, there were no signs of trouble while writing the files.
>>  
>> Can you suggest which kernel client debugging I might enable that
>> would help understand what is happening? Also, I have the full
>> MDS log from writing the files, if that will help. It's big (~10 GiB).
>>  
>>> (There are several other mysteries here that can probably be traced
>>> to different varieties of non-optimal and buggy code as well — there
>>> is a client which has write caps on the inode in question despite it
>>> needing recovery, but the recovery isn't triggered until the stat
>>> event occurs, etc).
>>  
>>  
>>  
>> OK, thanks for taking a look. Let me know if there is other
>> logging I can enable that will be helpful.
> 
> I'm going to want to spend more time with the log I've got, but I'll think about if there's a different set of data we can gather less disruptively.  

OK, cool.  Just let me know.

Thanks -- Jim

> -Greg
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-11 16:48                                       ` Jim Schutt
@ 2013-03-11 16:57                                         ` Greg Farnum
  2013-03-11 20:40                                           ` Jim Schutt
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Farnum @ 2013-03-11 16:57 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On Monday, March 11, 2013 at 9:48 AM, Jim Schutt wrote:
> On 03/11/2013 09:48 AM, Greg Farnum wrote:
> > On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote:
> > > 
> > > For this run, the MDS logging slowed it down enough to cause the
> > > client caps to occasionally go stale. I don't think it's the cause
> > > of the issue, because I was having it before I turned MDS debugging
> > > up. My client caps never go stale at, e.g., debug mds 5.
> > 
> > 
> > 
> > Oh, so this might be behaviorally different than you were seeing before? Drat.
> > 
> > You had said before that each newfstatat was taking tens of seconds,
> > whereas in the strace log you sent along most of the individual calls
> > were taking a bit less than 20 milliseconds. Do you have an strace of
> > them individually taking much more than that, or were you just
> > noticing that they took a long time in aggregate?
> 
> 
> 
> When I did the first strace, I didn't turn on timestamps, and I was
> watching it scroll by. I saw several stats in a row take ~30 secs,
> at which point I got bored, and took a look at the strace man page to
> figure out how to get timestamps ;)
> 
> Also, another difference is for that test, I was looking at files
> I had written the day before, whereas for the strace log I sent,
> there was only several minutes between writing and the strace of find.
> 
> I thought I had eliminated the page cache issue by using fdatasync
> when writing the files. Perhaps the real issue is affected by that
> delay?

I'm not sure. I can't think of any mechanism by which waiting longer would increase the time lags, though, so I doubt it.

> > I suppose if you were going to run it again then just the message
> > logging could also be helpful. That way we could at least check and
> > see the message delays and if the MDS is doing other work in the
> > course of answering a request.
> 
> 
> 
> I can do as many trials as needed to isolate the issue.
> 
> What message debugging level is sufficient on the MDS; 1?
Yep, that will capture all incoming and outgoing messages. :) 
> 
> If you want I can attempt to duplicate my memory of the first
> test I reported, writing the files today and doing the strace
> tomorrow (with timestamps, this time).
> 
> Also, would it be helpful to write the files with minimal logging, in
> hopes of inducing minimal timing changes, then upping the logging
> for the stat phase?

Well that would give us better odds of not introducing failures of any kind during the write phase, and then getting accurate information on what's happening during the stats, so it probably would. Basically I'd like as much logging as possible without changing the states they system goes through. ;)
-Greg


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-11 16:57                                         ` Greg Farnum
@ 2013-03-11 20:40                                           ` Jim Schutt
  2013-03-12 22:34                                             ` Jim Schutt
  0 siblings, 1 reply; 31+ messages in thread
From: Jim Schutt @ 2013-03-11 20:40 UTC (permalink / raw)
  To: Greg Farnum; +Cc: ceph-devel, Sage Weil, Wido den Hollander

On 03/11/2013 10:57 AM, Greg Farnum wrote:
> On Monday, March 11, 2013 at 9:48 AM, Jim Schutt wrote:
>> On 03/11/2013 09:48 AM, Greg Farnum wrote:
>>> On Monday, March 11, 2013 at 7:47 AM, Jim Schutt wrote:
>>>>
>>>> For this run, the MDS logging slowed it down enough to cause the
>>>> client caps to occasionally go stale. I don't think it's the cause
>>>> of the issue, because I was having it before I turned MDS debugging
>>>> up. My client caps never go stale at, e.g., debug mds 5.
>>>
>>>
>>>
>>> Oh, so this might be behaviorally different than you were seeing before? Drat.
>>>
>>> You had said before that each newfstatat was taking tens of seconds,
>>> whereas in the strace log you sent along most of the individual calls
>>> were taking a bit less than 20 milliseconds. Do you have an strace of
>>> them individually taking much more than that, or were you just
>>> noticing that they took a long time in aggregate?
>>
>>
>>
>> When I did the first strace, I didn't turn on timestamps, and I was
>> watching it scroll by. I saw several stats in a row take ~30 secs,
>> at which point I got bored, and took a look at the strace man page to
>> figure out how to get timestamps ;)
>>
>> Also, another difference is for that test, I was looking at files
>> I had written the day before, whereas for the strace log I sent,
>> there was only several minutes between writing and the strace of find.
>>
>> I thought I had eliminated the page cache issue by using fdatasync
>> when writing the files. Perhaps the real issue is affected by that
>> delay?
> 
> I'm not sure. I can't think of any mechanism by which waiting longer
> would increase the time lags, though, so I doubt it.
> 
>>> I suppose if you were going to run it again then just the message
>>> logging could also be helpful. That way we could at least check and
>>> see the message delays and if the MDS is doing other work in the
>>> course of answering a request.
>>
>>
>>
>> I can do as many trials as needed to isolate the issue.
>>
>> What message debugging level is sufficient on the MDS; 1?
> Yep, that will capture all incoming and outgoing messages. :) 
>>
>> If you want I can attempt to duplicate my memory of the first
>> test I reported, writing the files today and doing the strace
>> tomorrow (with timestamps, this time).
>>
>> Also, would it be helpful to write the files with minimal logging, in
>> hopes of inducing minimal timing changes, then upping the logging
>> for the stat phase?
> 
> Well that would give us better odds of not introducing failures of
> any kind during the write phase, and then getting accurate
> information on what's happening during the stats, so it probably
> would. Basically I'd like as much logging as possible without
> changing the states they system goes through. ;)

Hmmm, this is getting more interesting...

I just did two complete trials where I built a file system,
did two sets of writes with minimal MDS logging, then 
turned MDS logging up to 20 with MDS ms logging at 1 for
the stat phase.

In each trial my strace'd find finished in < 10 seconds,
and there were no slow stat calls (they were taking ~19 ms
each).

I'm going to do a third trial where I let things rest
overnight, after I write the files.  That delay is the
only thing I haven't reproduced from my first trial....

-- Jim


> -Greg
> 
> 
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-11 20:40                                           ` Jim Schutt
@ 2013-03-12 22:34                                             ` Jim Schutt
       [not found]                                               ` <513FAE0F.2010608@sandia.gov>
  0 siblings, 1 reply; 31+ messages in thread
From: Jim Schutt @ 2013-03-12 22:34 UTC (permalink / raw)
  To: Greg Farnum; +Cc: Jim Schutt, ceph-devel, Sage Weil, Wido den Hollander

On 03/11/2013 02:40 PM, Jim Schutt wrote:
>>> >> If you want I can attempt to duplicate my memory of the first
>>> >> test I reported, writing the files today and doing the strace
>>> >> tomorrow (with timestamps, this time).
>>> >>
>>> >> Also, would it be helpful to write the files with minimal logging, in
>>> >> hopes of inducing minimal timing changes, then upping the logging
>>> >> for the stat phase?
>> > 
>> > Well that would give us better odds of not introducing failures of
>> > any kind during the write phase, and then getting accurate
>> > information on what's happening during the stats, so it probably
>> > would. Basically I'd like as much logging as possible without
>> > changing the states they system goes through. ;)
> Hmmm, this is getting more interesting...
> 
> I just did two complete trials where I built a file system,
> did two sets of writes with minimal MDS logging, then 
> turned MDS logging up to 20 with MDS ms logging at 1 for
> the stat phase.
> 
> In each trial my strace'd find finished in < 10 seconds,
> and there were no slow stat calls (they were taking ~19 ms
> each).
> 
> I'm going to do a third trial where I let things rest
> overnight, after I write the files.  That delay is the
> only thing I haven't reproduced from my first trial....

As you suspected, that didn't make any difference either...
That trial didn't reproduce the slow stat behavior.

It turns out that the only way I can reliably reproduce
the slow stat behavior right now is to turn MDS debugging
up to 20.

I've got another set of logs with MDS debug ms = 1,
which I'll send via an off-list email.

I still haven't figured out what made that first test
exhibit this behavior, when I was using debug mds = 5.

-- Jim

> 
> -- Jim
> 
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
       [not found]                                                   ` <5143AA84.50409@sandia.gov>
@ 2013-03-15 23:17                                                     ` Greg Farnum
  2013-03-18 14:19                                                       ` Jim Schutt
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Farnum @ 2013-03-15 23:17 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Wido den Hollander, ceph-devel

[Putting list back on cc]

On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote:

> On 03/15/2013 04:23 PM, Greg Farnum wrote:
> > As I come back and look at these again, I'm not sure what the context
> > for these logs is. Which test did they come from, and which behavior
> > (slow or not slow, etc) did you see? :) -Greg
> 
> 
> 
> They come from a test where I had debug mds = 20 and debug ms = 1
> on the MDS while writing files from 198 clients. It turns out that 
> for some reason I need debug mds = 20 during writing to reproduce
> the slow stat behavior later.
> 
> strace.find.dirs.txt.bz2 contains the log of running 
> strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls -lhd {} \;
> 
> From that output, I believe that the stat of at least these files is slow:
> zero0.rc11
> zero0.rc30
> zero0.rc46
> zero0.rc8
> zero0.tc103
> zero0.tc105
> zero0.tc106
> I believe that log shows slow stats on more files, but those are the first few.
> 
> mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the
> find command started, until just after the fifth or sixth slow stat from
> the list above.
> 
> I haven't yet tried to find other ways of reproducing this, but so far
> it appears that something happens during the writing of the files that
> ends up causing the condition that results in slow stat commands.
> 
> I have the full MDS log from the writing of the files, as well, but it's
> big....
> 
> Is that what you were after?
> 
> Thanks for taking a look!
> 
> -- Jim

I just was coming back to these to see what new information was available, but I realized we'd discussed several tests and I wasn't sure what these ones came from. That information is enough, yes.

If in fact you believe you've only seen this with high-level MDS debugging, I believe the cause is as I mentioned last time: the MDS is flapping a bit and so some files get marked as "needsrecover", but they aren't getting recovered asynchronously, and the first thing that pokes them into doing a recover is the stat.
That's definitely not the behavior we want and so I'll be poking around the code a bit and generating bugs, but given that explanation it's a bit less scary than random slow stats are so it's not such a high priority. :) Do let me know if you come across it without the MDS and clients having had connection issues!
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: CephFS Space Accounting and Quotas
  2013-03-15 23:17                                                     ` Greg Farnum
@ 2013-03-18 14:19                                                       ` Jim Schutt
  0 siblings, 0 replies; 31+ messages in thread
From: Jim Schutt @ 2013-03-18 14:19 UTC (permalink / raw)
  To: Greg Farnum; +Cc: Wido den Hollander, ceph-devel

On 03/15/2013 05:17 PM, Greg Farnum wrote:
> [Putting list back on cc]
> 
> On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote:
> 
>> On 03/15/2013 04:23 PM, Greg Farnum wrote:
>>> As I come back and look at these again, I'm not sure what the context
>>> for these logs is. Which test did they come from, and which behavior
>>> (slow or not slow, etc) did you see? :) -Greg
>>
>>
>>
>> They come from a test where I had debug mds = 20 and debug ms = 1
>> on the MDS while writing files from 198 clients. It turns out that 
>> for some reason I need debug mds = 20 during writing to reproduce
>> the slow stat behavior later.
>>
>> strace.find.dirs.txt.bz2 contains the log of running 
>> strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls -lhd {} \;
>>
>> From that output, I believe that the stat of at least these files is slow:
>> zero0.rc11
>> zero0.rc30
>> zero0.rc46
>> zero0.rc8
>> zero0.tc103
>> zero0.tc105
>> zero0.tc106
>> I believe that log shows slow stats on more files, but those are the first few.
>>
>> mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the
>> find command started, until just after the fifth or sixth slow stat from
>> the list above.
>>
>> I haven't yet tried to find other ways of reproducing this, but so far
>> it appears that something happens during the writing of the files that
>> ends up causing the condition that results in slow stat commands.
>>
>> I have the full MDS log from the writing of the files, as well, but it's
>> big....
>>
>> Is that what you were after?
>>
>> Thanks for taking a look!
>>
>> -- Jim
> 
> I just was coming back to these to see what new information was
> available, but I realized we'd discussed several tests and I wasn't
> sure what these ones came from. That information is enough, yes.
> 
> If in fact you believe you've only seen this with high-level MDS
> debugging, I believe the cause is as I mentioned last time: the MDS
> is flapping a bit and so some files get marked as "needsrecover", but
> they aren't getting recovered asynchronously, and the first thing
> that pokes them into doing a recover is the stat.

OK, that makes sense.

> That's definitely not the behavior we want and so I'll be poking
> around the code a bit and generating bugs, but given that explanation
> it's a bit less scary than random slow stats are so it's not such a
> high priority. :) Do let me know if you come across it without the
> MDS and clients having had connection issues!

No problem - thanks!

-- Jim


> -Greg
> 
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2013-03-18 14:19 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <sfid-H20130305-170326-+024.05-1@marduk.tchpc.tcd.ie>
2013-03-05 17:03 ` CephFS First product release discussion Greg Farnum
2013-03-05 18:08   ` Wido den Hollander
2013-03-05 18:17     ` Greg Farnum
2013-03-05 18:28       ` Sage Weil
2013-03-05 18:36         ` Wido den Hollander
2013-03-05 18:48           ` Jim Schutt
2013-03-05 19:33           ` Sage Weil
2013-03-06 17:24             ` Wido den Hollander
2013-03-06 19:07             ` Jim Schutt
2013-03-06 19:13               ` CephFS Space Accounting and Quotas (was: CephFS First product release discussion) Greg Farnum
2013-03-06 19:58                 ` CephFS Space Accounting and Quotas Jim Schutt
2013-03-06 20:21                   ` Greg Farnum
2013-03-06 21:28                     ` Jim Schutt
2013-03-06 21:39                       ` Greg Farnum
2013-03-06 23:14                         ` Jim Schutt
2013-03-07  0:18                           ` Greg Farnum
2013-03-07 15:15                             ` Jim Schutt
2013-03-08 22:45                               ` Jim Schutt
2013-03-09  2:05                                 ` Greg Farnum
2013-03-11 14:47                                   ` Jim Schutt
2013-03-11 15:48                                     ` Greg Farnum
2013-03-11 16:48                                       ` Jim Schutt
2013-03-11 16:57                                         ` Greg Farnum
2013-03-11 20:40                                           ` Jim Schutt
2013-03-12 22:34                                             ` Jim Schutt
     [not found]                                               ` <513FAE0F.2010608@sandia.gov>
     [not found]                                                 ` <BE627BF4B6E74BD49037D07821FC1DB9@inktank.com>
     [not found]                                                   ` <5143AA84.50409@sandia.gov>
2013-03-15 23:17                                                     ` Greg Farnum
2013-03-18 14:19                                                       ` Jim Schutt
2013-03-06 21:42                     ` Sage Weil
2013-03-06  5:01   ` [ceph-users] CephFS First product release discussion Neil Levine
     [not found]     ` <CANygib-U_MQi1TMmQuT_Q9MVwPfT+PzJwN=+BMcBK69WuRfu3w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-03-07 13:11       ` Félix Ortega Hortigüela
     [not found]   ` <E0B1337A572647BA9FCC0CE8CA946F42-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
2013-03-07 11:54     ` Jimmy Tang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.