ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] ceph-client: fix xattr bugs
       [not found] <w2ibe579e381004290204lb59c124di7a59f71bbf4ede1@mail.gmail.com>
@ 2010-04-29 16:39 ` Sage Weil
       [not found]   ` <l2nbe579e381004300827l4c695cb1g2e6ff032fddee222@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2010-04-29 16:39 UTC (permalink / raw)
  To: Henry C Chang; +Cc: ceph-devel

Hi Henry--

On Thu, 29 Apr 2010, Henry C Chang wrote:
> 1. fill_inode() incorrectly frees the xattr blob in the end of the
>   function. It will cause segfault and then kernel will crash.

I fixed this slightly differently, by clearing xattr_blob if/when it is 
used.  See 

http://ceph.newdream.net/git/?p=ceph-client.git;a=commitdiff;h=c89f9c6decdbe2427d4d510a949a2d87c5e340dc

> 2. ceph_listxattr() should compare index_version and version by '>='.

http://ceph.newdream.net/git/?p=ceph-client.git;a=commitdiff;h=da6eb075f800b64af9e199ffbe9171752647fbaa

Both are pushed to the unstable branch.  Do you mind testing?  Do you have 
a simple test that reproduced the bug before?  If so we should add it to 
the qa suite (which currently does nothing with xattrs).

Thanks!
sage


> 
> Signed-off-by: henry_c_chang <henry_c_chang@tcloudcomputing.com>
> ---
>  inode.c |    2 +-
>  xattr.c |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/inode.c b/inode.c
> index 261f3e6..bfe48ee 100644
> --- a/inode.c
> +++ b/inode.c
> @@ -742,7 +742,7 @@ no_change:
>        err = 0;
> 
>  out:
> -       if (xattr_blob)
> +       if (err && xattr_blob)
>                ceph_buffer_put(xattr_blob);
>        return err;
>  }
> diff --git a/xattr.c b/xattr.c
> index 37d6ce6..8c4ef01 100644
> --- a/xattr.c
> +++ b/xattr.c
> @@ -573,7 +573,7 @@ ssize_t ceph_listxattr(struct dentry *dentry, char
> *names, size_t size)
>             ci->i_xattrs.version, ci->i_xattrs.index_version);
> 
>        if (__ceph_caps_issued_mask(ci, CEPH_CAP_XATTR_SHARED, 1) &&
> -           (ci->i_xattrs.index_version > ci->i_xattrs.version)) {
> +           (ci->i_xattrs.index_version >= ci->i_xattrs.version)) {
>                goto list_xattr;
>        } else {
>                spin_unlock(&inode->i_lock);
> --
> 1.6.3.3
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] ceph-client: fix xattr bugs
       [not found]   ` <l2nbe579e381004300827l4c695cb1g2e6ff032fddee222@mail.gmail.com>
@ 2010-04-30 18:21     ` Sage Weil
       [not found]       ` <m2obe579e381005022121z1ace49f3o188240077631f497@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2010-04-30 18:21 UTC (permalink / raw)
  To: Henry C Chang; +Cc: ceph-devel

On Fri, 30 Apr 2010, Henry C Chang wrote:
> Hi Sage,
> 
> Thanks for your reply.
> I don't have the test program that reproduces the bug actually.
> I detected it manually by entering xattr commands from the command prompt.

Do you remember which setfattr/whatever commands?

> In fact, I am trying to add "folder quota support" to ceph right now.
> My rough idea is as below:
> 
> (1) Store the quota limit in the xattr of the folder;
> (2) When the client requests a new max size for writing content to a file,
> MDS authorizes the request according to the quota and the rstat of the
> folder.

One thing to keep in mind is that because the recursive rsize info is 
lazily propagated up the file tree, this won't work perfectly.  If you set 
a limit of 1GB on /foo and are writing data in /foo/bar/baz, it won't stop 
you right at 1GB.  Similarly, if you hit the limit, and delete some stuff, 
it will take time before the MDS notices and lets you start writing again.

If that is acceptable, then in principle this is doable, although 
difficult.  The first step would probably be keeping track of how much 
"max_size - size" slop is currently outstanding in a recursive fashion so 
that the mds can limit itself.  For a single mds that's pretty doable.  
When you consider that the hierarchy can be partitioned across multiple 
nodes, it becomes more difficult (which is partly why rsize is lazily 
propagated).  This would definitely take some planning...

What is your use case?

sage



> 
> I detected the xattr bug when I was doing tests for (1). Everything goes
> well so far after it's fixed.
> I can write some test programs to test it more since I will write a program
> (tool) to set/get folder quota.
> 
> I would appreciate if you can give me some advices on implementing "folder
> quota".
> 
> Best Regards,
> Henry
> 
> 
> On Fri, Apr 30, 2010 at 12:39 AM, Sage Weil <sage@newdream.net> wrote:
> 
> > Hi Henry--
> >
> > On Thu, 29 Apr 2010, Henry C Chang wrote:
> > > 1. fill_inode() incorrectly frees the xattr blob in the end of the
> > >   function. It will cause segfault and then kernel will crash.
> >
> > I fixed this slightly differently, by clearing xattr_blob if/when it is
> > used.  See
> >
> >
> > http://ceph.newdream.net/git/?p=ceph-client.git;a=commitdiff;h=c89f9c6decdbe2427d4d510a949a2d87c5e340dc
> >
> > > 2. ceph_listxattr() should compare index_version and version by '>='.
> >
> >
> > http://ceph.newdream.net/git/?p=ceph-client.git;a=commitdiff;h=da6eb075f800b64af9e199ffbe9171752647fbaa
> >
> > Both are pushed to the unstable branch.  Do you mind testing?  Do you have
> > a simple test that reproduced the bug before?  If so we should add it to
> > the qa suite (which currently does nothing with xattrs).
> >
> > Thanks!
> > sage
> >
> >
> > >
> > > Signed-off-by: henry_c_chang <henry_c_chang@tcloudcomputing.com>
> > > ---
> > >  inode.c |    2 +-
> > >  xattr.c |    2 +-
> > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/inode.c b/inode.c
> > > index 261f3e6..bfe48ee 100644
> > > --- a/inode.c
> > > +++ b/inode.c
> > > @@ -742,7 +742,7 @@ no_change:
> > >        err = 0;
> > >
> > >  out:
> > > -       if (xattr_blob)
> > > +       if (err && xattr_blob)
> > >                ceph_buffer_put(xattr_blob);
> > >        return err;
> > >  }
> > > diff --git a/xattr.c b/xattr.c
> > > index 37d6ce6..8c4ef01 100644
> > > --- a/xattr.c
> > > +++ b/xattr.c
> > > @@ -573,7 +573,7 @@ ssize_t ceph_listxattr(struct dentry *dentry, char
> > > *names, size_t size)
> > >             ci->i_xattrs.version, ci->i_xattrs.index_version);
> > >
> > >        if (__ceph_caps_issued_mask(ci, CEPH_CAP_XATTR_SHARED, 1) &&
> > > -           (ci->i_xattrs.index_version > ci->i_xattrs.version)) {
> > > +           (ci->i_xattrs.index_version >= ci->i_xattrs.version)) {
> > >                goto list_xattr;
> > >        } else {
> > >                spin_unlock(&inode->i_lock);
> > > --
> > > 1.6.3.3
> > >
> >
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] ceph-client: fix xattr bugs
       [not found]       ` <m2obe579e381005022121z1ace49f3o188240077631f497@mail.gmail.com>
@ 2010-05-03  6:41         ` Henry C Chang
  2010-06-01 19:17           ` subdir quotas Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Henry C Chang @ 2010-05-03  6:41 UTC (permalink / raw)
  To: ceph-devel

Please see my inline reply below.

-Henry

On Sat, May 1, 2010 at 2:21 AM, Sage Weil <sage@newdream.net> wrote:
>
> On Fri, 30 Apr 2010, Henry C Chang wrote:
> > Hi Sage,
> >
> > Thanks for your reply.
> > I don't have the test program that reproduces the bug actually.
> > I detected it manually by entering xattr commands from the command prompt.
>
> Do you remember which setfattr/whatever commands?

I installed python-xattr package and it contains a script 'xattr' that
can be used from command prompt directly.

root@ceph-vm2:~# xattr --help
usage: xattr [-l] file [attr_name [attr_value]]
  -l: print long format (attr_name: attr_value) when listing xattrs
  With no optional arguments, lists the xattrs on file
  With attr_name only, lists the contents of attr_name on file
  With attr_value, set the contents of attr_name on file

One of my test sequence is:

$ xattr /mnt/ceph/folder user.quota 20000000
$ xattr /mnt/ceph/folder user.quota
20000000
$ umount /mnt/ceph
$ mount -t ceph 192.168.159.135:/ /mnt/ceph
$ xattr /mnt/ceph/folder user.quota
No such attribute.

Thereafter, the system will become abnormal at some point and crash eventually.

>
> > In fact, I am trying to add "folder quota support" to ceph right now.
> > My rough idea is as below:
> >
> > (1) Store the quota limit in the xattr of the folder;
> > (2) When the client requests a new max size for writing content to a file,
> > MDS authorizes the request according to the quota and the rstat of the
> > folder.
>
> One thing to keep in mind is that because the recursive rsize info is
> lazily propagated up the file tree, this won't work perfectly.  If you set
> a limit of 1GB on /foo and are writing data in /foo/bar/baz, it won't stop
> you right at 1GB.  Similarly, if you hit the limit, and delete some stuff,
> it will take time before the MDS notices and lets you start writing again.
>
Hmm... this would be a problem....
From the perspective of a user, I would be happy if I can write more
than my quota.
However, I would get pissed off if I have deleted some stuff but still
cannot write anything and don't know how long I have to wait.

Is it possible to force MDS to propaget rsize info when files are deleted?
Or, can lazy propagation be bounded to a maximum interval (say 5 seconds)?


>
> If that is acceptable, then in principle this is doable, although
> difficult.  The first step would probably be keeping track of how much
> "max_size - size" slop is currently outstanding in a recursive fashion so
> that the mds can limit itself.  For a single mds that's pretty doable.
> When you consider that the hierarchy can be partitioned across multiple
> nodes, it becomes more difficult (which is partly why rsize is lazily
> propagated).  This would definitely take some planning...
>
> What is your use case?
>
I want to create "depots" inside ceph:
- Each depot has its own quota limit and can be resized as needed.
- Multiple users can read/write the same depot concurrently.

My original plan is to create a first-level folder for each depot
(e.g., /mnt/ceph/depot1, /mnt/ceph/depot2, ...) and set quota on it.
Do you have any suggestion on implementing such a use case?


>
> sage
>
>
>
> >
> > I detected the xattr bug when I was doing tests for (1). Everything goes
> > well so far after it's fixed.
> > I can write some test programs to test it more since I will write a program
> > (tool) to set/get folder quota.
> >
> > I would appreciate if you can give me some advices on implementing "folder
> > quota".
> >
> > Best Regards,
> > Henry
> >
> >
> > On Fri, Apr 30, 2010 at 12:39 AM, Sage Weil <sage@newdream.net> wrote:
> >
> > > Hi Henry--
> > >
> > > On Thu, 29 Apr 2010, Henry C Chang wrote:
> > > > 1. fill_inode() incorrectly frees the xattr blob in the end of the
> > > >   function. It will cause segfault and then kernel will crash.
> > >
> > > I fixed this slightly differently, by clearing xattr_blob if/when it is
> > > used.  See
> > >
> > >
> > > http://ceph.newdream.net/git/?p=ceph-client.git;a=commitdiff;h=c89f9c6decdbe2427d4d510a949a2d87c5e340dc
> > >
> > > > 2. ceph_listxattr() should compare index_version and version by '>='.
> > >
> > >
> > > http://ceph.newdream.net/git/?p=ceph-client.git;a=commitdiff;h=da6eb075f800b64af9e199ffbe9171752647fbaa
> > >
> > > Both are pushed to the unstable branch.  Do you mind testing?  Do you have
> > > a simple test that reproduced the bug before?  If so we should add it to
> > > the qa suite (which currently does nothing with xattrs).
> > >
> > > Thanks!
> > > sage
> > >
> > >
> > > >
> > > > Signed-off-by: henry_c_chang <henry_c_chang@tcloudcomputing.com>
> > > > ---
> > > >  inode.c |    2 +-
> > > >  xattr.c |    2 +-
> > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/inode.c b/inode.c
> > > > index 261f3e6..bfe48ee 100644
> > > > --- a/inode.c
> > > > +++ b/inode.c
> > > > @@ -742,7 +742,7 @@ no_change:
> > > >        err = 0;
> > > >
> > > >  out:
> > > > -       if (xattr_blob)
> > > > +       if (err && xattr_blob)
> > > >                ceph_buffer_put(xattr_blob);
> > > >        return err;
> > > >  }
> > > > diff --git a/xattr.c b/xattr.c
> > > > index 37d6ce6..8c4ef01 100644
> > > > --- a/xattr.c
> > > > +++ b/xattr.c
> > > > @@ -573,7 +573,7 @@ ssize_t ceph_listxattr(struct dentry *dentry, char
> > > > *names, size_t size)
> > > >             ci->i_xattrs.version, ci->i_xattrs.index_version);
> > > >
> > > >        if (__ceph_caps_issued_mask(ci, CEPH_CAP_XATTR_SHARED, 1) &&
> > > > -           (ci->i_xattrs.index_version > ci->i_xattrs.version)) {
> > > > +           (ci->i_xattrs.index_version >= ci->i_xattrs.version)) {
> > > >                goto list_xattr;
> > > >        } else {
> > > >                spin_unlock(&inode->i_lock);
> > > > --
> > > > 1.6.3.3
> > > >
> > >
> >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* subdir quotas
  2010-05-03  6:41         ` Henry C Chang
@ 2010-06-01 19:17           ` Sage Weil
  2010-06-02  9:34             ` Henry C Chang
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2010-06-01 19:17 UTC (permalink / raw)
  To: Henry C Chang; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3472 bytes --]

Hi,

The subject of quota enforcement came up in the IRC channel last week so I 
thought I'd resurrect this discussion.

> > On Fri, 30 Apr 2010, Henry C Chang wrote:
> > > In fact, I am trying to add "folder quota support" to ceph right now.
> > > My rough idea is as below:
> > >
> > > (1) Store the quota limit in the xattr of the folder;
> > > (2) When the client requests a new max size for writing content to a file,
> > > MDS authorizes the request according to the quota and the rstat of the
> > > folder.
> >
> > One thing to keep in mind is that because the recursive rsize info is
> > lazily propagated up the file tree, this won't work perfectly.  If you set
> > a limit of 1GB on /foo and are writing data in /foo/bar/baz, it won't stop
> > you right at 1GB.  Similarly, if you hit the limit, and delete some stuff,
> > it will take time before the MDS notices and lets you start writing again.
> >
> Hmm... this would be a problem....
> From the perspective of a user, I would be happy if I can write more 
> than my quota. However, I would get pissed off if I have deleted some 
> stuff but still cannot write anything and don't know how long I have to 
> wait.
> 
> Is it possible to force MDS to propaget rsize info when files are deleted?
> Or, can lazy propagation be bounded to a maximum interval (say 5 seconds)?

The propagation is bounded by a tunable timeout (30 seconds by default, 
but adjustable).  That's per ancestor.. so if you're three levels deep, 
the max is 3x that.  In practice, it's typically less, though, and I think 
we could come up with something that would force propagation to happen 
faster in these situations.  The reason it's there is just to limit the 
overhead of maintaining the recursive stats.  We don't want to update all 
ancestors every time we change something, and because we're distributed 
over multiple nodes we can't.

> > What is your use case?
>
> I want to create "depots" inside ceph:
> - Each depot has its own quota limit and can be resized as needed.
> - Multiple users can read/write the same depot concurrently.
> 
> My original plan is to create a first-level folder for each depot
> (e.g., /mnt/ceph/depot1, /mnt/ceph/depot2, ...) and set quota on it.
> Do you have any suggestion on implementing such a use case?

There's no reason to restrict this to first-level folders (if that's what 
you were suggesting).  We should allow a subdir quota to be set on any 
directory, probably iff you are the owner.  We can make the user interface 
based on xattrs, since that's generally nicer to interact with than an 
ioctl based interface.  That's not to say the quota should necessarily be 
handled/stored internally as an xattr (although it could be).  It might 
make more sense to add a field to the inode and extend the client/mds 
protocol to manipulate it.

Either way, I think a coarse implementation could be done pretty easily, 
where by 'coarse' I mean we don't necessarily stop writes exactly at the 
limit (they can write a bit more before they start getting ENOSPC).

On IRC the subject of soft quotas also came up (where you're allowed over 
the soft limit for some grace period before writes start failing).  That's 
also not terribly difficult to implement (we just need to store some 
timestamp field as well so we know when they initially cross the soft 
threashold).

Are you still interested in working on this?

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: subdir quotas
  2010-06-01 19:17           ` subdir quotas Sage Weil
@ 2010-06-02  9:34             ` Henry C Chang
  2010-06-02 19:48               ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Henry C Chang @ 2010-06-02  9:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Yes, I am interested and that's what I am doing right now.
In fact, we have a clone of ceph on github, and have had a "quick"
implementation already. You can get it from:

http://github.com/tcloud/ceph/tree/folder-quota
http://github.com/tcloud/ceph-client-standalone/tree/folder-quota

To allow switching quota on/off, we added the option/configuration on both
client and server sides. To enable folder quota, you need to mount ceph with
"-o folder_quota=1" on client side. On server side, you need to add
"folder quota = 1" in the global section of ceph config file. We also
implemented a tool to set/unset/get/list quota limits on folders.

To enforce the quota more precisely, our imlementation, however, sacrifies
the writing throughput and introduces more traffic:

1. We modified the max_size request-reply behaviour between client and mds.
   Our client requests a new max_size only when endoff > max_size. (i.e., it
   will not pre-request a larger max-size when it's approached the max_size.)

2. Our client requests a constant 4 MB (the object size) every time. This
   degrades the throughput significantly. (It used to request more and more.)

Anyway, it is the initial implementation. I will take your comments into
consideration and try to revise the current implementation. Of course, I will
need your help on rstat propagation issue 'cause I have no clue right now
and have to dig the mds source code more to understand the existing
implementation. :)

A few questions about ceph testing:
- When will a subtree be fragmented?
- Can I force a subtree to be framented to faciliate testing?
- How do I know which mds is authoritive for particular fragment?

Thanks,
Henry

On Wed, Jun 2, 2010 at 3:17 AM, Sage Weil <sage@newdream.net> wrote:
> Hi,
>
> The subject of quota enforcement came up in the IRC channel last week so I
> thought I'd resurrect this discussion.
>
>> > On Fri, 30 Apr 2010, Henry C Chang wrote:
>> > > In fact, I am trying to add "folder quota support" to ceph right now.
>> > > My rough idea is as below:
>> > >
>> > > (1) Store the quota limit in the xattr of the folder;
>> > > (2) When the client requests a new max size for writing content to a file,
>> > > MDS authorizes the request according to the quota and the rstat of the
>> > > folder.
>> >
>> > One thing to keep in mind is that because the recursive rsize info is
>> > lazily propagated up the file tree, this won't work perfectly.  If you set
>> > a limit of 1GB on /foo and are writing data in /foo/bar/baz, it won't stop
>> > you right at 1GB.  Similarly, if you hit the limit, and delete some stuff,
>> > it will take time before the MDS notices and lets you start writing again.
>> >
>> Hmm... this would be a problem....
>> From the perspective of a user, I would be happy if I can write more
>> than my quota. However, I would get pissed off if I have deleted some
>> stuff but still cannot write anything and don't know how long I have to
>> wait.
>>
>> Is it possible to force MDS to propaget rsize info when files are deleted?
>> Or, can lazy propagation be bounded to a maximum interval (say 5 seconds)?
>
> The propagation is bounded by a tunable timeout (30 seconds by default,
> but adjustable).  That's per ancestor.. so if you're three levels deep,
> the max is 3x that.  In practice, it's typically less, though, and I think
> we could come up with something that would force propagation to happen
> faster in these situations.  The reason it's there is just to limit the
> overhead of maintaining the recursive stats.  We don't want to update all
> ancestors every time we change something, and because we're distributed
> over multiple nodes we can't.
>
>> > What is your use case?
>>
>> I want to create "depots" inside ceph:
>> - Each depot has its own quota limit and can be resized as needed.
>> - Multiple users can read/write the same depot concurrently.
>>
>> My original plan is to create a first-level folder for each depot
>> (e.g., /mnt/ceph/depot1, /mnt/ceph/depot2, ...) and set quota on it.
>> Do you have any suggestion on implementing such a use case?
>
> There's no reason to restrict this to first-level folders (if that's what
> you were suggesting).  We should allow a subdir quota to be set on any
> directory, probably iff you are the owner.  We can make the user interface
> based on xattrs, since that's generally nicer to interact with than an
> ioctl based interface.  That's not to say the quota should necessarily be
> handled/stored internally as an xattr (although it could be).  It might
> make more sense to add a field to the inode and extend the client/mds
> protocol to manipulate it.
>
> Either way, I think a coarse implementation could be done pretty easily,
> where by 'coarse' I mean we don't necessarily stop writes exactly at the
> limit (they can write a bit more before they start getting ENOSPC).
>
> On IRC the subject of soft quotas also came up (where you're allowed over
> the soft limit for some grace period before writes start failing).  That's
> also not terribly difficult to implement (we just need to store some
> timestamp field as well so we know when they initially cross the soft
> threashold).
>
> Are you still interested in working on this?
>
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: subdir quotas
  2010-06-02  9:34             ` Henry C Chang
@ 2010-06-02 19:48               ` Sage Weil
  2010-06-04  3:19                 ` Henry C Chang
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2010-06-02 19:48 UTC (permalink / raw)
  To: Henry C Chang; +Cc: ceph-devel

On Wed, 2 Jun 2010, Henry C Chang wrote:
> Yes, I am interested and that's what I am doing right now.
> In fact, we have a clone of ceph on github, and have had a "quick"
> implementation already. You can get it from:
> 
> http://github.com/tcloud/ceph/tree/folder-quota
> http://github.com/tcloud/ceph-client-standalone/tree/folder-quota

Oh, cool.  I'll take a look at this today.

> To allow switching quota on/off, we added the option/configuration on both
> client and server sides. To enable folder quota, you need to mount ceph with
> "-o folder_quota=1" on client side. On server side, you need to add
> "folder quota = 1" in the global section of ceph config file. We also
> implemented a tool to set/unset/get/list quota limits on folders.
> 
> To enforce the quota more precisely, our imlementation, however, sacrifies
> the writing throughput and introduces more traffic:
> 
> 1. We modified the max_size request-reply behaviour between client and mds.
>    Our client requests a new max_size only when endoff > max_size. (i.e., it
>    will not pre-request a larger max-size when it's approached the max_size.)
> 
> 2. Our client requests a constant 4 MB (the object size) every time. This
>    degrades the throughput significantly. (It used to request more and more.)

Is this just to reduce the amount by which we might overshoot?  I would 
try to make it a tunable, maybe ('max size slop' or something) so that it 
preserves the current doubling logic but caps it at some value, so the 
admin can trade throughput vs quota precision.  And/or we can make it also
dynamic reduce that window as the user approaches the limit.

> Anyway, it is the initial implementation. I will take your comments into
> consideration and try to revise the current implementation. Of course, I will
> need your help on rstat propagation issue 'cause I have no clue right now
> and have to dig the mds source code more to understand the existing
> implementation. :)

Sure.

> A few questions about ceph testing:
> - When will a subtree be fragmented?
> - Can I force a subtree to be framented to faciliate testing? 

By default the load balancer goes every 30 seconds.  You can turn on mds 
'thrashing' that will export random directories to random nodes (to stress 
test the migration), but that is probably overkill.

It would probably be best to add something to MDS.cc's handle_command that 
lets the admin explicit initiate a subtree migration, via something like

 $ ceph mds tell 0 export_dir /foo/bar 2    # send /foo/bar from mds0 to 2

I just pushed something to do that to unstable.. let me know if you run 
into problems with it.

> - How do I know which mds is authoritive for particular fragment?

In the mds log you'll periodically see a show_subtrees output, but that 
only shows a local view of the partition.  There isn't currently a way to 
query a running mds, though (e.g. via the 'ceph' tool)... let me think 
about that one!

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: subdir quotas
  2010-06-02 19:48               ` Sage Weil
@ 2010-06-04  3:19                 ` Henry C Chang
  2010-06-04 16:42                   ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Henry C Chang @ 2010-06-04  3:19 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,

On Thu, Jun 3, 2010 at 3:48 AM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 2 Jun 2010, Henry C Chang wrote:
>> Yes, I am interested and that's what I am doing right now.
>> In fact, we have a clone of ceph on github, and have had a "quick"
>> implementation already. You can get it from:
>>
>> http://github.com/tcloud/ceph/tree/folder-quota
>> http://github.com/tcloud/ceph-client-standalone/tree/folder-quota
>
> Oh, cool.  I'll take a look at this today.
>
>> To allow switching quota on/off, we added the option/configuration on both
>> client and server sides. To enable folder quota, you need to mount ceph with
>> "-o folder_quota=1" on client side. On server side, you need to add
>> "folder quota = 1" in the global section of ceph config file. We also
>> implemented a tool to set/unset/get/list quota limits on folders.
>>
>> To enforce the quota more precisely, our imlementation, however, sacrifies
>> the writing throughput and introduces more traffic:
>>
>> 1. We modified the max_size request-reply behaviour between client and mds.
>>    Our client requests a new max_size only when endoff > max_size. (i.e., it
>>    will not pre-request a larger max-size when it's approached the max_size.)
>>
>> 2. Our client requests a constant 4 MB (the object size) every time. This
>>    degrades the throughput significantly. (It used to request more and more.)
>
> Is this just to reduce the amount by which we might overshoot?  I would
> try to make it a tunable, maybe ('max size slop' or something) so that it

Great!

> preserves the current doubling logic but caps it at some value, so the
> admin can trade throughput vs quota precision.  And/or we can make it also
> dynamic reduce that window as the user approaches the limit.

Yes. But if there are multiple clients writing one subtree concurrently, it is
a little bit difficult to say if we are approaching the limit.... we
need to know
how many clients are writing to the same subtree...

>
>> Anyway, it is the initial implementation. I will take your comments into
>> consideration and try to revise the current implementation. Of course, I will
>> need your help on rstat propagation issue 'cause I have no clue right now
>> and have to dig the mds source code more to understand the existing
>> implementation. :)
>
> Sure.
>
>> A few questions about ceph testing:
>> - When will a subtree be fragmented?
>> - Can I force a subtree to be framented to faciliate testing?
>
> By default the load balancer goes every 30 seconds.  You can turn on mds
> 'thrashing' that will export random directories to random nodes (to stress
> test the migration), but that is probably overkill.
>
> It would probably be best to add something to MDS.cc's handle_command that
> lets the admin explicit initiate a subtree migration, via something like
>
>  $ ceph mds tell 0 export_dir /foo/bar 2    # send /foo/bar from mds0 to 2
>
> I just pushed something to do that to unstable.. let me know if you run
> into problems with it.
>

The export_dir command working well, and gives us a convenient way to test
multi-mds scenarios. Not surprisingly, our current implementation is not
working in mult-mds environment... :)

My test setup:
Under mount point, I created /volume, /volume/aaa, /volume/bbb.
    mds0 is authoritative for /volume, /volume/aaa.
    mds1 is authoritative for /volume/bbb.
Quota is set on /volume: 250M

Test case 0: pass
cp 100M file to /volume/aaa/a0
cp 100M file to /volume/aaa/a1
cp 100M file to /volume/aaa/a2  ==> quota exceeded error is expected here

Test case 1: pass
cp 100M file to /volume/bbb/b0
cp 100M file to /volume/bbb/b1
cp 100M file to /volume/aaa/a1  ==> quota exceeded error is expected here

Test case 2: failed
cp 100M file to /volume/bbb/b0
cp 100M file to /volume/bbb/b1
cp 100M file to /volume/bbb/b2  ==> quota exceeded error is expected here

It seems that rstat can be propagated up (from mds1 to mds0) quickly (case 1);
however, the ancestor replica (/volume) in mds1 is not updated (case 2).
I wonder how/when the replicas get updated. I'm still digging the source code
to find where. :(

Henry
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: subdir quotas
  2010-06-04  3:19                 ` Henry C Chang
@ 2010-06-04 16:42                   ` Sage Weil
  0 siblings, 0 replies; 8+ messages in thread
From: Sage Weil @ 2010-06-04 16:42 UTC (permalink / raw)
  To: Henry C Chang; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3079 bytes --]

On Fri, 4 Jun 2010, Henry C Chang wrote:
> > preserves the current doubling logic but caps it at some value, so the
> > admin can trade throughput vs quota precision.  And/or we can make it also
> > dynamic reduce that window as the user approaches the limit.
> 
> Yes. But if there are multiple clients writing one subtree concurrently, 
> it is a little bit difficult to say if we are approaching the limit.... 
> we need to know how many clients are writing to the same subtree...

I would suggest some sort of recursive 'nested_max_size_diff' accounting 
on each mds that works similarly to the nested_* values in CInode etc.  
Basically, fix up the cache link/unlink methods in CDir to adjust the 
recursive counts (like the anchor and auth_pin counters), and define some 
specific rules like:

 - max_size_diff for any given node is max_size - size (if max_size > 
size)
 - nested_max_size diff is max_size_diff + sum over children, if child 
does not have it's own recursive quota set

The accounting can initially be done local to the mds, which means the 
extent to which clients can exceed their quota before getting shut down 
would increase if the quota subtree spans multiple mds's.  (I don't think 
that's a big deal, personally, but it depends on how strict you want to 
be.  Later we could devise some additional mechanism that allocates 
remaining quota space among nodes the region spans or something.)

Maybe a similar counter can simply count how many open files are 
contributing to that sum, so you can tell a bit more about how that 
available space should be distributed...?

> The export_dir command working well, and gives us a convenient way to test
> multi-mds scenarios. Not surprisingly, our current implementation is not
> working in mult-mds environment... :)
> 
> My test setup:
> Under mount point, I created /volume, /volume/aaa, /volume/bbb.
>     mds0 is authoritative for /volume, /volume/aaa.
>     mds1 is authoritative for /volume/bbb.
> Quota is set on /volume: 250M
> 
> Test case 0: pass
> cp 100M file to /volume/aaa/a0
> cp 100M file to /volume/aaa/a1
> cp 100M file to /volume/aaa/a2  ==> quota exceeded error is expected here
> 
> Test case 1: pass
> cp 100M file to /volume/bbb/b0
> cp 100M file to /volume/bbb/b1
> cp 100M file to /volume/aaa/a1  ==> quota exceeded error is expected here
> 
> Test case 2: failed
> cp 100M file to /volume/bbb/b0
> cp 100M file to /volume/bbb/b1
> cp 100M file to /volume/bbb/b2  ==> quota exceeded error is expected here
> 
> It seems that rstat can be propagated up (from mds1 to mds0) quickly (case 1);
> however, the ancestor replica (/volume) in mds1 is not updated (case 2).
> I wonder how/when the replicas get updated. I'm still digging the source code
> to find where. :(

There is a Locker::scatter_nudge() function that periodically twiddles the 
lock state on the subtree boundary so that the rstat information 
propagates betweeen nodes.  There is an interval in g_conf that controls 
how often that happens...

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-06-04 16:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <w2ibe579e381004290204lb59c124di7a59f71bbf4ede1@mail.gmail.com>
2010-04-29 16:39 ` [PATCH] ceph-client: fix xattr bugs Sage Weil
     [not found]   ` <l2nbe579e381004300827l4c695cb1g2e6ff032fddee222@mail.gmail.com>
2010-04-30 18:21     ` Sage Weil
     [not found]       ` <m2obe579e381005022121z1ace49f3o188240077631f497@mail.gmail.com>
2010-05-03  6:41         ` Henry C Chang
2010-06-01 19:17           ` subdir quotas Sage Weil
2010-06-02  9:34             ` Henry C Chang
2010-06-02 19:48               ` Sage Weil
2010-06-04  3:19                 ` Henry C Chang
2010-06-04 16:42                   ` Sage Weil

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).