All of lore.kernel.org
 help / color / mirror / Atom feed
* libcephfs create file with layout and replication
@ 2012-11-17 20:13 Noah Watkins
  2012-11-17 21:35 ` Josh Durgin
  2012-11-17 23:23 ` Sage Weil
  0 siblings, 2 replies; 11+ messages in thread
From: Noah Watkins @ 2012-11-17 20:13 UTC (permalink / raw)
  To: ceph-devel; +Cc: Sage Weil

The Hadoop VFS layer assumes that block size and replication can be
set on a per-file basis, which is important to users for file
layout/workload optimizations.

The libcephfs interface doesn't make this entirely easy. Here is one
approach, but it isn't thread safe as the default values are global
variables in the client.

  orig_obj_size = ceph_get_default_object_size() //save
  set_default_object_size(new size)
  open(path, O_CREAT)
  set_default_object_size(new size) //reset

Something more convenient might be:

  ceph_open_layout(path, flags, mode, layout, replication)

where layout and replication are used with O_CREAT | O_EXCL, or and
interface for setting these values explicitly on newly created files:

  ceph_open(path, O_CREAT|O_EXCL)
  ceph_set_layout(path, layout, replication)

where ceph_set_layout would succeed ostensibly on zero-length files.

Any thoughts on how to handle this?

Thanks,
Noah

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: libcephfs create file with layout and replication
  2012-11-17 20:13 libcephfs create file with layout and replication Noah Watkins
@ 2012-11-17 21:35 ` Josh Durgin
  2012-11-17 23:23 ` Sage Weil
  1 sibling, 0 replies; 11+ messages in thread
From: Josh Durgin @ 2012-11-17 21:35 UTC (permalink / raw)
  To: Noah Watkins; +Cc: ceph-devel, Sage Weil

On 11/17/2012 12:13 PM, Noah Watkins wrote:
> The Hadoop VFS layer assumes that block size and replication can be
> set on a per-file basis, which is important to users for file
> layout/workload optimizations.
>
> The libcephfs interface doesn't make this entirely easy. Here is one
> approach, but it isn't thread safe as the default values are global
> variables in the client.
>
>    orig_obj_size = ceph_get_default_object_size() //save
>    set_default_object_size(new size)
>    open(path, O_CREAT)
>    set_default_object_size(new size) //reset
>
> Something more convenient might be:
>
>    ceph_open_layout(path, flags, mode, layout, replication)

I think this makes the most sense, since changing the layout of a
file after it's been created can't happen, and this interface
makes that the most clear. It also avoids maintaining extra state
in libcephfs between calls.

Since replication count is a per-pool setting, I think the hadoop
bindings would have to translate from a vfs request to a pool
with the requested replication level. So something like this,
where layout is a struct containing stripe unit, stripe count,
and object size (the subset of struct ceph_file_layout related to
objects that's useful currently):

     ceph_open_layout(path, flags, mode, layout, pool_name)

BTW, for anyone interested, there's a nice description of
the layout parameters here:

http://ceph.com/docs/master/dev/file-striping/

> where layout and replication are used with O_CREAT | O_EXCL, or and
> interface for setting these values explicitly on newly created files:
>
>    ceph_open(path, O_CREAT|O_EXCL)
>    ceph_set_layout(path, layout, replication)
>
> where ceph_set_layout would succeed ostensibly on zero-length files.
>
> Any thoughts on how to handle this?
>
> Thanks,
> Noah


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: libcephfs create file with layout and replication
  2012-11-17 20:13 libcephfs create file with layout and replication Noah Watkins
  2012-11-17 21:35 ` Josh Durgin
@ 2012-11-17 23:23 ` Sage Weil
  2012-11-17 23:58   ` Noah Watkins
  1 sibling, 1 reply; 11+ messages in thread
From: Sage Weil @ 2012-11-17 23:23 UTC (permalink / raw)
  To: Noah Watkins; +Cc: ceph-devel

On Sat, 17 Nov 2012, Noah Watkins wrote:
> The Hadoop VFS layer assumes that block size and replication can be
> set on a per-file basis, which is important to users for file
> layout/workload optimizations.
> 
> The libcephfs interface doesn't make this entirely easy. Here is one
> approach, but it isn't thread safe as the default values are global
> variables in the client.
> 
>   orig_obj_size = ceph_get_default_object_size() //save
>   set_default_object_size(new size)
>   open(path, O_CREAT)
>   set_default_object_size(new size) //reset
> 
> Something more convenient might be:
> 
>   ceph_open_layout(path, flags, mode, layout, replication)
> 
> where layout and replication are used with O_CREAT | O_EXCL, or and
> interface for setting these values explicitly on newly created files:
> 
>   ceph_open(path, O_CREAT|O_EXCL)
>   ceph_set_layout(path, layout, replication)

This is basically what we have now... at least that's how things work for 
the kernel client.  We should make sure there is a clean way via libcephfs 
to do that.

The client/mds protocol also allows you to specify the layout on file 
creation.  This is better since it has one less round trip to the MDS.  
Let's just create a new open call with those additional arguments.

FWIW, the striping parameters are object size, stripe unit, stripe count, 
and data pool.

sage



> 
> where ceph_set_layout would succeed ostensibly on zero-length files.
> 
> Any thoughts on how to handle this?
> 
> Thanks,
> Noah
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: libcephfs create file with layout and replication
  2012-11-17 23:23 ` Sage Weil
@ 2012-11-17 23:58   ` Noah Watkins
  2012-11-18  0:15     ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Noah Watkins @ 2012-11-17 23:58 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Sat, Nov 17, 2012 at 3:23 PM, Sage Weil <sage@inktank.com> wrote:
> On Sat, 17 Nov 2012, Noah Watkins wrote:
>
> FWIW, the striping parameters are object size, stripe unit, stripe count,
> and data pool.

In ceph_mds_request_args.open I see the all the striping parameters
except data pool, and I don't see any places that the file_replication
parameter is being used. Should a pg_pool field be added?

-Noah

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: libcephfs create file with layout and replication
  2012-11-17 23:58   ` Noah Watkins
@ 2012-11-18  0:15     ` Sage Weil
  2012-11-18  1:20       ` Noah Watkins
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2012-11-18  0:15 UTC (permalink / raw)
  To: Noah Watkins; +Cc: ceph-devel

On Sat, 17 Nov 2012, Noah Watkins wrote:
> On Sat, Nov 17, 2012 at 3:23 PM, Sage Weil <sage@inktank.com> wrote:
> > On Sat, 17 Nov 2012, Noah Watkins wrote:
> >
> > FWIW, the striping parameters are object size, stripe unit, stripe count,
> > and data pool.
> 
> In ceph_mds_request_args.open I see the all the striping parameters
> except data pool, and I don't see any places that the file_replication
> parameter is being used. Should a pg_pool field be added?

Yeah, I think this bit needs to be fixed in the on-write protocol.  That 
is a delicate fix.

We ignore that for the purposes of getting the libcephfs API correct, 
though...

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: libcephfs create file with layout and replication
  2012-11-18  0:15     ` Sage Weil
@ 2012-11-18  1:20       ` Noah Watkins
  2012-11-18 20:05         ` Noah Watkins
  0 siblings, 1 reply; 11+ messages in thread
From: Noah Watkins @ 2012-11-18  1:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Sat, Nov 17, 2012 at 4:15 PM, Sage Weil <sage@inktank.com> wrote:
>
> We ignore that for the purposes of getting the libcephfs API correct,
> though...

Ok, make sense. Thanks.

Noah

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: libcephfs create file with layout and replication
  2012-11-18  1:20       ` Noah Watkins
@ 2012-11-18 20:05         ` Noah Watkins
  2012-11-20  1:04           ` Gregory Farnum
  0 siblings, 1 reply; 11+ messages in thread
From: Noah Watkins @ 2012-11-18 20:05 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Wanna have a look at a first pass on this patch?

   wip-client-open-layout

Thanks,
Noah

On Sat, Nov 17, 2012 at 5:20 PM, Noah Watkins <jayhawk@cs.ucsc.edu> wrote:
> On Sat, Nov 17, 2012 at 4:15 PM, Sage Weil <sage@inktank.com> wrote:
>>
>> We ignore that for the purposes of getting the libcephfs API correct,
>> though...
>
> Ok, make sense. Thanks.
>
> Noah

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: libcephfs create file with layout and replication
  2012-11-18 20:05         ` Noah Watkins
@ 2012-11-20  1:04           ` Gregory Farnum
  2012-11-20  2:48             ` Noah Watkins
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2012-11-20  1:04 UTC (permalink / raw)
  To: Noah Watkins; +Cc: Sage Weil, ceph-devel

On Sun, Nov 18, 2012 at 12:05 PM, Noah Watkins <jayhawk@cs.ucsc.edu> wrote:
> Wanna have a look at a first pass on this patch?
>
>    wip-client-open-layout
>
> Thanks,
> Noah

Just glanced over this, and I'm curious:
1) Why symlink another reference to your file_layout.h?
2) There's already a ceph_file_layout struct which is used "widely"
(MDS, kernel, userspace client). It also has an accompanying function
that does basic validity checks.


> On Sat, Nov 17, 2012 at 5:20 PM, Noah Watkins <jayhawk@cs.ucsc.edu> wrote:
>> On Sat, Nov 17, 2012 at 4:15 PM, Sage Weil <sage@inktank.com> wrote:
>>>
>>> We ignore that for the purposes of getting the libcephfs API correct,
>>> though...
>>
>> Ok, make sense. Thanks.
>>
>> Noah

FYI, there's an "unused" __le32 in the open struct (used to be for
preferred PG). We should be able to steal that away without too much
pain or massaging! :)
-Greg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: libcephfs create file with layout and replication
  2012-11-20  1:04           ` Gregory Farnum
@ 2012-11-20  2:48             ` Noah Watkins
  2012-11-20  3:28               ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Noah Watkins @ 2012-11-20  2:48 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel

On Mon, Nov 19, 2012 at 5:04 PM, Gregory Farnum <greg@inktank.com> wrote:
>
> Just glanced over this, and I'm curious:
> 1) Why symlink another reference to your file_layout.h?

I followed the same pattern as page.h in librados, but may have
misunderstood its use. When libcephfs.h is installed, it includes

  #include "file_layout.h"

and we assume the user has -Iprefix/cephfs/.

but in the build tree, include/cephfs isn't an includes path used,
hence the symlink.

> 2) There's already a ceph_file_layout struct which is used "widely"
> (MDS, kernel, userspace client). It also has an accompanying function
> that does basic validity checks.

I avoided ceph_file_layout because I was under the impression that all
of the __le64 stuff in it was very much Linux-specific. I had run into
a lot of this hacking on an OSX port.

> FYI, there's an "unused" __le32 in the open struct (used to be for
> preferred PG). We should be able to steal that away without too much
> pain or massaging! :)

Nice. Do you think I should revert back to using ceph_file_layout?

Thanks,
Noah

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: libcephfs create file with layout and replication
  2012-11-20  2:48             ` Noah Watkins
@ 2012-11-20  3:28               ` Sage Weil
  2012-11-20 21:59                 ` Noah Watkins
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2012-11-20  3:28 UTC (permalink / raw)
  To: Noah Watkins; +Cc: Gregory Farnum, ceph-devel

On Mon, 19 Nov 2012, Noah Watkins wrote:
> On Mon, Nov 19, 2012 at 5:04 PM, Gregory Farnum <greg@inktank.com> wrote:
> >
> > Just glanced over this, and I'm curious:
> > 1) Why symlink another reference to your file_layout.h?
> 
> I followed the same pattern as page.h in librados, but may have
> misunderstood its use. When libcephfs.h is installed, it includes
> 
>   #include "file_layout.h"
> 
> and we assume the user has -Iprefix/cephfs/.
> 
> but in the build tree, include/cephfs isn't an includes path used,
> hence the symlink.
> 
> > 2) There's already a ceph_file_layout struct which is used "widely"
> > (MDS, kernel, userspace client). It also has an accompanying function
> > that does basic validity checks.
> 
> I avoided ceph_file_layout because I was under the impression that all
> of the __le64 stuff in it was very much Linux-specific. I had run into
> a lot of this hacking on an OSX port.
> 
> > FYI, there's an "unused" __le32 in the open struct (used to be for
> > preferred PG). We should be able to steal that away without too much
> > pain or massaging! :)
> 
> Nice. Do you think I should revert back to using ceph_file_layout?

We could avoid the whole issue by passing 4 arguments to the function...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: libcephfs create file with layout and replication
  2012-11-20  3:28               ` Sage Weil
@ 2012-11-20 21:59                 ` Noah Watkins
  0 siblings, 0 replies; 11+ messages in thread
From: Noah Watkins @ 2012-11-20 21:59 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel

On Mon, Nov 19, 2012 at 7:28 PM, Sage Weil <sage@inktank.com> wrote:
>
> We could avoid the whole issue by passing 4 arguments to the function...

I pushed a new patch that takes each of the 4 new arguments.

  wip-client-open-layout

Thanks,
-Noah

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-11-20 21:59 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-17 20:13 libcephfs create file with layout and replication Noah Watkins
2012-11-17 21:35 ` Josh Durgin
2012-11-17 23:23 ` Sage Weil
2012-11-17 23:58   ` Noah Watkins
2012-11-18  0:15     ` Sage Weil
2012-11-18  1:20       ` Noah Watkins
2012-11-18 20:05         ` Noah Watkins
2012-11-20  1:04           ` Gregory Farnum
2012-11-20  2:48             ` Noah Watkins
2012-11-20  3:28               ` Sage Weil
2012-11-20 21:59                 ` Noah Watkins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.