All of lore.kernel.org
 help / color / mirror / Atom feed
* deprecating inline_data support for CephFS
@ 2019-08-16 11:15 Jeff Layton
       [not found] ` <e392e00ed22ba37c37208988cf5a095150f6c45b.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff Layton @ 2019-08-16 11:15 UTC (permalink / raw)
  To: ceph-users; +Cc: Ceph Development, dev-a8pt6IJUokc

A couple of weeks ago, I sent a request to the mailing list asking
whether anyone was using the inline_data support in cephfs:

    https://docs.ceph.com/docs/mimic/cephfs/experimental-features/#inline-data

I got exactly zero responses, so I'm going to formally propose that we
move to start deprecating this feature for Octopus.

Why deprecate this feature?
===========================
While the userland clients have support for both reading and writing,
the kernel only has support for reading, and aggressively uninlines
everything as soon as it needs to do any writing. That uninlining has
some rather nasty potential race conditions too that could cause data
corruption.

We could work to fix this, and maybe add write support for the kernel,
but it adds a lot of complexity to the read and write codepaths in the
clients, which are already pretty complex. Given that there isn't a lot
of interest in this feature, I think we ought to just pull the plug on
it.

How should we do this?
======================
We should start by disabling this feature in master for Octopus. 

In particular, we should stop allowing users to call "fs set inline_data
true" on filesystems where it's disabled, and maybe throw a loud warning
about the feature being deprecated if the mds is started on a filesystem
that has it enabled.

We could also consider creating a utility to crawl an existing
filesystem and uninline anything there, if there was need for it.

Then, in a few release cycles, once we're past the point where someone
can upgrade directly from Nautilus (release Q or R?) we'd rip out
support for this feature entirely.

Thoughts, comments, questions welcome.
-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
_______________________________________________
ceph-users mailing list -- ceph-users-a8pt6IJUokc@public.gmane.org
To unsubscribe send an email to ceph-users-leave-a8pt6IJUokc@public.gmane.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: deprecating inline_data support for CephFS
       [not found] ` <e392e00ed22ba37c37208988cf5a095150f6c45b.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2019-08-16 12:12   ` Jonas Jelten
       [not found]     ` <f1829e2b-f78a-1202-b15a-2b23c9a6183d-xrfDFxQfymSzQB+pC5nmwQ@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Jonas Jelten @ 2019-08-16 12:12 UTC (permalink / raw)
  To: Jeff Layton, ceph-users; +Cc: Ceph Development, dev-a8pt6IJUokc

Hi!

I've missed your previous post, but we do have inline_data enabled on our cluster.
We've not yet benchmarked, but the filesystem has a wide variety of file sizes, and it sounded like a good idea to speed
up performance. We mount it with the kernel client only, and I've had the subjective impression that latency was better
once we enabled the feature. Now that you say the kernel client has no write support for it, my impression is probably
wrong.

I think inline_data is a nice and easy way to improve performance when the CephFS metadata are on SSDs but the bulk data
is on HDDs. So I'd vote against removal and would instead vouch for improvements of this feature :)

If storage on the MDS is a problem, files could be stored on a different (e.g. SSD) pool instead, and the file size
limit and pool selection could be configured via xattrs. And there was some idea to store small objects not in the OSD
block, but only in the OSD's DB (which is more complicated to use than separate SSD-pool and HDD-pool, but when block.db
is on an SSD the speed would be better). Maybe this could all be combined to have better small-file performance in CephFS!

-- Jonas


On 16/08/2019 13.15, Jeff Layton wrote:
> A couple of weeks ago, I sent a request to the mailing list asking
> whether anyone was using the inline_data support in cephfs:
> 
>     https://docs.ceph.com/docs/mimic/cephfs/experimental-features/#inline-data
> 
> I got exactly zero responses, so I'm going to formally propose that we
> move to start deprecating this feature for Octopus.
> 
> Why deprecate this feature?
> ===========================
> While the userland clients have support for both reading and writing,
> the kernel only has support for reading, and aggressively uninlines
> everything as soon as it needs to do any writing. That uninlining has
> some rather nasty potential race conditions too that could cause data
> corruption.
> 
> We could work to fix this, and maybe add write support for the kernel,
> but it adds a lot of complexity to the read and write codepaths in the
> clients, which are already pretty complex. Given that there isn't a lot
> of interest in this feature, I think we ought to just pull the plug on
> it.
> 
> How should we do this?
> ======================
> We should start by disabling this feature in master for Octopus. 
> 
> In particular, we should stop allowing users to call "fs set inline_data
> true" on filesystems where it's disabled, and maybe throw a loud warning
> about the feature being deprecated if the mds is started on a filesystem
> that has it enabled.
> 
> We could also consider creating a utility to crawl an existing
> filesystem and uninline anything there, if there was need for it.
> 
> Then, in a few release cycles, once we're past the point where someone
> can upgrade directly from Nautilus (release Q or R?) we'd rip out
> support for this feature entirely.
> 
> Thoughts, comments, questions welcome.
> 
_______________________________________________
ceph-users mailing list -- ceph-users-a8pt6IJUokc@public.gmane.org
To unsubscribe send an email to ceph-users-leave-a8pt6IJUokc@public.gmane.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: deprecating inline_data support for CephFS
       [not found]     ` <f1829e2b-f78a-1202-b15a-2b23c9a6183d-xrfDFxQfymSzQB+pC5nmwQ@public.gmane.org>
@ 2019-08-16 13:27       ` Jeff Layton
  0 siblings, 0 replies; 3+ messages in thread
From: Jeff Layton @ 2019-08-16 13:27 UTC (permalink / raw)
  To: Jonas Jelten, ceph-users; +Cc: Ceph Development, dev-a8pt6IJUokc

On Fri, 2019-08-16 at 14:12 +0200, Jonas Jelten wrote:
> Hi!
> 
> I've missed your previous post, but we do have inline_data enabled on our cluster.
> We've not yet benchmarked, but the filesystem has a wide variety of file sizes, and it sounded like a good idea to speed
> up performance. We mount it with the kernel client only, and I've had the subjective impression that latency was better
> once we enabled the feature. Now that you say the kernel client has no write support for it, my impression is probably
> wrong.
>
> I think inline_data is a nice and easy way to improve performance when the CephFS metadata are on SSDs but the bulk data
> is on HDDs. So I'd vote against removal and would instead vouch for improvements of this feature :)
> 
> If storage on the MDS is a problem, files could be stored on a different (e.g. SSD) pool instead, and the file size
> limit and pool selection could be configured via xattrs. And there was some idea to store small objects not in the OSD
> block, but only in the OSD's DB (which is more complicated to use than separate SSD-pool and HDD-pool, but when block.db
> is on an SSD the speed would be better). Maybe this could all be combined to have better small-file performance in CephFS!
> 

The main problem is developer time and the maintenance burden this
feature represents. This is very much a non-trivial thing to implement.
Consider that the read() and write() codepaths in the kernel already
have 3 main branches each:

buffered I/O (when Fcb caps are held)
synchronous I/O (when Fcb caps are not held)
O_DIRECT I/O

We could probably consolidate the O_DIRECT and sync I/O code somewhat,
but buffered is handled entirely differently. Once we mix in inline_data
support, we have to add a completely new branch for each of those cases,
effectively doubling the complexity.

We'd also need to add similar handing for mmap'ed I/O and for things
like copy_file_range.

But, even before that...I have some real concerns about the existing
handling, even with a single client.

While I haven't attempted to roll a testcase for it, I think we can
probably hit races where multiple tasks handling write page faults can
compete to uninline the data, potentially clobbering the others' writes.
Again, this is non-trivial to fix.

In summary I don't see a real future for this feature unless someone
wants to step up to own it and commit to fixing up these problems.


> On 16/08/2019 13.15, Jeff Layton wrote:
> > A couple of weeks ago, I sent a request to the mailing list asking
> > whether anyone was using the inline_data support in cephfs:
> > 
> >     https://docs.ceph.com/docs/mimic/cephfs/experimental-features/#inline-data
> > 
> > I got exactly zero responses, so I'm going to formally propose that we
> > move to start deprecating this feature for Octopus.
> > 
> > Why deprecate this feature?
> > ===========================
> > While the userland clients have support for both reading and writing,
> > the kernel only has support for reading, and aggressively uninlines
> > everything as soon as it needs to do any writing. That uninlining has
> > some rather nasty potential race conditions too that could cause data
> > corruption.
> > 
> > We could work to fix this, and maybe add write support for the kernel,
> > but it adds a lot of complexity to the read and write codepaths in the
> > clients, which are already pretty complex. Given that there isn't a lot
> > of interest in this feature, I think we ought to just pull the plug on
> > it.
> > 
> > How should we do this?
> > ======================
> > We should start by disabling this feature in master for Octopus. 
> > 
> > In particular, we should stop allowing users to call "fs set inline_data
> > true" on filesystems where it's disabled, and maybe throw a loud warning
> > about the feature being deprecated if the mds is started on a filesystem
> > that has it enabled.
> > 
> > We could also consider creating a utility to crawl an existing
> > filesystem and uninline anything there, if there was need for it.
> > 
> > Then, in a few release cycles, once we're past the point where someone
> > can upgrade directly from Nautilus (release Q or R?) we'd rip out
> > support for this feature entirely.
> > 
> > Thoughts, comments, questions welcome.
> > 

-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
_______________________________________________
ceph-users mailing list -- ceph-users-a8pt6IJUokc@public.gmane.org
To unsubscribe send an email to ceph-users-leave-a8pt6IJUokc@public.gmane.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-08-16 13:27 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-16 11:15 deprecating inline_data support for CephFS Jeff Layton
     [not found] ` <e392e00ed22ba37c37208988cf5a095150f6c45b.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2019-08-16 12:12   ` Jonas Jelten
     [not found]     ` <f1829e2b-f78a-1202-b15a-2b23c9a6183d-xrfDFxQfymSzQB+pC5nmwQ@public.gmane.org>
2019-08-16 13:27       ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.