From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: deprecating inline_data support for CephFS
Date: Fri, 16 Aug 2019 09:27:09 -0400
Message-ID: <a1a1bd7918ef51c09a07ae28bdad4f85e978c1ed.camel@redhat.com>
References: <e392e00ed22ba37c37208988cf5a095150f6c45b.camel@redhat.com>
	 <f1829e2b-f78a-1202-b15a-2b23c9a6183d@in.tum.de>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <ceph-users-bounces-a8pt6IJUokc@public.gmane.org>
In-Reply-To: <f1829e2b-f78a-1202-b15a-2b23c9a6183d-xrfDFxQfymSzQB+pC5nmwQ@public.gmane.org>
List-Archive: <https://lists.ceph.io/hyperkitty/list/ceph-users-a8pt6IJUokc@public.gmane.org/>
List-Help: <mailto:ceph-users-request-a8pt6IJUokc@public.gmane.org?subject=help>
List-Post: <mailto:ceph-users-a8pt6IJUokc@public.gmane.org>
List-Subscribe: <mailto:ceph-users-join-a8pt6IJUokc@public.gmane.org>
List-Unsubscribe: <mailto:ceph-users-leave-a8pt6IJUokc@public.gmane.org>
To: Jonas Jelten <jelten-xrfDFxQfymSzQB+pC5nmwQ@public.gmane.org>, ceph-users <ceph-users-Qp0mS5GaXlQ@public.gmane.org>
Cc: Ceph Development <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "dev-a8pt6IJUokc@public.gmane.org" <dev-a8pt6IJUokc@public.gmane.org>
List-Id: ceph-devel.vger.kernel.org

On Fri, 2019-08-16 at 14:12 +0200, Jonas Jelten wrote:
> Hi!
> 
> I've missed your previous post, but we do have inline_data enabled on our cluster.
> We've not yet benchmarked, but the filesystem has a wide variety of file sizes, and it sounded like a good idea to speed
> up performance. We mount it with the kernel client only, and I've had the subjective impression that latency was better
> once we enabled the feature. Now that you say the kernel client has no write support for it, my impression is probably
> wrong.
>
> I think inline_data is a nice and easy way to improve performance when the CephFS metadata are on SSDs but the bulk data
> is on HDDs. So I'd vote against removal and would instead vouch for improvements of this feature :)
> 
> If storage on the MDS is a problem, files could be stored on a different (e.g. SSD) pool instead, and the file size
> limit and pool selection could be configured via xattrs. And there was some idea to store small objects not in the OSD
> block, but only in the OSD's DB (which is more complicated to use than separate SSD-pool and HDD-pool, but when block.db
> is on an SSD the speed would be better). Maybe this could all be combined to have better small-file performance in CephFS!
> 

The main problem is developer time and the maintenance burden this
feature represents. This is very much a non-trivial thing to implement.
Consider that the read() and write() codepaths in the kernel already
have 3 main branches each:

buffered I/O (when Fcb caps are held)
synchronous I/O (when Fcb caps are not held)
O_DIRECT I/O

We could probably consolidate the O_DIRECT and sync I/O code somewhat,
but buffered is handled entirely differently. Once we mix in inline_data
support, we have to add a completely new branch for each of those cases,
effectively doubling the complexity.

We'd also need to add similar handing for mmap'ed I/O and for things
like copy_file_range.

But, even before that...I have some real concerns about the existing
handling, even with a single client.

While I haven't attempted to roll a testcase for it, I think we can
probably hit races where multiple tasks handling write page faults can
compete to uninline the data, potentially clobbering the others' writes.
Again, this is non-trivial to fix.

In summary I don't see a real future for this feature unless someone
wants to step up to own it and commit to fixing up these problems.


> On 16/08/2019 13.15, Jeff Layton wrote:
> > A couple of weeks ago, I sent a request to the mailing list asking
> > whether anyone was using the inline_data support in cephfs:
> > 
> >     https://docs.ceph.com/docs/mimic/cephfs/experimental-features/#inline-data
> > 
> > I got exactly zero responses, so I'm going to formally propose that we
> > move to start deprecating this feature for Octopus.
> > 
> > Why deprecate this feature?
> > ===========================
> > While the userland clients have support for both reading and writing,
> > the kernel only has support for reading, and aggressively uninlines
> > everything as soon as it needs to do any writing. That uninlining has
> > some rather nasty potential race conditions too that could cause data
> > corruption.
> > 
> > We could work to fix this, and maybe add write support for the kernel,
> > but it adds a lot of complexity to the read and write codepaths in the
> > clients, which are already pretty complex. Given that there isn't a lot
> > of interest in this feature, I think we ought to just pull the plug on
> > it.
> > 
> > How should we do this?
> > ======================
> > We should start by disabling this feature in master for Octopus. 
> > 
> > In particular, we should stop allowing users to call "fs set inline_data
> > true" on filesystems where it's disabled, and maybe throw a loud warning
> > about the feature being deprecated if the mds is started on a filesystem
> > that has it enabled.
> > 
> > We could also consider creating a utility to crawl an existing
> > filesystem and uninline anything there, if there was need for it.
> > 
> > Then, in a few release cycles, once we're past the point where someone
> > can upgrade directly from Nautilus (release Q or R?) we'd rip out
> > support for this feature entirely.
> > 
> > Thoughts, comments, questions welcome.
> > 

-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
_______________________________________________
ceph-users mailing list -- ceph-users-a8pt6IJUokc@public.gmane.org
To unsubscribe send an email to ceph-users-leave-a8pt6IJUokc@public.gmane.org