From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-by2nam01on0086.outbound.protection.outlook.com ([104.47.34.86]:40713 "EHLO NAM01-BY2-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754732AbdKNRf5 (ORCPT ); Tue, 14 Nov 2017 12:35:57 -0500 From: "Fu, Rodney" To: Dave Chinner CC: Matthew Wilcox , "hch@lst.de" , "viro@zeniv.linux.org.uk" , "linux-fsdevel@vger.kernel.org" Subject: RE: Provision for filesystem specific open flags Date: Tue, 14 Nov 2017 17:35:50 +0000 Message-ID: References: <20171110172344.GA15288@lst.de> <20171110192902.GA10339@bombadil.infradead.org> <20171113004855.GV4094@dastard> <20171113215847.GY4094@dastard> In-Reply-To: <20171113215847.GY4094@dastard> Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: > The filesystem can still choose to do that for O_DIRECT if it wants - loo= k at > all the filesystems that have a "fall back to buffered IO because this is= too > hard to implement in the direct Io path". Yes, I agree that the filesystem can still decide to buffer IO even with O_DIRECT, but the application's intent is that the effects of caching are minimized. Whereas with O_CONCURRENT_WRITE, the intent is to maximize cach= ing. > IOWs, you've got another set of custom userspace APIs that are needed to = make > proper use of this open flag? Yes and no. Applications can make ioctls to the filesystem to query or set layout details but don't have to. Directory level default layout attribute= s can be set up by an admin to meet the requirements of the application. > > In panfs, a well behaved CONCURRENT_WRITE application will consider=20 > > the file's layout on storage. Access from different machines will not= =20 > > overlap within the same RAID stripe so as not to cause distributed=20 > > stripe lock contention. Writes to the file that are page aligned can=20 > > be cached and the filesystem can aggregate multiple such writes before= =20 > > writing out to storage. Conversely, a CONCURRENT_WRITE application=20 > > that ends up colliding on the same stripe will see worse performance. = =20 > > Non page aligned writes are treated by panfs as write-through and=20 > > non-cachable, as the filesystem will have to assume that the region of= =20 > > the page that is untouched by this machine might in fact be written to= =20 > > on another machine. Caching such a page and writing it out later might= lead to data corruption. > That seems to fit the expected behaviour of O_DIRECT pretty damn closely = - if > the app doesn't do correctly aligned and sized IO then performance is goi= ng to > suck, and if the apps doesn't serialize access to the file correctly it c= an and > will corrupt data in the file.... I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have opposite intents with respect to caching. Our filesystem handles them differently, so we need to distinguish between the two. > > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the=20 > > application does not have to implement any caching to see good performa= nce. > Sure, but it has to be aware of layout and where/how it can write, which = is > exactly the same constraints that local filesystems place on O_DIRECT acc= ess. > Not convinced. The use case fits pretty neatly into expected O_DIRECT sem= antics > and behaviour, IMO. I'd like to make a slight adjustment to my proposal. The HPC community had talked about extensions to POSIX to include O_LAZY as a way for filesystems= to relax data coherency requirements. There is code in the ceph filesystem th= at uses that flag if defined. Can we get O_LAZY defined? HEC POSIX extension: http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.p= df Ceph usage of O_LAZY: https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334= 495ccc/net/ceph/ceph_fs.c#L78 Regards, Rodney