From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de ([195.135.220.15]:57752 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752830AbdLDFaA (ORCPT ); Mon, 4 Dec 2017 00:30:00 -0500 From: NeilBrown To: "Fu\, Rodney" , Dave Chinner Date: Mon, 04 Dec 2017 16:29:51 +1100 Cc: Matthew Wilcox , "hch\@lst.de" , "viro\@zeniv.linux.org.uk" , "linux-fsdevel\@vger.kernel.org" Subject: RE: Provision for filesystem specific open flags In-Reply-To: References: <20171110172344.GA15288@lst.de> <20171110192902.GA10339@bombadil.infradead.org> <20171113004855.GV4094@dastard> <20171113215847.GY4094@dastard> Message-ID: <87h8t7t60g.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Sender: linux-fsdevel-owner@vger.kernel.org List-ID: --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Tue, Nov 14 2017, Fu, Rodney wrote: >> The filesystem can still choose to do that for O_DIRECT if it wants - lo= ok at >> all the filesystems that have a "fall back to buffered IO because this i= s too >> hard to implement in the direct Io path". > > Yes, I agree that the filesystem can still decide to buffer IO even with > O_DIRECT, but the application's intent is that the effects of caching are > minimized. Whereas with O_CONCURRENT_WRITE, the intent is to maximize ca= ching. > >> IOWs, you've got another set of custom userspace APIs that are needed to= make >> proper use of this open flag? > > Yes and no. Applications can make ioctls to the filesystem to query or s= et > layout details but don't have to. Directory level default layout attribu= tes can > be set up by an admin to meet the requirements of the application. > >> > In panfs, a well behaved CONCURRENT_WRITE application will consider=20 >> > the file's layout on storage. Access from different machines will not= =20 >> > overlap within the same RAID stripe so as not to cause distributed=20 >> > stripe lock contention. Writes to the file that are page aligned can= =20 >> > be cached and the filesystem can aggregate multiple such writes before= =20 >> > writing out to storage. Conversely, a CONCURRENT_WRITE application=20 >> > that ends up colliding on the same stripe will see worse performance.= =20=20 >> > Non page aligned writes are treated by panfs as write-through and=20 >> > non-cachable, as the filesystem will have to assume that the region of= =20 >> > the page that is untouched by this machine might in fact be written to= =20 >> > on another machine. Caching such a page and writing it out later migh= t lead to data corruption. > >> That seems to fit the expected behaviour of O_DIRECT pretty damn closely= - if >> the app doesn't do correctly aligned and sized IO then performance is go= ing to >> suck, and if the apps doesn't serialize access to the file correctly it = can and >> will corrupt data in the file.... > > I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have > opposite intents with respect to caching. Our filesystem handles them > differently, so we need to distinguish between the two. > >> > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the=20 >> > application does not have to implement any caching to see good perform= ance. > >> Sure, but it has to be aware of layout and where/how it can write, which= is >> exactly the same constraints that local filesystems place on O_DIRECT ac= cess. > >> Not convinced. The use case fits pretty neatly into expected O_DIRECT se= mantics >> and behaviour, IMO. > > I'd like to make a slight adjustment to my proposal. The HPC community h= ad > talked about extensions to POSIX to include O_LAZY as a way for filesyste= ms to > relax data coherency requirements. There is code in the ceph filesystem = that > uses that flag if defined. Can we get O_LAZY defined? This O_LAZY sounds exactly like what NFS has always done. If different clients do page aligned writes and have their own protocol to keep track of who owns which page, then everything is fine and write-back caching does good things. If different clients use byte-range locks, then write-back caching is curtailed a bit, but clients don't need to be so careful. If clients do non-aligned writes without locking, then corruption can result. So: #define O_LAZY 0 and NFS already has it implemented :-) For NFS, with have O_SYNC which tries to provide cache-coherency as strong as other filesystems provide without it. Do we really want O_LAZY? Or are other filesystems trying too hard to provide coherency when apps don't use locks? NeilBrown > > HEC POSIX extension: > http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop= .pdf > > Ceph usage of O_LAZY: > https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e3= 34495ccc/net/ceph/ceph_fs.c#L78 > > Regards, > Rodney --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEG8Yp69OQ2HB7X0l6Oeye3VZigbkFAlok3U8ACgkQOeye3VZi gbnddA/+JaEG6qZ6FG+V1qjoHCAHLtouCaP2oKDvxvEtWwBVPB5w9dAfqq1zcwfI 5RhlNNbELOo1pK83glAHFsmCRh7fRw3rlX59H67AdJET5oZA4aUgOS/QOVfghyUA /VTqAs60vih88p9rZrqc/ZvbglN1XXrVIBPoBVwDiccPty5mA2eQH3/MrOynNRWu h2WqafZaPki9sJP41exrXH7Dkbc349zOh3VV05iTVab0syvh+PqmhAQlccCsYL37 qq5zYWXbO1YdtVtxz0Q9xQKsIx8cJF0FG0zZWMcPwGBFHPoT2HBjIJ3El6s7XezQ wuXFoTgyAneUnBwiutoGlQ34Dfvwywv2ne1azSWNEHktVbr3Lku9hiKkO0juvbsr ATiRQ/hnz/VBUlTvD1eEQY4qHmybPIB23ZwAOLe2V0Z4a6Fkcy1oHe18+/KoogPf Hh4b1B0IGhdgzumyHGnosh09563j8/pMbZvjuTZgr0JrshZSD8mtmV6lM6zR7jiY CCJL4Dt+ogXqCDFkYjClVvV1b5Qz0XOLOyNsPyco6g1tlftukhmVn8vpUcq8nF5L +knke9cn9nxToCDpJDzOm6mwC56K28lXVu5cJo59dIB0xdFGRUpiR9mv3BT7Yce8 OeaxzJ2Aj73+hAZcCzFwhZ4TvGuXa4DEeKLoSxYRnKpeeIjis1U= =Rw/u -----END PGP SIGNATURE----- --=-=-=--