From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoph Hellwig Subject: Re: [RFC] relaxed barrier semantics Date: Thu, 29 Jul 2010 10:31:42 +0200 Message-ID: <20100729083142.GA30077@lst.de> References: <20100727183546.GG7347@redhat.com> <4C4FE58C.8080403@kernel.org> <20100728082447.GA7668@lst.de> <4C4FECFE.9040509@kernel.org> <20100728085048.GA8884@lst.de> <4C4FF136.5000205@kernel.org> <20100728090025.GA9252@lst.de> <4C4FF592.9090800@kernel.org> <20100728092859.GA11096@lst.de> <20100729014431.GD4506@thunk.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from verein.lst.de ([213.95.11.210]:37744 "EHLO verein.lst.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750987Ab0G2IcK (ORCPT ); Thu, 29 Jul 2010 04:32:10 -0400 Content-Disposition: inline In-Reply-To: <20100729014431.GD4506@thunk.org> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Ted Ts'o , Christoph Hellwig , Tejun Heo , Vivek Goyal , Jan Kara , jaxboe@fusionio.com, James.Bottomley@s On Wed, Jul 28, 2010 at 09:44:31PM -0400, Ted Ts'o wrote: > Define "are safe" --- what interface we planning on using for the > non-draining barrier? At least for ext3, when we write the commit > record using set_buffer_ordered(bh), it assumes that this will do a > flush of all previous writes and that the commit will hit the disk > before any subsequent writes are sent to the disk. So turning the > write of a buffer head marked with set_buffered_ordered() into a FUA > write would _not_ be safe for ext3. Please be careful with your wording. Dou you really mean "all previous writes" or "all previous writes that were completed". My reading of the ext3/jbd code we explicitly wait on I/O completion of dependent writes, and only require those to actually be stable by issueing a flush. If that wasn't the case the default ext3 barriers off behaviour would not only be dangerous on devices with volatile write caches, but also on devices that do not have them, which in addition to the reading of the code is not what we've seen in actual power fail testing, where ext3 does well as long as there is no volatile write cache. Any, the pre-flush semantics are what the relaxe barriers will preservere. REQ_FUA is a separate interface, which we actually have already inside the block layer, we'll just need to emulate it for devices withot the FUA bit and handle it in dm and md. > For ext4, if we don't use journal checksums, then we have the same > requirements as ext3, and the same method of requesting it. If we do > use journal checksums, what ext4 needs is a way of assuring that no > writes after the commit are reordered with respect to the disk platter > before the commit record --- but any of the writes before that, > including the commit, and be reordered because we rely on the checksum > in the commit record to know at replay time whether the last commit is > valid or not. We do that right now by calling blkdev_issue_flush() > with BLKDEF_IFL_WAIT after submitting the write of the commit block. blkdev_issue_flush is just am empty barrier, and the current barriers prevent any kind of reordering. I'd rather avoid adding a one way reordering prevention. Given that we don't appear to actually need the full reordering prevention even without the journal checksums why do you have stricter requirements when they are enabled?