* [RFC] relaxed barrier semantics @ 2010-07-27 16:56 Christoph Hellwig 2010-07-27 17:54 ` Jan Kara 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-27 16:56 UTC (permalink / raw) To: jaxboe, tj, James.Bottomley Cc: linux-fsdevel, linux-scsi, jack, tytso, chris.mason, swhiteho, konishi.ryusuke I've been dealin with reports of massive slowdowns due to the barrier option if used with storage arrays that use do not actually have a volatile write cache. The reason for that is that sd.c by default sets the ordered mode to QUEUE_ORDERED_DRAIN when the WCE bit is not set. This is in accordance with Documentation/block/barriers.txt but missed out on an important point: most filesystems (at least all mainstream ones) couldn't care less about the ordering semantics barrier operations provide. In fact they are actively harmful as they cause us to stall the whole I/O queue while otherwise we'd only have to wait for a rather limited amount of I/O. The simplest fix is to not use write barrier for devices that do not have a volatile write cache, by specifying the nobarrier option. This has a huge disadvantage that it requires manual user interaction instead of simply working out of the box. There are two better automatic options: (1) if a filesystem detects the QUEUE_ORDERED_DRAIN mode, but doesn't actually need the barrier semantics it simply disables all calls to blockdev_issue_flush and never sets the REQ_HARDBARRIER flag on writes. This is a relatively safe option, but it requires code in all filesystems, as well as in the raid / device mapper modules so that they can cope with it. (2) never set the QUEUE_ORDERED_DRAIN, and remove the code related to it aftet auditing that no filesystem actually relies on this behaviour. Currently the block layer fails REQ_HARDBARRIER if QUEUE_ORDERED_NONE is set, so we'd have to fix that as well. (3) introduce a new QUEUE_ORDERED_REALLY_NONE which is set by drivers that know no barrier handling is needed. It's equivalent to QUEUE_ORDERED_NONE except for not failing barrier requests. I'm tempted to go for variant (2) above, and could use some help auditing the filesystems for their use of the barrier semantics. So far I've only found an explicit depency on this behaviour in reiserfs, and there's is guarded by the barrier mount option, so we could easily disable it when we know we don't have the full barrier semantics. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-27 16:56 [RFC] relaxed barrier semantics Christoph Hellwig @ 2010-07-27 17:54 ` Jan Kara 2010-07-27 18:35 ` Vivek Goyal ` (2 more replies) 0 siblings, 3 replies; 155+ messages in thread From: Jan Kara @ 2010-07-27 17:54 UTC (permalink / raw) To: Christoph Hellwig Cc: jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi, jack, tytso, chris.mason, swhiteho, konishi.ryusuke Hi, On Tue 27-07-10 18:56:27, Christoph Hellwig wrote: > I've been dealin with reports of massive slowdowns due to the barrier > option if used with storage arrays that use do not actually have a > volatile write cache. > > The reason for that is that sd.c by default sets the ordered mode to > QUEUE_ORDERED_DRAIN when the WCE bit is not set. This is in accordance > with Documentation/block/barriers.txt but missed out on an important > point: most filesystems (at least all mainstream ones) couldn't care > less about the ordering semantics barrier operations provide. In fact > they are actively harmful as they cause us to stall the whole I/O > queue while otherwise we'd only have to wait for a rather limited > amount of I/O. OK, let me understand one thing. So the storage arrays have some caches and queues of requests and QUEUE_ORDERED_DRAIN forces them flush all this to the platter, right? So can it happen that they somehow lose the requests that were already issued to them (e.g. because of power failure)? > The simplest fix is to not use write barrier for devices that do not > have a volatile write cache, by specifying the nobarrier option. This > has a huge disadvantage that it requires manual user interaction instead > of simply working out of the box. There are two better automatic > options: > > (1) if a filesystem detects the QUEUE_ORDERED_DRAIN mode, but doesn't > actually need the barrier semantics it simply disables all calls > to blockdev_issue_flush and never sets the REQ_HARDBARRIER flag > on writes. This is a relatively safe option, but it requires > code in all filesystems, as well as in the raid / device mapper > modules so that they can cope with it. > (2) never set the QUEUE_ORDERED_DRAIN, and remove the code related to > it aftet auditing that no filesystem actually relies on this > behaviour. Currently the block layer fails REQ_HARDBARRIER > if QUEUE_ORDERED_NONE is set, so we'd have to fix that as well. > (3) introduce a new QUEUE_ORDERED_REALLY_NONE which is set by > drivers that know no barrier handling is needed. It's equivalent > to QUEUE_ORDERED_NONE except for not failing barrier requests. > > I'm tempted to go for variant (2) above, and could use some help > auditing the filesystems for their use of the barrier semantics. > > So far I've only found an explicit depency on this behaviour in > reiserfs, and there's is guarded by the barrier mount option, so > we could easily disable it when we know we don't have the full > barrier semantics. Also JBD2 relies on the ordering semantics if JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT is set (it's used by ext4 if asked to). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-27 17:54 ` Jan Kara @ 2010-07-27 18:35 ` Vivek Goyal 2010-07-27 18:42 ` James Bottomley ` (2 more replies) 2010-07-27 19:37 ` Christoph Hellwig 2010-08-03 18:49 ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig 2 siblings, 3 replies; 155+ messages in thread From: Vivek Goyal @ 2010-07-27 18:35 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Tue, Jul 27, 2010 at 07:54:19PM +0200, Jan Kara wrote: > Hi, > > On Tue 27-07-10 18:56:27, Christoph Hellwig wrote: > > I've been dealin with reports of massive slowdowns due to the barrier > > option if used with storage arrays that use do not actually have a > > volatile write cache. > > > > The reason for that is that sd.c by default sets the ordered mode to > > QUEUE_ORDERED_DRAIN when the WCE bit is not set. This is in accordance > > with Documentation/block/barriers.txt but missed out on an important > > point: most filesystems (at least all mainstream ones) couldn't care > > less about the ordering semantics barrier operations provide. In fact > > they are actively harmful as they cause us to stall the whole I/O > > queue while otherwise we'd only have to wait for a rather limited > > amount of I/O. > OK, let me understand one thing. So the storage arrays have some caches > and queues of requests and QUEUE_ORDERED_DRAIN forces them flush all this > to the platter, right? IIUC, QUEUE_ORDERED_DRAIN will be set only for storage which either does not support write caches or which advertises himself as having no write caches (it has write caches but is batter backed up and is capable of flushing requests upon power failure). IIUC, what Christoph is trying to address is that if write cache is not enabled then we don't need flushing semantics. We can get rid of need of request ordering semantics by waiting on dependent request to finish instead of issuing a barrier. That way we will not issue barriers no request queue drains and that possibly will help with throughput. Vivek > So can it happen that they somehow lose the requests that were already > issued to them (e.g. because of power failure)? > > > The simplest fix is to not use write barrier for devices that do not > > have a volatile write cache, by specifying the nobarrier option. This > > has a huge disadvantage that it requires manual user interaction instead > > of simply working out of the box. There are two better automatic > > options: > > > > (1) if a filesystem detects the QUEUE_ORDERED_DRAIN mode, but doesn't > > actually need the barrier semantics it simply disables all calls > > to blockdev_issue_flush and never sets the REQ_HARDBARRIER flag > > on writes. This is a relatively safe option, but it requires > > code in all filesystems, as well as in the raid / device mapper > > modules so that they can cope with it. > > (2) never set the QUEUE_ORDERED_DRAIN, and remove the code related to > > it aftet auditing that no filesystem actually relies on this > > behaviour. Currently the block layer fails REQ_HARDBARRIER > > if QUEUE_ORDERED_NONE is set, so we'd have to fix that as well. > > (3) introduce a new QUEUE_ORDERED_REALLY_NONE which is set by > > drivers that know no barrier handling is needed. It's equivalent > > to QUEUE_ORDERED_NONE except for not failing barrier requests. > > > > I'm tempted to go for variant (2) above, and could use some help > > auditing the filesystems for their use of the barrier semantics. > > > > So far I've only found an explicit depency on this behaviour in > > reiserfs, and there's is guarded by the barrier mount option, so > > we could easily disable it when we know we don't have the full > > barrier semantics. > Also JBD2 relies on the ordering semantics if > JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT is set (it's used by ext4 if asked to). > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-27 18:35 ` Vivek Goyal @ 2010-07-27 18:42 ` James Bottomley 2010-07-27 18:51 ` Ric Wheeler 2010-07-27 19:43 ` Christoph Hellwig 2010-07-27 19:38 ` Christoph Hellwig 2010-07-28 8:08 ` Tejun Heo 2 siblings, 2 replies; 155+ messages in thread From: James Bottomley @ 2010-07-27 18:42 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Christoph Hellwig, jaxboe, tj, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Tue, 2010-07-27 at 14:35 -0400, Vivek Goyal wrote: > On Tue, Jul 27, 2010 at 07:54:19PM +0200, Jan Kara wrote: > > Hi, > > > > On Tue 27-07-10 18:56:27, Christoph Hellwig wrote: > > > I've been dealin with reports of massive slowdowns due to the barrier > > > option if used with storage arrays that use do not actually have a > > > volatile write cache. > > > > > > The reason for that is that sd.c by default sets the ordered mode to > > > QUEUE_ORDERED_DRAIN when the WCE bit is not set. This is in accordance > > > with Documentation/block/barriers.txt but missed out on an important > > > point: most filesystems (at least all mainstream ones) couldn't care > > > less about the ordering semantics barrier operations provide. In fact > > > they are actively harmful as they cause us to stall the whole I/O > > > queue while otherwise we'd only have to wait for a rather limited > > > amount of I/O. > > OK, let me understand one thing. So the storage arrays have some caches > > and queues of requests and QUEUE_ORDERED_DRAIN forces them flush all this > > to the platter, right? > > IIUC, QUEUE_ORDERED_DRAIN will be set only for storage which either does > not support write caches or which advertises himself as having no write > caches (it has write caches but is batter backed up and is capable of > flushing requests upon power failure). > > IIUC, what Christoph is trying to address is that if write cache is > not enabled then we don't need flushing semantics. We can get rid of > need of request ordering semantics by waiting on dependent request to > finish instead of issuing a barrier. That way we will not issue barriers > no request queue drains and that possibly will help with throughput. I hope not ... I hope that if the drive reports write through or no cache that we don't enable (flush) barriers by default. The problem case is NV cache arrays (usually an array with a battery backed cache). There's no consistency issue since the array will destage the cache on power fail but it reports a write back cache and we try to use barriers. This is wrong because we don't need barriers for consistency and they really damage throughput. James ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-27 18:42 ` James Bottomley @ 2010-07-27 18:51 ` Ric Wheeler 2010-07-27 19:43 ` Christoph Hellwig 1 sibling, 0 replies; 155+ messages in thread From: Ric Wheeler @ 2010-07-27 18:51 UTC (permalink / raw) To: James Bottomley Cc: Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe, tj, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On 07/27/2010 02:42 PM, James Bottomley wrote: > On Tue, 2010-07-27 at 14:35 -0400, Vivek Goyal wrote: > >> On Tue, Jul 27, 2010 at 07:54:19PM +0200, Jan Kara wrote: >> >>> Hi, >>> >>> On Tue 27-07-10 18:56:27, Christoph Hellwig wrote: >>> >>>> I've been dealin with reports of massive slowdowns due to the barrier >>>> option if used with storage arrays that use do not actually have a >>>> volatile write cache. >>>> >>>> The reason for that is that sd.c by default sets the ordered mode to >>>> QUEUE_ORDERED_DRAIN when the WCE bit is not set. This is in accordance >>>> with Documentation/block/barriers.txt but missed out on an important >>>> point: most filesystems (at least all mainstream ones) couldn't care >>>> less about the ordering semantics barrier operations provide. In fact >>>> they are actively harmful as they cause us to stall the whole I/O >>>> queue while otherwise we'd only have to wait for a rather limited >>>> amount of I/O. >>>> >>> OK, let me understand one thing. So the storage arrays have some caches >>> and queues of requests and QUEUE_ORDERED_DRAIN forces them flush all this >>> to the platter, right? >>> >> IIUC, QUEUE_ORDERED_DRAIN will be set only for storage which either does >> not support write caches or which advertises himself as having no write >> caches (it has write caches but is batter backed up and is capable of >> flushing requests upon power failure). >> >> IIUC, what Christoph is trying to address is that if write cache is >> not enabled then we don't need flushing semantics. We can get rid of >> need of request ordering semantics by waiting on dependent request to >> finish instead of issuing a barrier. That way we will not issue barriers >> no request queue drains and that possibly will help with throughput. >> > I hope not ... I hope that if the drive reports write through or no > cache that we don't enable (flush) barriers by default. > > The problem case is NV cache arrays (usually an array with a battery > backed cache). There's no consistency issue since the array will > destage the cache on power fail but it reports a write back cache and we > try to use barriers. This is wrong because we don't need barriers for > consistency and they really damage throughput. > > James > > This is the case we are trying to address. Some (most?) of these NV cache arrays hopefully advertise write through caches and we can automate disabling the unneeded bits here.... ric ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-27 18:42 ` James Bottomley 2010-07-27 18:51 ` Ric Wheeler @ 2010-07-27 19:43 ` Christoph Hellwig 1 sibling, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-27 19:43 UTC (permalink / raw) To: James Bottomley Cc: Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe, tj, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Tue, Jul 27, 2010 at 01:42:45PM -0500, James Bottomley wrote: > I hope not ... I hope that if the drive reports write through or no > cache that we don't enable (flush) barriers by default. drivers/scsi/sd.c:sd_revalidate_disk() if (sdkp->WCE) ordered = sdkp->DPOFUA ? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH; else ordered = QUEUE_ORDERED_DRAIN; blk_queue_ordered(sdkp->disk->queue, ordered); Documentation/block/barrier.txt: QUEUE_ORDERED_DRAIN Requests are ordered by draining the request queue and cache flushing isn't needed. Sequence: drain => barrier > The problem case is NV cache arrays (usually an array with a battery > backed cache). There's no consistency issue since the array will > destage the cache on power fail but it reports a write back cache and we > try to use barriers. This is wrong because we don't need barriers for > consistency and they really damage throughput. The arrays I have access to (various Netapp, IBM and LSI) never report write cache enabled. I've only heard about the above issue from historic tales. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-27 18:35 ` Vivek Goyal 2010-07-27 18:42 ` James Bottomley @ 2010-07-27 19:38 ` Christoph Hellwig 2010-07-28 8:08 ` Tejun Heo 2 siblings, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-27 19:38 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Christoph Hellwig, jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Tue, Jul 27, 2010 at 02:35:46PM -0400, Vivek Goyal wrote: > IIUC, QUEUE_ORDERED_DRAIN will be set only for storage which either does > not support write caches or which advertises himself as having no write > caches (it has write caches but is batter backed up and is capable of > flushing requests upon power failure). More or less. We set it for scsi devices without the write cache enable (WCE) bit, which is only set it there is a volatile write cache that needs flushing. Some historic arrays used to set it despite having a non-volatile write cache, but that doesn't happen anymore with any of the modern ones I have access to. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-27 18:35 ` Vivek Goyal 2010-07-27 18:42 ` James Bottomley 2010-07-27 19:38 ` Christoph Hellwig @ 2010-07-28 8:08 ` Tejun Heo 2010-07-28 8:20 ` Tejun Heo 2010-07-28 8:24 ` Christoph Hellwig 2 siblings, 2 replies; 155+ messages in thread From: Tejun Heo @ 2010-07-28 8:08 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Christoph Hellwig, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke Hello, On 07/27/2010 08:35 PM, Vivek Goyal wrote: > IIUC, what Christoph is trying to address is that if write cache is > not enabled then we don't need flushing semantics. We can get rid of > need of request ordering semantics by waiting on dependent request to > finish instead of issuing a barrier. That way we will not issue barriers > no request queue drains and that possibly will help with throughput. What I don't get here is if filesystems order requests already by waiting for completions why do they use barriers at all? All they need is flush request after all the preceding requests are known to be complete. Having writeback cache or not doesn't make any difference w.r.t. request ordering requirements. If filesystems don't need the heavy handed ordering provided by barrier, it should just use flush instead of barrier. If filesystem needs the barrier ordering, whether the device in question is battery backed and costs more than a house doesn't make any difference. Thanks. -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 8:08 ` Tejun Heo @ 2010-07-28 8:20 ` Tejun Heo 2010-07-28 13:55 ` Vladislav Bolkhovitin 2010-07-28 8:24 ` Christoph Hellwig 1 sibling, 1 reply; 155+ messages in thread From: Tejun Heo @ 2010-07-28 8:20 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Christoph Hellwig, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On 07/28/2010 10:08 AM, Tejun Heo wrote: > Having writeback cache or not doesn't make any difference > w.r.t. request ordering requirements. If filesystems don't need the > heavy handed ordering provided by barrier, it should just use flush > instead of barrier. If filesystem needs the barrier ordering, whether > the device in question is battery backed and costs more than a house > doesn't make any difference. BTW, if filesystems already have code to order the requests they're issuing, it would be *great* to phase out barrier and replace it with simple in-stream, non-ordering flush request. There have been several different suggestions about how to improve barrier and most revolved around how to transfer more information from filesystem to block layer so that block layer can use more relaxed orderign, but the more I think about it, it becomes clear that it doesn't belong to block layer at all. The only benefit of doing it in the block layer, and probably the reason why it was done this way at all, is making use of advanced ordering features of some devices - ordered tag and linked commands. The latter is deprecated and the former is fundamentally broken in error handling anyway. Furthermore, although they do relax ordering requirements from the device queue side, the level of flexibility is significantly lower compared to what filesystems can do themselves. So, yeah, let's phase it out if it isn't too difficult. Thanks. -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 8:20 ` Tejun Heo @ 2010-07-28 13:55 ` Vladislav Bolkhovitin 2010-07-28 14:23 ` Tejun Heo 0 siblings, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-28 13:55 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke Tejun Heo, on 07/28/2010 12:20 PM wrote: > On 07/28/2010 10:08 AM, Tejun Heo wrote: >> Having writeback cache or not doesn't make any difference >> w.r.t. request ordering requirements. If filesystems don't need the >> heavy handed ordering provided by barrier, it should just use flush >> instead of barrier. If filesystem needs the barrier ordering, whether >> the device in question is battery backed and costs more than a house >> doesn't make any difference. > > BTW, if filesystems already have code to order the requests they're > issuing, it would be *great* to phase out barrier and replace it with > simple in-stream, non-ordering flush request. There have been several > different suggestions about how to improve barrier and most revolved > around how to transfer more information from filesystem to block layer > so that block layer can use more relaxed orderign, but the more I > think about it, it becomes clear that it doesn't belong to block layer > at all. > > The only benefit of doing it in the block layer, and probably the > reason why it was done this way at all, is making use of advanced > ordering features of some devices - ordered tag and linked commands. > The latter is deprecated and the former is fundamentally broken in > error handling anyway. Why? SCSI provides ACA and UA_INTLCK which provide all needed facilities for errors handling in deep ordered queues. > Furthermore, although they do relax ordering > requirements from the device queue side, the level of flexibility is > significantly lower compared to what filesystems can do themselves. Can you elaborate more what is not sufficiently flexible in SCSI ordered commands, please? Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 13:55 ` Vladislav Bolkhovitin @ 2010-07-28 14:23 ` Tejun Heo 2010-07-28 14:37 ` James Bottomley 2010-07-28 16:16 ` Vladislav Bolkhovitin 0 siblings, 2 replies; 155+ messages in thread From: Tejun Heo @ 2010-07-28 14:23 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke Hello, On 07/28/2010 03:55 PM, Vladislav Bolkhovitin wrote: >> The only benefit of doing it in the block layer, and probably the >> reason why it was done this way at all, is making use of advanced >> ordering features of some devices - ordered tag and linked commands. >> The latter is deprecated and the former is fundamentally broken in >> error handling anyway. > > Why? SCSI provides ACA and UA_INTLCK which provide all needed > facilities for errors handling in deep ordered queues. I don't remember all the details now but IIRC what was necessary was earlier write failure failing all commands scheduled as ordered. Does ACA / UA_INTLCK or whatever allow that? >> Furthermore, although they do relax ordering >> requirements from the device queue side, the level of flexibility is >> significantly lower compared to what filesystems can do themselves. > > Can you elaborate more what is not sufficiently flexible in SCSI > ordered commands, please? File systems are not communicating enough ordering info to block layer already so we already lose a lot of ordering information there and SCSI ordered queueing is also pretty restricted in what kind of ordering it can represent. The end result is that we don't gain much by using ordered queueing. It may cut down command latencies among commands used for barrier sequence but if you compare it to the level of parallelism filesystem code can exploit by ordering requests themselves... Another thing is coverage. We have ordered queueing for quite some time now but there are only a couple of drivers which actually support them. Thanks. -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 14:23 ` Tejun Heo @ 2010-07-28 14:37 ` James Bottomley 2010-07-28 14:44 ` Tejun Heo 2010-07-28 16:17 ` Vladislav Bolkhovitin 2010-07-28 16:16 ` Vladislav Bolkhovitin 1 sibling, 2 replies; 155+ messages in thread From: James Bottomley @ 2010-07-28 14:37 UTC (permalink / raw) To: Tejun Heo Cc: Vladislav Bolkhovitin, Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Wed, 2010-07-28 at 16:23 +0200, Tejun Heo wrote: > Hello, > > On 07/28/2010 03:55 PM, Vladislav Bolkhovitin wrote: > >> The only benefit of doing it in the block layer, and probably the > >> reason why it was done this way at all, is making use of advanced > >> ordering features of some devices - ordered tag and linked commands. > >> The latter is deprecated and the former is fundamentally broken in > >> error handling anyway. > > > > Why? SCSI provides ACA and UA_INTLCK which provide all needed > > facilities for errors handling in deep ordered queues. > > I don't remember all the details now but IIRC what was necessary was > earlier write failure failing all commands scheduled as ordered. Does > ACA / UA_INTLCK or whatever allow that? No. That requires support for QErr ... which is in the same mode page. The real reason we have difficulty is that BUSY/QUEUE_FULL can cause reordering in the issue queue, which is a driver problem and not in the SCSI standards. James ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 14:37 ` James Bottomley @ 2010-07-28 14:44 ` Tejun Heo 2010-07-28 16:17 ` Vladislav Bolkhovitin 2010-07-28 16:17 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 155+ messages in thread From: Tejun Heo @ 2010-07-28 14:44 UTC (permalink / raw) To: James Bottomley Cc: Vladislav Bolkhovitin, Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke Hello, On 07/28/2010 04:37 PM, James Bottomley wrote: >> I don't remember all the details now but IIRC what was necessary was >> earlier write failure failing all commands scheduled as ordered. Does >> ACA / UA_INTLCK or whatever allow that? > > No. That requires support for QErr ... which is in the same mode page. I see. > The real reason we have difficulty is that BUSY/QUEUE_FULL can cause > reordering in the issue queue, which is a driver problem and not in the > SCSI standards. Ah yeah right. ISTR discussions about this years ago. But one way or the other, given the limited amount of ordering information available under the block layer, I doubt the benefit of doing would be anything significant. If it can be done w/o too much complexity, sure, but otherwise... Thanks. -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 14:44 ` Tejun Heo @ 2010-07-28 16:17 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-28 16:17 UTC (permalink / raw) To: Tejun Heo Cc: James Bottomley, Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke Tejun Heo, on 07/28/2010 06:44 PM wrote: > Hello, > > On 07/28/2010 04:37 PM, James Bottomley wrote: >>> I don't remember all the details now but IIRC what was necessary was >>> earlier write failure failing all commands scheduled as ordered. Does >>> ACA / UA_INTLCK or whatever allow that? >> >> No. That requires support for QErr ... which is in the same mode page. > > I see. > >> The real reason we have difficulty is that BUSY/QUEUE_FULL can cause >> reordering in the issue queue, which is a driver problem and not in the >> SCSI standards. > > Ah yeah right. ISTR discussions about this years ago. But one way or > the other, given the limited amount of ordering information available > under the block layer, I doubt the benefit of doing would be anything > significant. If it can be done w/o too much complexity, sure, but > otherwise... Hmm, this thread was started from the need to avoid queue draining, because it is a big performance hit. The use of ordered commands allows to _completely_ eliminate queue draining _at all_. It looks to be a significant benefit worth some additional complexity. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 14:37 ` James Bottomley 2010-07-28 14:44 ` Tejun Heo @ 2010-07-28 16:17 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-28 16:17 UTC (permalink / raw) To: James Bottomley Cc: Tejun Heo, Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke James Bottomley, on 07/28/2010 06:37 PM wrote: > On Wed, 2010-07-28 at 16:23 +0200, Tejun Heo wrote: >> Hello, >> >> On 07/28/2010 03:55 PM, Vladislav Bolkhovitin wrote: >>>> The only benefit of doing it in the block layer, and probably the >>>> reason why it was done this way at all, is making use of advanced >>>> ordering features of some devices - ordered tag and linked commands. >>>> The latter is deprecated and the former is fundamentally broken in >>>> error handling anyway. >>> >>> Why? SCSI provides ACA and UA_INTLCK which provide all needed >>> facilities for errors handling in deep ordered queues. >> >> I don't remember all the details now but IIRC what was necessary was >> earlier write failure failing all commands scheduled as ordered. Does >> ACA / UA_INTLCK or whatever allow that? > > No. That requires support for QErr ... which is in the same mode page. > > The real reason we have difficulty is that BUSY/QUEUE_FULL can cause > reordering in the issue queue, which is a driver problem and not in the > SCSI standards. BTW, I for long time wandering why low level drivers should process BUSY/QUEUE_FULL and perform adjusting the queue depth. Isn't it common for the drivers, so should be performed on the higher (SCSI) level? This level would provide facility to prevent reordering, if needed, and the driver would communicate with it in a transparent level. I mean the following. A driver always deals with a single command at time. It either sends the command to the device, or sends command's status/sense from the device to the SCSI level. Then SCSI level decides if to send another command to the driver or perform necessary recovery, eg, adjusting queue depth or using ACA restarting the QUEUE_FULL'ed command. In this architecture there would not be needed to update all the drivers to provide ordering guarantees and ACA based recovery as it seems needed now. Or, am I missing something? Thanks, Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 14:23 ` Tejun Heo 2010-07-28 14:37 ` James Bottomley @ 2010-07-28 16:16 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-28 16:16 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke Tejun Heo, on 07/28/2010 06:23 PM wrote: > Hello, > > On 07/28/2010 03:55 PM, Vladislav Bolkhovitin wrote: >>> The only benefit of doing it in the block layer, and probably the >>> reason why it was done this way at all, is making use of advanced >>> ordering features of some devices - ordered tag and linked commands. >>> The latter is deprecated and the former is fundamentally broken in >>> error handling anyway. >> >> Why? SCSI provides ACA and UA_INTLCK which provide all needed >> facilities for errors handling in deep ordered queues. > > I don't remember all the details now but IIRC what was necessary was > earlier write failure failing all commands scheduled as ordered. Does > ACA / UA_INTLCK or whatever allow that? Basically, ACA suspends the whole queue in case if a command in the head finished with CHECK CONDITION status. The queue should be resumed later by CLEAR ACA Task Management function. During ACA one or more new commands can be sent in the head of the queue. It allows, eg, restart the failed command. UA_INTLCK allows to establish a Unit Attention if a command in the head finished with error other that CHECK CONDITION status. Then next command will finish with CHECK CONDITION and then ACA comes into action. Overall, they look as a complete facility for effective errors recovery of ordered queues. >>> Furthermore, although they do relax ordering >>> requirements from the device queue side, the level of flexibility is >>> significantly lower compared to what filesystems can do themselves. >> >> Can you elaborate more what is not sufficiently flexible in SCSI >> ordered commands, please? > > File systems are not communicating enough ordering info to block layer > already so we already lose a lot of ordering information there and > SCSI ordered queueing is also pretty restricted in what kind of > ordering it can represent. What restrictions do you mean? > The end result is that we don't gain much > by using ordered queueing. It may cut down command latencies among > commands used for barrier sequence but if you compare it to the level > of parallelism filesystem code can exploit by ordering requests > themselves... Another thing is coverage. We have ordered queueing > for quite some time now but there are only a couple of drivers which > actually support them. Agree, file systems should provide full ordering info to the block level. The block level then should do the best to provide the needed ordering requirements using available hardware facilities. Thanks, Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 8:08 ` Tejun Heo 2010-07-28 8:20 ` Tejun Heo @ 2010-07-28 8:24 ` Christoph Hellwig 2010-07-28 8:40 ` Tejun Heo 1 sibling, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-28 8:24 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, Jan Kara, Christoph Hellwig, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Wed, Jul 28, 2010 at 10:08:44AM +0200, Tejun Heo wrote: > What I don't get here is if filesystems order requests already by > waiting for completions why do they use barriers at all? All they > need is flush request after all the preceding requests are known to be > complete. In fact for XFS I'm working on doing some bit of that, too, but it's not actually that easy. For one we don't actually have a non-barrier cache flush primitive currently, although the conversion of cache flushes to FS requests and the addition of REQ_FLUSH helps greatly with it. Second the usual primitive for log writes actually is a WRITE_FUA, that is a WRITE that needs to go to disk, without consequences to the rest of the cache. I've stared implementing that, including proper emulation for devices only supporting cache flushes but got stuck with the barrier machinery. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 8:24 ` Christoph Hellwig @ 2010-07-28 8:40 ` Tejun Heo 2010-07-28 8:50 ` Christoph Hellwig 0 siblings, 1 reply; 155+ messages in thread From: Tejun Heo @ 2010-07-28 8:40 UTC (permalink / raw) To: Christoph Hellwig Cc: Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke Hello, On 07/28/2010 10:24 AM, Christoph Hellwig wrote: > In fact for XFS I'm working on doing some bit of that, too, but it's not > actually that easy. For one we don't actually have a non-barrier cache > flush primitive currently, although the conversion of cache flushes > to FS requests and the addition of REQ_FLUSH helps greatly with it. > Second the usual primitive for log writes actually is a WRITE_FUA, > that is a WRITE that needs to go to disk, without consequences to > the rest of the cache. I've stared implementing that, including > proper emulation for devices only supporting cache flushes but got > stuck with the barrier machinery. The barrier machinery can be easily changed to drop the DRAIN and ordering stages, so all we need to do is an interface for the filesystem to tell the barrier implementation that it will take care of ordering itself and barriers (a bit of misnomer but well it isn't too bad) can be handled as FUA writes which get executed after all previous commansd are committed to NV media. On write-through device w/ FUA support, it will simply become a FUA write. On a device w/ write back cache and w/o FUA support, it will become flush, write, flush sequence. On a device inbetween, flush, FUA write. Would that be enough for filesystems? If so, the transition would be pretty painless, md already splits barriers correctly and the modification is confined to barrier implementation itself and filesystem which want to use more relaxed ordering. Thanks. -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 8:40 ` Tejun Heo @ 2010-07-28 8:50 ` Christoph Hellwig 2010-07-28 8:58 ` Tejun Heo 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-28 8:50 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Hellwig, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Wed, Jul 28, 2010 at 10:40:30AM +0200, Tejun Heo wrote: > The barrier machinery can be easily changed to drop the DRAIN and > ordering stages, Maybe you're smarted than me, but so far I had real trouble with that. The problem is that we actually still need the drain colouring to keep out other "barrier" requests given that we have the state for the pre- and post- flush requests in struct request. This and dealing is where I'm still struggling with my the even more relaxed barriers I had been working on for a while. They work perfectly on devices supporting the FUA bit and nothing inbetween. > so all we need to do is an interface for the > filesystem to tell the barrier implementation that it will take care > of ordering itself and barriers (a bit of misnomer but well it isn't > too bad) can be handled as FUA writes which get executed after all > previous commansd are committed to NV media. On write-through device > w/ FUA support, it will simply become a FUA write. If the device is write through there is not need for the FUA bit to start with. > On a device w/ > write back cache and w/o FUA support, it will become flush, write, > flush sequence. On a device inbetween, flush, FUA write. Would that > be enough for filesystems? If so, the transition would be pretty > painless, md already splits barriers correctly and the modification is > confined to barrier implementation itself and filesystem which want to > use more relaxed ordering. The above is a good start. But at least for XFS we'll eventually want writes without the pre flush, too. We'll only need the pre-flush for a specific class of log writes (when we had an extending write or need to push the log tail), otherwise plain FUA semantics are enough. Just going for the pre-flush / FUA semantics as a start has the big advantage of making the transition a lot simpler, though. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 8:50 ` Christoph Hellwig @ 2010-07-28 8:58 ` Tejun Heo 2010-07-28 9:00 ` Christoph Hellwig 0 siblings, 1 reply; 155+ messages in thread From: Tejun Heo @ 2010-07-28 8:58 UTC (permalink / raw) To: Christoph Hellwig Cc: Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke Hello, On 07/28/2010 10:50 AM, Christoph Hellwig wrote: > On Wed, Jul 28, 2010 at 10:40:30AM +0200, Tejun Heo wrote: >> The barrier machinery can be easily changed to drop the DRAIN and >> ordering stages, > > Maybe you're smarted than me, but so far I had real trouble with that. It's more likely that I was just blowing out hot air as I haven't looked at the code for a couple of years now. So, well, yeah, let's drop "easily" from the original sentence. :-) > The problem is that we actually still need the drain colouring to > keep out other "barrier" requests given that we have the state for > the pre- and post- flush requests in struct request. This and dealing > is where I'm still struggling with my the even more relaxed barriers > I had been working on for a while. They work perfectly on devices > supporting the FUA bit and nothing inbetween. > >> so all we need to do is an interface for the >> filesystem to tell the barrier implementation that it will take care >> of ordering itself and barriers (a bit of misnomer but well it isn't >> too bad) can be handled as FUA writes which get executed after all >> previous commansd are committed to NV media. On write-through device >> w/ FUA support, it will simply become a FUA write. > > If the device is write through there is not need for the FUA bit to > start with. Oh, right. >> On a device w/ >> write back cache and w/o FUA support, it will become flush, write, >> flush sequence. On a device inbetween, flush, FUA write. Would that >> be enough for filesystems? If so, the transition would be pretty >> painless, md already splits barriers correctly and the modification is >> confined to barrier implementation itself and filesystem which want to >> use more relaxed ordering. > > The above is a good start. But at least for XFS we'll eventually > want writes without the pre flush, too. We'll only need the pre-flush > for a specific class of log writes (when we had an extending write or > need to push the log tail), otherwise plain FUA semantics are enough. > Just going for the pre-flush / FUA semantics as a start has the > big advantage of making the transition a lot simpler, though. I see. It probably would be good to have ordering requirements carried in the bio / request, so that filesystems can mix and match barriers of different strengths as necesasry. As you seem to be already working on it, are you interested in pursuing that direction? Thanks. -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 8:58 ` Tejun Heo @ 2010-07-28 9:00 ` Christoph Hellwig 2010-07-28 9:11 ` Hannes Reinecke ` (2 more replies) 0 siblings, 3 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-28 9:00 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Hellwig, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Wed, Jul 28, 2010 at 10:58:30AM +0200, Tejun Heo wrote: > I see. It probably would be good to have ordering requirements > carried in the bio / request, so that filesystems can mix and match > barriers of different strengths as necesasry. As you seem to be > already working on it, are you interested in pursuing that direction? I've been working on that for a while, but it got a lot more urgent as there's been an application hit particularly hard by the barrier semantics on cache less devices and people started getting angry about it. That's why fixing this for cache less devices has become a higher priority than solving the big picture. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:00 ` Christoph Hellwig @ 2010-07-28 9:11 ` Hannes Reinecke 2010-07-28 9:16 ` Christoph Hellwig 2010-07-28 9:28 ` Steven Whitehouse 2010-07-28 9:17 ` Tejun Heo 2010-07-28 14:42 ` Vivek Goyal 2 siblings, 2 replies; 155+ messages in thread From: Hannes Reinecke @ 2010-07-28 9:11 UTC (permalink / raw) To: Christoph Hellwig Cc: Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke Christoph Hellwig wrote: > On Wed, Jul 28, 2010 at 10:58:30AM +0200, Tejun Heo wrote: >> I see. It probably would be good to have ordering requirements >> carried in the bio / request, so that filesystems can mix and match >> barriers of different strengths as necesasry. As you seem to be >> already working on it, are you interested in pursuing that direction? > > I've been working on that for a while, but it got a lot more urgent > as there's been an application hit particularly hard by the barrier > semantics on cache less devices and people started getting angry > about it. That's why fixing this for cache less devices has become > a higher priority than solving the big picture. > My idea here is to use the 'META' request tag to emulate FUA. From what I've seen, the META request tag is only ever used on gfs2, and even that is using is for tagging journal requests on write. Once you've tagged all bios/requests with correctly it trivial to set the FUA bit. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:11 ` Hannes Reinecke @ 2010-07-28 9:16 ` Christoph Hellwig 2010-07-28 9:24 ` Tejun Heo 2010-07-28 9:28 ` Steven Whitehouse 1 sibling, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-28 9:16 UTC (permalink / raw) To: Hannes Reinecke Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Wed, Jul 28, 2010 at 11:11:08AM +0200, Hannes Reinecke wrote: > My idea here is to use the 'META' request tag to emulate FUA. > >From what I've seen, the META request tag is only ever used on gfs2, > and even that is using is for tagging journal requests on write. Please don't overload META even more, it's already overloaded with at least two meanings. We do in fact already have a REQ_FUA flag, and now that I have unified the bio and request flags we can easily set it from filesystems. The problem is to emulate it properly on devices that do no actually support the FUA bit. Of which we unfortunately have a lot given that libata by default disables the FUA support even if the device supports. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:16 ` Christoph Hellwig @ 2010-07-28 9:24 ` Tejun Heo 2010-07-28 9:38 ` Christoph Hellwig 0 siblings, 1 reply; 155+ messages in thread From: Tejun Heo @ 2010-07-28 9:24 UTC (permalink / raw) To: Christoph Hellwig Cc: Hannes Reinecke, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On 07/28/2010 11:16 AM, Christoph Hellwig wrote: > The problem is to emulate it properly on devices that do no actually > support the FUA bit. Of which we unfortunately have a lot given > that libata by default disables the FUA support even if the device > supports. These were the reasons. * Some controllers puke for FUA commands whether the device supports it or not. * With the traditional strong barriers, it doesn't make much difference whether FUA is used or not. The full queue has already been stalled and flushed by the time barrier write is issued and all that we save is overhead for a single command which doesn't make any difference to actual timing of completion. * Low confidence in drives reporting FUA support. New features in ATA world seldomly work well and I'm fairly sure there are devices which report FUA support and handle FUA writes exactly the same way as regular writes. :-( So, until now, it just wasn't worth the effort / risk. If filesystems can make use of more relaxed ordering including avoiding full flush completely, it might make sense to revisit it. But, in general, I think most barriers, even when relaxed, would at least involve single flush before the FUA write and in that case I'm pretty skeptical how useful FUA write for the barrier itself would be. Thanks. -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:24 ` Tejun Heo @ 2010-07-28 9:38 ` Christoph Hellwig 0 siblings, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-28 9:38 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Hellwig, Hannes Reinecke, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Wed, Jul 28, 2010 at 11:24:11AM +0200, Tejun Heo wrote: > On 07/28/2010 11:16 AM, Christoph Hellwig wrote: > > The problem is to emulate it properly on devices that do no actually > > support the FUA bit. Of which we unfortunately have a lot given > > that libata by default disables the FUA support even if the device > > supports. > > These were the reasons. > > * Some controllers puke for FUA commands whether the device supports > it or not. > > * Low confidence in drives reporting FUA support. New features in ATA > world seldomly work well and I'm fairly sure there are devices which > report FUA support and handle FUA writes exactly the same way as > regular writes. :-( Jens recently told that Windows seems to send lots of FUA requests these days, which should really have helped shacking it out. > completely, it might make sense to revisit it. But, in general, I > think most barriers, even when relaxed, would at least involve single > flush before the FUA write and in that case I'm pretty skeptical how > useful FUA write for the barrier itself would be. At least for XFS we should be able to get away with almost no full flush at all for special workloads (no fsyncs/syncs, not appending file writes). With more normal workloads that get a fsync/sync once in a while we'd almost alwasy do a full flush for every log write, though. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:11 ` Hannes Reinecke 2010-07-28 9:16 ` Christoph Hellwig @ 2010-07-28 9:28 ` Steven Whitehouse 2010-07-28 9:35 ` READ_META semantics, was " Christoph Hellwig 1 sibling, 1 reply; 155+ messages in thread From: Steven Whitehouse @ 2010-07-28 9:28 UTC (permalink / raw) To: Hannes Reinecke Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, konishi.ryusuke Hi, On Wed, 2010-07-28 at 11:11 +0200, Hannes Reinecke wrote: > Christoph Hellwig wrote: > > On Wed, Jul 28, 2010 at 10:58:30AM +0200, Tejun Heo wrote: > >> I see. It probably would be good to have ordering requirements > >> carried in the bio / request, so that filesystems can mix and match > >> barriers of different strengths as necesasry. As you seem to be > >> already working on it, are you interested in pursuing that direction? > > > > I've been working on that for a while, but it got a lot more urgent > > as there's been an application hit particularly hard by the barrier > > semantics on cache less devices and people started getting angry > > about it. That's why fixing this for cache less devices has become > > a higher priority than solving the big picture. > > > My idea here is to use the 'META' request tag to emulate FUA. > From what I've seen, the META request tag is only ever used on gfs2, > and even that is using is for tagging journal requests on write. > > Once you've tagged all bios/requests with correctly it trivial to > set the FUA bit. > > Cheers, > > Hannes The META tag is used in GFS2 for tagging all metadata whether to the journal or otherwise. Is there some reason why this isn't correct? My understanding was that it was more or less an informational hint to those watching blktrace, Steve. ^ permalink raw reply [flat|nested] 155+ messages in thread
* READ_META semantics, was Re: [RFC] relaxed barrier semantics 2010-07-28 9:28 ` Steven Whitehouse @ 2010-07-28 9:35 ` Christoph Hellwig 2010-07-28 13:52 ` Jeff Moyer 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-28 9:35 UTC (permalink / raw) To: Steven Whitehouse Cc: Hannes Reinecke, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, konishi.ryusuke On Wed, Jul 28, 2010 at 10:28:55AM +0100, Steven Whitehouse wrote: > The META tag is used in GFS2 for tagging all metadata whether to the > journal or otherwise. Is there some reason why this isn't correct? My > understanding was that it was more or less an informational hint to > those watching blktrace, Unfortunately the META flag is overloaded in the CFQ I/O scheduler. It gives META requests a boost over other, including synchronous request. From all I could gather so far it's intended to give desktops better interactivity by boosting some metadata reads, while it should in that form never be used for writes. So far I failed badly in getting a clarification of which read requests need to be tagged and if we should not apply this boost to write request marked META so that they can be used for blktrace tagging. Unless we really want to boost all reads separating the META from the BOOST flag might be a good option, but I really need to understand better how it's supposed to use. Except for gfs2 big hammer tagging it's used in ext3/ext4 for all reads on directories, the quota file and for reading the actual inode structure. It's not used for indirect blocks, symlinks, the superblock and allocation bitmaps. XFS appears to set the META flag for both reads and writes, but that code is unreachable currently. I haven't removed it yet as I'm still wondering if it could be used correctly instead. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: READ_META semantics, was Re: [RFC] relaxed barrier semantics 2010-07-28 9:35 ` READ_META semantics, was " Christoph Hellwig @ 2010-07-28 13:52 ` Jeff Moyer 0 siblings, 0 replies; 155+ messages in thread From: Jeff Moyer @ 2010-07-28 13:52 UTC (permalink / raw) To: Christoph Hellwig Cc: Steven Whitehouse, Hannes Reinecke, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, konishi.ryusuke Christoph Hellwig <hch@lst.de> writes: > On Wed, Jul 28, 2010 at 10:28:55AM +0100, Steven Whitehouse wrote: >> The META tag is used in GFS2 for tagging all metadata whether to the >> journal or otherwise. Is there some reason why this isn't correct? My >> understanding was that it was more or less an informational hint to >> those watching blktrace, > > Unfortunately the META flag is overloaded in the CFQ I/O scheduler. > It gives META requests a boost over other, including synchronous > request. Within a single process, when choosing the next request to be serviced, if both requests are synchronous and one is tagged as metadata, then the metadata request is chosen. Also, as you mention, a request tagged as metadata will also allow the issuing process to preempt another process that currently has the I/O scheduler. Note that this isn't the intention of the code; it's actually a bug, I think: /* * So both queues are sync. Let the new request get disk time if * it's a metadata request and the current queue is doing regular IO. */ if (rq_is_meta(rq) && !cfqq->meta_pending) return true; But, it seems to me that there is no guarantee that both cfq_queues are synchronous at this point! Probably some code reshuffling has caused this to happen. > From all I could gather so far it's intended to give desktops better > interactivity by boosting some metadata reads, while it should in that > form never be used for writes. Unfortunately, I don't know the history of this code. The commit messages are too vague to be useful: cfq-iosched: fix bad return value cfq_should_preempt() Commit a6151c3a5c8e1ff5a28450bc8d6a99a2a0add0a7 inadvertently reversed a preempt condition check, potentially causing a performance regression. Make the meta check correct again. It's anyone's guess as to what the performance regression "potentially" was. commit 374f84ac39ec7829a57a66efd5125d3561ff0e00 Author: Jens Axboe <axboe@suse.de> Date: Sun Jul 23 01:42:19 2006 +0200 [PATCH] cfq-iosched: use metadata read flag Give meta data reads preference over regular reads, as the process often needs to get that out of the way to do the io it was actually interested in. Signed-off-by: Jens Axboe <axboe@suse.de> Again, no idea what the affected workloads are. I have to admit, though it sounds like a good idea. ;-) Jens, if you know what types of workloads are affected, then I can put together some tests and submit a patch to fix the above logic. > So far I failed badly in getting a clarification of which read requests > need to be tagged and if we should not apply this boost to write request > marked META so that they can be used for blktrace tagging. Unless > we really want to boost all reads separating the META from the BOOST > flag might be a good option, but I really need to understand better > how it's supposed to use. I think it makes sense to split out the flag into two: one for blktrace annotation and the other for boosted I/O priority. Hopefully we can come up with some real world use cases that shows the benefits of the latter. Cheers, Jeff > Except for gfs2 big hammer tagging it's used in ext3/ext4 for all > reads on directories, the quota file and for reading the actual inode > structure. It's not used for indirect blocks, symlinks, the superblock > and allocation bitmaps. > > XFS appears to set the META flag for both reads and writes, but that > code is unreachable currently. I haven't removed it yet as I'm still > wondering if it could be used correctly instead. > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:00 ` Christoph Hellwig 2010-07-28 9:11 ` Hannes Reinecke @ 2010-07-28 9:17 ` Tejun Heo 2010-07-28 9:28 ` Christoph Hellwig 2010-07-28 13:56 ` Vladislav Bolkhovitin 2010-07-28 14:42 ` Vivek Goyal 2 siblings, 2 replies; 155+ messages in thread From: Tejun Heo @ 2010-07-28 9:17 UTC (permalink / raw) To: Christoph Hellwig Cc: Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On 07/28/2010 11:00 AM, Christoph Hellwig wrote: > On Wed, Jul 28, 2010 at 10:58:30AM +0200, Tejun Heo wrote: >> I see. It probably would be good to have ordering requirements >> carried in the bio / request, so that filesystems can mix and match >> barriers of different strengths as necesasry. As you seem to be >> already working on it, are you interested in pursuing that direction? > > I've been working on that for a while, but it got a lot more urgent > as there's been an application hit particularly hard by the barrier > semantics on cache less devices and people started getting angry > about it. That's why fixing this for cache less devices has become > a higher priority than solving the big picture. Well, if disabling barrier works around the problem for them (which is basically what was suggeseted in the first message), that's not too bad for short term, I think. At least, there's a handy workaround. I'll re-read barrier code and see how hard it would be to implement a proper solution. Thanks. -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:17 ` Tejun Heo @ 2010-07-28 9:28 ` Christoph Hellwig 2010-07-28 9:48 ` Tejun Heo ` (5 more replies) 2010-07-28 13:56 ` Vladislav Bolkhovitin 1 sibling, 6 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-28 9:28 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Hellwig, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Wed, Jul 28, 2010 at 11:17:06AM +0200, Tejun Heo wrote: > Well, if disabling barrier works around the problem for them (which is > basically what was suggeseted in the first message), that's not too > bad for short term, I think. It's a pretty horrible workaround. Requiring manual mount options to get performance out of a setup which could trivially work out of the box is a bad workaround. > I'll re-read barrier code and see how hard it would be to implement a > proper solution. If we move all filesystems to non-draining barriers with pre- and post- flushes that might actually be a relatively easy first step. We don't have the complications to deal with multiple types of barriers to start with, and it'll fix the issue for devices without volatile write caches completely. I just need some help from the filesystem folks to determine if they are safe with them. I know for sure that ext3 and xfs are from looking through them. And I know reiserfs is if we make sure it doesn't hit the code path that relies on it that is currently enabled by the barrier option. I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks. That already ends our small list of barrier supporting filesystems, and possibly ocfs2, too - although the barrier implementation there seems incomplete as it doesn't seem to flush caches in fsync. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:28 ` Christoph Hellwig @ 2010-07-28 9:48 ` Tejun Heo 2010-07-28 10:19 ` Steven Whitehouse ` (4 subsequent siblings) 5 siblings, 0 replies; 155+ messages in thread From: Tejun Heo @ 2010-07-28 9:48 UTC (permalink / raw) To: Christoph Hellwig Cc: Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On 07/28/2010 11:28 AM, Christoph Hellwig wrote: > If we move all filesystems to non-draining barriers with pre- and post- > flushes that might actually be a relatively easy first step. We don't > have the complications to deal with multiple types of barriers to > start with, and it'll fix the issue for devices without volatile write > caches completely. > > I just need some help from the filesystem folks to determine if they > are safe with them. Agreed, if all filesystems can agree on the relaxed semantics, things would be much simpler. Thanks. -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:28 ` Christoph Hellwig 2010-07-28 9:48 ` Tejun Heo @ 2010-07-28 10:19 ` Steven Whitehouse 2010-07-28 11:45 ` Christoph Hellwig 2010-07-28 12:47 ` Jan Kara ` (3 subsequent siblings) 5 siblings, 1 reply; 155+ messages in thread From: Steven Whitehouse @ 2010-07-28 10:19 UTC (permalink / raw) To: Christoph Hellwig Cc: Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, konishi.ryusuke Hi, On Wed, 2010-07-28 at 11:28 +0200, Christoph Hellwig wrote: > On Wed, Jul 28, 2010 at 11:17:06AM +0200, Tejun Heo wrote: > > Well, if disabling barrier works around the problem for them (which is > > basically what was suggeseted in the first message), that's not too > > bad for short term, I think. > > It's a pretty horrible workaround. Requiring manual mount options to > get performance out of a setup which could trivially work out of the > box is a bad workaround. > > > I'll re-read barrier code and see how hard it would be to implement a > > proper solution. > > If we move all filesystems to non-draining barriers with pre- and post- > flushes that might actually be a relatively easy first step. We don't > have the complications to deal with multiple types of barriers to > start with, and it'll fix the issue for devices without volatile write > caches completely. > > I just need some help from the filesystem folks to determine if they > are safe with them. > > I know for sure that ext3 and xfs are from looking through them. And > I know reiserfs is if we make sure it doesn't hit the code path that > relies on it that is currently enabled by the barrier option. > > I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks. > That already ends our small list of barrier supporting filesystems, and > possibly ocfs2, too - although the barrier implementation there seems > incomplete as it doesn't seem to flush caches in fsync. GFS2 uses barriers only on journal flushing. There are three reasons for flushing the journal: 1. Its full and we need more space (or the periodic timer has expired, and there is at least one transaction to flush) 2. We are doing fsync or a full fs sync 3. We need to release a glock to another node, and that glock has some journaled blocks associated with it In case #1, I don't think there is any need to actually issue a flush along with the barrier - the fs will always be correct in case of a (for example) power failure and it is only the amount of data which might be lost which depends on the write cache size. This is basically the same for any local filesystem. In case #2 we must always flush In case #3 we need to be certain that all I/O up to and including the barrier (and subsequent written back in-place metadata, if any) has reached the storage device (and is not still lurking in the I/O elevator) before we release the lock, but there is no actual need to flush the write cache of the device itself. In other words, we need to flush the non-shared bit of the stack, but not the shared bit on the device itself. The same caveats about the amount of data which may be lost on power failure apply as per case #1. I have also made the assumption that a barrier issued from one node to the shared device will affect I/O from all nodes equally. If that is not the case, then the above will not apply and we must always flush in case #3. Currently the code is also waiting for I/O to drain in cases #1 and #3 as well as case #2 since it was simpler to implement all cases the same, at least to start with. Also in case #3, if we were to implement a non-flushing barrier, then we would need to add a barrier after the in-place metadata writeback of the inode that is being released I think, in order to be sure cross-node ordering was correct. Hmmm. Maybe we should be doing that anyway.... Steve. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 10:19 ` Steven Whitehouse @ 2010-07-28 11:45 ` Christoph Hellwig 0 siblings, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-28 11:45 UTC (permalink / raw) To: Steven Whitehouse Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, konishi.ryusuke On Wed, Jul 28, 2010 at 11:19:57AM +0100, Steven Whitehouse wrote: > In case #1, I don't think there is any need to actually issue a flush > along with the barrier - the fs will always be correct in case of a (for > example) power failure and it is only the amount of data which might be > lost which depends on the write cache size. This is basically the same > for any local filesystem. For now we're mostly talking about removing the _ordering_ not the flushing. Eventually I'd like to relax some of the flushing requirements, too - but that is secondary priority. So for now I'm mostly interested if gfs2 relies on the ordering semantics from barriers. Given that it's been around for a while and primarily used on devices without any kind of barriers support I'm inclined it is, but I'd really prefer to get this from the horses mouth. > I have also made the assumption that a barrier issued from one node to > the shared device will affect I/O from all nodes equally. If that is not > the case, then the above will not apply and we must always flush in case > #3. There is absolutely no ordering vs other nodes. The volatile write cache if present is a per-target state so it will be flushed for all nodes. > Currently the code is also waiting for I/O to drain in cases #1 and #3 > as well as case #2 since it was simpler to implement all cases the same, > at least to start with. Aka gfs2 waits for the I/O completion by itself. That sounds like it is the answer to my original question. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:28 ` Christoph Hellwig 2010-07-28 9:48 ` Tejun Heo 2010-07-28 10:19 ` Steven Whitehouse @ 2010-07-28 12:47 ` Jan Kara 2010-07-28 23:00 ` Christoph Hellwig 2010-07-29 1:44 ` Ted Ts'o ` (2 subsequent siblings) 5 siblings, 1 reply; 155+ messages in thread From: Jan Kara @ 2010-07-28 12:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Wed 28-07-10 11:28:59, Christoph Hellwig wrote: > On Wed, Jul 28, 2010 at 11:17:06AM +0200, Tejun Heo wrote: > > Well, if disabling barrier works around the problem for them (which is > > basically what was suggeseted in the first message), that's not too > > bad for short term, I think. > > It's a pretty horrible workaround. Requiring manual mount options to > get performance out of a setup which could trivially work out of the > box is a bad workaround. > > > I'll re-read barrier code and see how hard it would be to implement a > > proper solution. > > If we move all filesystems to non-draining barriers with pre- and post- > flushes that might actually be a relatively easy first step. We don't > have the complications to deal with multiple types of barriers to > start with, and it'll fix the issue for devices without volatile write > caches completely. > > I just need some help from the filesystem folks to determine if they > are safe with them. > > I know for sure that ext3 and xfs are from looking through them. And Yes, ext3 is safe. > I know reiserfs is if we make sure it doesn't hit the code path that > relies on it that is currently enabled by the barrier option. Yes, just always writing the commit buffer at the place where we currently do it in !barrier case should be enough for reiserfs. > I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks. As I wrote in some other email, ext4/jbd2 is OK, unless you mount the filesystem with async_commit mount option. With that option it does the same thing as reiserfs in barrier case - i.e., it needs ordering. > That already ends our small list of barrier supporting filesystems, and > possibly ocfs2, too - although the barrier implementation there seems > incomplete as it doesn't seem to flush caches in fsync. Well, ocfs2 uses jbd2 for journaling so it supports barriers out of the box and does not need the ordering. ocfs2_sync_file is actually correct (although maybe slightly inefficient) because it does jbd2_journal_force_commit() which creates and immediately commits a transaction and that implies a barrier. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 12:47 ` Jan Kara @ 2010-07-28 23:00 ` Christoph Hellwig 2010-07-29 10:45 ` Jan Kara 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-28 23:00 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Wed, Jul 28, 2010 at 02:47:20PM +0200, Jan Kara wrote: > Well, ocfs2 uses jbd2 for journaling so it supports barriers out of the > box and does not need the ordering. ocfs2_sync_file is actually correct > (although maybe slightly inefficient) because it does > jbd2_journal_force_commit() which creates and immediately commits a > transaction and that implies a barrier. I don't think that's correct. ocfs2_sync_file first does ocfs2_sync_inode, which does a completely superflous filemap_fdatawrite, and from what I can see a just as superflous sync_mapping_buffers (given that ocfs doesn't use mark_buffer_dirty_inode) and then might return early in case we do fdatasync but the inode isn't marked I_DIRTY_DATASYNC. In that case we might need a cache flush given that the data might still be dirty. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 23:00 ` Christoph Hellwig @ 2010-07-29 10:45 ` Jan Kara 2010-07-29 16:54 ` Joel Becker 0 siblings, 1 reply; 155+ messages in thread From: Jan Kara @ 2010-07-29 10:45 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Thu 29-07-10 01:00:10, Christoph Hellwig wrote: > On Wed, Jul 28, 2010 at 02:47:20PM +0200, Jan Kara wrote: > > Well, ocfs2 uses jbd2 for journaling so it supports barriers out of the > > box and does not need the ordering. ocfs2_sync_file is actually correct > > (although maybe slightly inefficient) because it does > > jbd2_journal_force_commit() which creates and immediately commits a > > transaction and that implies a barrier. > > I don't think that's correct. ocfs2_sync_file first does > ocfs2_sync_inode, which does a completely superflous filemap_fdatawrite, > and from what I can see a just as superflous sync_mapping_buffers (given > that ocfs doesn't use mark_buffer_dirty_inode) and then might return > early in case we do fdatasync but the inode isn't marked > I_DIRTY_DATASYNC. In that case we might need a cache flush given > that the data might still be dirty. Ah, I see. You're right, fdatasync case is buggy. I'll send Joel a fix. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 10:45 ` Jan Kara @ 2010-07-29 16:54 ` Joel Becker 2010-07-29 17:02 ` Christoph Hellwig 2010-07-29 17:02 ` Christoph Hellwig 0 siblings, 2 replies; 155+ messages in thread From: Joel Becker @ 2010-07-29 16:54 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Thu, Jul 29, 2010 at 12:45:30PM +0200, Jan Kara wrote: > On Thu 29-07-10 01:00:10, Christoph Hellwig wrote: > > On Wed, Jul 28, 2010 at 02:47:20PM +0200, Jan Kara wrote: > > > Well, ocfs2 uses jbd2 for journaling so it supports barriers out of the > > > box and does not need the ordering. ocfs2_sync_file is actually correct > > > (although maybe slightly inefficient) because it does > > > jbd2_journal_force_commit() which creates and immediately commits a > > > transaction and that implies a barrier. > > > > I don't think that's correct. ocfs2_sync_file first does > > ocfs2_sync_inode, which does a completely superflous filemap_fdatawrite, > > and from what I can see a just as superflous sync_mapping_buffers (given > > that ocfs doesn't use mark_buffer_dirty_inode) and then might return > > early in case we do fdatasync but the inode isn't marked > > I_DIRTY_DATASYNC. In that case we might need a cache flush given > > that the data might still be dirty. > Ah, I see. You're right, fdatasync case is buggy. I'll send Joel a fix. I can certainly see our code being inefficient if the handled-for-us behaviors of sync have changed. If the VFS is already doing some work for us, maybe we don't need to do it. But we have to be sure that these calls are always going through those paths. We sync our files to disk when we drop cluster locks, regardless of whether there is a userspace fsync(). I guess I never knew that data could be dirty without the I_DIRTY_DATASYNC bit. Joel -- "Copy from one, it's plagiarism; copy from two, it's research." - Wilson Mizner Joel Becker Consulting Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 16:54 ` Joel Becker @ 2010-07-29 17:02 ` Christoph Hellwig 2010-07-29 17:02 ` Christoph Hellwig 1 sibling, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-29 17:02 UTC (permalink / raw) To: Jan Kara, Christoph Hellwig, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel On Thu, Jul 29, 2010 at 09:54:50AM -0700, Joel Becker wrote: > handled-for-us behaviors of sync have changed. If the VFS is already > doing some work for us, maybe we don't need to do it. But we have to be > sure that these calls are always going through those paths. We sync our > files to disk when we drop cluster locks, regardless of whether there is > a userspace fsync(). ocfs2_sync_file only gets called through the fsync inode operation, so that doesn't happen here. And if it did the filemap_fdatawrite would not help at all, given that is only starts writeout, but never waits for it to finish. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 16:54 ` Joel Becker 2010-07-29 17:02 ` Christoph Hellwig @ 2010-07-29 17:02 ` Christoph Hellwig 1 sibling, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-29 17:02 UTC (permalink / raw) To: Jan Kara, Christoph Hellwig, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel On Thu, Jul 29, 2010 at 09:54:50AM -0700, Joel Becker wrote: > handled-for-us behaviors of sync have changed. If the VFS is already > doing some work for us, maybe we don't need to do it. But we have to be > sure that these calls are always going through those paths. We sync our > files to disk when we drop cluster locks, regardless of whether there is > a userspace fsync(). ocfs2_sync_file only gets called through the fsync inode operation, so that doesn't happen here. And if it did the filemap_fdatawrite would not help at all, given that is only starts writeout, but never waits for it to finish. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:28 ` Christoph Hellwig ` (2 preceding siblings ...) 2010-07-28 12:47 ` Jan Kara @ 2010-07-29 1:44 ` Ted Ts'o 2010-07-29 2:43 ` Vivek Goyal ` (4 more replies) 2010-08-02 16:47 ` Ryusuke Konishi 2010-08-02 17:39 ` Chris Mason 5 siblings, 5 replies; 155+ messages in thread From: Ted Ts'o @ 2010-07-29 1:44 UTC (permalink / raw) To: Christoph Hellwig Cc: Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote: > If we move all filesystems to non-draining barriers with pre- and post- > flushes that might actually be a relatively easy first step. We don't > have the complications to deal with multiple types of barriers to > start with, and it'll fix the issue for devices without volatile write > caches completely. > > I just need some help from the filesystem folks to determine if they > are safe with them. > > I know for sure that ext3 and xfs are from looking through them. And > I know reiserfs is if we make sure it doesn't hit the code path that > relies on it that is currently enabled by the barrier option. > > I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks. > That already ends our small list of barrier supporting filesystems, and > possibly ocfs2, too - although the barrier implementation there seems > incomplete as it doesn't seem to flush caches in fsync. Define "are safe" --- what interface we planning on using for the non-draining barrier? At least for ext3, when we write the commit record using set_buffer_ordered(bh), it assumes that this will do a flush of all previous writes and that the commit will hit the disk before any subsequent writes are sent to the disk. So turning the write of a buffer head marked with set_buffered_ordered() into a FUA write would _not_ be safe for ext3. For ext4, if we don't use journal checksums, then we have the same requirements as ext3, and the same method of requesting it. If we do use journal checksums, what ext4 needs is a way of assuring that no writes after the commit are reordered with respect to the disk platter before the commit record --- but any of the writes before that, including the commit, and be reordered because we rely on the checksum in the commit record to know at replay time whether the last commit is valid or not. We do that right now by calling blkdev_issue_flush() with BLKDEF_IFL_WAIT after submitting the write of the commit block. - Ted ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 1:44 ` Ted Ts'o @ 2010-07-29 2:43 ` Vivek Goyal 2010-07-29 2:43 ` Vivek Goyal ` (3 subsequent siblings) 4 siblings, 0 replies; 155+ messages in thread From: Vivek Goyal @ 2010-07-29 2:43 UTC (permalink / raw) To: Ted Ts'o, Christoph Hellwig, Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel On Wed, Jul 28, 2010 at 09:44:31PM -0400, Ted Ts'o wrote: > On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote: > > If we move all filesystems to non-draining barriers with pre- and post- > > flushes that might actually be a relatively easy first step. We don't > > have the complications to deal with multiple types of barriers to > > start with, and it'll fix the issue for devices without volatile write > > caches completely. > > > > I just need some help from the filesystem folks to determine if they > > are safe with them. > > > > I know for sure that ext3 and xfs are from looking through them. And > > I know reiserfs is if we make sure it doesn't hit the code path that > > relies on it that is currently enabled by the barrier option. > > > > I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks. > > That already ends our small list of barrier supporting filesystems, and > > possibly ocfs2, too - although the barrier implementation there seems > > incomplete as it doesn't seem to flush caches in fsync. > > Define "are safe" --- what interface we planning on using for the > non-draining barrier? At least for ext3, when we write the commit > record using set_buffer_ordered(bh), it assumes that this will do a > flush of all previous writes and that the commit will hit the disk > before any subsequent writes are sent to the disk. So turning the > write of a buffer head marked with set_buffered_ordered() into a FUA > write would _not_ be safe for ext3. > I guess we will require something like set_buffer_preflush_fua() kind of operation so that we preflush the cache to make sure everything before commit block is on platter and then do commit block write with FUA to make sure commit block is on platter. This is assuming that before issuing commit block request we have waited for completion of rest of the journal data. This will make sure none of that journal data is in request queue. Then if we issue commit with preflush and FUA, it should make sure all the journal blocks are on disk and then commit block is on disk. So as long as we wait in filesystem for completion of the requests commit block is dependent on, before we issue commit request, we should not require request queue drain and preflush and FUA write probably should be fine. > For ext4, if we don't use journal checksums, then we have the same > requirements as ext3, and the same method of requesting it. If we do > use journal checksums, what ext4 needs is a way of assuring that no > writes after the commit are reordered with respect to the disk platter > before the commit record --- but any of the writes before that, > including the commit, and be reordered because we rely on the checksum > in the commit record to know at replay time whether the last commit is > valid or not. We do that right now by calling blkdev_issue_flush() > with BLKDEF_IFL_WAIT after submitting the write of the commit block. IIUC, blkdev_issue_flush() is just a hard barrier and will drain queue and flush the cache. I guess what we need is only flush and not drain after we have waited for completion of commit record as well as requests issued before commit record. That should make sure any WRITE after commit record does not get reordered w.r.t previous commit. So we probably need blkdev_issue_flush_only() which will just flush caches and not drain request queue. This is all based on my very primitive knowledge. Please ignore if it is all rubbish. Thanks Vivek ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 1:44 ` Ted Ts'o 2010-07-29 2:43 ` Vivek Goyal @ 2010-07-29 2:43 ` Vivek Goyal 2010-07-29 8:42 ` Christoph Hellwig 2010-07-29 8:31 ` [RFC] relaxed barrier semantics Christoph Hellwig ` (2 subsequent siblings) 4 siblings, 1 reply; 155+ messages in thread From: Vivek Goyal @ 2010-07-29 2:43 UTC (permalink / raw) To: Ted Ts'o, Christoph Hellwig, Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel On Wed, Jul 28, 2010 at 09:44:31PM -0400, Ted Ts'o wrote: > On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote: > > If we move all filesystems to non-draining barriers with pre- and post- > > flushes that might actually be a relatively easy first step. We don't > > have the complications to deal with multiple types of barriers to > > start with, and it'll fix the issue for devices without volatile write > > caches completely. > > > > I just need some help from the filesystem folks to determine if they > > are safe with them. > > > > I know for sure that ext3 and xfs are from looking through them. And > > I know reiserfs is if we make sure it doesn't hit the code path that > > relies on it that is currently enabled by the barrier option. > > > > I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks. > > That already ends our small list of barrier supporting filesystems, and > > possibly ocfs2, too - although the barrier implementation there seems > > incomplete as it doesn't seem to flush caches in fsync. > > Define "are safe" --- what interface we planning on using for the > non-draining barrier? At least for ext3, when we write the commit > record using set_buffer_ordered(bh), it assumes that this will do a > flush of all previous writes and that the commit will hit the disk > before any subsequent writes are sent to the disk. So turning the > write of a buffer head marked with set_buffered_ordered() into a FUA > write would _not_ be safe for ext3. > I guess we will require something like set_buffer_preflush_fua() kind of operation so that we preflush the cache to make sure everything before commit block is on platter and then do commit block write with FUA to make sure commit block is on platter. This is assuming that before issuing commit block request we have waited for completion of rest of the journal data. This will make sure none of that journal data is in request queue. Then if we issue commit with preflush and FUA, it should make sure all the journal blocks are on disk and then commit block is on disk. So as long as we wait in filesystem for completion of the requests commit block is dependent on, before we issue commit request, we should not require request queue drain and preflush and FUA write probably should be fine. > For ext4, if we don't use journal checksums, then we have the same > requirements as ext3, and the same method of requesting it. If we do > use journal checksums, what ext4 needs is a way of assuring that no > writes after the commit are reordered with respect to the disk platter > before the commit record --- but any of the writes before that, > including the commit, and be reordered because we rely on the checksum > in the commit record to know at replay time whether the last commit is > valid or not. We do that right now by calling blkdev_issue_flush() > with BLKDEF_IFL_WAIT after submitting the write of the commit block. IIUC, blkdev_issue_flush() is just a hard barrier and will drain queue and flush the cache. I guess what we need is only flush and not drain after we have waited for completion of commit record as well as requests issued before commit record. That should make sure any WRITE after commit record does not get reordered w.r.t previous commit. So we probably need blkdev_issue_flush_only() which will just flush caches and not drain request queue. This is all based on my very primitive knowledge. Please ignore if it is all rubbish. Thanks Vivek ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 2:43 ` Vivek Goyal @ 2010-07-29 8:42 ` Christoph Hellwig 2010-07-29 20:02 ` Vivek Goyal 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-29 8:42 UTC (permalink / raw) To: Vivek Goyal Cc: Ted Ts'o, Christoph Hellwig, Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Wed, Jul 28, 2010 at 10:43:34PM -0400, Vivek Goyal wrote: > I guess we will require something like set_buffer_preflush_fua() kind of > operation so that we preflush the cache to make sure everything before > commit block is on platter and then do commit block write with FUA > to make sure commit block is on platter. No more messing with buffer flags for barriers / cache flush options please. It's a flag for the I/O submission, not buffer state. See my patch from June to remove BH_Ordered if you're interested. > This is assuming that before issuing commit block request we have waited > for completion of rest of the journal data. This will make sure none of > that journal data is in request queue. Then if we issue commit with > preflush and FUA, it should make sure all the journal blocks are on > disk and then commit block is on disk. > > So as long as we wait in filesystem for completion of the requests commit > block is dependent on, before we issue commit request, we should not > require request queue drain and preflush and FUA write probably should > be fine. We do not require the drain for that case. The flush is more difficult, because it's entirely possible that we have state that we require to be on disk before writing out a log buffer. For XFS that's two cases: (1) we require the actual file data to be on disk before logging the file size update to avoid stale data exposure in case the log buffer hits the disk before the data (2) we require that the buffers writing back metadata actually made it to disk before pushing the log tail (1) means we'll always a pre-flush when a log buffer contains a size update from an appending write. (2) means we need to more complicated tracking of the tail lsn, e.g. by caching it somewhere and only updating the cached value after a cache flush happened, with a way to force one if needed. All that is at least as complicated as it sounds. While I have a working prototype just going with the relaxed barriers as a first step is probably. > IIUC, blkdev_issue_flush() is just a hard barrier and will drain queue > and flush the cache. Exactly. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 8:42 ` Christoph Hellwig @ 2010-07-29 20:02 ` Vivek Goyal 2010-07-29 20:06 ` Christoph Hellwig 0 siblings, 1 reply; 155+ messages in thread From: Vivek Goyal @ 2010-07-29 20:02 UTC (permalink / raw) To: Christoph Hellwig Cc: Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, Jul 29, 2010 at 10:42:25AM +0200, Christoph Hellwig wrote: > On Wed, Jul 28, 2010 at 10:43:34PM -0400, Vivek Goyal wrote: > > I guess we will require something like set_buffer_preflush_fua() kind of > > operation so that we preflush the cache to make sure everything before > > commit block is on platter and then do commit block write with FUA > > to make sure commit block is on platter. > > No more messing with buffer flags for barriers / cache flush options > please. It's a flag for the I/O submission, not buffer state. See > my patch from June to remove BH_Ordered if you're interested. > > > This is assuming that before issuing commit block request we have waited > > for completion of rest of the journal data. This will make sure none of > > that journal data is in request queue. Then if we issue commit with > > preflush and FUA, it should make sure all the journal blocks are on > > disk and then commit block is on disk. > > > > So as long as we wait in filesystem for completion of the requests commit > > block is dependent on, before we issue commit request, we should not > > require request queue drain and preflush and FUA write probably should > > be fine. > > We do not require the drain for that case. The flush is more difficult, > because it's entirely possible that we have state that we require to be > on disk before writing out a log buffer. For XFS that's two cases: > > (1) we require the actual file data to be on disk before logging the > file size update to avoid stale data exposure in case the log > buffer hits the disk before the data > (2) we require that the buffers writing back metadata actually made it > to disk before pushing the log tail > > (1) means we'll always a pre-flush when a log buffer contains a size > update from an appending write. > (2) means we need to more complicated tracking of the tail lsn, e.g. > by caching it somewhere and only updating the cached value after a > cache flush happened, with a way to force one if needed. > > All that is at least as complicated as it sounds. While I have a > working prototype just going with the relaxed barriers as a first step > is probably. There are so many mails on this topic now that I am kind of lost. I guess this has already been asked but I will ask one more time. Looks like you still want to go with option 2 where you will scan the file system code for requirement of DRAIN semantics and everything is fine then for devices no supporting volatile caches, you will mark request queue as NONE. This solves the problem on devices with WCE=0 but what about devices with WCE=1. If file systems anyway don't require DRAIN semantics, then we should not require it on devices with WCE=1 also? If yes, then why not go with another variant of barriers which don't perform DRAIN and just do PREFLUSH + FUA (or post flush for devices not supporting FUA). And then file systems can slowly move to using this non draining barrier usage wherever appropriate. The advantage here is that it should save us request queue DRAIN even on devices with WCE=1. Am I missing something very obivious here? Vivek ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 20:02 ` Vivek Goyal @ 2010-07-29 20:06 ` Christoph Hellwig 2010-07-30 3:17 ` Vivek Goyal 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-29 20:06 UTC (permalink / raw) To: Vivek Goyal Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, Jul 29, 2010 at 04:02:17PM -0400, Vivek Goyal wrote: > Looks like you still want to go with option 2 where you will scan the file > system code for requirement of DRAIN semantics and everything is fine then for > devices no supporting volatile caches, you will mark request queue as NONE. The filesystem can't simply change the request queue settings. A request queue is often shared by multiple filesystems that can have very different requirements. > This solves the problem on devices with WCE=0 but what about devices with > WCE=1. If file systems anyway don't require DRAIN semantics, then we > should not require it on devices with WCE=1 also? Yes. > If yes, then why not go with another variant of barriers which don't > perform DRAIN and just do PREFLUSH + FUA (or post flush for devices not > supporting FUA). I've been trying to prototype it, but it's in fact rather hard to get this right. Tejun has done a really good job at the current barrier implementation and coming up with something just half as clever for the relaxed barriers has been driving me mad. > And then file systems can slowly move to using this non > draining barrier usage wherever appropriate. Actually supporting different kind of barriers at the same time is even harder. We'll need two different state machines for them, including the actual state in the request_queue. And then make sure when different filesystems on the same queue use different types work well together. If at all possible switching the semantics on a flag day would make life a lot simpler. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 20:06 ` Christoph Hellwig @ 2010-07-30 3:17 ` Vivek Goyal 2010-07-30 7:07 ` Christoph Hellwig 0 siblings, 1 reply; 155+ messages in thread From: Vivek Goyal @ 2010-07-30 3:17 UTC (permalink / raw) To: Christoph Hellwig Cc: Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, Jul 29, 2010 at 10:06:55PM +0200, Christoph Hellwig wrote: > On Thu, Jul 29, 2010 at 04:02:17PM -0400, Vivek Goyal wrote: > > Looks like you still want to go with option 2 where you will scan the file > > system code for requirement of DRAIN semantics and everything is fine then for > > devices no supporting volatile caches, you will mark request queue as NONE. > > The filesystem can't simply change the request queue settings. A request > queue is often shared by multiple filesystems that can have very > different requirements. > > > This solves the problem on devices with WCE=0 but what about devices with > > WCE=1. If file systems anyway don't require DRAIN semantics, then we > > should not require it on devices with WCE=1 also? > > Yes. > > > If yes, then why not go with another variant of barriers which don't > > perform DRAIN and just do PREFLUSH + FUA (or post flush for devices not > > supporting FUA). > > I've been trying to prototype it, but it's in fact rather hard to > get this right. Tejun has done a really good job at the current > barrier implementation and coming up with something just half as > clever for the relaxed barriers has been driving me mad. > > > And then file systems can slowly move to using this non > > draining barrier usage wherever appropriate. > > Actually supporting different kind of barriers at the same time > is even harder. We'll need two different state machines for them, > including the actual state in the request_queue. And then make > sure when different filesystems on the same queue use different > types work well together. If at all possible switching the semantics > on a flag day would make life a lot simpler. Hi Christoph, I was looking at barrier code and was trying to think that how hard it is to support a new barrier type which does not implement DRAIN but only does PREFLUSH + FUA for devices with WCE=1. To me it looked like as if everything is there and it is just a matter of skipping elevator draining and request queue draining. Can you please have a look at attached patch. This is not a complete patch but just a part of it if we were to implement another barrier type, say FLUSHBARRIER. Do you think this will work or I am blissfuly unaware of complexity here and oversimplifying the things. Thanks Vivek --- block/blk-barrier.c | 14 +++++++++++++- block/elevator.c | 3 ++- include/linux/blkdev.h | 5 ++++- 3 files changed, 19 insertions(+), 3 deletions(-) Index: linux-2.6/include/linux/blkdev.h =================================================================== --- linux-2.6.orig/include/linux/blkdev.h 2010-06-19 09:54:32.000000000 -0400 +++ linux-2.6/include/linux/blkdev.h 2010-07-29 22:36:52.000000000 -0400 @@ -97,6 +97,7 @@ enum rq_flag_bits { __REQ_SORTED, /* elevator knows about this request */ __REQ_SOFTBARRIER, /* may not be passed by ioscheduler */ __REQ_HARDBARRIER, /* may not be passed by drive either */ + __REQ_FLUSHBARRIER, /* only flush barrier. no drains required */ __REQ_FUA, /* forced unit access */ __REQ_NOMERGE, /* don't touch this for merging */ __REQ_STARTED, /* drive already may have started this one */ @@ -126,6 +127,7 @@ enum rq_flag_bits { #define REQ_SORTED (1 << __REQ_SORTED) #define REQ_SOFTBARRIER (1 << __REQ_SOFTBARRIER) #define REQ_HARDBARRIER (1 << __REQ_HARDBARRIER) +#define REQ_FLUSHBARRIER (1 << __REQ_FLUSHBARRIER) #define REQ_FUA (1 << __REQ_FUA) #define REQ_NOMERGE (1 << __REQ_NOMERGE) #define REQ_STARTED (1 << __REQ_STARTED) @@ -626,6 +628,7 @@ enum { #define blk_rq_cpu_valid(rq) ((rq)->cpu != -1) #define blk_sorted_rq(rq) ((rq)->cmd_flags & REQ_SORTED) #define blk_barrier_rq(rq) ((rq)->cmd_flags & REQ_HARDBARRIER) +#define blk_flush_barrier_rq(rq) ((rq)->cmd_flags & REQ_FLUSHBARRIER) #define blk_fua_rq(rq) ((rq)->cmd_flags & REQ_FUA) #define blk_discard_rq(rq) ((rq)->cmd_flags & REQ_DISCARD) #define blk_bidi_rq(rq) ((rq)->next_rq != NULL) @@ -681,7 +684,7 @@ static inline void blk_clear_queue_full( * it already be started by driver. */ #define RQ_NOMERGE_FLAGS \ - (REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER) + (REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER | REQ_FLUSHBARRIER) #define rq_mergeable(rq) \ (!((rq)->cmd_flags & RQ_NOMERGE_FLAGS) && \ (blk_discard_rq(rq) || blk_fs_request((rq)))) Index: linux-2.6/block/blk-barrier.c =================================================================== --- linux-2.6.orig/block/blk-barrier.c 2010-06-19 09:54:29.000000000 -0400 +++ linux-2.6/block/blk-barrier.c 2010-07-29 23:02:05.000000000 -0400 @@ -219,7 +219,8 @@ static inline bool start_ordered(struct } else skip |= QUEUE_ORDSEQ_PREFLUSH; - if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q)) + if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q) + && !blk_flush_barrier_rq(rq)) rq = NULL; else skip |= QUEUE_ORDSEQ_DRAIN; @@ -241,6 +242,17 @@ bool blk_do_ordered(struct request_queue if (!q->ordseq) { if (!is_barrier) return true; + /* + * For flush only barriers, nothing has to be done if there is + * no caching happening on the deice. The barrier request is + * still has to be written to disk but it can written as + * normal rq. + */ + + if (blk_flush_barrier_rq(rq) + && (q->ordered & QUEUE_ORDERED_BY_DRAIN + || q->ordered & QUEUE_ORDERED_BY_TAG)) + return true; if (q->next_ordered != QUEUE_ORDERED_NONE) return start_ordered(q, rqp); Index: linux-2.6/block/elevator.c =================================================================== --- linux-2.6.orig/block/elevator.c 2010-06-19 09:54:29.000000000 -0400 +++ linux-2.6/block/elevator.c 2010-07-29 23:06:21.000000000 -0400 @@ -628,7 +628,8 @@ void elv_insert(struct request_queue *q, case ELEVATOR_INSERT_BACK: rq->cmd_flags |= REQ_SOFTBARRIER; - elv_drain_elevator(q); + if (!blk_flush_barrier_rq(rq)) + elv_drain_elevator(q); list_add_tail(&rq->queuelist, &q->queue_head); /* * We kick the queue here for the following reasons. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 3:17 ` Vivek Goyal @ 2010-07-30 7:07 ` Christoph Hellwig 2010-07-30 7:41 ` Vivek Goyal 2010-08-02 18:28 ` [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) Vivek Goyal 0 siblings, 2 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-30 7:07 UTC (permalink / raw) To: Vivek Goyal Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, Jul 29, 2010 at 11:17:21PM -0400, Vivek Goyal wrote: > To me it looked like as if everything is there and it is just a matter > of skipping elevator draining and request queue draining. The problem is that is just appears to be so. The code blocking only the next barrier for tagged writes is there, but in that form it doesn't work and probably never did. When I try to use it and debug it I always get my post-flush request issued before the barrier request has finished. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 7:07 ` Christoph Hellwig @ 2010-07-30 7:41 ` Vivek Goyal 2010-08-02 18:28 ` [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) Vivek Goyal 1 sibling, 0 replies; 155+ messages in thread From: Vivek Goyal @ 2010-07-30 7:41 UTC (permalink / raw) To: Christoph Hellwig Cc: Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Fri, Jul 30, 2010 at 09:07:32AM +0200, Christoph Hellwig wrote: > On Thu, Jul 29, 2010 at 11:17:21PM -0400, Vivek Goyal wrote: > > To me it looked like as if everything is there and it is just a matter > > of skipping elevator draining and request queue draining. > > The problem is that is just appears to be so. The code blocking only > the next barrier for tagged writes is there, but in that form it doesn't > work and probably never did. When I try to use it and debug it I always > get my post-flush request issued before the barrier request has > finished. Are you referring to following piece of code. if (q->ordered & QUEUE_ORDERED_BY_TAG) { /* Ordered by tag. Blocking the next barrier is enough. */ if (is_barrier && rq != &q->bar_rq) *rqp = NULL; If request queue is ordered by TAG, then isn't it ok to issue post-flush after barrier immediately (without barrier request to finish). We just need to block next barrier (a new barrier and not the post-flush request of current barrier). I thought for tagged queue, controller will take care of making sure commands finish in order. If queue is ordered by DRAIN, then I need to worry about first barrier to finish and then issue post-flush and I thought following should take care of it. } else { /* Ordered by draining. Wait for turn. */ WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q)); if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q)) *rqp = NULL; } May be there is a bug somewhere. I will do some debugging. Thanks Vivek ^ permalink raw reply [flat|nested] 155+ messages in thread
* [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) 2010-07-30 7:07 ` Christoph Hellwig 2010-07-30 7:41 ` Vivek Goyal @ 2010-08-02 18:28 ` Vivek Goyal 2010-08-03 13:03 ` Christoph Hellwig 1 sibling, 1 reply; 155+ messages in thread From: Vivek Goyal @ 2010-08-02 18:28 UTC (permalink / raw) To: Christoph Hellwig Cc: Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Fri, Jul 30, 2010 at 09:07:32AM +0200, Christoph Hellwig wrote: > On Thu, Jul 29, 2010 at 11:17:21PM -0400, Vivek Goyal wrote: > > To me it looked like as if everything is there and it is just a matter > > of skipping elevator draining and request queue draining. > > The problem is that is just appears to be so. The code blocking only > the next barrier for tagged writes is there, but in that form it doesn't > work and probably never did. When I try to use it and debug it I always > get my post-flush request issued before the barrier request has > finished. Hi Christoph, Please find attached a new version of patch where I am trying to implement flush only barriers. Why do that? I was thinking that it would nice to avoid elevator drains with WCE=1. Here I have a DRAIN queue and I seem to be issuing post-flush only after barrier has finished. Need to find some device with TAG queue also to test. This is still a very crude patch where I need to do lot of testing to see if things are working. For the time being I have just hooked up ext3 to use flush barrier and verified that in case of WCE=0 we don't issue barrier and in case of WCE=1 we do issue barrier with pre flush and postflush. I don't yet have found a device with FUA and tagging support to verify that functionality. I looked at your BH_ordered kill patch. For the time being I have introduced another flag BH_Flush_Ordered along the lines of BH_Ordered. But it can be easily replaced once your kill patch is in. Thanks Vivek o Implement flush only barriers. These do not implement any drain semantics. File system needs to wait for completion of all the dependent IO. o On storage with no write cache, these barriers should just do nothing. Empty barrier request returns immediately and a write request with barrier is processed as normal request. No drains, no flushing. o On storage with write cache, for empty barrier, only pre-flush is done. For barrier request with some data one of following should happen depending on queue capability. Draining queue -------------- preflush ==> barrier (FUA) preflush ==> barrier ===> postflush Ordered Queue ------------- preflush-->barrier (FUA) preflush --> barrier ---> postflush ===> Wait for previous request to finish ---> Issue an ordered request in Tagged queue. o For write cache enabled case, we are not completely drain free. - I don't try to drain request queue for dispatching pre flush request. - But after dispatching pre flush, I wait for it to finish before actual barrier request goes in. So if controller re-orders the pre-flush and executes it ahead of other request, full draining will be avoided otherwise it will take place. - Similarly post-flush will wait for previous barrier request to finish and this will ultimately lead to draining the queue if drive is not re-ordering the requests. - So what did we gain by this patch in case of WCE=1. I think primarily we avoided elevator draining which can be useful for IO controller where we provide service differentation in elevator. - Not sure how to avoid this drain. Trying to allow other non-barrier requests to dispatch while we wait for pre-flush/flush barrier to finish will make code more complicated. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- Makefile | 2 - block/blk-barrier.c | 67 ++++++++++++++++++++++++++++++++++++++++---- block/blk-core.c | 9 +++-- block/elevator.c | 9 +++-- fs/buffer.c | 3 + fs/ext3/fsync.c | 2 - fs/jbd/commit.c | 2 - include/linux/bio.h | 7 +++- include/linux/blkdev.h | 9 ++++- include/linux/buffer_head.h | 3 + include/linux/fs.h | 1 kernel/trace/blktrace.c | 2 - 12 files changed, 97 insertions(+), 19 deletions(-) Index: linux-2.6/include/linux/blkdev.h =================================================================== --- linux-2.6.orig/include/linux/blkdev.h 2010-08-02 13:17:35.000000000 -0400 +++ linux-2.6/include/linux/blkdev.h 2010-08-02 14:01:17.000000000 -0400 @@ -97,6 +97,7 @@ enum rq_flag_bits { __REQ_SORTED, /* elevator knows about this request */ __REQ_SOFTBARRIER, /* may not be passed by ioscheduler */ __REQ_HARDBARRIER, /* may not be passed by drive either */ + __REQ_FLUSHBARRIER, /* only flush barrier. no drains required */ __REQ_FUA, /* forced unit access */ __REQ_NOMERGE, /* don't touch this for merging */ __REQ_STARTED, /* drive already may have started this one */ @@ -126,6 +127,7 @@ enum rq_flag_bits { #define REQ_SORTED (1 << __REQ_SORTED) #define REQ_SOFTBARRIER (1 << __REQ_SOFTBARRIER) #define REQ_HARDBARRIER (1 << __REQ_HARDBARRIER) +#define REQ_FLUSHBARRIER (1 << __REQ_FLUSHBARRIER) #define REQ_FUA (1 << __REQ_FUA) #define REQ_NOMERGE (1 << __REQ_NOMERGE) #define REQ_STARTED (1 << __REQ_STARTED) @@ -625,7 +627,8 @@ enum { #define blk_rq_cpu_valid(rq) ((rq)->cpu != -1) #define blk_sorted_rq(rq) ((rq)->cmd_flags & REQ_SORTED) -#define blk_barrier_rq(rq) ((rq)->cmd_flags & REQ_HARDBARRIER) +#define blk_barrier_rq(rq) ((rq)->cmd_flags & REQ_HARDBARRIER || (rq)->cmd_flags & REQ_FLUSHBARRIER) +#define blk_flush_barrier_rq(rq) ((rq)->cmd_flags & REQ_FLUSHBARRIER) #define blk_fua_rq(rq) ((rq)->cmd_flags & REQ_FUA) #define blk_discard_rq(rq) ((rq)->cmd_flags & REQ_DISCARD) #define blk_bidi_rq(rq) ((rq)->next_rq != NULL) @@ -681,7 +684,7 @@ static inline void blk_clear_queue_full( * it already be started by driver. */ #define RQ_NOMERGE_FLAGS \ - (REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER) + (REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER | REQ_FLUSHBARRIER) #define rq_mergeable(rq) \ (!((rq)->cmd_flags & RQ_NOMERGE_FLAGS) && \ (blk_discard_rq(rq) || blk_fs_request((rq)))) @@ -1006,9 +1009,11 @@ static inline struct request *blk_map_qu enum{ BLKDEV_WAIT, /* wait for completion */ BLKDEV_BARRIER, /*issue request with barrier */ + BLKDEV_FLUSHBARRIER, /*issue request with flush barrier. no drains */ }; #define BLKDEV_IFL_WAIT (1 << BLKDEV_WAIT) #define BLKDEV_IFL_BARRIER (1 << BLKDEV_BARRIER) +#define BLKDEV_IFL_FLUSHBARRIER (1 << BLKDEV_FLUSHBARRIER) extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *, unsigned long); extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector, Index: linux-2.6/block/blk-barrier.c =================================================================== --- linux-2.6.orig/block/blk-barrier.c 2010-08-02 13:17:35.000000000 -0400 +++ linux-2.6/block/blk-barrier.c 2010-08-02 14:01:17.000000000 -0400 @@ -129,7 +129,7 @@ static void post_flush_end_io(struct req blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error); } -static void queue_flush(struct request_queue *q, unsigned which) +static void queue_flush(struct request_queue *q, unsigned which, bool ordered) { struct request *rq; rq_end_io_fn *end_io; @@ -143,7 +143,17 @@ static void queue_flush(struct request_q } blk_rq_init(q, rq); - rq->cmd_flags = REQ_HARDBARRIER; + + /* + * Does this flush request has to be ordered? In case of FLUSHBARRIERS + * we don't need PREFLUSH to be ordered. POSTFLUSH needs to be ordered + * if device does not support FUA. + */ + if (ordered) + rq->cmd_flags = REQ_HARDBARRIER; + else + rq->cmd_flags = REQ_FLUSHBARRIER; + rq->rq_disk = q->bar_rq.rq_disk; rq->end_io = end_io; q->prepare_flush_fn(q, rq); @@ -192,7 +202,7 @@ static inline bool start_ordered(struct * request gets inbetween ordered sequence. */ if (q->ordered & QUEUE_ORDERED_DO_POSTFLUSH) { - queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH); + queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH, 1); rq = &q->post_flush_rq; } else skip |= QUEUE_ORDSEQ_POSTFLUSH; @@ -207,6 +217,17 @@ static inline bool start_ordered(struct if (q->ordered & QUEUE_ORDERED_DO_FUA) rq->cmd_flags |= REQ_FUA; init_request_from_bio(rq, q->orig_bar_rq->bio); + + /* + * For flush barriers, we want these to be ordered w.r.t + * preflush hence mark them as HARDBARRIER here. + * + * Note: init_request_from_bio() call above will mark it + * as FLUSHBARRIER + */ + if (blk_flush_barrier_rq(q->orig_bar_rq)) + rq->cmd_flags |= REQ_HARDBARRIER; + rq->end_io = bar_end_io; elv_insert(q, rq, ELEVATOR_INSERT_FRONT); @@ -214,12 +235,21 @@ static inline bool start_ordered(struct skip |= QUEUE_ORDSEQ_BAR; if (q->ordered & QUEUE_ORDERED_DO_PREFLUSH) { - queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH); + /* + * For flush only barrier, we don't care to order preflush + * request w.r.t other requests in the controller queue. + */ + if (blk_flush_barrier_rq(q->orig_bar_rq)) + queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH, 0); + else + queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH, 1); + rq = &q->pre_flush_rq; } else skip |= QUEUE_ORDSEQ_PREFLUSH; - if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q)) + if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q) + && !blk_flush_barrier_rq(q->orig_bar_rq)) rq = NULL; else skip |= QUEUE_ORDSEQ_DRAIN; @@ -241,6 +271,29 @@ bool blk_do_ordered(struct request_queue if (!q->ordseq) { if (!is_barrier) return true; + /* + * For flush only barriers, nothing has to be done if there is + * no caching happening on the deice. The barrier request is + * still has to be written to disk but it can written as + * normal rq. + */ + + if (blk_flush_barrier_rq(rq) + && (q->ordered == QUEUE_ORDERED_DRAIN + || q->ordered == QUEUE_ORDERED_TAG)) { + if (!blk_rq_sectors(rq)) { + /* + * Empty barrier. Device is write through. + * Nothing has to be done. Return success. + */ + blk_dequeue_request(rq); + __blk_end_request_all(rq, 0); + *rqp = NULL; + return false; + } else + /* Process as normal rq. */ + return true; + } if (q->next_ordered != QUEUE_ORDERED_NONE) return start_ordered(q, rqp); @@ -311,6 +364,8 @@ int blkdev_issue_flush(struct block_devi struct request_queue *q; struct bio *bio; int ret = 0; + int type = flags & BLKDEV_IFL_FLUSHBARRIER ? WRITE_FLUSHBARRIER + : WRITE_BARRIER; if (bdev->bd_disk == NULL) return -ENXIO; @@ -326,7 +381,7 @@ int blkdev_issue_flush(struct block_devi bio->bi_private = &wait; bio_get(bio); - submit_bio(WRITE_BARRIER, bio); + submit_bio(type, bio); if (test_bit(BLKDEV_WAIT, &flags)) { wait_for_completion(&wait); /* Index: linux-2.6/block/elevator.c =================================================================== --- linux-2.6.orig/block/elevator.c 2010-08-02 13:17:35.000000000 -0400 +++ linux-2.6/block/elevator.c 2010-08-02 13:19:02.000000000 -0400 @@ -424,7 +424,8 @@ void elv_dispatch_sort(struct request_qu q->nr_sorted--; boundary = q->end_sector; - stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED; + stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED + | REQ_FLUSHBARRIER; list_for_each_prev(entry, &q->queue_head) { struct request *pos = list_entry_rq(entry); @@ -628,7 +629,8 @@ void elv_insert(struct request_queue *q, case ELEVATOR_INSERT_BACK: rq->cmd_flags |= REQ_SOFTBARRIER; - elv_drain_elevator(q); + if (!blk_flush_barrier_rq(rq)) + elv_drain_elevator(q); list_add_tail(&rq->queuelist, &q->queue_head); /* * We kick the queue here for the following reasons. @@ -712,7 +714,8 @@ void __elv_add_request(struct request_qu if (q->ordcolor) rq->cmd_flags |= REQ_ORDERED_COLOR; - if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)) { + if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER | + REQ_FLUSHBARRIER)) { /* * toggle ordered color */ Index: linux-2.6/include/linux/bio.h =================================================================== --- linux-2.6.orig/include/linux/bio.h 2010-08-02 13:17:35.000000000 -0400 +++ linux-2.6/include/linux/bio.h 2010-08-02 14:01:17.000000000 -0400 @@ -161,6 +161,10 @@ struct bio { * Don't want driver retries for any fast fail whatever the reason. * bit 10 -- Tell the IO scheduler not to wait for more requests after this one has been submitted, even if it is a SYNC request. + * bit 11 -- This is flush only barrier and does not perform drain operations. + * A user using this should make sure all the requests one is + * depndent on have completed and then use this barrier to flush + * the cache and also do FUA write if it is non empty barrier. */ enum bio_rw_flags { BIO_RW, @@ -175,6 +179,7 @@ enum bio_rw_flags { BIO_RW_META, BIO_RW_DISCARD, BIO_RW_NOIDLE, + BIO_RW_FLUSHBARRIER, }; /* @@ -211,7 +216,7 @@ static inline bool bio_rw_flagged(struct #define bio_offset(bio) bio_iovec((bio))->bv_offset #define bio_segments(bio) ((bio)->bi_vcnt - (bio)->bi_idx) #define bio_sectors(bio) ((bio)->bi_size >> 9) -#define bio_empty_barrier(bio) (bio_rw_flagged(bio, BIO_RW_BARRIER) && !bio_has_data(bio) && !bio_rw_flagged(bio, BIO_RW_DISCARD)) +#define bio_empty_barrier(bio) ((bio_rw_flagged(bio, BIO_RW_BARRIER) || bio_rw_flagged(bio, BIO_RW_FLUSHBARRIER)) && !bio_has_data(bio) && !bio_rw_flagged(bio, BIO_RW_DISCARD)) static inline unsigned int bio_cur_bytes(struct bio *bio) { Index: linux-2.6/block/blk-core.c =================================================================== --- linux-2.6.orig/block/blk-core.c 2010-08-02 13:17:35.000000000 -0400 +++ linux-2.6/block/blk-core.c 2010-08-02 14:01:17.000000000 -0400 @@ -1153,6 +1153,8 @@ void init_request_from_bio(struct reques req->cmd_flags |= REQ_DISCARD; if (bio_rw_flagged(bio, BIO_RW_BARRIER)) req->cmd_flags |= REQ_HARDBARRIER; + if (bio_rw_flagged(bio, BIO_RW_FLUSHBARRIER)) + req->cmd_flags |= REQ_FLUSHBARRIER; if (bio_rw_flagged(bio, BIO_RW_SYNCIO)) req->cmd_flags |= REQ_RW_SYNC; if (bio_rw_flagged(bio, BIO_RW_META)) @@ -1185,9 +1187,10 @@ static int __make_request(struct request const bool unplug = bio_rw_flagged(bio, BIO_RW_UNPLUG); const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK; int rw_flags; + const bool is_barrier = (bio_rw_flagged(bio, BIO_RW_BARRIER) + || bio_rw_flagged(bio, BIO_RW_FLUSHBARRIER)); - if (bio_rw_flagged(bio, BIO_RW_BARRIER) && - (q->next_ordered == QUEUE_ORDERED_NONE)) { + if (is_barrier && (q->next_ordered == QUEUE_ORDERED_NONE)) { bio_endio(bio, -EOPNOTSUPP); return 0; } @@ -1200,7 +1203,7 @@ static int __make_request(struct request spin_lock_irq(q->queue_lock); - if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER)) || elv_queue_empty(q)) + if (unlikely(is_barrier) || elv_queue_empty(q)) goto get_rq; el_ret = elv_merge(q, &req, bio); Index: linux-2.6/include/linux/fs.h =================================================================== --- linux-2.6.orig/include/linux/fs.h 2010-08-02 13:17:35.000000000 -0400 +++ linux-2.6/include/linux/fs.h 2010-08-02 13:19:02.000000000 -0400 @@ -160,6 +160,7 @@ struct inodes_stat_t { (SWRITE | (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_NOIDLE)) #define SWRITE_SYNC (SWRITE_SYNC_PLUG | (1 << BIO_RW_UNPLUG)) #define WRITE_BARRIER (WRITE | (1 << BIO_RW_BARRIER)) +#define WRITE_FLUSHBARRIER (WRITE | (1 << BIO_RW_FLUSHBARRIER)) /* * These aren't really reads or writes, they pass down information about Index: linux-2.6/Makefile =================================================================== --- linux-2.6.orig/Makefile 2010-08-02 13:17:35.000000000 -0400 +++ linux-2.6/Makefile 2010-08-02 13:19:02.000000000 -0400 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 35 -EXTRAVERSION = -rc6 +EXTRAVERSION = -rc6-flush-barriers NAME = Sheep on Meth # *DOCUMENTATION* Index: linux-2.6/fs/buffer.c =================================================================== --- linux-2.6.orig/fs/buffer.c 2010-08-02 14:01:08.000000000 -0400 +++ linux-2.6/fs/buffer.c 2010-08-02 14:01:17.000000000 -0400 @@ -3026,6 +3026,9 @@ int submit_bh(int rw, struct buffer_head if (buffer_ordered(bh) && (rw & WRITE)) rw |= WRITE_BARRIER; + if (buffer_flush_ordered(bh) && (rw & WRITE)) + rw |= WRITE_FLUSHBARRIER; + /* * Only clear out a write error when rewriting */ Index: linux-2.6/fs/ext3/fsync.c =================================================================== --- linux-2.6.orig/fs/ext3/fsync.c 2010-08-02 14:01:08.000000000 -0400 +++ linux-2.6/fs/ext3/fsync.c 2010-08-02 14:01:17.000000000 -0400 @@ -91,6 +91,6 @@ int ext3_sync_file(struct file *file, in */ if (needs_barrier) blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL, - BLKDEV_IFL_WAIT); + BLKDEV_IFL_WAIT | BLKDEV_IFL_FLUSHBARRIER); return ret; } Index: linux-2.6/fs/jbd/commit.c =================================================================== --- linux-2.6.orig/fs/jbd/commit.c 2010-08-02 14:01:08.000000000 -0400 +++ linux-2.6/fs/jbd/commit.c 2010-08-02 14:01:17.000000000 -0400 @@ -138,7 +138,7 @@ static int journal_write_commit_record(j JBUFFER_TRACE(descriptor, "write commit block"); set_buffer_dirty(bh); if (journal->j_flags & JFS_BARRIER) { - set_buffer_ordered(bh); + set_buffer_flush_ordered(bh); barrier_done = 1; } ret = sync_dirty_buffer(bh); Index: linux-2.6/include/linux/buffer_head.h =================================================================== --- linux-2.6.orig/include/linux/buffer_head.h 2010-08-02 14:01:08.000000000 -0400 +++ linux-2.6/include/linux/buffer_head.h 2010-08-02 14:01:17.000000000 -0400 @@ -33,6 +33,8 @@ enum bh_state_bits { BH_Boundary, /* Block is followed by a discontiguity */ BH_Write_EIO, /* I/O error on write */ BH_Ordered, /* ordered write */ + BH_Flush_Ordered,/* ordered write. Ordered w.r.t contents in write + cache */ BH_Eopnotsupp, /* operation not supported (barrier) */ BH_Unwritten, /* Buffer is allocated on disk but not written */ BH_Quiet, /* Buffer Error Prinks to be quiet */ @@ -126,6 +128,7 @@ BUFFER_FNS(Delay, delay) BUFFER_FNS(Boundary, boundary) BUFFER_FNS(Write_EIO, write_io_error) BUFFER_FNS(Ordered, ordered) +BUFFER_FNS(Flush_Ordered, flush_ordered) BUFFER_FNS(Eopnotsupp, eopnotsupp) BUFFER_FNS(Unwritten, unwritten) Index: linux-2.6/kernel/trace/blktrace.c =================================================================== --- linux-2.6.orig/kernel/trace/blktrace.c 2010-08-02 14:01:08.000000000 -0400 +++ linux-2.6/kernel/trace/blktrace.c 2010-08-02 14:01:17.000000000 -0400 @@ -1764,7 +1764,7 @@ void blk_fill_rwbs(char *rwbs, u32 rw, i if (rw & 1 << BIO_RW_AHEAD) rwbs[i++] = 'A'; - if (rw & 1 << BIO_RW_BARRIER) + if (rw & 1 << BIO_RW_BARRIER || rw & 1 << BIO_RW_FLUSHBARRIER) rwbs[i++] = 'B'; if (rw & 1 << BIO_RW_SYNCIO) rwbs[i++] = 'S'; ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) 2010-08-02 18:28 ` [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) Vivek Goyal @ 2010-08-03 13:03 ` Christoph Hellwig 2010-08-04 15:29 ` Vivek Goyal 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-08-03 13:03 UTC (permalink / raw) To: Vivek Goyal Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Mon, Aug 02, 2010 at 02:28:04PM -0400, Vivek Goyal wrote: > Hi Christoph, > > Please find attached a new version of patch where I am trying to implement > flush only barriers. Why do that? I was thinking that it would nice to avoid > elevator drains with WCE=1. > > Here I have a DRAIN queue and I seem to be issuing post-flush only after > barrier has finished. Need to find some device with TAG queue also to test. > > This is still a very crude patch where I need to do lot of testing to see if > things are working. For the time being I have just hooked up ext3 to use > flush barrier and verified that in case of WCE=0 we don't issue barrier > and in case of WCE=1 we do issue barrier with pre flush and postflush. > > I don't yet have found a device with FUA and tagging support to verify > that functionality. There are not devices that use the tagging support. Only brd and virtio every use the QUEUE_ORDERED_TAG type. For brd Nick chose it at random, and it really doesn't matter when we're dealing with a ramdisk. For virtio-blk it's only used by lguest which only allows a signle outstanding command anyway. In short we can just remove it once we stop draining for the other modes. > o On storage with write cache, for empty barrier, only pre-flush is done. > For barrier request with some data one of following should happen depending > on queue capability. > > Draining queue > -------------- > preflush ==> barrier (FUA) > preflush ==> barrier ===> postflush > > Ordered Queue > ------------- > preflush-->barrier (FUA) > preflush --> barrier ---> postflush > > ===> Wait for previous request to finish > ---> Issue an ordered request in Tagged queue. with ordered you mean the unused _TAG mode? > - Not sure how to avoid this drain. Trying to allow other non-barrier > requests to dispatch while we wait for pre-flush/flush barrier to finish > will make code more complicated. That's pretty much where I got stuck, too. Thanks for doing this, but I'd be surprised if it really gives us all that much benefits for real life workloads. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) 2010-08-03 13:03 ` Christoph Hellwig @ 2010-08-04 15:29 ` Vivek Goyal 2010-08-04 16:21 ` Christoph Hellwig 0 siblings, 1 reply; 155+ messages in thread From: Vivek Goyal @ 2010-08-04 15:29 UTC (permalink / raw) To: Christoph Hellwig Cc: Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Tue, Aug 03, 2010 at 03:03:47PM +0200, Christoph Hellwig wrote: > On Mon, Aug 02, 2010 at 02:28:04PM -0400, Vivek Goyal wrote: > > Hi Christoph, > > > > Please find attached a new version of patch where I am trying to implement > > flush only barriers. Why do that? I was thinking that it would nice to avoid > > elevator drains with WCE=1. > > > > Here I have a DRAIN queue and I seem to be issuing post-flush only after > > barrier has finished. Need to find some device with TAG queue also to test. > > > > This is still a very crude patch where I need to do lot of testing to see if > > things are working. For the time being I have just hooked up ext3 to use > > flush barrier and verified that in case of WCE=0 we don't issue barrier > > and in case of WCE=1 we do issue barrier with pre flush and postflush. > > > > I don't yet have found a device with FUA and tagging support to verify > > that functionality. > > There are not devices that use the tagging support. Only brd and virtio > every use the QUEUE_ORDERED_TAG type. For brd Nick chose it at random, > and it really doesn't matter when we're dealing with a ramdisk. For > virtio-blk it's only used by lguest which only allows a signle > outstanding command anyway. What about qemu-kvm? Who imposes this single request in queue limitation? A quick look at virtio-blk driver code did not suggest anything like that. > In short we can just remove it once we > stop draining for the other modes. > > > o On storage with write cache, for empty barrier, only pre-flush is done. > > For barrier request with some data one of following should happen depending > > on queue capability. > > > > Draining queue > > -------------- > > preflush ==> barrier (FUA) > > preflush ==> barrier ===> postflush > > > > Ordered Queue > > ------------- > > preflush-->barrier (FUA) > > preflush --> barrier ---> postflush > > > > ===> Wait for previous request to finish > > ---> Issue an ordered request in Tagged queue. > > with ordered you mean the unused _TAG mode? Yes. If nobody is using it, then we can probably drop it but some of the mails in the thread suggested scsi controllers can support tagged/ordered queues very well. If so then whole barrier problem is really simplified a lot without losing performance. That would suggest that instead of dropping the TAG queue support we should move in the direction of figuring out how to enable it for scsi devices. > > > - Not sure how to avoid this drain. Trying to allow other non-barrier > > requests to dispatch while we wait for pre-flush/flush barrier to finish > > will make code more complicated. > > That's pretty much where I got stuck, too. Thanks for doing this, but > I'd be surprised if it really gives us all that much benefits for real > life workloads. True. Without getting rid of draining completely performance benefits might not be there. May be file systems can take care of ordering completely. You alredy modifed blkdev_issue_flush() to convert it to just flush reqeust and it is no longer a barrier. So file systems can always issue flush first and then issue the dependent commit request with FUA. That will bring us back to question of FUA emulation. Can the queue capability be exposed to file systems so that they issue a post flush after commit block if device does not support FUA. Vivek ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) 2010-08-04 15:29 ` Vivek Goyal @ 2010-08-04 16:21 ` Christoph Hellwig 0 siblings, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-08-04 16:21 UTC (permalink / raw) To: Vivek Goyal Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Wed, Aug 04, 2010 at 11:29:16AM -0400, Vivek Goyal wrote: > > There are not devices that use the tagging support. Only brd and virtio > > every use the QUEUE_ORDERED_TAG type. For brd Nick chose it at random, > > and it really doesn't matter when we're dealing with a ramdisk. For > > virtio-blk it's only used by lguest which only allows a signle > > outstanding command anyway. > > What about qemu-kvm? Who imposes this single request in queue limitation? > A quick look at virtio-blk driver code did not suggest anything like that. qemu never used that mode exactly because it's buggy. It has no way to actually send a cache flush request (aka empty barrier), and to implement the ordering by tag properly in a Unix userspace program we just need to do the drain we currently do in the host kernel inside qemu/lguest. > > with ordered you mean the unused _TAG mode? > > Yes. If nobody is using it, then we can probably drop it but some of the > mails in the thread suggested scsi controllers can support tagged/ordered > queues very well. If so then whole barrier problem is really simplified > a lot without losing performance. That would suggest that instead of > dropping the TAG queue support we should move in the direction of figuring > out how to enable it for scsi devices. scsi controllers can in theory, but the scsi layer can't without major work. I don't mind using ordering by tag, but I'd rather see an actually working implementation instead of code that doesn't actually get used and this almost by defintion is getting buggy sooner or later. > That will bring us back to question of FUA emulation. Can the queue > capability be exposed to file systems so that they issue a post flush > after commit block if device does not support FUA. Doing the pre and post flushes from the filesystem does mean that a) we add a lot of complexity to every single filesystem instead of doing it once b) much higher latency as we need to go through a lot more layers compared to the current implementation. E.g. for XFS moving the log state machines means first waking up a per-cpu kernel thread. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 1:44 ` Ted Ts'o 2010-07-29 2:43 ` Vivek Goyal 2010-07-29 2:43 ` Vivek Goyal @ 2010-07-29 8:31 ` Christoph Hellwig 2010-07-29 11:16 ` Jan Kara 2010-07-29 13:00 ` extfs reliability Vladislav Bolkhovitin 2010-07-29 19:44 ` [RFC] relaxed barrier semantics Ric Wheeler 2010-07-29 19:44 ` Ric Wheeler 4 siblings, 2 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-29 8:31 UTC (permalink / raw) To: Ted Ts'o, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley On Wed, Jul 28, 2010 at 09:44:31PM -0400, Ted Ts'o wrote: > Define "are safe" --- what interface we planning on using for the > non-draining barrier? At least for ext3, when we write the commit > record using set_buffer_ordered(bh), it assumes that this will do a > flush of all previous writes and that the commit will hit the disk > before any subsequent writes are sent to the disk. So turning the > write of a buffer head marked with set_buffered_ordered() into a FUA > write would _not_ be safe for ext3. Please be careful with your wording. Dou you really mean "all previous writes" or "all previous writes that were completed". My reading of the ext3/jbd code we explicitly wait on I/O completion of dependent writes, and only require those to actually be stable by issueing a flush. If that wasn't the case the default ext3 barriers off behaviour would not only be dangerous on devices with volatile write caches, but also on devices that do not have them, which in addition to the reading of the code is not what we've seen in actual power fail testing, where ext3 does well as long as there is no volatile write cache. Any, the pre-flush semantics are what the relaxe barriers will preservere. REQ_FUA is a separate interface, which we actually have already inside the block layer, we'll just need to emulate it for devices withot the FUA bit and handle it in dm and md. > For ext4, if we don't use journal checksums, then we have the same > requirements as ext3, and the same method of requesting it. If we do > use journal checksums, what ext4 needs is a way of assuring that no > writes after the commit are reordered with respect to the disk platter > before the commit record --- but any of the writes before that, > including the commit, and be reordered because we rely on the checksum > in the commit record to know at replay time whether the last commit is > valid or not. We do that right now by calling blkdev_issue_flush() > with BLKDEF_IFL_WAIT after submitting the write of the commit block. blkdev_issue_flush is just am empty barrier, and the current barriers prevent any kind of reordering. I'd rather avoid adding a one way reordering prevention. Given that we don't appear to actually need the full reordering prevention even without the journal checksums why do you have stricter requirements when they are enabled? ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 8:31 ` [RFC] relaxed barrier semantics Christoph Hellwig @ 2010-07-29 11:16 ` Jan Kara 2010-07-29 13:00 ` extfs reliability Vladislav Bolkhovitin 1 sibling, 0 replies; 155+ messages in thread From: Jan Kara @ 2010-07-29 11:16 UTC (permalink / raw) To: Christoph Hellwig Cc: Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu 29-07-10 10:31:42, Christoph Hellwig wrote: > On Wed, Jul 28, 2010 at 09:44:31PM -0400, Ted Ts'o wrote: > > Define "are safe" --- what interface we planning on using for the > > non-draining barrier? At least for ext3, when we write the commit > > record using set_buffer_ordered(bh), it assumes that this will do a > > flush of all previous writes and that the commit will hit the disk > > before any subsequent writes are sent to the disk. So turning the > > write of a buffer head marked with set_buffered_ordered() into a FUA > > write would _not_ be safe for ext3. > > Please be careful with your wording. Dou you really mean > "all previous writes" or "all previous writes that were completed". > > My reading of the ext3/jbd code we explicitly wait on I/O completion > of dependent writes, and only require those to actually be stable > by issueing a flush. If that wasn't the case the default ext3 > barriers off behaviour would not only be dangerous on devices with > volatile write caches, but also on devices that do not have them, > which in addition to the reading of the code is not what we've seen > in actual power fail testing, where ext3 does well as long as there > is no volatile write cache. Yes, ext3 waits for all buffers it needs before writing the commit block with ordered flag to disk. So preflush + FUA write of commit block is OK for ext3. Note: We really rely on commit block being on disk before transaction commit finishes because at that moment we allow reallocation of blocks freed by the committed transaction. And if they are reallocated for data, they can get overwritten as soon as they are reallocated, so we have to be sure they are percieved as free even after journal replay. > Any, the pre-flush semantics are what the relaxe barriers will > preservere. REQ_FUA is a separate interface, which we actually have > already inside the block layer, we'll just need to emulate it for > devices withot the FUA bit and handle it in dm and md. > > > For ext4, if we don't use journal checksums, then we have the same > > requirements as ext3, and the same method of requesting it. If we do > > use journal checksums, what ext4 needs is a way of assuring that no > > writes after the commit are reordered with respect to the disk platter > > before the commit record --- but any of the writes before that, > > including the commit, and be reordered because we rely on the checksum > > in the commit record to know at replay time whether the last commit is > > valid or not. We do that right now by calling blkdev_issue_flush() > > with BLKDEF_IFL_WAIT after submitting the write of the commit block. > > blkdev_issue_flush is just am empty barrier, and the current barriers > prevent any kind of reordering. I'd rather avoid adding a one way > reordering prevention. > > Given that we don't appear to actually need the full reordering > prevention even without the journal checksums why do you have stricter > requirements when they are enabled? Because Ted found out it actually improves performance - see message of commit 0e3d2a6313d03413d93327202a60256d1d726fdc. At that time we thought it's because the latency of forcing commit block to the platter after flushing caches is still noticeable. But maybe it's something else. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 155+ messages in thread
* extfs reliability 2010-07-29 8:31 ` [RFC] relaxed barrier semantics Christoph Hellwig 2010-07-29 11:16 ` Jan Kara @ 2010-07-29 13:00 ` Vladislav Bolkhovitin 2010-07-29 13:08 ` Christoph Hellwig ` (2 more replies) 1 sibling, 3 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-29 13:00 UTC (permalink / raw) To: Christoph Hellwig Cc: Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke, linux-kernel, kernel-bugs [-- Attachment #1: Type: text/plain, Size: 31306 bytes --] Christoph Hellwig, on 07/29/2010 12:31 PM wrote: > My reading of the ext3/jbd code we explicitly wait on I/O completion > of dependent writes, and only require those to actually be stable > by issueing a flush. If that wasn't the case the default ext3 > barriers off behaviour would not only be dangerous on devices with > volatile write caches, but also on devices that do not have them, > which in addition to the reading of the code is not what we've seen > in actual power fail testing, where ext3 does well as long as there > is no volatile write cache. Basically, it is so, but, unfortunately, not absolutely. I've just tried 2 tests on ext4 with iSCSI: # uname -a Linux ini 2.6.32-22-386 #36-Ubuntu SMP Fri Jun 4 00:27:09 UTC 2010 i686 GNU/Linux # e2fsck -f -y /dev/sdb e2fsck 1.41.11 (14-Mar-2010) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/sdb: 49/640000 files (0.0% non-contiguous), 56496/1280000 blocks root@ini:~# mount -t ext4 -o barrier=1 /dev/sdb /mnt root@ini:~# cd /mnt/dbench-mod/ root@ini:/mnt/dbench-mod# ./dbench 50 50 clients started ... <-- Pull cable <-- After sometime a lot of warnings like: (22002) open CLIENTS/CLIENT44/~DMTMP/COREL/CDRBARS.CFG failed for handle 4235 (Read-only file system) (22004) open CLIENTS/CLIENT44/~DMTMP/COREL/ARTISTIC.ACL failed for handle 4236 (Read-only file system) (22010) open CLIENTS/CLIENT44/~DMTMP/COREL/@@@CDRW.TMP failed for handle 4237 (Read-only file system) (22011) nb_close: handle 4237 was not open (22014) unlink CLIENTS/CLIENT44/~DMTMP/COREL/@@@CDRW.TMP failed (Read-only file system) (22018) open CLIENTS/CLIENT44/~DMTMP/COREL/CORELDRW.CDT failed for handle 4238 (Read-only file system) (22021) nb_close: handle 4218 was not open (22032) open CLIENTS/CLIENT44/~DMTMP/COREL/GRAPHIC1.CDR failed for handle 4239 (Read-only file system) (22050) open CLIENTS/CLIENT44/~DMTMP/COREL/@@@CDRW.TMP failed for handle 4240 (Read-only file system) (22051) nb_close: handle 4240 was not open (22054) unlink CLIENTS/CLIENT44/~DMTMP/COREL/@@@CDRW.TMP failed (Read-only file system) (22057) nb_close: handle 4228 was not open (22061) nb_close: handle 4182 was not open (22065) nb_close: handle 4234 was not open (22078) open CLIENTS/CLIENT44/~DMTMP/COREL/GRAPH1.CDR failed for handle 4242 (Read-only file system)^C^C^C^C^C^C root@ini:/mnt/dbench-mod# ^C root@ini:/mnt/dbench-mod# ^C root@ini:~# umount /mnt Segmentation fault Kernel log: Jul 29 19:55:35 ini kernel: [ 3044.722313] c2c28e40: 00023740 00023741 00023742 00023743 @7..A7..B7..C7.. Jul 29 19:55:35 ini kernel: [ 3044.722320] c2c28e50: 00023744 00023745 00023746 00023747 D7..E7..F7..G7.. Jul 29 19:55:35 ini kernel: [ 3044.722327] c2c28e60: 00023748 00023749 0002374a 0002374b H7..I7..J7..K7.. Jul 29 19:55:35 ini kernel: [ 3044.722334] c2c28e70: 0002372c 00000000 00000000 00000000 ,7.............. Jul 29 19:55:35 ini kernel: [ 3044.722341] c2c28e80: 00000000 00000000 00000000 00000002 ................ Jul 29 19:55:35 ini kernel: [ 3044.722346] c2c28e90: 00000000 00000000 00000000 00000000 ................ Jul 29 19:55:35 ini kernel: [ 3044.722354] c2c28ea0: c2c28ea0 c2c28ea0 c307f138 c307f138 ........8...8... Jul 29 19:55:35 ini kernel: [ 3044.722360] c2c28eb0: 0003f800 00000000 00000000 00000000 ................ Jul 29 19:55:35 ini kernel: [ 3044.722366] c2c28ec0: c2c28ec0 c2c28ec0 00000000 00000000 ................ Jul 29 19:55:35 ini kernel: [ 3044.722373] c2c28ed0: 00100100 00200200 c2c28ed8 c2c28ed8 ...... ......... Jul 29 19:55:35 ini kernel: [ 3044.722379] c2c28ee0: c2c28ee0 c2c28ee0 0000800b 00000000 ................ Jul 29 19:55:35 ini kernel: [ 3044.722384] c2c28ef0: 00000001 00000000 00000000 00000000 ................ Jul 29 19:55:35 ini kernel: [ 3044.722391] c2c28f00: 00000001 00000000 0003f800 00000000 ................ Jul 29 19:55:35 ini kernel: [ 3044.722398] c2c28f10: 00000002 4c51a3cc 00000000 4c51a3cc ......QL......QL Jul 29 19:55:35 ini kernel: [ 3044.722404] c2c28f20: 00000000 4c51a3cc 00000000 00000208 ......QL........ Jul 29 19:55:35 ini kernel: [ 3044.722410] c2c28f30: 00000000 0000000c 81800000 00000101 ................ Jul 29 19:55:35 ini kernel: [ 3044.722416] c2c28f40: 00000001 00000000 c2c28f48 c2c28f48 ........H...H... Jul 29 19:55:35 ini kernel: [ 3044.722422] c2c28f50: 00000000 00000000 00000000 c2c28f5c ............\... Jul 29 19:55:35 ini kernel: [ 3044.722428] c2c28f60: c2c28f5c c0593440 c05933c0 ca228a00 \...@4Y..3Y...". Jul 29 19:55:35 ini kernel: [ 3044.722434] c2c28f70: 00000000 c2c28f78 c2c28ec8 00000000 ....x........... Jul 29 19:55:35 ini kernel: [ 3044.722440] c2c28f80: 00000020 00000000 00000505 00000000 ............... Jul 29 19:55:35 ini kernel: [ 3044.722446] c2c28f90: 00000000 00010001 c2c28f98 c2c28f98 ................ Jul 29 19:55:35 ini kernel: [ 3044.722451] c2c28fa0: 00000000 00000000 00000000 00000000 ................ Jul 29 19:55:35 ini kernel: [ 3044.722457] c2c28fb0: c0593680 000200da cdcc104c 00000202 .6Y.....L....... Jul 29 19:55:35 ini kernel: [ 3044.722463] c2c28fc0: c2c28fc0 c2c28fc0 00000000 00000000 ................ Jul 29 19:55:35 ini kernel: [ 3044.722469] c2c28fd0: 00000000 c2c28fd4 c2c28fd4 00000000 ................ Jul 29 19:55:35 ini kernel: [ 3044.722475] c2c28fe0: 0623225b 00000000 00000000 c2c28fec ["#............. Jul 29 19:55:35 ini kernel: [ 3044.722481] c2c28ff0: c2c28fec 00000001 00000000 c2c28ffc ................ Jul 29 19:55:35 ini kernel: [ 3044.722487] c2c29000: c2c28ffc 00000000 00000040 00000000 ........@....... Jul 29 19:55:35 ini kernel: [ 3044.722493] c2c29010: 00000000 00000000 00000000 ffffffff ................ Jul 29 19:55:35 ini kernel: [ 3044.722499] c2c29020: ffffffff 00000000 00000000 00000000 ................ Jul 29 19:55:35 ini kernel: [ 3044.722505] c2c29030: c2c29030 c2c29030 c2c28ec8 00000000 0...0........... Jul 29 19:55:35 ini kernel: [ 3044.722510] c2c29040: 00000000 00000000 00000000 00000000 ................ Jul 29 19:55:35 ini kernel: [ 3044.722516] c2c29050: 00000000 4c51a3d8 00000000 c2c2905c ......QL....\... Jul 29 19:55:35 ini kernel: [ 3044.722522] c2c29060: c2c2905c 00000101 ffffffff 00000000 \............... Jul 29 19:55:35 ini kernel: [ 3044.722528] c2c29070: 00000000 00000000 00000000 00000101 ................ Jul 29 19:55:35 ini kernel: [ 3044.722534] c2c29080: 00000000 00000000 c2c29088 c2c29088 ................ Jul 29 19:55:35 ini kernel: [ 3044.722540] c2c29090: 00000000 00005be2 00005be2 .....[...[.. Jul 29 19:55:35 ini kernel: [ 3044.722546] Pid: 1299, comm: umount Not tainted 2.6.32-22-386 #36-Ubuntu Jul 29 19:55:35 ini kernel: [ 3044.722550] Call Trace: Jul 29 19:55:35 ini kernel: [ 3044.722567] [<c0291731>] ext4_destroy_inode+0x91/0xa0 Jul 29 19:55:35 ini kernel: [ 3044.722577] [<c020ecb4>] destroy_inode+0x24/0x40 Jul 29 19:55:35 ini kernel: [ 3044.722583] [<c020f11e>] dispose_list+0x8e/0x100 Jul 29 19:55:35 ini kernel: [ 3044.722588] [<c020f534>] invalidate_inodes+0xf4/0x120 Jul 29 19:55:35 ini kernel: [ 3044.722598] [<c023b310>] ? vfs_quota_off+0x0/0x20 Jul 29 19:55:35 ini kernel: [ 3044.722606] [<c01fc602>] generic_shutdown_super+0x42/0xe0 Jul 29 19:55:35 ini kernel: [ 3044.722612] [<c01fc6ca>] kill_block_super+0x2a/0x50 Jul 29 19:55:35 ini kernel: [ 3044.722618] [<c01fd4e4>] deactivate_super+0x64/0x90 Jul 29 19:55:35 ini kernel: [ 3044.722625] [<c021282f>] mntput_no_expire+0x8f/0xe0 Jul 29 19:55:35 ini kernel: [ 3044.722631] [<c0212e47>] sys_umount+0x47/0xa0 Jul 29 19:55:35 ini kernel: [ 3044.722636] [<c0212ebe>] sys_oldumount+0x1e/0x20 Jul 29 19:55:35 ini kernel: [ 3044.722643] [<c01033ec>] syscall_call+0x7/0xb Jul 29 19:55:35 ini kernel: [ 3044.731043] sd 6:0:0:0: [sdb] Unhandled error code Jul 29 19:55:35 ini kernel: [ 3044.731049] sd 6:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK Jul 29 19:55:35 ini kernel: [ 3044.731056] sd 6:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 00 00 00 00 01 00 Jul 29 19:55:35 ini kernel: [ 3044.743469] __ratelimit: 37 callbacks suppressed Jul 29 19:55:35 ini kernel: [ 3044.755695] lost page write due to I/O error on sdb Jul 29 19:55:36 ini kernel: [ 3044.823044] Modules linked in: crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi w83627hf hwmon_vid fbcon tileblit font bitblit softcursor ppdev adm1021 i2c_i801 vga16fb vgastate e7xxx_edac psmouse serio_raw parport_pc shpchp edac_core lp parport qla2xxx ohci1394 scsi_transport_fc r8169 sata_via ieee1394 mii scsi_tgt e1000 floppy Jul 29 19:55:36 ini kernel: [ 3044.823044] Jul 29 19:55:36 ini kernel: [ 3044.823044] Pid: 1299, comm: umount Not tainted (2.6.32-22-386 #36-Ubuntu) X5DPA Jul 29 19:55:36 ini kernel: [ 3044.823044] EIP: 0060:[<c0293c2a>] EFLAGS: 00010206 CPU: 0 Jul 29 19:55:36 ini kernel: [ 3044.823044] EIP is at ext4_put_super+0x2ea/0x350 Jul 29 19:55:36 ini kernel: [ 3044.823044] EAX: c2c28ea8 EBX: c307f000 ECX: ffffff52 EDX: c307f138 Jul 29 19:55:36 ini kernel: [ 3044.823044] ESI: ca228a00 EDI: c307f0fc EBP: cec6ff30 ESP: cec6fefc Jul 29 19:55:36 ini kernel: [ 3044.823044] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Jul 29 19:55:36 ini kernel: [ 3044.823044] c06bb054 ca228b64 0000800b c2c28ec8 00008180 00000001 00000000 c307f138 Jul 29 19:55:36 ini kernel: [ 3044.823044] <0> c307f138 c307f138 ca228a00 c0593c80 c023b310 cec6ff48 c01fc60d ca228ac0 Jul 29 19:55:36 ini kernel: [ 3044.823044] <0> cec6ff44 cf328400 00000003 cec6ff58 c01fc6ca ca228a00 c0759d80 cec6ff6c Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c023b310>] ? vfs_quota_off+0x0/0x20 Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c01fc60d>] ? generic_shutdown_super+0x4d/0xe0 Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c01fc6ca>] ? kill_block_super+0x2a/0x50 Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c01fd4e4>] ? deactivate_super+0x64/0x90 Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c021282f>] ? mntput_no_expire+0x8f/0xe0 Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c0212e47>] ? sys_umount+0x47/0xa0 Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c0212ebe>] ? sys_oldumount+0x1e/0x20 Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c01033ec>] ? syscall_call+0x7/0xb Jul 29 19:55:36 ini kernel: [ 3045.299442] ---[ end trace 426db011a0289db3 ]--- Jul 29 19:55:36 ini kernel: [ 3045.310429] ------------[ cut here ]------------ Jul 29 19:55:36 ini kernel: [ 3045.321086] WARNING: at /build/buildd/linux-2.6.32/kernel/exit.c:895 do_exit+0x2f9/0x300() Jul 29 19:55:36 ini kernel: [ 3045.342153] Hardware name: X5DPA Jul 29 19:55:36 ini kernel: [ 3045.352697] Modules linked in: crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi w83627hf hwmon_vid fbcon tileblit font bitblit softcursor ppdev adm1021 i2c_i801 vga16fb vgastate e7xxx_edac psmouse serio_raw parport_pc shpchp edac_core lp parport qla2xxx ohci1394 scsi_transport_fc r8169 sata_via ieee1394 mii scsi_tgt e1000 floppy Jul 29 19:55:36 ini kernel: [ 3045.422317] Pid: 1299, comm: umount Tainted: G D 2.6.32-22-386 #36-Ubuntu Jul 29 19:55:36 ini kernel: [ 3045.444158] Call Trace: Jul 29 19:55:36 ini kernel: [ 3045.454755] [<c01487a2>] warn_slowpath_common+0x72/0xa0 Jul 29 19:55:36 ini kernel: [ 3045.465152] [<c014ca49>] ? do_exit+0x2f9/0x300 Jul 29 19:55:36 ini kernel: [ 3045.475281] [<c014ca49>] ? do_exit+0x2f9/0x300 Jul 29 19:55:36 ini kernel: [ 3045.485296] [<c01487ea>] warn_slowpath_null+0x1a/0x20 Jul 29 19:55:36 ini kernel: [ 3045.495432] [<c014ca49>] do_exit+0x2f9/0x300 Jul 29 19:55:36 ini kernel: [ 3045.505640] [<c014856f>] ? print_oops_end_marker+0x2f/0x40 Jul 29 19:55:36 ini kernel: [ 3045.516012] [<c0579fc5>] oops_end+0x95/0xd0 Jul 29 19:55:36 ini kernel: [ 3045.526394] [<c01068a4>] die+0x54/0x80 Jul 29 19:55:36 ini kernel: [ 3045.536808] [<c0579716>] do_trap+0x96/0xc0 Jul 29 19:55:36 ini kernel: [ 3045.547268] [<c0104980>] ? do_invalid_op+0x0/0xa0 Jul 29 19:55:36 ini kernel: [ 3045.557756] [<c0104a0b>] do_invalid_op+0x8b/0xa0 Jul 29 19:55:36 ini kernel: [ 3045.568296] [<c0293c2a>] ? ext4_put_super+0x2ea/0x350 Jul 29 19:55:36 ini kernel: [ 3045.578561] [<c0149291>] ? vprintk+0x191/0x3f0 Jul 29 19:55:36 ini kernel: [ 3045.588708] [<c0579493>] error_code+0x73/0x80 Jul 29 19:55:36 ini kernel: [ 3045.598076] [<c0293c2a>] ? ext4_put_super+0x2ea/0x350 Jul 29 19:55:36 ini kernel: [ 3045.607381] [<c023b310>] ? vfs_quota_off+0x0/0x20 Jul 29 19:55:36 ini kernel: [ 3045.616499] [<c01fc60d>] generic_shutdown_super+0x4d/0xe0 Jul 29 19:55:36 ini kernel: [ 3045.625688] [<c01fc6ca>] kill_block_super+0x2a/0x50 Jul 29 19:55:36 ini kernel: [ 3045.634777] [<c01fd4e4>] deactivate_super+0x64/0x90 Jul 29 19:55:36 ini kernel: [ 3045.643744] [<c021282f>] mntput_no_expire+0x8f/0xe0 Jul 29 19:55:36 ini kernel: [ 3045.652782] [<c0212e47>] sys_umount+0x47/0xa0 Jul 29 19:55:36 ini kernel: [ 3045.661514] [<c0212ebe>] sys_oldumount+0x1e/0x20 Jul 29 19:55:36 ini kernel: [ 3045.670139] [<c01033ec>] syscall_call+0x7/0xb Jul 29 19:55:36 ini kernel: [ 3045.678566] ---[ end trace 426db011a0289db4 ]--- Another test. Everything is as before, only I did not pull the cable, but deleted the corresponding LUN on the target, so all the command starting from this moment failed. Then on umount system rebooted. Kernel log: Jul 29 20:20:42 ini kernel: [ 1320.251393] umount D 00478e55 0 1234 924 0x00000000 Jul 29 20:20:42 ini kernel: [ 1320.251403] ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330 Jul 29 20:20:42 ini kernel: [ 1320.251415] c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000 Jul 29 20:20:42 ini kernel: [ 1320.251425] 0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54 Jul 29 20:20:42 ini kernel: [ 1320.251436] Call Trace: Jul 29 20:20:42 ini kernel: [ 1320.251452] [<c057745a>] io_schedule+0x3a/0x60 Jul 29 20:20:42 ini kernel: [ 1320.251463] [<c01bd95d>] sync_page+0x3d/0x50 Jul 29 20:20:42 ini kernel: [ 1320.251470] [<c0577aa7>] __wait_on_bit_lock+0x47/0x90 Jul 29 20:20:42 ini kernel: [ 1320.251476] [<c01bd920>] ? sync_page+0x0/0x50 Jul 29 20:20:42 ini kernel: [ 1320.251483] [<c01bd8ee>] __lock_page+0x7e/0x90 Jul 29 20:20:42 ini kernel: [ 1320.251491] [<c01624d0>] ? wake_bit_function+0x0/0x50 Jul 29 20:20:42 ini kernel: [ 1320.251499] [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0 Jul 29 20:20:42 ini kernel: [ 1320.251510] [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0 Jul 29 20:20:42 ini kernel: [ 1320.251517] [<c01c724f>] truncate_inode_pages+0x1f/0x30 Jul 29 20:20:42 ini kernel: [ 1320.251523] [<c020f15c>] dispose_list+0xcc/0x100 Jul 29 20:20:42 ini kernel: [ 1320.251529] [<c020f534>] invalidate_inodes+0xf4/0x120 Jul 29 20:20:42 ini kernel: [ 1320.251538] [<c023b310>] ? vfs_quota_off+0x0/0x20 Jul 29 20:20:42 ini kernel: [ 1320.251546] [<c01fc602>] generic_shutdown_super+0x42/0xe0 Jul 29 20:20:42 ini kernel: [ 1320.251553] [<c01fc6ca>] kill_block_super+0x2a/0x50 Jul 29 20:20:42 ini kernel: [ 1320.251559] [<c01fd4e4>] deactivate_super+0x64/0x90 Jul 29 20:20:42 ini kernel: [ 1320.251566] [<c021282f>] mntput_no_expire+0x8f/0xe0 Jul 29 20:20:42 ini kernel: [ 1320.251573] [<c0212e47>] sys_umount+0x47/0xa0 Jul 29 20:20:42 ini kernel: [ 1320.251579] [<c0212ebe>] sys_oldumount+0x1e/0x20 Jul 29 20:20:42 ini kernel: [ 1320.251586] [<c01033ec>] syscall_call+0x7/0xb Jul 29 20:22:42 ini kernel: [ 1440.285910] umount D 00478e55 0 1234 924 0x00000004 Jul 29 20:22:42 ini kernel: [ 1440.285919] ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330 Jul 29 20:22:42 ini kernel: [ 1440.285931] c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000 Jul 29 20:22:42 ini kernel: [ 1440.285942] 0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54 Jul 29 20:22:42 ini kernel: [ 1440.285953] Call Trace: Jul 29 20:22:42 ini kernel: [ 1440.285969] [<c057745a>] io_schedule+0x3a/0x60 Jul 29 20:22:42 ini kernel: [ 1440.285980] [<c01bd95d>] sync_page+0x3d/0x50 Jul 29 20:22:42 ini kernel: [ 1440.285987] [<c0577aa7>] __wait_on_bit_lock+0x47/0x90 Jul 29 20:22:42 ini kernel: [ 1440.285994] [<c01bd920>] ? sync_page+0x0/0x50 Jul 29 20:22:42 ini kernel: [ 1440.286001] [<c01bd8ee>] __lock_page+0x7e/0x90 Jul 29 20:22:42 ini kernel: [ 1440.286010] [<c01624d0>] ? wake_bit_function+0x0/0x50 Jul 29 20:22:42 ini kernel: [ 1440.286018] [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0 Jul 29 20:22:42 ini kernel: [ 1440.286028] [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0 Jul 29 20:22:42 ini kernel: [ 1440.286035] [<c01c724f>] truncate_inode_pages+0x1f/0x30 Jul 29 20:22:42 ini kernel: [ 1440.286041] [<c020f15c>] dispose_list+0xcc/0x100 Jul 29 20:22:42 ini kernel: [ 1440.286047] [<c020f534>] invalidate_inodes+0xf4/0x120 Jul 29 20:22:42 ini kernel: [ 1440.286056] [<c023b310>] ? vfs_quota_off+0x0/0x20 Jul 29 20:22:42 ini kernel: [ 1440.286064] [<c01fc602>] generic_shutdown_super+0x42/0xe0 Jul 29 20:22:42 ini kernel: [ 1440.286071] [<c01fc6ca>] kill_block_super+0x2a/0x50 Jul 29 20:22:42 ini kernel: [ 1440.286077] [<c01fd4e4>] deactivate_super+0x64/0x90 Jul 29 20:22:42 ini kernel: [ 1440.286084] [<c021282f>] mntput_no_expire+0x8f/0xe0 Jul 29 20:22:42 ini kernel: [ 1440.286091] [<c0212e47>] sys_umount+0x47/0xa0 Jul 29 20:22:42 ini kernel: [ 1440.286097] [<c0212ebe>] sys_oldumount+0x1e/0x20 Jul 29 20:22:42 ini kernel: [ 1440.286104] [<c01033ec>] syscall_call+0x7/0xb Jul 29 20:24:42 ini kernel: [ 1560.321709] umount D 00478e55 0 1234 924 0x00000004 Jul 29 20:24:42 ini kernel: [ 1560.321718] ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330 Jul 29 20:24:42 ini kernel: [ 1560.321730] c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000 Jul 29 20:24:42 ini kernel: [ 1560.321741] 0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54 Jul 29 20:24:42 ini kernel: [ 1560.321751] Call Trace: Jul 29 20:24:42 ini kernel: [ 1560.321767] [<c057745a>] io_schedule+0x3a/0x60 Jul 29 20:24:42 ini kernel: [ 1560.321777] [<c01bd95d>] sync_page+0x3d/0x50 Jul 29 20:24:42 ini kernel: [ 1560.321784] [<c0577aa7>] __wait_on_bit_lock+0x47/0x90 Jul 29 20:24:42 ini kernel: [ 1560.321791] [<c01bd920>] ? sync_page+0x0/0x50 Jul 29 20:24:42 ini kernel: [ 1560.321797] [<c01bd8ee>] __lock_page+0x7e/0x90 Jul 29 20:24:42 ini kernel: [ 1560.321805] [<c01624d0>] ? wake_bit_function+0x0/0x50 Jul 29 20:24:42 ini kernel: [ 1560.321814] [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0 Jul 29 20:24:42 ini kernel: [ 1560.321824] [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0 Jul 29 20:24:42 ini kernel: [ 1560.321831] [<c01c724f>] truncate_inode_pages+0x1f/0x30 Jul 29 20:24:42 ini kernel: [ 1560.321837] [<c020f15c>] dispose_list+0xcc/0x100 Jul 29 20:24:42 ini kernel: [ 1560.321845] [<c020f534>] invalidate_inodes+0xf4/0x120 Jul 29 20:24:42 ini kernel: [ 1560.321855] [<c023b310>] ? vfs_quota_off+0x0/0x20 Jul 29 20:24:42 ini kernel: [ 1560.321864] [<c01fc602>] generic_shutdown_super+0x42/0xe0 Jul 29 20:24:42 ini kernel: [ 1560.321870] [<c01fc6ca>] kill_block_super+0x2a/0x50 Jul 29 20:24:42 ini kernel: [ 1560.321877] [<c01fd4e4>] deactivate_super+0x64/0x90 Jul 29 20:24:42 ini kernel: [ 1560.321885] [<c021282f>] mntput_no_expire+0x8f/0xe0 Jul 29 20:24:42 ini kernel: [ 1560.321892] [<c0212e47>] sys_umount+0x47/0xa0 Jul 29 20:24:42 ini kernel: [ 1560.321898] [<c0212ebe>] sys_oldumount+0x1e/0x20 Jul 29 20:24:42 ini kernel: [ 1560.321905] [<c01033ec>] syscall_call+0x7/0xb Jul 29 20:24:42 ini kernel: [ 1560.358795] sync D 0004beb0 0 1265 1255 0x00000004 Jul 29 20:24:42 ini kernel: [ 1560.358803] cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330 Jul 29 20:24:42 ini kernel: [ 1560.358815] c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200 Jul 29 20:24:42 ini kernel: [ 1560.358826] 00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff Jul 29 20:24:42 ini kernel: [ 1560.358837] Call Trace: Jul 29 20:24:42 ini kernel: [ 1560.358845] [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0 Jul 29 20:24:42 ini kernel: [ 1560.358852] [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30 Jul 29 20:24:42 ini kernel: [ 1560.358858] [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10 Jul 29 20:24:42 ini kernel: [ 1560.358863] [<c057850c>] ? down_read+0x1c/0x20 Jul 29 20:24:42 ini kernel: [ 1560.358870] [<c021cb6d>] sync_filesystems+0xbd/0x110 Jul 29 20:24:42 ini kernel: [ 1560.358876] [<c021cc16>] sys_sync+0x16/0x40 Jul 29 20:24:42 ini kernel: [ 1560.358881] [<c01033ec>] syscall_call+0x7/0xb Jul 29 20:26:42 ini kernel: [ 1680.392190] umount D 00478e55 0 1234 924 0x00000004 Jul 29 20:26:42 ini kernel: [ 1680.392200] ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330 Jul 29 20:26:42 ini kernel: [ 1680.392212] c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000 Jul 29 20:26:42 ini kernel: [ 1680.392223] 0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54 Jul 29 20:26:42 ini kernel: [ 1680.392233] Call Trace: Jul 29 20:26:42 ini kernel: [ 1680.392250] [<c057745a>] io_schedule+0x3a/0x60 Jul 29 20:26:42 ini kernel: [ 1680.392260] [<c01bd95d>] sync_page+0x3d/0x50 Jul 29 20:26:42 ini kernel: [ 1680.392267] [<c0577aa7>] __wait_on_bit_lock+0x47/0x90 Jul 29 20:26:42 ini kernel: [ 1680.392274] [<c01bd920>] ? sync_page+0x0/0x50 Jul 29 20:26:42 ini kernel: [ 1680.392280] [<c01bd8ee>] __lock_page+0x7e/0x90 Jul 29 20:26:42 ini kernel: [ 1680.392289] [<c01624d0>] ? wake_bit_function+0x0/0x50 Jul 29 20:26:42 ini kernel: [ 1680.392298] [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0 Jul 29 20:26:42 ini kernel: [ 1680.392308] [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0 Jul 29 20:26:42 ini kernel: [ 1680.392314] [<c01c724f>] truncate_inode_pages+0x1f/0x30 Jul 29 20:26:42 ini kernel: [ 1680.392321] [<c020f15c>] dispose_list+0xcc/0x100 Jul 29 20:26:42 ini kernel: [ 1680.392327] [<c020f534>] invalidate_inodes+0xf4/0x120 Jul 29 20:26:42 ini kernel: [ 1680.392336] [<c023b310>] ? vfs_quota_off+0x0/0x20 Jul 29 20:26:42 ini kernel: [ 1680.392344] [<c01fc602>] generic_shutdown_super+0x42/0xe0 Jul 29 20:26:42 ini kernel: [ 1680.392351] [<c01fc6ca>] kill_block_super+0x2a/0x50 Jul 29 20:26:42 ini kernel: [ 1680.392357] [<c01fd4e4>] deactivate_super+0x64/0x90 Jul 29 20:26:42 ini kernel: [ 1680.392364] [<c021282f>] mntput_no_expire+0x8f/0xe0 Jul 29 20:26:42 ini kernel: [ 1680.392371] [<c0212e47>] sys_umount+0x47/0xa0 Jul 29 20:26:42 ini kernel: [ 1680.392378] [<c0212ebe>] sys_oldumount+0x1e/0x20 Jul 29 20:26:42 ini kernel: [ 1680.392384] [<c01033ec>] syscall_call+0x7/0xb Jul 29 20:26:42 ini kernel: [ 1680.427874] sync D 0004beb0 0 1265 1255 0x00000004 Jul 29 20:26:42 ini kernel: [ 1680.427883] cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330 Jul 29 20:26:42 ini kernel: [ 1680.427894] c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200 Jul 29 20:26:42 ini kernel: [ 1680.427904] 00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff Jul 29 20:26:42 ini kernel: [ 1680.427915] Call Trace: Jul 29 20:26:42 ini kernel: [ 1680.427922] [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0 Jul 29 20:26:42 ini kernel: [ 1680.427929] [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30 Jul 29 20:26:42 ini kernel: [ 1680.427935] [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10 Jul 29 20:26:42 ini kernel: [ 1680.427940] [<c057850c>] ? down_read+0x1c/0x20 Jul 29 20:26:42 ini kernel: [ 1680.427947] [<c021cb6d>] sync_filesystems+0xbd/0x110 Jul 29 20:26:42 ini kernel: [ 1680.427953] [<c021cc16>] sys_sync+0x16/0x40 Jul 29 20:26:42 ini kernel: [ 1680.427958] [<c01033ec>] syscall_call+0x7/0xb Jul 29 20:28:42 ini kernel: [ 1800.458856] umount D 00478e55 0 1234 924 0x00000004 Jul 29 20:28:42 ini kernel: [ 1800.458866] ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330 Jul 29 20:28:42 ini kernel: [ 1800.458877] c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000 Jul 29 20:28:42 ini kernel: [ 1800.458888] 0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54 Jul 29 20:28:42 ini kernel: [ 1800.458899] Call Trace: Jul 29 20:28:42 ini kernel: [ 1800.458915] [<c057745a>] io_schedule+0x3a/0x60 Jul 29 20:28:42 ini kernel: [ 1800.458925] [<c01bd95d>] sync_page+0x3d/0x50 Jul 29 20:28:42 ini kernel: [ 1800.458932] [<c0577aa7>] __wait_on_bit_lock+0x47/0x90 Jul 29 20:28:42 ini kernel: [ 1800.458938] [<c01bd920>] ? sync_page+0x0/0x50 Jul 29 20:28:42 ini kernel: [ 1800.458945] [<c01bd8ee>] __lock_page+0x7e/0x90 Jul 29 20:28:42 ini kernel: [ 1800.458953] [<c01624d0>] ? wake_bit_function+0x0/0x50 Jul 29 20:28:42 ini kernel: [ 1800.458961] [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0 Jul 29 20:28:42 ini kernel: [ 1800.458971] [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0 Jul 29 20:28:42 ini kernel: [ 1800.458978] [<c01c724f>] truncate_inode_pages+0x1f/0x30 Jul 29 20:28:42 ini kernel: [ 1800.458984] [<c020f15c>] dispose_list+0xcc/0x100 Jul 29 20:28:42 ini kernel: [ 1800.458991] [<c020f534>] invalidate_inodes+0xf4/0x120 Jul 29 20:28:42 ini kernel: [ 1800.458999] [<c023b310>] ? vfs_quota_off+0x0/0x20 Jul 29 20:28:42 ini kernel: [ 1800.459007] [<c01fc602>] generic_shutdown_super+0x42/0xe0 Jul 29 20:28:42 ini kernel: [ 1800.459013] [<c01fc6ca>] kill_block_super+0x2a/0x50 Jul 29 20:28:42 ini kernel: [ 1800.459020] [<c01fd4e4>] deactivate_super+0x64/0x90 Jul 29 20:28:42 ini kernel: [ 1800.459027] [<c021282f>] mntput_no_expire+0x8f/0xe0 Jul 29 20:28:42 ini kernel: [ 1800.459033] [<c0212e47>] sys_umount+0x47/0xa0 Jul 29 20:28:42 ini kernel: [ 1800.459039] [<c0212ebe>] sys_oldumount+0x1e/0x20 Jul 29 20:28:42 ini kernel: [ 1800.459046] [<c01033ec>] syscall_call+0x7/0xb Jul 29 20:28:42 ini kernel: [ 1800.493768] sync D 0004beb0 0 1265 1255 0x00000004 Jul 29 20:28:42 ini kernel: [ 1800.493777] cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330 Jul 29 20:28:42 ini kernel: [ 1800.493788] c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200 Jul 29 20:28:42 ini kernel: [ 1800.493798] 00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff Jul 29 20:28:42 ini kernel: [ 1800.493809] Call Trace: Jul 29 20:28:42 ini kernel: [ 1800.493816] [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0 Jul 29 20:28:42 ini kernel: [ 1800.493823] [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30 Jul 29 20:28:42 ini kernel: [ 1800.493828] [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10 Jul 29 20:28:42 ini kernel: [ 1800.493834] [<c057850c>] ? down_read+0x1c/0x20 Jul 29 20:28:42 ini kernel: [ 1800.493841] [<c021cb6d>] sync_filesystems+0xbd/0x110 Jul 29 20:28:42 ini kernel: [ 1800.493847] [<c021cc16>] sys_sync+0x16/0x40 Jul 29 20:28:42 ini kernel: [ 1800.493853] [<c01033ec>] syscall_call+0x7/0xb Jul 29 20:30:42 ini kernel: [ 1920.526729] umount D 00478e55 0 1234 924 0x00000004 Jul 29 20:30:42 ini kernel: [ 1920.526739] ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330 Jul 29 20:30:42 ini kernel: [ 1920.526750] c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000 Jul 29 20:30:42 ini kernel: [ 1920.526761] 0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54 Jul 29 20:30:42 ini kernel: [ 1920.526772] Call Trace: Jul 29 20:30:42 ini kernel: [ 1920.526788] [<c057745a>] io_schedule+0x3a/0x60 Jul 29 20:30:42 ini kernel: [ 1920.526798] [<c01bd95d>] sync_page+0x3d/0x50 Jul 29 20:30:42 ini kernel: [ 1920.526805] [<c0577aa7>] __wait_on_bit_lock+0x47/0x90 Jul 29 20:30:42 ini kernel: [ 1920.526813] [<c01bd920>] ? sync_page+0x0/0x50 Jul 29 20:30:42 ini kernel: [ 1920.526819] [<c01bd8ee>] __lock_page+0x7e/0x90 Jul 29 20:30:42 ini kernel: [ 1920.526827] [<c01624d0>] ? wake_bit_function+0x0/0x50 Jul 29 20:30:42 ini kernel: [ 1920.526836] [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0 Jul 29 20:30:42 ini kernel: [ 1920.526845] [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0 Jul 29 20:30:42 ini kernel: [ 1920.526853] [<c01c724f>] truncate_inode_pages+0x1f/0x30 Jul 29 20:30:42 ini kernel: [ 1920.526859] [<c020f15c>] dispose_list+0xcc/0x100 Jul 29 20:30:42 ini kernel: [ 1920.526866] [<c020f534>] invalidate_inodes+0xf4/0x120 Jul 29 20:30:42 ini kernel: [ 1920.526874] [<c023b310>] ? vfs_quota_off+0x0/0x20 Jul 29 20:30:42 ini kernel: [ 1920.526882] [<c01fc602>] generic_shutdown_super+0x42/0xe0 Jul 29 20:30:42 ini kernel: [ 1920.526889] [<c01fc6ca>] kill_block_super+0x2a/0x50 Jul 29 20:30:42 ini kernel: [ 1920.526895] [<c01fd4e4>] deactivate_super+0x64/0x90 Jul 29 20:30:42 ini kernel: [ 1920.526902] [<c021282f>] mntput_no_expire+0x8f/0xe0 Jul 29 20:30:42 ini kernel: [ 1920.526908] [<c0212e47>] sys_umount+0x47/0xa0 Jul 29 20:30:42 ini kernel: [ 1920.526915] [<c0212ebe>] sys_oldumount+0x1e/0x20 Jul 29 20:30:42 ini kernel: [ 1920.526922] [<c01033ec>] syscall_call+0x7/0xb Jul 29 20:30:42 ini kernel: [ 1920.563739] sync D 0004beb0 0 1265 1255 0x00000004 Jul 29 20:30:42 ini kernel: [ 1920.563747] cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330 Jul 29 20:30:42 ini kernel: [ 1920.563758] c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200 Jul 29 20:30:42 ini kernel: [ 1920.563768] 00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff Jul 29 20:30:42 ini kernel: [ 1920.563779] Call Trace: Jul 29 20:30:42 ini kernel: [ 1920.563787] [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0 Jul 29 20:30:42 ini kernel: [ 1920.563793] [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30 Jul 29 20:30:42 ini kernel: [ 1920.563799] [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10 Jul 29 20:30:42 ini kernel: [ 1920.563804] [<c057850c>] ? down_read+0x1c/0x20 Jul 29 20:30:42 ini kernel: [ 1920.563812] [<c021cb6d>] sync_filesystems+0xbd/0x110 Jul 29 20:30:42 ini kernel: [ 1920.563817] [<c021cc16>] sys_sync+0x16/0x40 Jul 29 20:30:42 ini kernel: [ 1920.563823] [<c01033ec>] syscall_call+0x7/0xb Although in both cases the FS remained consistent: root@ini:~# mount -t ext4 /dev/sdb /mnt root@ini:~# umount /mnt root@ini:~# e2fsck -f -y /dev/sdb e2fsck 1.41.11 (14-Mar-2010) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/sdb: ***** FILE SYSTEM WAS MODIFIED ***** /dev/sdb: 4194/640000 files (74.2% non-contiguous), 334774/1280000 blocks You can find full kernel logs starting from iSCSI load in the attachments. I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces. Vlad [-- Attachment #2: m.bz2 --] [-- Type: application/x-bzip, Size: 24364 bytes --] [-- Attachment #3: m1.bz2 --] [-- Type: application/x-bzip, Size: 45322 bytes --] ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: extfs reliability 2010-07-29 13:00 ` extfs reliability Vladislav Bolkhovitin @ 2010-07-29 13:08 ` Christoph Hellwig 2010-07-29 14:12 ` Vladislav Bolkhovitin 2010-07-29 14:26 ` Jan Kara 2010-07-29 18:58 ` Ted Ts'o 2 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-29 13:08 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke, linux-kernel, kernel-bugs On Thu, Jul 29, 2010 at 05:00:10PM +0400, Vladislav Bolkhovitin wrote: > You can find full kernel logs starting from iSCSI load in the attachments. > > I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces. I was only talking about ext3. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: extfs reliability 2010-07-29 13:08 ` Christoph Hellwig @ 2010-07-29 14:12 ` Vladislav Bolkhovitin 2010-07-29 14:34 ` Jan Kara 0 siblings, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-29 14:12 UTC (permalink / raw) To: Christoph Hellwig Cc: Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke, linux-kernel, kernel-bugs Christoph Hellwig, on 07/29/2010 05:08 PM wrote: > On Thu, Jul 29, 2010 at 05:00:10PM +0400, Vladislav Bolkhovitin wrote: >> You can find full kernel logs starting from iSCSI load in the attachments. >> >> I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces. > > I was only talking about ext3. Yes, now ext3 is a lot more reliable. The only how I was able to confuse it was: ... (2197) nb_write: handle 4272 was not open size=65475 ofs=0 (2199) nb_write: handle 4272 was not open size=65475 ofs=65534 (2201) nb_write: handle 4272 was not open size=65475 ofs=131068 (2203) nb_write: handle 4272 was not open size=65475 ofs=196602 (2205) nb_write: handle 4272 was not open size=65475 ofs=262136^C ^C root@ini:/mnt/dbench-mod# ^C root@ini:/mnt/dbench-mod# ^C root@ini:/mnt/dbench-mod# cd root@ini:~# umount /mnt <- recover device root@ini:~# mount -t ext3 -o barrier=1 /dev/sdb /mnt mount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so Kernel log: "Jul 29 22:05:32 ini kernel: [ 2905.423092] JBD: recovery failed" root@ini:~# mount -t ext3 -o barrier=1 /dev/sdb /mnt root@ini:~# Kernel log: Jul 29 22:05:54 ini kernel: [ 2927.832893] kjournald starting. Commit interval 5 seconds Jul 29 22:05:54 ini kernel: [ 2927.833430] EXT3 FS on sdb, internal journal Jul 29 22:05:54 ini kernel: [ 2927.833499] EXT3-fs: sdb: 1 orphan inode deleted Jul 29 22:05:54 ini kernel: [ 2927.833503] EXT3-fs: recovery complete. Jul 29 22:05:54 ini kernel: [ 2927.838122] EXT3-fs: mounted filesystem with ordered data mode. But it still remained consistent: root@ini:~# umount /mnt root@ini:~# e2fsck -f -y /dev/sdb e2fsck 1.41.11 (14-Mar-2010) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/sdb: 3504/320000 files (21.1% non-contiguous), 307034/1280000 blocks Good progress since my original reports for kernels around 2.6.27! Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: extfs reliability 2010-07-29 14:12 ` Vladislav Bolkhovitin @ 2010-07-29 14:34 ` Jan Kara 2010-07-29 18:20 ` Vladislav Bolkhovitin 2010-07-29 18:49 ` Vladislav Bolkhovitin 0 siblings, 2 replies; 155+ messages in thread From: Jan Kara @ 2010-07-29 14:34 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke, linux-kernel, kernel-bugs On Thu 29-07-10 18:12:29, Vladislav Bolkhovitin wrote: > > Christoph Hellwig, on 07/29/2010 05:08 PM wrote: > > On Thu, Jul 29, 2010 at 05:00:10PM +0400, Vladislav Bolkhovitin wrote: > >> You can find full kernel logs starting from iSCSI load in the attachments. > >> > >> I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces. > > > > I was only talking about ext3. > > Yes, now ext3 is a lot more reliable. The only how I was able to confuse it was: > > ... > (2197) nb_write: handle 4272 was not open size=65475 ofs=0 > (2199) nb_write: handle 4272 was not open size=65475 ofs=65534 > (2201) nb_write: handle 4272 was not open size=65475 ofs=131068 > (2203) nb_write: handle 4272 was not open size=65475 ofs=196602 > (2205) nb_write: handle 4272 was not open size=65475 ofs=262136^C > ^C > root@ini:/mnt/dbench-mod# ^C > root@ini:/mnt/dbench-mod# ^C > root@ini:/mnt/dbench-mod# cd > root@ini:~# umount /mnt > > <- recover device > > root@ini:~# mount -t ext3 -o barrier=1 /dev/sdb /mnt > mount: wrong fs type, bad option, bad superblock on /dev/sdb, > missing codepage or helper program, or other error > In some cases useful info is found in syslog - try > dmesg | tail or so > > Kernel log: "Jul 29 22:05:32 ini kernel: [ 2905.423092] JBD: recovery failed" Hmm, this is strange. Are there more messages around this one? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: extfs reliability 2010-07-29 14:34 ` Jan Kara @ 2010-07-29 18:20 ` Vladislav Bolkhovitin 2010-07-29 18:49 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-29 18:20 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke, linux-kernel Jan Kara, on 07/29/2010 06:34 PM wrote: > On Thu 29-07-10 18:12:29, Vladislav Bolkhovitin wrote: >> >> Christoph Hellwig, on 07/29/2010 05:08 PM wrote: >>> On Thu, Jul 29, 2010 at 05:00:10PM +0400, Vladislav Bolkhovitin wrote: >>>> You can find full kernel logs starting from iSCSI load in the attachments. >>>> >>>> I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces. >>> >>> I was only talking about ext3. >> >> Yes, now ext3 is a lot more reliable. The only how I was able to confuse it was: >> >> ... >> (2197) nb_write: handle 4272 was not open size=65475 ofs=0 >> (2199) nb_write: handle 4272 was not open size=65475 ofs=65534 >> (2201) nb_write: handle 4272 was not open size=65475 ofs=131068 >> (2203) nb_write: handle 4272 was not open size=65475 ofs=196602 >> (2205) nb_write: handle 4272 was not open size=65475 ofs=262136^C >> ^C >> root@ini:/mnt/dbench-mod# ^C >> root@ini:/mnt/dbench-mod# ^C >> root@ini:/mnt/dbench-mod# cd >> root@ini:~# umount /mnt >> >> <- recover device >> >> root@ini:~# mount -t ext3 -o barrier=1 /dev/sdb /mnt >> mount: wrong fs type, bad option, bad superblock on /dev/sdb, >> missing codepage or helper program, or other error >> In some cases useful info is found in syslog - try >> dmesg | tail or so >> >> Kernel log: "Jul 29 22:05:32 ini kernel: [ 2905.423092] JBD: recovery failed" > Hmm, this is strange. Are there more messages around this one? Rather none: Jul 29 22:02:05 ini kernel: [ 2698.488446] sd 7:0:0:0: [sdb] CDB: Read(10): 28 00 00 07 88 69 00 00 01 00 Jul 29 22:02:05 ini kernel: [ 2698.505470] sd 7:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jul 29 22:02:05 ini kernel: [ 2698.505480] sd 7:0:0:0: [sdb] Sense Key : Illegal Request [current] Jul 29 22:02:05 ini kernel: [ 2698.505488] sd 7:0:0:0: [sdb] Add. Sense: Logical unit not supported Jul 29 22:02:05 ini kernel: [ 2698.505497] sd 7:0:0:0: [sdb] CDB: Read(10): 28 00 00 07 88 69 00 00 01 00 Jul 29 22:02:05 ini kernel: [ 2698.555147] sd 7:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jul 29 22:02:05 ini kernel: [ 2698.555157] sd 7:0:0:0: [sdb] Sense Key : Illegal Request [current] Jul 29 22:02:05 ini kernel: [ 2698.555165] sd 7:0:0:0: [sdb] Add. Sense: Logical unit not supported Jul 29 22:02:05 ini kernel: [ 2698.555175] sd 7:0:0:0: [sdb] CDB: Read(10): 28 00 00 07 88 69 00 00 01 00 Jul 29 22:02:05 ini kernel: [ 2698.582241] sd 7:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jul 29 22:02:05 ini kernel: [ 2698.582251] sd 7:0:0:0: [sdb] Sense Key : Illegal Request [current] Jul 29 22:02:05 ini kernel: [ 2698.582259] sd 7:0:0:0: [sdb] Add. Sense: Logical unit not supported Jul 29 22:02:05 ini kernel: [ 2698.582268] sd 7:0:0:0: [sdb] CDB: Read(10): 28 00 00 07 88 69 00 00 01 00 Jul 29 22:02:05 ini kernel: [ 2698.614789] sd 7:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jul 29 22:02:05 ini kernel: [ 2698.614799] sd 7:0:0:0: [sdb] Sense Key : Illegal Request [current] Jul 29 22:02:05 ini kernel: [ 2698.614807] sd 7:0:0:0: [sdb] Add. Sense: Logical unit not supported Jul 29 22:02:05 ini kernel: [ 2698.614817] sd 7:0:0:0: [sdb] CDB: Read(10): 28 00 00 07 88 69 00 00 01 00 Jul 29 22:02:45 ini kernel: [ 2738.474386] __journal_remove_journal_head: freeing b_committed_data Jul 29 22:02:45 ini kernel: [ 2738.474529] __journal_remove_journal_head: freeing b_committed_data Jul 29 22:02:45 ini kernel: [ 2738.474536] __journal_remove_journal_head: freeing b_committed_data Jul 29 22:02:45 ini kernel: [ 2738.474570] __journal_remove_journal_head: freeing b_committed_data Jul 29 22:02:45 ini kernel: [ 2738.474583] __journal_remove_journal_head: freeing b_committed_data Jul 29 22:02:45 ini kernel: [ 2738.474603] __journal_remove_journal_head: freeing b_committed_data Jul 29 22:02:45 ini kernel: [ 2738.474615] __journal_remove_journal_head: freeing b_committed_data Jul 29 22:02:45 ini kernel: [ 2738.474621] __journal_remove_journal_head: freeing b_committed_data Jul 29 22:02:45 ini kernel: [ 2738.474633] __journal_remove_journal_head: freeing b_committed_data Jul 29 22:02:45 ini kernel: [ 2738.474659] __journal_remove_journal_head: freeing b_committed_data Jul 29 22:05:32 ini kernel: [ 2905.423092] JBD: recovery failed Jul 29 22:05:54 ini kernel: [ 2927.832893] kjournald starting. Commit interval 5 seconds Jul 29 22:05:54 ini kernel: [ 2927.833430] EXT3 FS on sdb, internal journal Jul 29 22:05:54 ini kernel: [ 2927.833499] EXT3-fs: sdb: 1 orphan inode deleted Jul 29 22:05:54 ini kernel: [ 2927.833503] EXT3-fs: recovery complete. Jul 29 22:05:54 ini kernel: [ 2927.838122] EXT3-fs: mounted filesystem with ordered data mode. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: extfs reliability 2010-07-29 14:34 ` Jan Kara 2010-07-29 18:20 ` Vladislav Bolkhovitin @ 2010-07-29 18:49 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-29 18:49 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke, linux-kernel, kernel-bugs Jan Kara, on 07/29/2010 06:34 PM wrote: > On Thu 29-07-10 18:12:29, Vladislav Bolkhovitin wrote: >> >> Christoph Hellwig, on 07/29/2010 05:08 PM wrote: >>> On Thu, Jul 29, 2010 at 05:00:10PM +0400, Vladislav Bolkhovitin wrote: >>>> You can find full kernel logs starting from iSCSI load in the attachments. >>>> >>>> I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces. >>> >>> I was only talking about ext3. >> >> Yes, now ext3 is a lot more reliable. The only how I was able to confuse it was: >> >> ... >> (2197) nb_write: handle 4272 was not open size=65475 ofs=0 >> (2199) nb_write: handle 4272 was not open size=65475 ofs=65534 >> (2201) nb_write: handle 4272 was not open size=65475 ofs=131068 >> (2203) nb_write: handle 4272 was not open size=65475 ofs=196602 >> (2205) nb_write: handle 4272 was not open size=65475 ofs=262136^C >> ^C >> root@ini:/mnt/dbench-mod# ^C >> root@ini:/mnt/dbench-mod# ^C >> root@ini:/mnt/dbench-mod# cd >> root@ini:~# umount /mnt >> >> <- recover device >> >> root@ini:~# mount -t ext3 -o barrier=1 /dev/sdb /mnt >> mount: wrong fs type, bad option, bad superblock on /dev/sdb, >> missing codepage or helper program, or other error >> In some cases useful info is found in syslog - try >> dmesg | tail or so >> >> Kernel log: "Jul 29 22:05:32 ini kernel: [ 2905.423092] JBD: recovery failed" > Hmm, this is strange. Are there more messages around this one? I'd encourage you to reproduce similar setup and perform various failure injection testings. I promise you, you'll find a lot of strange and interesting ;). Software devices give unique opportunities for that. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: extfs reliability 2010-07-29 13:00 ` extfs reliability Vladislav Bolkhovitin 2010-07-29 13:08 ` Christoph Hellwig @ 2010-07-29 14:26 ` Jan Kara 2010-07-29 18:20 ` Vladislav Bolkhovitin 2010-07-29 18:58 ` Ted Ts'o 2 siblings, 1 reply; 155+ messages in thread From: Jan Kara @ 2010-07-29 14:26 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke, linux-kernel, kernel-bugs On Thu 29-07-10 17:00:10, Vladislav Bolkhovitin wrote: > Christoph Hellwig, on 07/29/2010 12:31 PM wrote: > > My reading of the ext3/jbd code we explicitly wait on I/O completion > > of dependent writes, and only require those to actually be stable > > by issueing a flush. If that wasn't the case the default ext3 > > barriers off behaviour would not only be dangerous on devices with > > volatile write caches, but also on devices that do not have them, > > which in addition to the reading of the code is not what we've seen > > in actual power fail testing, where ext3 does well as long as there > > is no volatile write cache. > > Basically, it is so, but, unfortunately, not absolutely. I've just tried > 2 tests on ext4 with iSCSI: > > # uname -a > Linux ini 2.6.32-22-386 #36-Ubuntu SMP Fri Jun 4 00:27:09 UTC 2010 i686 GNU/Linux > > # e2fsck -f -y /dev/sdb > e2fsck 1.41.11 (14-Mar-2010) > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > /dev/sdb: 49/640000 files (0.0% non-contiguous), 56496/1280000 blocks > root@ini:~# mount -t ext4 -o barrier=1 /dev/sdb /mnt > root@ini:~# cd /mnt/dbench-mod/ > root@ini:/mnt/dbench-mod# ./dbench 50 > 50 clients started > ... > <-- Pull cable > <-- After sometime a lot of warnings like: > (22002) open CLIENTS/CLIENT44/~DMTMP/COREL/CDRBARS.CFG failed for handle 4235 (Read-only file system) > (22004) open CLIENTS/CLIENT44/~DMTMP/COREL/ARTISTIC.ACL failed for handle 4236 (Read-only file system) ... These are OK. You pulled a cable and now you start getting EIO from the kernel. > root@ini:/mnt/dbench-mod# ^C > root@ini:/mnt/dbench-mod# ^C > root@ini:~# umount /mnt > Segmentation fault This isn't OK of course ;) > Kernel log: > > Jul 29 19:55:35 ini kernel: [ 3044.722313] c2c28e40: 00023740 00023741 00023742 00023743 @7..A7..B7..C7.. > Jul 29 19:55:35 ini kernel: [ 3044.722320] c2c28e50: 00023744 00023745 00023746 00023747 D7..E7..F7..G7.. > Jul 29 19:55:35 ini kernel: [ 3044.722327] c2c28e60: 00023748 00023749 0002374a 0002374b H7..I7..J7..K7.. > Jul 29 19:55:35 ini kernel: [ 3044.722334] c2c28e70: 0002372c 00000000 00000000 00000000 ,7.............. > Jul 29 19:55:35 ini kernel: [ 3044.722341] c2c28e80: 00000000 00000000 00000000 00000002 ................ ... Sadly these messages above seem to have overwritten beginning of the message below. Hmm, but maybe it's just a warning message about inode still being on orphan list because the next oops still shows untainted kernel. > Jul 29 19:55:35 ini kernel: [ 3044.722546] Pid: 1299, comm: umount Not tainted 2.6.32-22-386 #36-Ubuntu > Jul 29 19:55:35 ini kernel: [ 3044.722550] Call Trace: > Jul 29 19:55:35 ini kernel: [ 3044.722567] [<c0291731>] ext4_destroy_inode+0x91/0xa0 > Jul 29 19:55:35 ini kernel: [ 3044.722577] [<c020ecb4>] destroy_inode+0x24/0x40 > Jul 29 19:55:35 ini kernel: [ 3044.722583] [<c020f11e>] dispose_list+0x8e/0x100 > Jul 29 19:55:35 ini kernel: [ 3044.722588] [<c020f534>] invalidate_inodes+0xf4/0x120 > Jul 29 19:55:35 ini kernel: [ 3044.722598] [<c023b310>] ? vfs_quota_off+0x0/0x20 > Jul 29 19:55:35 ini kernel: [ 3044.722606] [<c01fc602>] generic_shutdown_super+0x42/0xe0 > Jul 29 19:55:35 ini kernel: [ 3044.722612] [<c01fc6ca>] kill_block_super+0x2a/0x50 > Jul 29 19:55:35 ini kernel: [ 3044.722618] [<c01fd4e4>] deactivate_super+0x64/0x90 > Jul 29 19:55:35 ini kernel: [ 3044.722625] [<c021282f>] mntput_no_expire+0x8f/0xe0 > Jul 29 19:55:35 ini kernel: [ 3044.722631] [<c0212e47>] sys_umount+0x47/0xa0 > Jul 29 19:55:35 ini kernel: [ 3044.722636] [<c0212ebe>] sys_oldumount+0x1e/0x20 > Jul 29 19:55:35 ini kernel: [ 3044.722643] [<c01033ec>] syscall_call+0x7/0xb > Jul 29 19:55:35 ini kernel: [ 3044.731043] sd 6:0:0:0: [sdb] Unhandled error code > Jul 29 19:55:35 ini kernel: [ 3044.731049] sd 6:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK > Jul 29 19:55:35 ini kernel: [ 3044.731056] sd 6:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 00 00 00 00 01 00 > Jul 29 19:55:35 ini kernel: [ 3044.743469] __ratelimit: 37 callbacks suppressed > Jul 29 19:55:35 ini kernel: [ 3044.755695] lost page write due to I/O error on sdb > Jul 29 19:55:36 ini kernel: [ 3044.823044] Modules linked in: crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi w83627hf hwmon_vid fbcon tileblit font bitblit softcursor ppdev adm1021 i2c_i801 vga16fb vgastate e7xxx_edac psmouse serio_raw parport_pc shpchp edac_core lp parport qla2xxx ohci1394 scsi_transport_fc r8169 sata_via ieee1394 mii scsi_tgt e1000 floppy So here probably starts the real oops. But sadly we are missing the beginning as well. Can you send me disassembly of your ext4_put_super? > Jul 29 19:55:36 ini kernel: [ 3044.823044] > Jul 29 19:55:36 ini kernel: [ 3044.823044] Pid: 1299, comm: umount Not tainted (2.6.32-22-386 #36-Ubuntu) X5DPA > Jul 29 19:55:36 ini kernel: [ 3044.823044] EIP: 0060:[<c0293c2a>] EFLAGS: 00010206 CPU: 0 > Jul 29 19:55:36 ini kernel: [ 3044.823044] EIP is at ext4_put_super+0x2ea/0x350 > Jul 29 19:55:36 ini kernel: [ 3044.823044] EAX: c2c28ea8 EBX: c307f000 ECX: ffffff52 EDX: c307f138 > Jul 29 19:55:36 ini kernel: [ 3044.823044] ESI: ca228a00 EDI: c307f0fc EBP: cec6ff30 ESP: cec6fefc > Jul 29 19:55:36 ini kernel: [ 3044.823044] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 > Jul 29 19:55:36 ini kernel: [ 3044.823044] c06bb054 ca228b64 0000800b c2c28ec8 00008180 00000001 00000000 c307f138 > Jul 29 19:55:36 ini kernel: [ 3044.823044] <0> c307f138 c307f138 ca228a00 c0593c80 c023b310 cec6ff48 c01fc60d ca228ac0 > Jul 29 19:55:36 ini kernel: [ 3044.823044] <0> cec6ff44 cf328400 00000003 cec6ff58 c01fc6ca ca228a00 c0759d80 cec6ff6c > Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c023b310>] ? vfs_quota_off+0x0/0x20 > Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c01fc60d>] ? generic_shutdown_super+0x4d/0xe0 > Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c01fc6ca>] ? kill_block_super+0x2a/0x50 > Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c01fd4e4>] ? deactivate_super+0x64/0x90 > Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c021282f>] ? mntput_no_expire+0x8f/0xe0 > Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c0212e47>] ? sys_umount+0x47/0xa0 > Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c0212ebe>] ? sys_oldumount+0x1e/0x20 > Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c01033ec>] ? syscall_call+0x7/0xb > Jul 29 19:55:36 ini kernel: [ 3045.299442] ---[ end trace 426db011a0289db3 ]--- ... > Another test. Everything is as before, only I did not pull the cable, but > deleted the corresponding LUN on the target, so all the command starting > from this moment failed. Then on umount system rebooted. Kernel log: Nasty. But the log actually contains only traces of processes in D state (generally waiting for a page to be unlocked). Do you have any sort of watchdog which might have rebooted the machine? > Jul 29 20:20:42 ini kernel: [ 1320.251393] umount D 00478e55 0 1234 924 0x00000000 > Jul 29 20:20:42 ini kernel: [ 1320.251403] ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330 > Jul 29 20:20:42 ini kernel: [ 1320.251415] c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000 > Jul 29 20:20:42 ini kernel: [ 1320.251425] 0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54 > Jul 29 20:20:42 ini kernel: [ 1320.251436] Call Trace: > Jul 29 20:20:42 ini kernel: [ 1320.251452] [<c057745a>] io_schedule+0x3a/0x60 > Jul 29 20:20:42 ini kernel: [ 1320.251463] [<c01bd95d>] sync_page+0x3d/0x50 > Jul 29 20:20:42 ini kernel: [ 1320.251470] [<c0577aa7>] __wait_on_bit_lock+0x47/0x90 > Jul 29 20:20:42 ini kernel: [ 1320.251476] [<c01bd920>] ? sync_page+0x0/0x50 > Jul 29 20:20:42 ini kernel: [ 1320.251483] [<c01bd8ee>] __lock_page+0x7e/0x90 > Jul 29 20:20:42 ini kernel: [ 1320.251491] [<c01624d0>] ? wake_bit_function+0x0/0x50 > Jul 29 20:20:42 ini kernel: [ 1320.251499] [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0 > Jul 29 20:20:42 ini kernel: [ 1320.251510] [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0 > Jul 29 20:20:42 ini kernel: [ 1320.251517] [<c01c724f>] truncate_inode_pages+0x1f/0x30 > Jul 29 20:20:42 ini kernel: [ 1320.251523] [<c020f15c>] dispose_list+0xcc/0x100 > Jul 29 20:20:42 ini kernel: [ 1320.251529] [<c020f534>] invalidate_inodes+0xf4/0x120 > Jul 29 20:20:42 ini kernel: [ 1320.251538] [<c023b310>] ? vfs_quota_off+0x0/0x20 > Jul 29 20:20:42 ini kernel: [ 1320.251546] [<c01fc602>] generic_shutdown_super+0x42/0xe0 > Jul 29 20:20:42 ini kernel: [ 1320.251553] [<c01fc6ca>] kill_block_super+0x2a/0x50 > Jul 29 20:20:42 ini kernel: [ 1320.251559] [<c01fd4e4>] deactivate_super+0x64/0x90 > Jul 29 20:20:42 ini kernel: [ 1320.251566] [<c021282f>] mntput_no_expire+0x8f/0xe0 > Jul 29 20:20:42 ini kernel: [ 1320.251573] [<c0212e47>] sys_umount+0x47/0xa0 > Jul 29 20:20:42 ini kernel: [ 1320.251579] [<c0212ebe>] sys_oldumount+0x1e/0x20 > Jul 29 20:20:42 ini kernel: [ 1320.251586] [<c01033ec>] syscall_call+0x7/0xb > Jul 29 20:22:42 ini kernel: [ 1440.285910] umount D 00478e55 0 1234 924 0x00000004 > Jul 29 20:22:42 ini kernel: [ 1440.285919] ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330 > Jul 29 20:22:42 ini kernel: [ 1440.285931] c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000 > Jul 29 20:22:42 ini kernel: [ 1440.285942] 0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54 > Jul 29 20:22:42 ini kernel: [ 1440.285953] Call Trace: > Jul 29 20:22:42 ini kernel: [ 1440.285969] [<c057745a>] io_schedule+0x3a/0x60 > Jul 29 20:22:42 ini kernel: [ 1440.285980] [<c01bd95d>] sync_page+0x3d/0x50 > Jul 29 20:22:42 ini kernel: [ 1440.285987] [<c0577aa7>] __wait_on_bit_lock+0x47/0x90 > Jul 29 20:22:42 ini kernel: [ 1440.285994] [<c01bd920>] ? sync_page+0x0/0x50 > Jul 29 20:22:42 ini kernel: [ 1440.286001] [<c01bd8ee>] __lock_page+0x7e/0x90 > Jul 29 20:22:42 ini kernel: [ 1440.286010] [<c01624d0>] ? wake_bit_function+0x0/0x50 > Jul 29 20:22:42 ini kernel: [ 1440.286018] [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0 > Jul 29 20:22:42 ini kernel: [ 1440.286028] [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0 > Jul 29 20:22:42 ini kernel: [ 1440.286035] [<c01c724f>] truncate_inode_pages+0x1f/0x30 > Jul 29 20:22:42 ini kernel: [ 1440.286041] [<c020f15c>] dispose_list+0xcc/0x100 > Jul 29 20:22:42 ini kernel: [ 1440.286047] [<c020f534>] invalidate_inodes+0xf4/0x120 > Jul 29 20:22:42 ini kernel: [ 1440.286056] [<c023b310>] ? vfs_quota_off+0x0/0x20 > Jul 29 20:22:42 ini kernel: [ 1440.286064] [<c01fc602>] generic_shutdown_super+0x42/0xe0 > Jul 29 20:22:42 ini kernel: [ 1440.286071] [<c01fc6ca>] kill_block_super+0x2a/0x50 > Jul 29 20:22:42 ini kernel: [ 1440.286077] [<c01fd4e4>] deactivate_super+0x64/0x90 > Jul 29 20:22:42 ini kernel: [ 1440.286084] [<c021282f>] mntput_no_expire+0x8f/0xe0 > Jul 29 20:22:42 ini kernel: [ 1440.286091] [<c0212e47>] sys_umount+0x47/0xa0 > Jul 29 20:22:42 ini kernel: [ 1440.286097] [<c0212ebe>] sys_oldumount+0x1e/0x20 > Jul 29 20:22:42 ini kernel: [ 1440.286104] [<c01033ec>] syscall_call+0x7/0xb > Jul 29 20:24:42 ini kernel: [ 1560.321709] umount D 00478e55 0 1234 924 0x00000004 > Jul 29 20:24:42 ini kernel: [ 1560.321718] ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330 > Jul 29 20:24:42 ini kernel: [ 1560.321730] c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000 > Jul 29 20:24:42 ini kernel: [ 1560.321741] 0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54 > Jul 29 20:24:42 ini kernel: [ 1560.321751] Call Trace: > Jul 29 20:24:42 ini kernel: [ 1560.321767] [<c057745a>] io_schedule+0x3a/0x60 > Jul 29 20:24:42 ini kernel: [ 1560.321777] [<c01bd95d>] sync_page+0x3d/0x50 > Jul 29 20:24:42 ini kernel: [ 1560.321784] [<c0577aa7>] __wait_on_bit_lock+0x47/0x90 > Jul 29 20:24:42 ini kernel: [ 1560.321791] [<c01bd920>] ? sync_page+0x0/0x50 > Jul 29 20:24:42 ini kernel: [ 1560.321797] [<c01bd8ee>] __lock_page+0x7e/0x90 > Jul 29 20:24:42 ini kernel: [ 1560.321805] [<c01624d0>] ? wake_bit_function+0x0/0x50 > Jul 29 20:24:42 ini kernel: [ 1560.321814] [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0 > Jul 29 20:24:42 ini kernel: [ 1560.321824] [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0 > Jul 29 20:24:42 ini kernel: [ 1560.321831] [<c01c724f>] truncate_inode_pages+0x1f/0x30 > Jul 29 20:24:42 ini kernel: [ 1560.321837] [<c020f15c>] dispose_list+0xcc/0x100 > Jul 29 20:24:42 ini kernel: [ 1560.321845] [<c020f534>] invalidate_inodes+0xf4/0x120 > Jul 29 20:24:42 ini kernel: [ 1560.321855] [<c023b310>] ? vfs_quota_off+0x0/0x20 > Jul 29 20:24:42 ini kernel: [ 1560.321864] [<c01fc602>] generic_shutdown_super+0x42/0xe0 > Jul 29 20:24:42 ini kernel: [ 1560.321870] [<c01fc6ca>] kill_block_super+0x2a/0x50 > Jul 29 20:24:42 ini kernel: [ 1560.321877] [<c01fd4e4>] deactivate_super+0x64/0x90 > Jul 29 20:24:42 ini kernel: [ 1560.321885] [<c021282f>] mntput_no_expire+0x8f/0xe0 > Jul 29 20:24:42 ini kernel: [ 1560.321892] [<c0212e47>] sys_umount+0x47/0xa0 > Jul 29 20:24:42 ini kernel: [ 1560.321898] [<c0212ebe>] sys_oldumount+0x1e/0x20 > Jul 29 20:24:42 ini kernel: [ 1560.321905] [<c01033ec>] syscall_call+0x7/0xb > Jul 29 20:24:42 ini kernel: [ 1560.358795] sync D 0004beb0 0 1265 1255 0x00000004 > Jul 29 20:24:42 ini kernel: [ 1560.358803] cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330 > Jul 29 20:24:42 ini kernel: [ 1560.358815] c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200 > Jul 29 20:24:42 ini kernel: [ 1560.358826] 00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff > Jul 29 20:24:42 ini kernel: [ 1560.358837] Call Trace: > Jul 29 20:24:42 ini kernel: [ 1560.358845] [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0 > Jul 29 20:24:42 ini kernel: [ 1560.358852] [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30 > Jul 29 20:24:42 ini kernel: [ 1560.358858] [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10 > Jul 29 20:24:42 ini kernel: [ 1560.358863] [<c057850c>] ? down_read+0x1c/0x20 > Jul 29 20:24:42 ini kernel: [ 1560.358870] [<c021cb6d>] sync_filesystems+0xbd/0x110 > Jul 29 20:24:42 ini kernel: [ 1560.358876] [<c021cc16>] sys_sync+0x16/0x40 > Jul 29 20:24:42 ini kernel: [ 1560.358881] [<c01033ec>] syscall_call+0x7/0xb > Jul 29 20:26:42 ini kernel: [ 1680.392190] umount D 00478e55 0 1234 924 0x00000004 > Jul 29 20:26:42 ini kernel: [ 1680.392200] ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330 > Jul 29 20:26:42 ini kernel: [ 1680.392212] c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000 > Jul 29 20:26:42 ini kernel: [ 1680.392223] 0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54 > Jul 29 20:26:42 ini kernel: [ 1680.392233] Call Trace: > Jul 29 20:26:42 ini kernel: [ 1680.392250] [<c057745a>] io_schedule+0x3a/0x60 > Jul 29 20:26:42 ini kernel: [ 1680.392260] [<c01bd95d>] sync_page+0x3d/0x50 > Jul 29 20:26:42 ini kernel: [ 1680.392267] [<c0577aa7>] __wait_on_bit_lock+0x47/0x90 > Jul 29 20:26:42 ini kernel: [ 1680.392274] [<c01bd920>] ? sync_page+0x0/0x50 > Jul 29 20:26:42 ini kernel: [ 1680.392280] [<c01bd8ee>] __lock_page+0x7e/0x90 > Jul 29 20:26:42 ini kernel: [ 1680.392289] [<c01624d0>] ? wake_bit_function+0x0/0x50 > Jul 29 20:26:42 ini kernel: [ 1680.392298] [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0 > Jul 29 20:26:42 ini kernel: [ 1680.392308] [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0 > Jul 29 20:26:42 ini kernel: [ 1680.392314] [<c01c724f>] truncate_inode_pages+0x1f/0x30 > Jul 29 20:26:42 ini kernel: [ 1680.392321] [<c020f15c>] dispose_list+0xcc/0x100 > Jul 29 20:26:42 ini kernel: [ 1680.392327] [<c020f534>] invalidate_inodes+0xf4/0x120 > Jul 29 20:26:42 ini kernel: [ 1680.392336] [<c023b310>] ? vfs_quota_off+0x0/0x20 > Jul 29 20:26:42 ini kernel: [ 1680.392344] [<c01fc602>] generic_shutdown_super+0x42/0xe0 > Jul 29 20:26:42 ini kernel: [ 1680.392351] [<c01fc6ca>] kill_block_super+0x2a/0x50 > Jul 29 20:26:42 ini kernel: [ 1680.392357] [<c01fd4e4>] deactivate_super+0x64/0x90 > Jul 29 20:26:42 ini kernel: [ 1680.392364] [<c021282f>] mntput_no_expire+0x8f/0xe0 > Jul 29 20:26:42 ini kernel: [ 1680.392371] [<c0212e47>] sys_umount+0x47/0xa0 > Jul 29 20:26:42 ini kernel: [ 1680.392378] [<c0212ebe>] sys_oldumount+0x1e/0x20 > Jul 29 20:26:42 ini kernel: [ 1680.392384] [<c01033ec>] syscall_call+0x7/0xb > Jul 29 20:26:42 ini kernel: [ 1680.427874] sync D 0004beb0 0 1265 1255 0x00000004 > Jul 29 20:26:42 ini kernel: [ 1680.427883] cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330 > Jul 29 20:26:42 ini kernel: [ 1680.427894] c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200 > Jul 29 20:26:42 ini kernel: [ 1680.427904] 00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff > Jul 29 20:26:42 ini kernel: [ 1680.427915] Call Trace: > Jul 29 20:26:42 ini kernel: [ 1680.427922] [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0 > Jul 29 20:26:42 ini kernel: [ 1680.427929] [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30 > Jul 29 20:26:42 ini kernel: [ 1680.427935] [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10 > Jul 29 20:26:42 ini kernel: [ 1680.427940] [<c057850c>] ? down_read+0x1c/0x20 > Jul 29 20:26:42 ini kernel: [ 1680.427947] [<c021cb6d>] sync_filesystems+0xbd/0x110 > Jul 29 20:26:42 ini kernel: [ 1680.427953] [<c021cc16>] sys_sync+0x16/0x40 > Jul 29 20:26:42 ini kernel: [ 1680.427958] [<c01033ec>] syscall_call+0x7/0xb > Jul 29 20:28:42 ini kernel: [ 1800.458856] umount D 00478e55 0 1234 924 0x00000004 > Jul 29 20:28:42 ini kernel: [ 1800.458866] ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330 > Jul 29 20:28:42 ini kernel: [ 1800.458877] c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000 > Jul 29 20:28:42 ini kernel: [ 1800.458888] 0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54 > Jul 29 20:28:42 ini kernel: [ 1800.458899] Call Trace: > Jul 29 20:28:42 ini kernel: [ 1800.458915] [<c057745a>] io_schedule+0x3a/0x60 > Jul 29 20:28:42 ini kernel: [ 1800.458925] [<c01bd95d>] sync_page+0x3d/0x50 > Jul 29 20:28:42 ini kernel: [ 1800.458932] [<c0577aa7>] __wait_on_bit_lock+0x47/0x90 > Jul 29 20:28:42 ini kernel: [ 1800.458938] [<c01bd920>] ? sync_page+0x0/0x50 > Jul 29 20:28:42 ini kernel: [ 1800.458945] [<c01bd8ee>] __lock_page+0x7e/0x90 > Jul 29 20:28:42 ini kernel: [ 1800.458953] [<c01624d0>] ? wake_bit_function+0x0/0x50 > Jul 29 20:28:42 ini kernel: [ 1800.458961] [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0 > Jul 29 20:28:42 ini kernel: [ 1800.458971] [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0 > Jul 29 20:28:42 ini kernel: [ 1800.458978] [<c01c724f>] truncate_inode_pages+0x1f/0x30 > Jul 29 20:28:42 ini kernel: [ 1800.458984] [<c020f15c>] dispose_list+0xcc/0x100 > Jul 29 20:28:42 ini kernel: [ 1800.458991] [<c020f534>] invalidate_inodes+0xf4/0x120 > Jul 29 20:28:42 ini kernel: [ 1800.458999] [<c023b310>] ? vfs_quota_off+0x0/0x20 > Jul 29 20:28:42 ini kernel: [ 1800.459007] [<c01fc602>] generic_shutdown_super+0x42/0xe0 > Jul 29 20:28:42 ini kernel: [ 1800.459013] [<c01fc6ca>] kill_block_super+0x2a/0x50 > Jul 29 20:28:42 ini kernel: [ 1800.459020] [<c01fd4e4>] deactivate_super+0x64/0x90 > Jul 29 20:28:42 ini kernel: [ 1800.459027] [<c021282f>] mntput_no_expire+0x8f/0xe0 > Jul 29 20:28:42 ini kernel: [ 1800.459033] [<c0212e47>] sys_umount+0x47/0xa0 > Jul 29 20:28:42 ini kernel: [ 1800.459039] [<c0212ebe>] sys_oldumount+0x1e/0x20 > Jul 29 20:28:42 ini kernel: [ 1800.459046] [<c01033ec>] syscall_call+0x7/0xb > Jul 29 20:28:42 ini kernel: [ 1800.493768] sync D 0004beb0 0 1265 1255 0x00000004 > Jul 29 20:28:42 ini kernel: [ 1800.493777] cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330 > Jul 29 20:28:42 ini kernel: [ 1800.493788] c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200 > Jul 29 20:28:42 ini kernel: [ 1800.493798] 00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff > Jul 29 20:28:42 ini kernel: [ 1800.493809] Call Trace: > Jul 29 20:28:42 ini kernel: [ 1800.493816] [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0 > Jul 29 20:28:42 ini kernel: [ 1800.493823] [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30 > Jul 29 20:28:42 ini kernel: [ 1800.493828] [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10 > Jul 29 20:28:42 ini kernel: [ 1800.493834] [<c057850c>] ? down_read+0x1c/0x20 > Jul 29 20:28:42 ini kernel: [ 1800.493841] [<c021cb6d>] sync_filesystems+0xbd/0x110 > Jul 29 20:28:42 ini kernel: [ 1800.493847] [<c021cc16>] sys_sync+0x16/0x40 > Jul 29 20:28:42 ini kernel: [ 1800.493853] [<c01033ec>] syscall_call+0x7/0xb > Jul 29 20:30:42 ini kernel: [ 1920.526729] umount D 00478e55 0 1234 924 0x00000004 > Jul 29 20:30:42 ini kernel: [ 1920.526739] ce579e10 00000086 00000082 00478e55 00000000 c082b330 cef574dc c082b330 > Jul 29 20:30:42 ini kernel: [ 1920.526750] c082b330 c082b330 cef574dc 0355459b 0000010d c082b330 c082b330 ce652000 > Jul 29 20:30:42 ini kernel: [ 1920.526761] 0000010d cef57230 c1407330 cef57230 ce579e5c ce579e20 c057745a ce579e54 > Jul 29 20:30:42 ini kernel: [ 1920.526772] Call Trace: > Jul 29 20:30:42 ini kernel: [ 1920.526788] [<c057745a>] io_schedule+0x3a/0x60 > Jul 29 20:30:42 ini kernel: [ 1920.526798] [<c01bd95d>] sync_page+0x3d/0x50 > Jul 29 20:30:42 ini kernel: [ 1920.526805] [<c0577aa7>] __wait_on_bit_lock+0x47/0x90 > Jul 29 20:30:42 ini kernel: [ 1920.526813] [<c01bd920>] ? sync_page+0x0/0x50 > Jul 29 20:30:42 ini kernel: [ 1920.526819] [<c01bd8ee>] __lock_page+0x7e/0x90 > Jul 29 20:30:42 ini kernel: [ 1920.526827] [<c01624d0>] ? wake_bit_function+0x0/0x50 > Jul 29 20:30:42 ini kernel: [ 1920.526836] [<c01c7219>] truncate_inode_pages_range+0x2a9/0x2c0 > Jul 29 20:30:42 ini kernel: [ 1920.526845] [<c02916c6>] ? ext4_destroy_inode+0x26/0xa0 > Jul 29 20:30:42 ini kernel: [ 1920.526853] [<c01c724f>] truncate_inode_pages+0x1f/0x30 > Jul 29 20:30:42 ini kernel: [ 1920.526859] [<c020f15c>] dispose_list+0xcc/0x100 > Jul 29 20:30:42 ini kernel: [ 1920.526866] [<c020f534>] invalidate_inodes+0xf4/0x120 > Jul 29 20:30:42 ini kernel: [ 1920.526874] [<c023b310>] ? vfs_quota_off+0x0/0x20 > Jul 29 20:30:42 ini kernel: [ 1920.526882] [<c01fc602>] generic_shutdown_super+0x42/0xe0 > Jul 29 20:30:42 ini kernel: [ 1920.526889] [<c01fc6ca>] kill_block_super+0x2a/0x50 > Jul 29 20:30:42 ini kernel: [ 1920.526895] [<c01fd4e4>] deactivate_super+0x64/0x90 > Jul 29 20:30:42 ini kernel: [ 1920.526902] [<c021282f>] mntput_no_expire+0x8f/0xe0 > Jul 29 20:30:42 ini kernel: [ 1920.526908] [<c0212e47>] sys_umount+0x47/0xa0 > Jul 29 20:30:42 ini kernel: [ 1920.526915] [<c0212ebe>] sys_oldumount+0x1e/0x20 > Jul 29 20:30:42 ini kernel: [ 1920.526922] [<c01033ec>] syscall_call+0x7/0xb > Jul 29 20:30:42 ini kernel: [ 1920.563739] sync D 0004beb0 0 1265 1255 0x00000004 > Jul 29 20:30:42 ini kernel: [ 1920.563747] cea6ff2c 00000086 00000001 0004beb0 00000000 c082b330 cde3db7c c082b330 > Jul 29 20:30:42 ini kernel: [ 1920.563758] c082b330 c082b330 cde3db7c 15c42f3f 00000140 c082b330 c082b330 ce653200 > Jul 29 20:30:42 ini kernel: [ 1920.563768] 00000140 cde3d8d0 cea6ff60 cde3d8d0 cefef23c cea6ff58 c0578de5 fffeffff > Jul 29 20:30:42 ini kernel: [ 1920.563779] Call Trace: > Jul 29 20:30:42 ini kernel: [ 1920.563787] [<c0578de5>] rwsem_down_failed_common+0x75/0x1a0 > Jul 29 20:30:42 ini kernel: [ 1920.563793] [<c0578f5d>] rwsem_down_read_failed+0x1d/0x30 > Jul 29 20:30:42 ini kernel: [ 1920.563799] [<c0578fb7>] call_rwsem_down_read_failed+0x7/0x10 > Jul 29 20:30:42 ini kernel: [ 1920.563804] [<c057850c>] ? down_read+0x1c/0x20 > Jul 29 20:30:42 ini kernel: [ 1920.563812] [<c021cb6d>] sync_filesystems+0xbd/0x110 > Jul 29 20:30:42 ini kernel: [ 1920.563817] [<c021cc16>] sys_sync+0x16/0x40 > Jul 29 20:30:42 ini kernel: [ 1920.563823] [<c01033ec>] syscall_call+0x7/0xb > > Although in both cases the FS remained consistent: Yes, at least something positive in the end ;). > root@ini:~# mount -t ext4 /dev/sdb /mnt > root@ini:~# umount /mnt > root@ini:~# e2fsck -f -y /dev/sdb > e2fsck 1.41.11 (14-Mar-2010) > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > > /dev/sdb: ***** FILE SYSTEM WAS MODIFIED ***** > /dev/sdb: 4194/640000 files (74.2% non-contiguous), 334774/1280000 blocks > > You can find full kernel logs starting from iSCSI load in the attachments. > > I already reported such issues some time ago, but my reports were not too much welcomed, so I gave up. Anyway, anybody can easily do my tests at any time. They don't need any special hardware, just 2 Linux boxes: one for iSCSI target and one for iSCSI initiator (the test box itself). But they are generic for other transports as well. You can see there's nothing iSCSI specific in the traces. Thanks for running the test. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: extfs reliability 2010-07-29 14:26 ` Jan Kara @ 2010-07-29 18:20 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-29 18:20 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke, linux-kernel Jan Kara, on 07/29/2010 06:26 PM wrote: >> root@ini:/mnt/dbench-mod# ^C >> root@ini:/mnt/dbench-mod# ^C >> root@ini:~# umount /mnt >> Segmentation fault > This isn't OK of course ;) > >> Kernel log: >> >> Jul 29 19:55:35 ini kernel: [ 3044.722313] c2c28e40: 00023740 00023741 00023742 00023743 @7..A7..B7..C7.. >> Jul 29 19:55:35 ini kernel: [ 3044.722320] c2c28e50: 00023744 00023745 00023746 00023747 D7..E7..F7..G7.. >> Jul 29 19:55:35 ini kernel: [ 3044.722327] c2c28e60: 00023748 00023749 0002374a 0002374b H7..I7..J7..K7.. >> Jul 29 19:55:35 ini kernel: [ 3044.722334] c2c28e70: 0002372c 00000000 00000000 00000000 ,7.............. >> Jul 29 19:55:35 ini kernel: [ 3044.722341] c2c28e80: 00000000 00000000 00000000 00000002 ................ > ... > Sadly these messages above seem to have overwritten beginning of the > message below. Hmm, but maybe it's just a warning message about inode still > being on orphan list because the next oops still shows untainted kernel. You can find previous messages in the attachments I attached to the report. They are big (500K and 1M), so I compressed and attached them. >> Jul 29 19:55:35 ini kernel: [ 3044.722546] Pid: 1299, comm: umount Not tainted 2.6.32-22-386 #36-Ubuntu >> Jul 29 19:55:35 ini kernel: [ 3044.722550] Call Trace: >> Jul 29 19:55:35 ini kernel: [ 3044.722567] [<c0291731>] ext4_destroy_inode+0x91/0xa0 >> Jul 29 19:55:35 ini kernel: [ 3044.722577] [<c020ecb4>] destroy_inode+0x24/0x40 >> Jul 29 19:55:35 ini kernel: [ 3044.722583] [<c020f11e>] dispose_list+0x8e/0x100 >> Jul 29 19:55:35 ini kernel: [ 3044.722588] [<c020f534>] invalidate_inodes+0xf4/0x120 >> Jul 29 19:55:35 ini kernel: [ 3044.722598] [<c023b310>] ? vfs_quota_off+0x0/0x20 >> Jul 29 19:55:35 ini kernel: [ 3044.722606] [<c01fc602>] generic_shutdown_super+0x42/0xe0 >> Jul 29 19:55:35 ini kernel: [ 3044.722612] [<c01fc6ca>] kill_block_super+0x2a/0x50 >> Jul 29 19:55:35 ini kernel: [ 3044.722618] [<c01fd4e4>] deactivate_super+0x64/0x90 >> Jul 29 19:55:35 ini kernel: [ 3044.722625] [<c021282f>] mntput_no_expire+0x8f/0xe0 >> Jul 29 19:55:35 ini kernel: [ 3044.722631] [<c0212e47>] sys_umount+0x47/0xa0 >> Jul 29 19:55:35 ini kernel: [ 3044.722636] [<c0212ebe>] sys_oldumount+0x1e/0x20 >> Jul 29 19:55:35 ini kernel: [ 3044.722643] [<c01033ec>] syscall_call+0x7/0xb >> Jul 29 19:55:35 ini kernel: [ 3044.731043] sd 6:0:0:0: [sdb] Unhandled error code >> Jul 29 19:55:35 ini kernel: [ 3044.731049] sd 6:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK >> Jul 29 19:55:35 ini kernel: [ 3044.731056] sd 6:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 00 00 00 00 01 00 >> Jul 29 19:55:35 ini kernel: [ 3044.743469] __ratelimit: 37 callbacks suppressed >> Jul 29 19:55:35 ini kernel: [ 3044.755695] lost page write due to I/O error on sdb >> Jul 29 19:55:36 ini kernel: [ 3044.823044] Modules linked in: crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi w83627hf hwmon_vid fbcon tileblit font bitblit softcursor ppdev adm1021 i2c_i801 vga16fb vgastate e7xxx_edac psmouse serio_raw parport_pc shpchp edac_core lp parport qla2xxx ohci1394 scsi_transport_fc r8169 sata_via ieee1394 mii scsi_tgt e1000 floppy > So here probably starts the real oops. It isn't yet an oops, it's dump_stack() from ext4_destroy_inode() together with hex dump: static void ext4_destroy_inode(struct inode *inode) { if (!list_empty(&(EXT4_I(inode)->i_orphan))) { ext4_msg(inode->i_sb, KERN_ERR, "Inode %lu (%p): orphan list check failed!", inode->i_ino, EXT4_I(inode)); print_hex_dump(KERN_INFO, "", DUMP_PREFIX_ADDRESS, 16, 4, EXT4_I(inode), sizeof(struct ext4_inode_info), true); dump_stack(); } kmem_cache_free(ext4_inode_cachep, EXT4_I(inode)); } > But sadly we are missing the > beginning as well. It was also in the attached file. > Can you send me disassembly of your ext4_put_super? In System.map-2.6.32-22-386: c0293940 t ext4_put_super c0293c90 t ext4_quota_write $ objdump -d --start-address=0xc0293940 vmlinux >ext4_put_super ^C $ cat ext4_put_super vmlinux: file format elf32-i386 Disassembly of section .text: c0293940 <.text+0x193940>: c0293940: 55 push %ebp c0293941: 89 e5 mov %esp,%ebp c0293943: 57 push %edi c0293944: 56 push %esi c0293945: 53 push %ebx c0293946: 83 ec 28 sub $0x28,%esp c0293949: e8 02 07 e7 ff call 0xc0104050 c029394e: 8b 98 84 01 00 00 mov 0x184(%eax),%ebx c0293954: 89 c6 mov %eax,%esi c0293956: 8b 83 2c 02 00 00 mov 0x22c(%ebx),%eax c029395c: 8b 7b 38 mov 0x38(%ebx),%edi c029395f: e8 ac b5 ec ff call 0xc015ef10 c0293964: 8b 83 2c 02 00 00 mov 0x22c(%ebx),%eax c029396a: e8 41 af ec ff call 0xc015e8b0 c029396f: 89 f0 mov %esi,%eax c0293971: e8 5a 86 f6 ff call 0xc01fbfd0 c0293976: e8 05 5a 2e 00 call 0xc0579380 c029397b: 80 7e 11 00 cmpb $0x0,0x11(%esi) c029397f: 0f 85 0b 02 00 00 jne 0xc0293b90 c0293985: 8b 83 34 01 00 00 mov 0x134(%ebx),%eax c029398b: 85 c0 test %eax,%eax c029398d: 74 17 je 0xc02939a6 c029398f: e8 ec 64 02 00 call 0xc02b9e80 c0293994: c7 83 34 01 00 00 00 movl $0x0,0x134(%ebx) c029399b: 00 00 00 c029399e: 85 c0 test %eax,%eax c02939a0: 0f 88 ff 01 00 00 js 0xc0293ba5 c02939a6: 89 f0 mov %esi,%eax c02939a8: e8 c3 22 01 00 call 0xc02a5c70 c02939ad: 89 f0 mov %esi,%eax c02939af: e8 0c dc 00 00 call 0xc02a15c0 c02939b4: 89 f0 mov %esi,%eax c02939b6: e8 c5 51 00 00 call 0xc0298b80 c02939bb: 89 f0 mov %esi,%eax c02939bd: e8 de 45 01 00 call 0xc02a7fa0 c02939c2: f6 46 30 01 testb $0x1,0x30(%esi) c02939c6: 0f 84 9c 01 00 00 je 0xc0293b68 c02939cc: 8b 93 f8 00 00 00 mov 0xf8(%ebx),%edx c02939d2: 85 d2 test %edx,%edx c02939d4: 74 11 je 0xc02939e7 c02939d6: 8b 15 c8 8a 8a c0 mov 0xc08a8ac8,%edx c02939dc: 8d 86 64 01 00 00 lea 0x164(%esi),%eax c02939e2: e8 59 fd fa ff call 0xc0243740 c02939e7: 8d bb fc 00 00 00 lea 0xfc(%ebx),%edi c02939ed: 89 f8 mov %edi,%eax c02939ef: e8 7c 5f 0a 00 call 0xc0339970 c02939f4: 8b 43 14 mov 0x14(%ebx),%eax c02939f7: 85 c0 test %eax,%eax c02939f9: 0f 84 c3 01 00 00 je 0xc0293bc2 c02939ff: 31 d2 xor %edx,%edx c0293a01: 8b 4b 3c mov 0x3c(%ebx),%ecx c0293a04: 31 c0 xor %eax,%eax c0293a06: 89 75 f0 mov %esi,-0x10(%ebp) c0293a09: 89 de mov %ebx,%esi c0293a0b: 89 d3 mov %edx,%ebx c0293a0d: 8d 76 00 lea 0x0(%esi),%esi c0293a10: 8b 04 81 mov (%ecx,%eax,4),%eax c0293a13: 85 c0 test %eax,%eax c0293a15: 74 08 je 0xc0293a1f c0293a17: e8 54 ab f8 ff call 0xc021e570 c0293a1c: 8b 4e 3c mov 0x3c(%esi),%ecx c0293a1f: 83 c3 01 add $0x1,%ebx c0293a22: 39 5e 14 cmp %ebx,0x14(%esi) c0293a25: 89 d8 mov %ebx,%eax c0293a27: 77 e7 ja 0xc0293a10 c0293a29: 89 f3 mov %esi,%ebx c0293a2b: 8b 75 f0 mov -0x10(%ebp),%esi c0293a2e: 89 c8 mov %ecx,%eax c0293a30: e8 fb bb f5 ff call 0xc01ef630 c0293a35: 8b 15 2c 53 8a c0 mov 0xc08a532c,%edx c0293a3b: 8b 83 28 02 00 00 mov 0x228(%ebx),%eax c0293a41: 81 c2 00 00 80 00 add $0x800000,%edx c0293a47: 39 d0 cmp %edx,%eax c0293a49: 72 20 jb 0xc0293a6b c0293a4b: 8b 15 c0 17 75 c0 mov 0xc07517c0,%edx c0293a51: 81 ea 00 20 60 00 sub $0x602000,%edx c0293a57: 81 e2 00 00 c0 ff and $0xffc00000,%edx c0293a5d: 81 ea 00 20 00 00 sub $0x2000,%edx c0293a63: 39 d0 cmp %edx,%eax c0293a65: 0f 82 ed 00 00 00 jb 0xc0293b58 c0293a6b: e8 c0 bb f5 ff call 0xc01ef630 c0293a70: 8d 83 94 00 00 00 lea 0x94(%ebx),%eax c0293a76: e8 c5 42 0b 00 call 0xc0347d40 c0293a7b: 8d 83 ac 00 00 00 lea 0xac(%ebx),%eax c0293a81: e8 ba 42 0b 00 call 0xc0347d40 c0293a86: 8d 83 c4 00 00 00 lea 0xc4(%ebx),%eax c0293a8c: e8 af 42 0b 00 call 0xc0347d40 c0293a91: 8d 83 dc 00 00 00 lea 0xdc(%ebx),%eax c0293a97: e8 a4 42 0b 00 call 0xc0347d40 c0293a9c: 8b 43 34 mov 0x34(%ebx),%eax c0293a9f: 85 c0 test %eax,%eax c0293aa1: 74 05 je 0xc0293aa8 c0293aa3: e8 c8 aa f8 ff call 0xc021e570 c0293aa8: 8b 83 78 01 00 00 mov 0x178(%ebx),%eax c0293aae: e8 7d bb f5 ff call 0xc01ef630 c0293ab3: 8b 83 7c 01 00 00 mov 0x17c(%ebx),%eax c0293ab9: e8 72 bb f5 ff call 0xc01ef630 c0293abe: 8d 93 38 01 00 00 lea 0x138(%ebx),%edx c0293ac4: 3b 93 38 01 00 00 cmp 0x138(%ebx),%edx c0293aca: 0f 85 fa 00 00 00 jne 0xc0293bca c0293ad0: 8b 86 94 00 00 00 mov 0x94(%esi),%eax c0293ad6: e8 65 b4 f8 ff call 0xc021ef40 c0293adb: 8b 83 74 01 00 00 mov 0x174(%ebx),%eax c0293ae1: 85 c0 test %eax,%eax c0293ae3: 74 31 je 0xc0293b16 c0293ae5: 3b 86 94 00 00 00 cmp 0x94(%esi),%eax c0293aeb: 74 29 je 0xc0293b16 c0293aed: e8 4e 0f f9 ff call 0xc0224a40 c0293af2: 8b 83 74 01 00 00 mov 0x174(%ebx),%eax c0293af8: e8 43 b4 f8 ff call 0xc021ef40 c0293afd: 8b 83 74 01 00 00 mov 0x174(%ebx),%eax c0293b03: 85 c0 test %eax,%eax c0293b05: 74 0f je 0xc0293b16 c0293b07: e8 64 d7 ff ff call 0xc0291270 c0293b0c: c7 83 74 01 00 00 00 movl $0x0,0x174(%ebx) c0293b13: 00 00 00 c0293b16: c7 86 84 01 00 00 00 movl $0x0,0x184(%esi) c0293b1d: 00 00 00 c0293b20: e8 2b 58 2e 00 call 0xc0579350 c0293b25: 89 f0 mov %esi,%eax c0293b27: e8 c4 84 f6 ff call 0xc01fbff0 c0293b2c: 89 f8 mov %edi,%eax c0293b2e: e8 7d 5d 0a 00 call 0xc03398b0 c0293b33: 8d 83 20 01 00 00 lea 0x120(%ebx),%eax c0293b39: e8 d2 3b 2e 00 call 0xc0577710 c0293b3e: 8b 83 f4 00 00 00 mov 0xf4(%ebx),%eax c0293b44: e8 e7 ba f5 ff call 0xc01ef630 c0293b49: 89 d8 mov %ebx,%eax c0293b4b: e8 e0 ba f5 ff call 0xc01ef630 c0293b50: 83 c4 28 add $0x28,%esp c0293b53: 5b pop %ebx c0293b54: 5e pop %esi c0293b55: 5f pop %edi c0293b56: 5d pop %ebp c0293b57: c3 ret c0293b58: e8 e3 db f4 ff call 0xc01e1740 c0293b5d: 8d 76 00 lea 0x0(%esi),%esi c0293b60: e9 0b ff ff ff jmp 0xc0293a70 c0293b65: 8d 76 00 lea 0x0(%esi),%esi c0293b68: 8b 86 84 01 00 00 mov 0x184(%esi),%eax c0293b6e: ba 01 00 00 00 mov $0x1,%edx c0293b73: 8b 40 38 mov 0x38(%eax),%eax c0293b76: 83 60 60 fb andl $0xfffffffb,0x60(%eax) c0293b7a: 0f b7 43 58 movzwl 0x58(%ebx),%eax c0293b7e: 66 89 47 3a mov %ax,0x3a(%edi) c0293b82: 89 f0 mov %esi,%eax c0293b84: e8 67 e5 ff ff call 0xc02920f0 c0293b89: e9 3e fe ff ff jmp 0xc02939cc c0293b8e: 66 90 xchg %ax,%ax c0293b90: ba 01 00 00 00 mov $0x1,%edx c0293b95: 89 f0 mov %esi,%eax c0293b97: e8 54 e5 ff ff call 0xc02920f0 c0293b9c: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi c0293ba0: e9 e0 fd ff ff jmp 0xc0293985 c0293ba5: 89 34 24 mov %esi,(%esp) c0293ba8: c7 44 24 08 bf 9c 6c movl $0xc06c9cbf,0x8(%esp) c0293baf: c0 c0293bb0: c7 44 24 04 64 3e 59 movl $0xc0593e64,0x4(%esp) c0293bb7: c0 c0293bb8: e8 d3 f1 ff ff call 0xc0292d90 c0293bbd: e9 e4 fd ff ff jmp 0xc02939a6 c0293bc2: 8b 4b 3c mov 0x3c(%ebx),%ecx c0293bc5: e9 64 fe ff ff jmp 0xc0293a2e c0293bca: 8b 43 38 mov 0x38(%ebx),%eax c0293bcd: 8b 80 e8 00 00 00 mov 0xe8(%eax),%eax c0293bd3: 89 55 e8 mov %edx,-0x18(%ebp) c0293bd6: c7 44 24 08 be a9 6c movl $0xc06ca9be,0x8(%esp) c0293bdd: c0 c0293bde: c7 44 24 04 b9 0c 6a movl $0xc06a0cb9,0x4(%esp) c0293be5: c0 c0293be6: 89 44 24 0c mov %eax,0xc(%esp) c0293bea: 89 34 24 mov %esi,(%esp) c0293bed: e8 be d8 ff ff call 0xc02914b0 c0293bf2: c7 04 24 f6 9c 6c c0 movl $0xc06c9cf6,(%esp) c0293bf9: e8 4d 2e 2e 00 call 0xc0576a4b c0293bfe: 8b 83 38 01 00 00 mov 0x138(%ebx),%eax c0293c04: 8b 55 e8 mov -0x18(%ebp),%edx c0293c07: 89 45 f0 mov %eax,-0x10(%ebp) c0293c0a: 89 55 ec mov %edx,-0x14(%ebp) c0293c0d: 8b 55 f0 mov -0x10(%ebp),%edx c0293c10: 8b 02 mov (%edx),%eax c0293c12: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi c0293c16: 39 55 ec cmp %edx,-0x14(%ebp) c0293c19: 75 13 jne 0xc0293c2e c0293c1b: 8b 55 ec mov -0x14(%ebp),%edx c0293c1e: 3b 93 38 01 00 00 cmp 0x138(%ebx),%edx c0293c24: 0f 84 a6 fe ff ff je 0xc0293ad0 c0293c2a: 0f 0b ud2a c0293c2c: eb fe jmp 0xc0293c2c c0293c2e: 8b 55 f0 mov -0x10(%ebp),%edx c0293c31: 8b 45 f0 mov -0x10(%ebp),%eax c0293c34: 83 c2 20 add $0x20,%edx c0293c37: 8b 4a c0 mov -0x40(%edx),%ecx c0293c3a: 83 e8 68 sub $0x68,%eax c0293c3d: 89 4c 24 18 mov %ecx,0x18(%esp) c0293c41: 8b 88 b0 00 00 00 mov 0xb0(%eax),%ecx c0293c47: 89 4c 24 14 mov %ecx,0x14(%esp) c0293c4b: 0f b7 88 fa 00 00 00 movzwl 0xfa(%eax),%ecx c0293c52: 89 54 24 0c mov %edx,0xc(%esp) c0293c56: 89 4c 24 10 mov %ecx,0x10(%esp) c0293c5a: 8b 90 a8 00 00 00 mov 0xa8(%eax),%edx c0293c60: 89 54 24 08 mov %edx,0x8(%esp) c0293c64: 8b 80 2c 01 00 00 mov 0x12c(%eax),%eax c0293c6a: c7 04 24 54 b0 6b c0 movl $0xc06bb054,(%esp) c0293c71: 05 64 01 00 00 add $0x164,%eax c0293c76: 89 44 24 04 mov %eax,0x4(%esp) c0293c7a: e8 cc 2d 2e 00 call 0xc0576a4b c0293c7f: 8b 55 f0 mov -0x10(%ebp),%edx c0293c82: 8b 12 mov (%edx),%edx c0293c84: 89 55 f0 mov %edx,-0x10(%ebp) c0293c87: eb 84 jmp 0xc0293c0d c0293c89: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi The rest snipped. >> Jul 29 19:55:36 ini kernel: [ 3044.823044] >> Jul 29 19:55:36 ini kernel: [ 3044.823044] Pid: 1299, comm: umount Not tainted (2.6.32-22-386 #36-Ubuntu) X5DPA >> Jul 29 19:55:36 ini kernel: [ 3044.823044] EIP: 0060:[<c0293c2a>] EFLAGS: 00010206 CPU: 0 >> Jul 29 19:55:36 ini kernel: [ 3044.823044] EIP is at ext4_put_super+0x2ea/0x350 >> Jul 29 19:55:36 ini kernel: [ 3044.823044] EAX: c2c28ea8 EBX: c307f000 ECX: ffffff52 EDX: c307f138 >> Jul 29 19:55:36 ini kernel: [ 3044.823044] ESI: ca228a00 EDI: c307f0fc EBP: cec6ff30 ESP: cec6fefc >> Jul 29 19:55:36 ini kernel: [ 3044.823044] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 >> Jul 29 19:55:36 ini kernel: [ 3044.823044] c06bb054 ca228b64 0000800b c2c28ec8 00008180 00000001 00000000 c307f138 >> Jul 29 19:55:36 ini kernel: [ 3044.823044]<0> c307f138 c307f138 ca228a00 c0593c80 c023b310 cec6ff48 c01fc60d ca228ac0 >> Jul 29 19:55:36 ini kernel: [ 3044.823044]<0> cec6ff44 cf328400 00000003 cec6ff58 c01fc6ca ca228a00 c0759d80 cec6ff6c >> Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c023b310>] ? vfs_quota_off+0x0/0x20 >> Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c01fc60d>] ? generic_shutdown_super+0x4d/0xe0 >> Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c01fc6ca>] ? kill_block_super+0x2a/0x50 >> Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c01fd4e4>] ? deactivate_super+0x64/0x90 >> Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c021282f>] ? mntput_no_expire+0x8f/0xe0 >> Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c0212e47>] ? sys_umount+0x47/0xa0 >> Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c0212ebe>] ? sys_oldumount+0x1e/0x20 >> Jul 29 19:55:36 ini kernel: [ 3044.823044] [<c01033ec>] ? syscall_call+0x7/0xb >> Jul 29 19:55:36 ini kernel: [ 3045.299442] ---[ end trace 426db011a0289db3 ]--- > ... >> Another test. Everything is as before, only I did not pull the cable, but >> deleted the corresponding LUN on the target, so all the command starting >> from this moment failed. Then on umount system rebooted. Kernel log: > > Nasty. But the log actually contains only traces of processes in D state > (generally waiting for a page to be unlocked). Do you have any sort of > watchdog which might have rebooted the machine? I didn't configured it ;). This is unmodified Ubuntu server 10.04, only with non-PAE kernel. The reboot wasn't immediate. I even tried to check something in another ssh. Again, you can find more logs attached to the original message. > Thanks for running the test. Thanks for looking at the results! Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: extfs reliability 2010-07-29 13:00 ` extfs reliability Vladislav Bolkhovitin 2010-07-29 13:08 ` Christoph Hellwig 2010-07-29 14:26 ` Jan Kara @ 2010-07-29 18:58 ` Ted Ts'o 2 siblings, 0 replies; 155+ messages in thread From: Ted Ts'o @ 2010-07-29 18:58 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke, linux-kernel, kernel-bugs On Thu, Jul 29, 2010 at 05:00:10PM +0400, Vladislav Bolkhovitin wrote: > Christoph Hellwig, on 07/29/2010 12:31 PM wrote: > > My reading of the ext3/jbd code we explicitly wait on I/O completion > > of dependent writes, and only require those to actually be stable > > by issueing a flush. If that wasn't the case the default ext3 > > barriers off behaviour would not only be dangerous on devices with > > volatile write caches, but also on devices that do not have them, > > which in addition to the reading of the code is not what we've seen > > in actual power fail testing, where ext3 does well as long as there > > is no volatile write cache. > > Basically, it is so, but, unfortunately, not absolutely. I've just tried 2 tests on ext4 with iSCSI: Well, this thread was talking about something else (which is how various file systems handle barriers), and not bugs about what happen when a disk disappears from a system due to attachment failure --- but that's fine, we can deal with that here. > Segmentation fault OK, I've looked at your kernel messages, and it looks like the problem comes from this: /* Debugging code just in case the in-memory inode orphan list * isn't empty. The on-disk one can be non-empty if we've * detected an error and taken the fs readonly, but the * in-memory list had better be clean by this point. */ if (!list_empty(&sbi->s_orphan)) dump_orphan_list(sb, sbi); J_ASSERT(list_empty(&sbi->s_orphan)); <==== This is a "should never happen situation", and we crash so we can figure out how we got there. For production kernels, arguably it would probably be better to print a message and a WARN_ON(1), and then not force a crash from a BUG_ON (which is what J_ASSERT is defined to use). Looking at your messages and the ext4_delete_inode() warning, I think I know what caused it. Can you try this patch (attached below) and see if it fixes things for you? > I already reported such issues some time ago, but my reports were > not too much welcomed, so I gave up. Anyway, anybody can easily do > my tests at any time. My apologies. I've gone through the linux-ext4 mailing list logs, and I can't find any mention of this problem from any username @vlnb.net. I'm not sure where you reported it, and I'm sorry we dropped your bug report. All I can say is that we do the best that we can, and our team is relatively small and short-handed. - Ted >From a190d0386e601d58db6d2a6cbf00dc1c17d02136 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o <tytso@mit.edu> Date: Thu, 29 Jul 2010 14:54:48 -0400 Subject: [PATCH] patch explicitly-drop-inode-from-orphan-list-on-ext4_delete_inode-failure --- fs/ext4/inode.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index a52d5af..533b607 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -221,6 +221,7 @@ void ext4_delete_inode(struct inode *inode) "couldn't extend journal (err %d)", err); stop_handle: ext4_journal_stop(handle); + ext4_orphan_del(NULL, inode); goto no_delete; } } -- 1.7.0.4 ^ permalink raw reply related [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 1:44 ` Ted Ts'o ` (2 preceding siblings ...) 2010-07-29 8:31 ` [RFC] relaxed barrier semantics Christoph Hellwig @ 2010-07-29 19:44 ` Ric Wheeler 2010-07-29 19:49 ` Christoph Hellwig 2010-07-31 0:35 ` Jan Kara 2010-07-29 19:44 ` Ric Wheeler 4 siblings, 2 replies; 155+ messages in thread From: Ric Wheeler @ 2010-07-29 19:44 UTC (permalink / raw) To: Ted Ts'o, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley On 07/28/2010 09:44 PM, Ted Ts'o wrote: > On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote: > >> If we move all filesystems to non-draining barriers with pre- and post- >> flushes that might actually be a relatively easy first step. We don't >> have the complications to deal with multiple types of barriers to >> start with, and it'll fix the issue for devices without volatile write >> caches completely. >> >> I just need some help from the filesystem folks to determine if they >> are safe with them. >> >> I know for sure that ext3 and xfs are from looking through them. And >> I know reiserfs is if we make sure it doesn't hit the code path that >> relies on it that is currently enabled by the barrier option. >> >> I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks. >> That already ends our small list of barrier supporting filesystems, and >> possibly ocfs2, too - although the barrier implementation there seems >> incomplete as it doesn't seem to flush caches in fsync. >> > Define "are safe" --- what interface we planning on using for the > non-draining barrier? At least for ext3, when we write the commit > record using set_buffer_ordered(bh), it assumes that this will do a > flush of all previous writes and that the commit will hit the disk > before any subsequent writes are sent to the disk. So turning the > write of a buffer head marked with set_buffered_ordered() into a FUA > write would _not_ be safe for ext3. > I confess that I am a bit fuzzy on FUA, but think that it means that any FUA tagged IO will go down to persistent store before returning. If so, then all order dependent IO would need to be issued in order and tagged with FUA. It would not suffice to tag just the commit record as FUA, or do I misunderstand what FUA does? (Looking for a record in the how many times can I use FUA in an email). ric > For ext4, if we don't use journal checksums, then we have the same > requirements as ext3, and the same method of requesting it. If we do > use journal checksums, what ext4 needs is a way of assuring that no > writes after the commit are reordered with respect to the disk platter > before the commit record --- but any of the writes before that, > including the commit, and be reordered because we rely on the checksum > in the commit record to know at replay time whether the last commit is > valid or not. We do that right now by calling blkdev_issue_flush() > with BLKDEF_IFL_WAIT after submitting the write of the commit block. > > - Ted > > ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 19:44 ` [RFC] relaxed barrier semantics Ric Wheeler @ 2010-07-29 19:49 ` Christoph Hellwig 2010-07-29 19:56 ` Ric Wheeler 2010-07-31 0:35 ` Jan Kara 1 sibling, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-29 19:49 UTC (permalink / raw) To: Ric Wheeler Cc: Ted Ts'o, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, Jul 29, 2010 at 03:44:31PM -0400, Ric Wheeler wrote: > I confess that I am a bit fuzzy on FUA, but think that it means that any > FUA tagged IO will go down to persistent store before returning. Exactly. > If so, then all order dependent IO would need to be issued in order and > tagged with FUA. It would not suffice to tag just the commit record as > FUA, or do I misunderstand what FUA does? The commit record is ext3/4 specific terminalogy. In xfs we just have one type of log buffers, and we could tag that as FUA. There is very little other depenent I/O, but if that is present we need a pre-flush for it anyway. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 19:49 ` Christoph Hellwig @ 2010-07-29 19:56 ` Ric Wheeler 2010-07-29 19:59 ` James Bottomley 2010-07-29 22:30 ` Andreas Dilger 0 siblings, 2 replies; 155+ messages in thread From: Ric Wheeler @ 2010-07-29 19:56 UTC (permalink / raw) To: Christoph Hellwig Cc: Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On 07/29/2010 03:49 PM, Christoph Hellwig wrote: > On Thu, Jul 29, 2010 at 03:44:31PM -0400, Ric Wheeler wrote: > >> I confess that I am a bit fuzzy on FUA, but think that it means that any >> FUA tagged IO will go down to persistent store before returning. >> > Exactly. > > >> If so, then all order dependent IO would need to be issued in order and >> tagged with FUA. It would not suffice to tag just the commit record as >> FUA, or do I misunderstand what FUA does? >> > The commit record is ext3/4 specific terminalogy. In xfs we just have > one type of log buffers, and we could tag that as FUA. There is very > little other depenent I/O, but if that is present we need a pre-flush > for it anyway. > > I assume that for ext3 it would get more complicated depending on the journal mode. In ordered or data journal mode, we would have to write the dependent non-journal data tagged with FUA, then the FUA tagged transaction and finally the FUA tagged commit block. Not sure how FUA performs, but writing lots of small tagged writes is probably not good for performance... Ric ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 19:56 ` Ric Wheeler @ 2010-07-29 19:59 ` James Bottomley 2010-07-29 20:03 ` Christoph Hellwig 2010-07-29 20:58 ` Ric Wheeler 2010-07-29 22:30 ` Andreas Dilger 1 sibling, 2 replies; 155+ messages in thread From: James Bottomley @ 2010-07-29 19:59 UTC (permalink / raw) To: Ric Wheeler Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, 2010-07-29 at 15:56 -0400, Ric Wheeler wrote: > On 07/29/2010 03:49 PM, Christoph Hellwig wrote: > > On Thu, Jul 29, 2010 at 03:44:31PM -0400, Ric Wheeler wrote: > > > >> I confess that I am a bit fuzzy on FUA, but think that it means that any > >> FUA tagged IO will go down to persistent store before returning. > >> > > Exactly. > > > > > >> If so, then all order dependent IO would need to be issued in order and > >> tagged with FUA. It would not suffice to tag just the commit record as > >> FUA, or do I misunderstand what FUA does? > >> > > The commit record is ext3/4 specific terminalogy. In xfs we just have > > one type of log buffers, and we could tag that as FUA. There is very > > little other depenent I/O, but if that is present we need a pre-flush > > for it anyway. > > > > > > I assume that for ext3 it would get more complicated depending on the > journal mode. In ordered or data journal mode, we would have to write > the dependent non-journal data tagged with FUA, then the FUA tagged > transaction and finally the FUA tagged commit block. > > Not sure how FUA performs, but writing lots of small tagged writes is > probably not good for performance... That's basically everything FUA ... you might just as well switch your cache to write through and have done. This, by the way, is one area I'm hoping to have researched on SCSI (where most devices do obey the caching directives). Actually see if write through without flush barriers is faster than writeback with flush barriers. I really suspect it is. James ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 19:59 ` James Bottomley @ 2010-07-29 20:03 ` Christoph Hellwig 2010-07-29 20:07 ` James Bottomley 2010-07-30 12:46 ` Vladislav Bolkhovitin 2010-07-29 20:58 ` Ric Wheeler 1 sibling, 2 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-29 20:03 UTC (permalink / raw) To: James Bottomley Cc: Ric Wheeler, Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, Jul 29, 2010 at 02:59:51PM -0500, James Bottomley wrote: > That's basically everything FUA ... you might just as well switch your > cache to write through and have done. > > This, by the way, is one area I'm hoping to have researched on SCSI > (where most devices do obey the caching directives). Actually see if > write through without flush barriers is faster than writeback with flush > barriers. I really suspect it is. We have done the research and at least for XFS a write through cache actually is faster for many workloads. Ric always has workloads where the cache is faster, though - mostly doing lots of small file write kind of setups. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 20:03 ` Christoph Hellwig @ 2010-07-29 20:07 ` James Bottomley 2010-07-29 20:11 ` Christoph Hellwig 2010-07-30 12:46 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 155+ messages in thread From: James Bottomley @ 2010-07-29 20:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, 2010-07-29 at 22:03 +0200, Christoph Hellwig wrote: > On Thu, Jul 29, 2010 at 02:59:51PM -0500, James Bottomley wrote: > > That's basically everything FUA ... you might just as well switch your > > cache to write through and have done. > > > > This, by the way, is one area I'm hoping to have researched on SCSI > > (where most devices do obey the caching directives). Actually see if > > write through without flush barriers is faster than writeback with flush > > barriers. I really suspect it is. > > We have done the research and at least for XFS a write through cache > actually is faster for many workloads. Ric always has workloads where > the cache is faster, though - mostly doing lots of small file write > kind of setups. There's lies, damned lies and benchmarks .. but what I was thinking is could we just do the right thing? SCSI exposes (in sd) the interfaces to change the cache setting, so if the customer *doesn't* specify barriers on mount, could we just flip the device to write through it would be more performant in most use cases. James ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 20:07 ` James Bottomley @ 2010-07-29 20:11 ` Christoph Hellwig 2010-07-30 12:45 ` Vladislav Bolkhovitin 2010-08-04 1:58 ` Jamie Lokier 0 siblings, 2 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-29 20:11 UTC (permalink / raw) To: James Bottomley Cc: Christoph Hellwig, Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, Jul 29, 2010 at 03:07:17PM -0500, James Bottomley wrote: > There's lies, damned lies and benchmarks .. but what I was thinking is > could we just do the right thing? SCSI exposes (in sd) the interfaces > to change the cache setting, so if the customer *doesn't* specify > barriers on mount, could we just flip the device to write through it > would be more performant in most use cases. We could for SCSI and ATA, but probably not easily for other kind of storage. Except that it's not that simple as we have partitions and volume managers inbetween - different filesystems sitting on the same device might have very different ideas of what they want. For SCSI we can at least permanently disable the cache, but ATA devices keep coming up again with the volatile write cache enabled after a reboot, or even worse a suspend to ram / resume cycle. The latter is what keeps me from just disabling the volatile cache on my laptop, despite that option giving significanly better performance for typical kernel developer workloads. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 20:11 ` Christoph Hellwig @ 2010-07-30 12:45 ` Vladislav Bolkhovitin 2010-07-30 12:56 ` Christoph Hellwig 2010-08-04 1:58 ` Jamie Lokier 1 sibling, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-30 12:45 UTC (permalink / raw) To: Christoph Hellwig Cc: James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke Christoph Hellwig, on 07/30/2010 12:11 AM wrote: > On Thu, Jul 29, 2010 at 03:07:17PM -0500, James Bottomley wrote: >> There's lies, damned lies and benchmarks .. but what I was thinking is >> could we just do the right thing? SCSI exposes (in sd) the interfaces >> to change the cache setting, so if the customer *doesn't* specify >> barriers on mount, could we just flip the device to write through it >> would be more performant in most use cases. > > We could for SCSI and ATA, but probably not easily for other kind of > storage. Except that it's not that simple as we have partitions and > volume managers inbetween - different filesystems sitting on the same > device might have very different ideas of what they want. > > For SCSI we can at least permanently disable the cache, but ATA devices > keep coming up again with the volatile write cache enabled after a > reboot, or even worse a suspend to ram / resume cycle. The latter is > what keeps me from just disabling the volatile cache on my laptop, > despite that option giving significanly better performance for typical > kernel developer workloads. There are also SCSI devices which keep changed settings only until the next reset/restart. (The devices might be shared, so other initiators can at any time reset them.) So, to make the changed settings to not be resetted, there must be a procedure which would catch the corresponding notification event (RESET Unit Attention for SCSI) and set the resetted settings back in the desired value. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 12:45 ` Vladislav Bolkhovitin @ 2010-07-30 12:56 ` Christoph Hellwig 0 siblings, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-30 12:56 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Fri, Jul 30, 2010 at 04:45:00PM +0400, Vladislav Bolkhovitin wrote: > There are also SCSI devices which keep changed settings only until the > next reset/restart. (The devices might be shared, so other initiators > can at any time reset them.) I haven't seen a scsi device without support for the saved values mode pages for years. But yes, in a shared environment every initator could change the settings. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 20:11 ` Christoph Hellwig 2010-07-30 12:45 ` Vladislav Bolkhovitin @ 2010-08-04 1:58 ` Jamie Lokier 1 sibling, 0 replies; 155+ messages in thread From: Jamie Lokier @ 2010-08-04 1:58 UTC (permalink / raw) To: Christoph Hellwig Cc: James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke Christoph Hellwig wrote: > On Thu, Jul 29, 2010 at 03:07:17PM -0500, James Bottomley wrote: > > There's lies, damned lies and benchmarks .. but what I was thinking is > > could we just do the right thing? SCSI exposes (in sd) the interfaces > > to change the cache setting, so if the customer *doesn't* specify > > barriers on mount, could we just flip the device to write through it > > would be more performant in most use cases. > > We could for SCSI and ATA, but probably not easily for other kind of > storage. Except that it's not that simple as we have partitions and > volume managers inbetween - different filesystems sitting on the same > device might have very different ideas of what they want. > > For SCSI we can at least permanently disable the cache, but ATA devices > keep coming up again with the volatile write cache enabled after a > reboot, or even worse a suspend to ram / resume cycle. The latter is > what keeps me from just disabling the volatile cache on my laptop, > despite that option giving significanly better performance for typical > kernel developer workloads. I have workloads where enabling volatile write cache + barriers is much faster than disabling the cache. It is admittedly a 2.4.ancient kernel and PATA on an embedded system, but still, it's enough of a difference (about 3x speedup for large file writes) that it was worth porting SuSE's barrier patches to that kernel so that I could enable the write cache to get a huge speedup while remaining powerfail safe with ext3. -- Jamie ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 20:03 ` Christoph Hellwig 2010-07-29 20:07 ` James Bottomley @ 2010-07-30 12:46 ` Vladislav Bolkhovitin 2010-07-30 12:57 ` Christoph Hellwig 1 sibling, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-30 12:46 UTC (permalink / raw) To: Christoph Hellwig Cc: James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke Christoph Hellwig, on 07/30/2010 12:03 AM wrote: > On Thu, Jul 29, 2010 at 02:59:51PM -0500, James Bottomley wrote: >> That's basically everything FUA ... you might just as well switch your >> cache to write through and have done. >> >> This, by the way, is one area I'm hoping to have researched on SCSI >> (where most devices do obey the caching directives). Actually see if >> write through without flush barriers is faster than writeback with flush >> barriers. I really suspect it is. > > We have done the research and at least for XFS a write through cache > actually is faster for many workloads. Ric always has workloads where > the cache is faster, though - mostly doing lots of small file write > kind of setups. I supposed, with write back cache you did the queue drain after request(s) with ordered requirements, correct? Did you also do the queue drain in the same places with write through caching? Just in case, to be sure the comparison was fair. I can't see why sequence of [(write command/internal cache sync) .. (write command/internal cache sync)] for write through caching should be faster than sequence of [(write command) .. (write command) (cache sync) .. (write command) .. (write command) (cache sync)], except if there are additional queue flushing (draining) in the latter case. I think we need to explain that before doing the next step. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 12:46 ` Vladislav Bolkhovitin @ 2010-07-30 12:57 ` Christoph Hellwig 2010-07-30 13:09 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-30 12:57 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Fri, Jul 30, 2010 at 04:46:12PM +0400, Vladislav Bolkhovitin wrote: > I supposed, with write back cache you did the queue drain after > request(s) with ordered requirements, correct? Did you also do the queue > drain in the same places with write through caching? Using the queue drains in both cases. I can only imagine keeping the queue drained over the cache flush instead of just a few small I/Os has nasty side effects. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 12:57 ` Christoph Hellwig @ 2010-07-30 13:09 ` Vladislav Bolkhovitin 2010-07-30 13:12 ` Christoph Hellwig 0 siblings, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-30 13:09 UTC (permalink / raw) To: Christoph Hellwig Cc: James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke Christoph Hellwig, on 07/30/2010 04:57 PM wrote: > On Fri, Jul 30, 2010 at 04:46:12PM +0400, Vladislav Bolkhovitin wrote: >> I supposed, with write back cache you did the queue drain after >> request(s) with ordered requirements, correct? Did you also do the queue >> drain in the same places with write through caching? > > Using the queue drains in both cases. I can only imagine keeping the > queue drained over the cache flush instead of just a few small I/Os > has nasty side effects. Sorry, I can't follow you here. What was the load pattern difference between the tests in the way how the backend device saw it? I thought, it was only in absence of the cache flush commands (SYNCHRONIZE_CACHE?) in the write through case, but looks like there is something more different? Thanks, Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 13:09 ` Vladislav Bolkhovitin @ 2010-07-30 13:12 ` Christoph Hellwig 2010-07-30 17:40 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-30 13:12 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Fri, Jul 30, 2010 at 05:09:52PM +0400, Vladislav Bolkhovitin wrote: > Sorry, I can't follow you here. What was the load pattern difference > between the tests in the way how the backend device saw it? I thought, > it was only in absence of the cache flush commands (SYNCHRONIZE_CACHE?) > in the write through case, but looks like there is something more different? The only difference in commands is that we see no SYNCHRONIZE_CACHE. The big picture difference is that we also only drain the queue just to undrain it ASAP, instead of keeping it drained over a sequence of SYNCHRONIZE_CACHE + WRITE + SYNCHRONIZE_CACHE, which can make a huge difference for a device with very low latencies like the SSD in my laptop. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 13:12 ` Christoph Hellwig @ 2010-07-30 17:40 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-30 17:40 UTC (permalink / raw) To: Christoph Hellwig Cc: James Bottomley, Ric Wheeler, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke Christoph Hellwig, on 07/30/2010 05:12 PM wrote: > On Fri, Jul 30, 2010 at 05:09:52PM +0400, Vladislav Bolkhovitin wrote: >> Sorry, I can't follow you here. What was the load pattern difference >> between the tests in the way how the backend device saw it? I thought, >> it was only in absence of the cache flush commands (SYNCHRONIZE_CACHE?) >> in the write through case, but looks like there is something more different? > > The only difference in commands is that we see no SYNCHRONIZE_CACHE. > The big picture difference is that we also only drain the queue just > to undrain it ASAP, instead of keeping it drained over a sequence > of SYNCHRONIZE_CACHE + WRITE + SYNCHRONIZE_CACHE, which can make > a huge difference for a device with very low latencies like the SSD > in my laptop. It's weird. I can only explain it if: 1. The device fully or particularly lies about write through mode. Under "particularly" I mean something like if the response returned when the writes "almost" sent to the media. 2. The device has very ineffective SYNCHRONIZE_CACHE implementation. For instance, it has relatively slow internal cache scan (you do complete cache flush, not only affected by the previous writes blocks, correct?). It would be good if you performed your test on some software SCSI target device, where we can fully control and see what's going on inside. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 19:59 ` James Bottomley 2010-07-29 20:03 ` Christoph Hellwig @ 2010-07-29 20:58 ` Ric Wheeler 1 sibling, 0 replies; 155+ messages in thread From: Ric Wheeler @ 2010-07-29 20:58 UTC (permalink / raw) To: James Bottomley Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On 07/29/2010 03:59 PM, James Bottomley wrote: > On Thu, 2010-07-29 at 15:56 -0400, Ric Wheeler wrote: > >> On 07/29/2010 03:49 PM, Christoph Hellwig wrote: >> >>> On Thu, Jul 29, 2010 at 03:44:31PM -0400, Ric Wheeler wrote: >>> >>> >>>> I confess that I am a bit fuzzy on FUA, but think that it means that any >>>> FUA tagged IO will go down to persistent store before returning. >>>> >>>> >>> Exactly. >>> >>> >>> >>>> If so, then all order dependent IO would need to be issued in order and >>>> tagged with FUA. It would not suffice to tag just the commit record as >>>> FUA, or do I misunderstand what FUA does? >>>> >>>> >>> The commit record is ext3/4 specific terminalogy. In xfs we just have >>> one type of log buffers, and we could tag that as FUA. There is very >>> little other depenent I/O, but if that is present we need a pre-flush >>> for it anyway. >>> >>> >>> >> I assume that for ext3 it would get more complicated depending on the >> journal mode. In ordered or data journal mode, we would have to write >> the dependent non-journal data tagged with FUA, then the FUA tagged >> transaction and finally the FUA tagged commit block. >> >> Not sure how FUA performs, but writing lots of small tagged writes is >> probably not good for performance... >> > That's basically everything FUA ... you might just as well switch your > cache to write through and have done. > I think that for data ordered mode that is all of the data more or less would get tagged. For data journal, would we have to send 2x the write workload down with tags? I agree that this would be dubious at best. Note that using the non-FUA cache flush commands, while brute force, does have a clear win on slower devices (S-ATA specifically). Each time I have looked, using the write cache enabled on S-ATA was a win (big win on streaming write performance, not sure why). On SAS drives, the flush barriers were not as large a delta (do not remember which won out). > This, by the way, is one area I'm hoping to have researched on SCSI > (where most devices do obey the caching directives). Actually see if > write through without flush barriers is faster than writeback with flush > barriers. I really suspect it is. > > James > There are clearly much better ways to do this. Even the flushes, if we could flush ranges that matched the partition under the file system, would be better than today where we flush the entire physical device. Ric ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 19:56 ` Ric Wheeler 2010-07-29 19:59 ` James Bottomley @ 2010-07-29 22:30 ` Andreas Dilger 2010-07-29 23:04 ` Ted Ts'o 1 sibling, 1 reply; 155+ messages in thread From: Andreas Dilger @ 2010-07-29 22:30 UTC (permalink / raw) To: Ric Wheeler Cc: Christoph Hellwig, Ted Ts'o, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On 2010-07-29, at 13:56, Ric Wheeler wrote: > I assume that for ext3 it would get more complicated depending on the journal mode. In ordered or data journal mode, we would have to write the dependent non-journal data tagged with FUA, then the FUA tagged transaction and finally the FUA tagged commit block. Like James wrote, this is basically everything FUA. It is OK for ordered mode to allow the device to aggregate the normal filesystem and journal IO, but when the commit block is written it should flush all of the previously written data to disk. This still allows request re-ordering and merging inside the device, but orders the data vs. the commit block. Having the proposed "flush ranges" interface to the disk would be ideal, since there would be no wasted time flushing data that does not need it (i.e. other partitions). There is no need to prevent new data from being written during a cache flush, since ext*/jbd will already manage any required data/metadata ordering internally. There was some proposal (maybe from Eric Sandeen?) about having a device-level IO request counter that numbers every request submitted, and if there are multiple partitions per device, or fsync operations that flush the device cache, it is possible to determine from the request number whether there has already been a cache flush after that request on that device. This avoids extra cache flushes if it was just done for another file or partition on the same device. Cheers, Andreas ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 22:30 ` Andreas Dilger @ 2010-07-29 23:04 ` Ted Ts'o 2010-07-29 23:08 ` Ric Wheeler ` (6 more replies) 0 siblings, 7 replies; 155+ messages in thread From: Ted Ts'o @ 2010-07-29 23:04 UTC (permalink / raw) To: Andreas Dilger Cc: Ric Wheeler, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote: > Like James wrote, this is basically everything FUA. It is OK for > ordered mode to allow the device to aggregate the normal filesystem > and journal IO, but when the commit block is written it should flush > all of the previously written data to disk. This still allows > request re-ordering and merging inside the device, but orders the > data vs. the commit block. Having the proposed "flush ranges" > interface to the disk would be ideal, since there would be no wasted > time flushing data that does not need it (i.e. other partitions). My understanding is that "everything FUA" can be a performance disaster. That's because it bypasses the track buffer, and things get written directly to disk. So there is no possibility to reorder buffers so that they get written in one disk rotation. Depending on the disk, it might even be that if you send N sequential sectors all tagged with FUA, it could be slower than sending the N sectors followed by a cache flush or SYNCHRONIZE_CACHE command. It may be worth doing some experiments to see how big N is for various disks, but I'm pretty sure that FUA will probably turn out to not be such a great idea for ext3/ext4. - Ted ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 23:04 ` Ted Ts'o @ 2010-07-29 23:08 ` Ric Wheeler 2010-07-29 23:08 ` Ric Wheeler ` (5 subsequent siblings) 6 siblings, 0 replies; 155+ messages in thread From: Ric Wheeler @ 2010-07-29 23:08 UTC (permalink / raw) To: Ted Ts'o, Andreas Dilger, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara On 07/29/2010 07:04 PM, Ted Ts'o wrote: > On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote: > >> Like James wrote, this is basically everything FUA. It is OK for >> ordered mode to allow the device to aggregate the normal filesystem >> and journal IO, but when the commit block is written it should flush >> all of the previously written data to disk. This still allows >> request re-ordering and merging inside the device, but orders the >> data vs. the commit block. Having the proposed "flush ranges" >> interface to the disk would be ideal, since there would be no wasted >> time flushing data that does not need it (i.e. other partitions). >> > My understanding is that "everything FUA" can be a performance > disaster. That's because it bypasses the track buffer, and things get > written directly to disk. So there is no possibility to reorder > buffers so that they get written in one disk rotation. Depending on > the disk, it might even be that if you send N sequential sectors all > tagged with FUA, it could be slower than sending the N sectors > followed by a cache flush or SYNCHRONIZE_CACHE command. > You certainly can reorder in a drive with FUA, you just cannot ACK the write until the tagged request is on disk. That clearly depends on the firmware of the device and, if it is an uncommon request, firmware people are unlikely to have spent too much thought and time doing it right :-) > It may be worth doing some experiments to see how big N is for various > disks, but I'm pretty sure that FUA will probably turn out to not be > such a great idea for ext3/ext4. > > - Ted > I am also sceptical and would expect a lot of variability in the results, Ric ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 23:04 ` Ted Ts'o 2010-07-29 23:08 ` Ric Wheeler @ 2010-07-29 23:08 ` Ric Wheeler 2010-07-29 23:28 ` James Bottomley ` (4 subsequent siblings) 6 siblings, 0 replies; 155+ messages in thread From: Ric Wheeler @ 2010-07-29 23:08 UTC (permalink / raw) To: Ted Ts'o, Andreas Dilger, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, ja On 07/29/2010 07:04 PM, Ted Ts'o wrote: > On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote: > >> Like James wrote, this is basically everything FUA. It is OK for >> ordered mode to allow the device to aggregate the normal filesystem >> and journal IO, but when the commit block is written it should flush >> all of the previously written data to disk. This still allows >> request re-ordering and merging inside the device, but orders the >> data vs. the commit block. Having the proposed "flush ranges" >> interface to the disk would be ideal, since there would be no wasted >> time flushing data that does not need it (i.e. other partitions). >> > My understanding is that "everything FUA" can be a performance > disaster. That's because it bypasses the track buffer, and things get > written directly to disk. So there is no possibility to reorder > buffers so that they get written in one disk rotation. Depending on > the disk, it might even be that if you send N sequential sectors all > tagged with FUA, it could be slower than sending the N sectors > followed by a cache flush or SYNCHRONIZE_CACHE command. > You certainly can reorder in a drive with FUA, you just cannot ACK the write until the tagged request is on disk. That clearly depends on the firmware of the device and, if it is an uncommon request, firmware people are unlikely to have spent too much thought and time doing it right :-) > It may be worth doing some experiments to see how big N is for various > disks, but I'm pretty sure that FUA will probably turn out to not be > such a great idea for ext3/ext4. > > - Ted > I am also sceptical and would expect a lot of variability in the results, Ric ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 23:04 ` Ted Ts'o 2010-07-29 23:08 ` Ric Wheeler 2010-07-29 23:08 ` Ric Wheeler @ 2010-07-29 23:28 ` James Bottomley 2010-07-29 23:37 ` James Bottomley 2010-07-30 12:56 ` Vladislav Bolkhovitin 2010-07-30 7:11 ` Christoph Hellwig ` (3 subsequent siblings) 6 siblings, 2 replies; 155+ messages in thread From: James Bottomley @ 2010-07-29 23:28 UTC (permalink / raw) To: Ted Ts'o Cc: Andreas Dilger, Ric Wheeler, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, 2010-07-29 at 19:04 -0400, Ted Ts'o wrote: > On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote: > > Like James wrote, this is basically everything FUA. It is OK for > > ordered mode to allow the device to aggregate the normal filesystem > > and journal IO, but when the commit block is written it should flush > > all of the previously written data to disk. This still allows > > request re-ordering and merging inside the device, but orders the > > data vs. the commit block. Having the proposed "flush ranges" > > interface to the disk would be ideal, since there would be no wasted > > time flushing data that does not need it (i.e. other partitions). > > My understanding is that "everything FUA" can be a performance > disaster. That's because it bypasses the track buffer, and things get > written directly to disk. So there is no possibility to reorder > buffers so that they get written in one disk rotation. Depending on > the disk, it might even be that if you send N sequential sectors all > tagged with FUA, it could be slower than sending the N sectors > followed by a cache flush or SYNCHRONIZE_CACHE command. I think we're getting into disk differences here. This certainly isn't correct for SCSI disks. The standard enterprise configuration for a SCSI disk is actually cache set to write through ... so FUA is a nop. Even for Write Back cache SCSI devices, FUA is just a wait until I/O is on media, which is pretty much equivalent to the write through case for the given cache lines. I can see the problems you describe possibly affecting ATA devices with less sophisticated caches ... but, realistically, SATA and SAS devices come from virtually the same manufacturing process ... I'd be really surprised if they didn't share caching technologies. > It may be worth doing some experiments to see how big N is for various > disks, but I'm pretty sure that FUA will probably turn out to not be > such a great idea for ext3/ext4. I think we should definitely run the benchmarks. James ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 23:28 ` James Bottomley @ 2010-07-29 23:37 ` James Bottomley 2010-07-30 0:19 ` Ted Ts'o 2010-07-30 12:56 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 155+ messages in thread From: James Bottomley @ 2010-07-29 23:37 UTC (permalink / raw) To: Ted Ts'o Cc: Andreas Dilger, Ric Wheeler, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, 2010-07-29 at 18:28 -0500, James Bottomley wrote: > On Thu, 2010-07-29 at 19:04 -0400, Ted Ts'o wrote: > > On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote: > > > Like James wrote, this is basically everything FUA. It is OK for > > > ordered mode to allow the device to aggregate the normal filesystem > > > and journal IO, but when the commit block is written it should flush > > > all of the previously written data to disk. This still allows > > > request re-ordering and merging inside the device, but orders the > > > data vs. the commit block. Having the proposed "flush ranges" > > > interface to the disk would be ideal, since there would be no wasted > > > time flushing data that does not need it (i.e. other partitions). > > > > My understanding is that "everything FUA" can be a performance > > disaster. That's because it bypasses the track buffer, and things get > > written directly to disk. So there is no possibility to reorder > > buffers so that they get written in one disk rotation. Depending on > > the disk, it might even be that if you send N sequential sectors all > > tagged with FUA, it could be slower than sending the N sectors > > followed by a cache flush or SYNCHRONIZE_CACHE command. > > I think we're getting into disk differences here. This certainly isn't > correct for SCSI disks. The standard enterprise configuration for a > SCSI disk is actually cache set to write through ... so FUA is a nop. > Even for Write Back cache SCSI devices, FUA is just a wait until I/O is > on media, which is pretty much equivalent to the write through case for > the given cache lines. > > I can see the problems you describe possibly affecting ATA devices with > less sophisticated caches ... but, realistically, SATA and SAS devices > come from virtually the same manufacturing process ... I'd be really > surprised if they didn't share caching technologies. Actually, just an update on this now that I've taken my SCSI glasses off. Anything that does tagging properly ... like SCSI or SATA NCQ shouldn't have this problem because the multiple outstanding tags hide the media access latency. For untagged devices, yes, it will be painful. James ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 23:37 ` James Bottomley @ 2010-07-30 0:19 ` Ted Ts'o 0 siblings, 0 replies; 155+ messages in thread From: Ted Ts'o @ 2010-07-30 0:19 UTC (permalink / raw) To: James Bottomley Cc: Andreas Dilger, Ric Wheeler, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu, Jul 29, 2010 at 06:37:35PM -0500, James Bottomley wrote: > Actually, just an update on this now that I've taken my SCSI glasses > off. Anything that does tagging properly ... like SCSI or SATA NCQ > shouldn't have this problem because the multiple outstanding tags hide > the media access latency. For untagged devices, yes, it will be > painful. > Maybe I'm just being too paranoid and not trusting enough about the competence of firmware authors, but let's do a lot of testing on this first. Or let's have some options so we can turn off FUA if it turns out to be a disaster on a particular device. I'll have to do some searching, but I distinctly remember reading an article in Ars Technical or Anandtech about how FUA wasn't all that useful because what the writer had seen in terms of testing some specific devices. Maybe that was a while ago and devices have gotten better, and maybe that writer was on crack, but given that FUA doesn't get used a lot, I'm nervous.... - Ted ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 23:28 ` James Bottomley 2010-07-29 23:37 ` James Bottomley @ 2010-07-30 12:56 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-30 12:56 UTC (permalink / raw) To: James Bottomley Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke James Bottomley, on 07/30/2010 03:28 AM wrote: > On Thu, 2010-07-29 at 19:04 -0400, Ted Ts'o wrote: >> On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote: >>> Like James wrote, this is basically everything FUA. It is OK for >>> ordered mode to allow the device to aggregate the normal filesystem >>> and journal IO, but when the commit block is written it should flush >>> all of the previously written data to disk. This still allows >>> request re-ordering and merging inside the device, but orders the >>> data vs. the commit block. Having the proposed "flush ranges" >>> interface to the disk would be ideal, since there would be no wasted >>> time flushing data that does not need it (i.e. other partitions). >> >> My understanding is that "everything FUA" can be a performance >> disaster. That's because it bypasses the track buffer, and things get >> written directly to disk. So there is no possibility to reorder >> buffers so that they get written in one disk rotation. Depending on >> the disk, it might even be that if you send N sequential sectors all >> tagged with FUA, it could be slower than sending the N sectors >> followed by a cache flush or SYNCHRONIZE_CACHE command. > > I think we're getting into disk differences here. This certainly isn't > correct for SCSI disks. The standard enterprise configuration for a > SCSI disk is actually cache set to write through ... so FUA is a nop. > Even for Write Back cache SCSI devices, FUA is just a wait until I/O is > on media, which is pretty much equivalent to the write through case for > the given cache lines. > > I can see the problems you describe possibly affecting ATA devices with > less sophisticated caches ... but, realistically, SATA and SAS devices > come from virtually the same manufacturing process ... I'd be really > surprised if they didn't share caching technologies. Please, don't limit consideration to local disks only! Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 23:04 ` Ted Ts'o ` (2 preceding siblings ...) 2010-07-29 23:28 ` James Bottomley @ 2010-07-30 7:11 ` Christoph Hellwig 2010-07-30 7:11 ` Christoph Hellwig ` (2 subsequent siblings) 6 siblings, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-30 7:11 UTC (permalink / raw) To: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig, Tejun Heo, Vivek Goyal On Thu, Jul 29, 2010 at 07:04:06PM -0400, Ted Ts'o wrote: > My understanding is that "everything FUA" can be a performance > disaster. That's because it bypasses the track buffer, and things get > written directly to disk. So there is no possibility to reorder > buffers so that they get written in one disk rotation. Depending on > the disk, it might even be that if you send N sequential sectors all > tagged with FUA, it could be slower than sending the N sectors > followed by a cache flush or SYNCHRONIZE_CACHE command. Not sure why the discussion is drifting in this direction again, but no one suggested to switch eweryone to forcefully use a FUA only primitive. If we offer a WRITE_FUA primitive to those who can make use of it, it won't mean the the cache flush primitive will go away - we will need it to implement fsync anyway. > ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 23:04 ` Ted Ts'o ` (3 preceding siblings ...) 2010-07-30 7:11 ` Christoph Hellwig @ 2010-07-30 7:11 ` Christoph Hellwig 2010-07-30 12:56 ` Vladislav Bolkhovitin 2010-07-30 12:56 ` Vladislav Bolkhovitin 6 siblings, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-30 7:11 UTC (permalink / raw) To: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig, Tejun Heo, Vivek Goyal On Thu, Jul 29, 2010 at 07:04:06PM -0400, Ted Ts'o wrote: > My understanding is that "everything FUA" can be a performance > disaster. That's because it bypasses the track buffer, and things get > written directly to disk. So there is no possibility to reorder > buffers so that they get written in one disk rotation. Depending on > the disk, it might even be that if you send N sequential sectors all > tagged with FUA, it could be slower than sending the N sectors > followed by a cache flush or SYNCHRONIZE_CACHE command. Not sure why the discussion is drifting in this direction again, but no one suggested to switch eweryone to forcefully use a FUA only primitive. If we offer a WRITE_FUA primitive to those who can make use of it, it won't mean the the cache flush primitive will go away - we will need it to implement fsync anyway. > ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 23:04 ` Ted Ts'o ` (4 preceding siblings ...) 2010-07-30 7:11 ` Christoph Hellwig @ 2010-07-30 12:56 ` Vladislav Bolkhovitin 2010-07-30 13:07 ` Tejun Heo 2010-07-30 13:09 ` Christoph Hellwig 2010-07-30 12:56 ` Vladislav Bolkhovitin 6 siblings, 2 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-30 12:56 UTC (permalink / raw) To: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig, Tejun Heo, Vivek Goyal Ted Ts'o, on 07/30/2010 03:04 AM wrote: > On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote: >> Like James wrote, this is basically everything FUA. It is OK for >> ordered mode to allow the device to aggregate the normal filesystem >> and journal IO, but when the commit block is written it should flush >> all of the previously written data to disk. This still allows >> request re-ordering and merging inside the device, but orders the >> data vs. the commit block. Having the proposed "flush ranges" >> interface to the disk would be ideal, since there would be no wasted >> time flushing data that does not need it (i.e. other partitions). > > My understanding is that "everything FUA" can be a performance > disaster. That's because it bypasses the track buffer, and things get > written directly to disk. So there is no possibility to reorder > buffers so that they get written in one disk rotation. Depending on > the disk, it might even be that if you send N sequential sectors all > tagged with FUA, it could be slower than sending the N sectors > followed by a cache flush or SYNCHRONIZE_CACHE command. It should be, because it gives the drive opportunity to better load internal resources and provide data transfer pipelining. Although, of course, it's possible to imagine a stupid drive with nearly broken caching which would work in write through mode faster. I used word "drive", not "disk" above, because I think this discussions is not only about disks. Storage might be not only disks, but also external arrays and even clusters of arrays. They all look to the system as single "disks", but are much more advanced and sophisticated in all internal capabilities than dumb (S)ATA disks. Now such arrays and clusters are getting more and more commonly used. Anybody can make such array using just a Linux box with any OSS SCSI target software and use them with a variety of interfaces: iSCSI, Fibre Channel, SAS, InfiniBand and even familiar parallel SCSI (Funny, 2 Linux boxes connected by Wide SCSI :) ). So, why to only limit discussion to the low end disks? I believe it would be more productive if we at first determine the set of capabilities which should be used for the best performance and which advanced storage devices can provide and then go down to the lower end eliminating the use of the advantage features with sacrificing performance. Otherwise, ignoring the "hardware offload" which advanced devices provide, we would never achieve the best performance they could give. I'd start the analyze of the best performance facilities from the following: 1. Full set of SCSI queuing and task management control facilities. Namely: - SIMPLE, ORDERED, ACA and, maybe, HEAD OF QUEUE commands attributes - Never draining the queue to wait for completion of one or more commands, except some rare recovery error recovery cases. - ACA and UA_INTRCK for protecting the queue order in case if one or more commands in it finished abnormally. - Use of write back caching by default and switch to write through only for "blacklisted" drives. - FUA for sequences of few write commands, where either SYNCHRONIZE_CACHE command is an overkill, or there is internal order dependency between the commands, so they must be written to the media exactly in the required order. So, for instance, a naive sequence of meta-data updates with the corresponding journal writes would be a chain of commands: 1. 1st journal write command (SIMPLE) 2. 2d journal write command (SIMPLE) 3. 3d journal write command (SIMPLE) 4. SYNCHRONIZE_CACHE for blocks written by those 3 commands (ORDERED) 5. Necessary amount of meta-data update commands (all SIMPLE) 6. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED) 7. Command marking the transaction committed in the journal (ORDERED) That's all. No queue draining anywhere. Plus, sending commands without internal order requirements as SIMPLE would allow the drive to better schedule execution of them among internal storage (actual disks). For an error recovery case consider command (4) abnormally finished because of some external event, like Unit Attention. Then the drive would establish ACA condition and suspend the commands queue with commands from (5) in the head. Then the system would retry this command with ACA attribute. Then, when it finished, would clear the ACA condition. Then the drive would resume the queue and commands in the head ((5)) started being processed. For a simpler device (a disk without support for ORDERED queuing) the same meta-data updates would be: 1. 1st journal write command 2. 2d journal write command 3. 3d journal write command 4. The queue draining. 5. SYNCHRONIZE_CACHE 6. The queue draining. 7. Necessary amount of meta-data update commands 8. The queue draining. 9. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED) 10. The queue draining. 11. Command marking the transaction committed in the journal (ORDERED) Then we would need to figure out an interface for file systems to let them be able to specify the necessary ordering and cache flushing requirements in a generic way. The current interface looks almost good, but: 1. In it semantic of "barrier" is quite overloaded, hence confusing and hard to implement. 2. It doesn't allow to bind several requests in an ordered chain. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 12:56 ` Vladislav Bolkhovitin @ 2010-07-30 13:07 ` Tejun Heo 2010-07-30 13:22 ` Vladislav Bolkhovitin 2010-07-30 13:09 ` Christoph Hellwig 1 sibling, 1 reply; 155+ messages in thread From: Tejun Heo @ 2010-07-30 13:07 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On 07/30/2010 02:56 PM, Vladislav Bolkhovitin wrote: > 1. 1st journal write command (SIMPLE) > > 2. 2d journal write command (SIMPLE) > > 3. 3d journal write command (SIMPLE) > > 4. SYNCHRONIZE_CACHE for blocks written by those 3 commands (ORDERED) > > 5. Necessary amount of meta-data update commands (all SIMPLE) > > 6. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED) > > 7. Command marking the transaction committed in the journal (ORDERED) > > That's all. No queue draining anywhere. Plus, sending commands > without internal order requirements as SIMPLE would allow the drive > to better schedule execution of them among internal storage (actual > disks). Are SIMPLE commands ordered against ORDERED commands? Aren't ORDERED ordered among themselves only? -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 13:07 ` Tejun Heo @ 2010-07-30 13:22 ` Vladislav Bolkhovitin 2010-07-30 13:27 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-30 13:22 UTC (permalink / raw) To: Tejun Heo Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke Tejun Heo, on 07/30/2010 05:07 PM wrote: > On 07/30/2010 02:56 PM, Vladislav Bolkhovitin wrote: >> 1. 1st journal write command (SIMPLE) >> >> 2. 2d journal write command (SIMPLE) >> >> 3. 3d journal write command (SIMPLE) >> >> 4. SYNCHRONIZE_CACHE for blocks written by those 3 commands (ORDERED) >> >> 5. Necessary amount of meta-data update commands (all SIMPLE) >> >> 6. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED) >> >> 7. Command marking the transaction committed in the journal (ORDERED) >> >> That's all. No queue draining anywhere. Plus, sending commands >> without internal order requirements as SIMPLE would allow the drive >> to better schedule execution of them among internal storage (actual >> disks). > > Are SIMPLE commands ordered against ORDERED commands? Aren't ORDERED > ordered among themselves only? About SIMPLE commands SAM says: "The command shall not enter the enabled command state until all commands having a HEAD OF QUEUE task attribute and older commands having an ORDERED task attribute in the task set have completed" About ORDERED commands: "The command shall not enter the enabled command state until all commands having a HEAD OF QUEUE task attribute and all older commands in the task set have completed". In a normal language it means that ORDERED commands are ordered against all other commands: no SIMPLE command can be executed before ORDERED commands ahead of it completed and no ORDERED command can be executed before all SIMPLE and ORDERED commands ahead of it completed. (I excluded HEAD OF QUEUE commands from the consideration for simplicity.) Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 13:22 ` Vladislav Bolkhovitin @ 2010-07-30 13:27 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-30 13:27 UTC (permalink / raw) To: Tejun Heo Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke Vladislav Bolkhovitin, on 07/30/2010 05:22 PM wrote: > Tejun Heo, on 07/30/2010 05:07 PM wrote: >> On 07/30/2010 02:56 PM, Vladislav Bolkhovitin wrote: >>> 1. 1st journal write command (SIMPLE) >>> >>> 2. 2d journal write command (SIMPLE) >>> >>> 3. 3d journal write command (SIMPLE) >>> >>> 4. SYNCHRONIZE_CACHE for blocks written by those 3 commands (ORDERED) >>> >>> 5. Necessary amount of meta-data update commands (all SIMPLE) >>> >>> 6. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED) >>> >>> 7. Command marking the transaction committed in the journal (ORDERED) >>> >>> That's all. No queue draining anywhere. Plus, sending commands >>> without internal order requirements as SIMPLE would allow the drive >>> to better schedule execution of them among internal storage (actual >>> disks). >> >> Are SIMPLE commands ordered against ORDERED commands? Aren't ORDERED >> ordered among themselves only? > > About SIMPLE commands SAM says: "The command shall not enter the enabled > command state until all commands having a HEAD OF QUEUE task attribute > and older commands having an ORDERED task attribute in the task set have > completed" > > About ORDERED commands: "The command shall not enter the enabled command > state until all commands having a HEAD OF QUEUE task attribute and all > older commands in the task set have completed". > > In a normal language it means that ORDERED commands are ordered against > all other commands: no SIMPLE command can be executed before ORDERED > commands ahead of it completed and no ORDERED command can be executed > before all SIMPLE and ORDERED commands ahead of it completed. ...and, of course, SIMPLE commands can be freely reordered against neighbor SIMPLE commands. > Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 12:56 ` Vladislav Bolkhovitin 2010-07-30 13:07 ` Tejun Heo @ 2010-07-30 13:09 ` Christoph Hellwig 2010-07-30 13:25 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-30 13:09 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Fri, Jul 30, 2010 at 04:56:31PM +0400, Vladislav Bolkhovitin wrote: > For a simpler device (a disk without support for ORDERED queuing) the > same meta-data updates would be: > > 1. 1st journal write command > > 2. 2d journal write command > > 3. 3d journal write command > > 4. The queue draining. Which is complete overkill. We have state machines for everything we do block I/O on (both data and the journal), which allows us to just wait on the I/O requests we need inside the filesystem instead of draining the queue, or enforce global ordering using ordered tags. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 13:09 ` Christoph Hellwig @ 2010-07-30 13:25 ` Vladislav Bolkhovitin 2010-07-30 13:34 ` Christoph Hellwig 0 siblings, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-30 13:25 UTC (permalink / raw) To: Christoph Hellwig Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke Christoph Hellwig, on 07/30/2010 05:09 PM wrote: > On Fri, Jul 30, 2010 at 04:56:31PM +0400, Vladislav Bolkhovitin wrote: >> For a simpler device (a disk without support for ORDERED queuing) the >> same meta-data updates would be: >> >> 1. 1st journal write command >> >> 2. 2d journal write command >> >> 3. 3d journal write command >> >> 4. The queue draining. > > Which is complete overkill. We have state machines for everything we do > block I/O on (both data and the journal), which allows us to just wait > on the I/O requests we need inside the filesystem instead of draining > the queue, or enforce global ordering using ordered tags. Sure. It was only a naive example to illustrate my points. But the FS is still waiting for the requests, so "draining" its "local queue"? Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 13:25 ` Vladislav Bolkhovitin @ 2010-07-30 13:34 ` Christoph Hellwig 2010-07-30 13:44 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-30 13:34 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Fri, Jul 30, 2010 at 05:25:52PM +0400, Vladislav Bolkhovitin wrote: > Sure. It was only a naive example to illustrate my points. But the FS is > still waiting for the requests, so "draining" its "local queue"? Yes, just a much smaller queue in general. To present a typical case, fsync() on a regular file that has a few dirty pages on it using XFS. We use filemap_write_and_wait to write out those few pages and wait for it. And after that we only need to issue a SYNCHRONIZE_CACHE and we'd be done. Right now the draining semantics of the (empty) barrier means we also need to wait for all other I/O in the system to finish, which is rather suboptimal. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 13:34 ` Christoph Hellwig @ 2010-07-30 13:44 ` Vladislav Bolkhovitin 2010-07-30 14:20 ` Christoph Hellwig 0 siblings, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-30 13:44 UTC (permalink / raw) To: Christoph Hellwig Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke Christoph Hellwig, on 07/30/2010 05:34 PM wrote: > On Fri, Jul 30, 2010 at 05:25:52PM +0400, Vladislav Bolkhovitin wrote: >> Sure. It was only a naive example to illustrate my points. But the FS is >> still waiting for the requests, so "draining" its "local queue"? > > Yes, just a much smaller queue in general. > > To present a typical case, fsync() on a regular file that has a few > dirty pages on it using XFS. > > We use filemap_write_and_wait to write out those few pages and wait > for it. And after that we only need to issue a SYNCHRONIZE_CACHE > and we'd be done. Right now the draining semantics of the (empty) > barrier means we also need to wait for all other I/O in the system > to finish, which is rather suboptimal. Yes, but why not to make step further and allow to completely eliminate the waiting/draining using ORDERED requests? Current advanced storage hardware allows that. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 13:44 ` Vladislav Bolkhovitin @ 2010-07-30 14:20 ` Christoph Hellwig 2010-07-31 0:47 ` Jan Kara 2010-08-02 19:01 ` [RFC] relaxed barrier semantics Vladislav Bolkhovitin 0 siblings, 2 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-30 14:20 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Fri, Jul 30, 2010 at 05:44:08PM +0400, Vladislav Bolkhovitin wrote: > Yes, but why not to make step further and allow to completely eliminate > the waiting/draining using ORDERED requests? Current advanced storage > hardware allows that. There is a few caes where we could do that - the fsync without metadata changes above would be the prime example. But there's a lot lower hanging fruit until we get to the point where it's worth trying. But in most cases we don't just drain an imaginary queue but actually need to modify software state before finishing one class of I/O and submitting the next. Again, take the example of fsync, but this time we have actually extended the file and need to log an inode size update, as well as a modification to to the btree blocks. Now the fsync in XFS looks like this: 1) write out all the data blocks using WRITE 2) wait for these to finish 3) propagate any I/O error to the inode so we can pick them up 4) update the inode size in the shadow in-memory structure 5) start a transaction to log the inode size 6) flush the write cache to make sure the data really is on disk 7) write out a log buffer containing the inode and btree updates 8) if the FUA bit is not support flush the cache again and yes, the flush in 6) is important so that we don't happen to log the inode size update before all data has made it to disk in case the cache flush in 8) is interrupted ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 14:20 ` Christoph Hellwig @ 2010-07-31 0:47 ` Jan Kara 2010-07-31 9:12 ` Christoph Hellwig 2010-08-02 10:38 ` Vladislav Bolkhovitin 2010-08-02 19:01 ` [RFC] relaxed barrier semantics Vladislav Bolkhovitin 1 sibling, 2 replies; 155+ messages in thread From: Jan Kara @ 2010-07-31 0:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Vladislav Bolkhovitin, Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Fri 30-07-10 16:20:25, Christoph Hellwig wrote: > On Fri, Jul 30, 2010 at 05:44:08PM +0400, Vladislav Bolkhovitin wrote: > > Yes, but why not to make step further and allow to completely eliminate > > the waiting/draining using ORDERED requests? Current advanced storage > > hardware allows that. > > There is a few caes where we could do that - the fsync without metadata > changes above would be the prime example. But there's a lot lower > hanging fruit until we get to the point where it's worth trying. Umm, I don't understand you. I think that fsync in particular is an example where you have to wait and issue cache flush if the drive has volatile write cache. Otherwise you cannot promise to the user data will be really on disk in case of crash. So no ordering helps you. And if you are speaking about a drive without volatile write caches, then fsync without metadata changes is just trivial and you don't need any ordering. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-31 0:47 ` Jan Kara @ 2010-07-31 9:12 ` Christoph Hellwig 2010-08-02 13:14 ` Jan Kara 2010-08-02 10:38 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-07-31 9:12 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Vladislav Bolkhovitin, Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Sat, Jul 31, 2010 at 02:47:57AM +0200, Jan Kara wrote: > > There is a few caes where we could do that - the fsync without metadata > > changes above would be the prime example. But there's a lot lower > > hanging fruit until we get to the point where it's worth trying. > Umm, I don't understand you. I think that fsync in particular is an > example where you have to wait and issue cache flush if the drive has > volatile write cache. Of course. What makes you believe anyone said something else? ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-31 9:12 ` Christoph Hellwig @ 2010-08-02 13:14 ` Jan Kara 0 siblings, 0 replies; 155+ messages in thread From: Jan Kara @ 2010-08-02 13:14 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Vladislav Bolkhovitin, Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Sat 31-07-10 11:12:46, Christoph Hellwig wrote: > On Sat, Jul 31, 2010 at 02:47:57AM +0200, Jan Kara wrote: > > > There is a few caes where we could do that - the fsync without metadata > > > changes above would be the prime example. But there's a lot lower > > > hanging fruit until we get to the point where it's worth trying. > > Umm, I don't understand you. I think that fsync in particular is an > > example where you have to wait and issue cache flush if the drive has > > volatile write cache. > > Of course. What makes you believe anyone said something else? Ok, then I just misunderstood which requests you wanted to send ORDERED. Never mind, I think we agree on what needs to / can be done. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-31 0:47 ` Jan Kara 2010-07-31 9:12 ` Christoph Hellwig @ 2010-08-02 10:38 ` Vladislav Bolkhovitin 2010-08-02 12:48 ` Christoph Hellwig 1 sibling, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-08-02 10:38 UTC (permalink / raw) To: Jan Kara, Christoph Hellwig Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke Jan Kara, on 07/31/2010 04:47 AM wrote: > On Fri 30-07-10 16:20:25, Christoph Hellwig wrote: >> On Fri, Jul 30, 2010 at 05:44:08PM +0400, Vladislav Bolkhovitin wrote: >>> Yes, but why not to make step further and allow to completely eliminate >>> the waiting/draining using ORDERED requests? Current advanced storage >>> hardware allows that. >> >> There is a few caes where we could do that - the fsync without metadata >> changes above would be the prime example. But there's a lot lower >> hanging fruit until we get to the point where it's worth trying. > Umm, I don't understand you. I think that fsync in particular is an > example where you have to wait and issue cache flush if the drive has > volatile write cache. Otherwise you cannot promise to the user data will be > really on disk in case of crash. So no ordering helps you. Isn't there the second wait for journal update? > And if you are speaking about a drive without volatile write caches, then > fsync without metadata changes is just trivial and you don't need any > ordering. A drive can reorder queued SIMPLE requests at any time doesn't matter if it has volatile write caches or not. So, if you expect in-order requests execution (with journal updates you do?), you need to enforce that order either by ORDERED requests or (local) queue draining. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-02 10:38 ` Vladislav Bolkhovitin @ 2010-08-02 12:48 ` Christoph Hellwig 2010-08-02 19:03 ` xfs rm performance Vladislav Bolkhovitin 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-08-02 12:48 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Jan Kara, Christoph Hellwig, Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Mon, Aug 02, 2010 at 02:38:18PM +0400, Vladislav Bolkhovitin wrote: > > Umm, I don't understand you. I think that fsync in particular is an > >example where you have to wait and issue cache flush if the drive has > >volatile write cache. Otherwise you cannot promise to the user data will be > >really on disk in case of crash. So no ordering helps you. > > Isn't there the second wait for journal update? Yes. > A drive can reorder queued SIMPLE requests at any time doesn't matter if > it has volatile write caches or not. I know. > So, if you expect in-order requests > execution (with journal updates you do?), you need to enforce that order > either by ORDERED requests or (local) queue draining. Yes, exactly what I say. ^ permalink raw reply [flat|nested] 155+ messages in thread
* xfs rm performance 2010-08-02 12:48 ` Christoph Hellwig @ 2010-08-02 19:03 ` Vladislav Bolkhovitin 2010-08-02 19:18 ` Christoph Hellwig 0 siblings, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-08-02 19:03 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke This is somehow related to the discussion, so I think it would be relevant if I send here some my observations. One of the tests I use to verify performance of SCST is io_trash utility. This utility emulates DB-like access. For more details see http://lkml.org/lkml/2008/11/17/444. Particularly, I'm running io_trash with the following parameters: "2 2 ./ 500000000 50000000 10 4096 4096 300000 10 90 0 10" over a 5GB XFS iSCSI drive. Backend for this drive is a 5GB file on a 15RPM Wide SCSI HDD. Initiator has 256MB of memory, the target - 2GB. Kernel on the initiator - Ubuntu 2.6.32-22-386. In this mode io_trash creates sparse files and fill them in a transactional DB-like manner. After it finished it leaves 4 files: # ls -l total 1448548 -rw-r--r-- 1 root root 2048000000000 2010-08-03 01:13 _0.db -rw-r--r-- 1 root root 124596224 2010-08-03 01:13 _0.jnl -rw-r--r-- 1 root root 2048000000000 2010-08-03 01:13 _1.db -rw-r--r-- 1 root root 124592128 2010-08-03 01:13 _1.jnl -rwxr-xr-x 1 root root 24141 2008-11-19 19:29 io_thrash The problem is: # time rm _* real 4m3.769s user 0m0.000s sys 0m25.594s 4(!) minutes to delete 4 files! For comparison, ext4 does it in few seconds. I traced what XFS is doing that time. The initiator is sending by a _single command at time_ the following pattern: kernel: [12703.146464] [4021]: scst_cmd_init_done:286:Receiving CDB: kernel: [12703.146477] (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F kernel: [12703.146490] 0: 2a 00 00 09 cc ee 00 00 08 00 00 00 00 00 00 00 *............... kernel: [12703.146513] [4021]: scst: scst_parse_cmd:601:op_name <WRITE(10)> (cmd d6b4a000), direction=1 (expected 1, set yes), bufflen=32768, out_bufflen=0, (expected len 32768, out expected len 0), flags=111 kernel: [12703.148201] [4112]: scst: scst_cmd_done_local:1598:cmd d6b4a000, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 kernel: [12703.149195] [4021]: scst: scst_cmd_init_done:284:tag=112, lun=0, CDB len=16, queue_type=1 (cmd d6b4a000) kernel: [12703.149216] [4021]: scst_cmd_init_done:286:Receiving CDB: kernel: [12703.149228] (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F kernel: [12703.149242] 0: 2a 00 00 09 cc f6 00 00 08 00 00 00 00 00 00 00 *............... kernel: [12703.149266] [4021]: scst: scst_parse_cmd:601:op_name <WRITE(10)> (cmd d6b4a000), direction=1 (expected 1, set yes), bufflen=32768, out_bufflen=0, (expected len 32768, out expected len 0), flags=111 kernel: [12703.150852] [4112]: scst: scst_cmd_done_local:1598:cmd d6b4a000, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 kernel: [12703.151887] [4021]: scst: scst_cmd_init_done:284:tag=12, lun=0, CDB len=16, queue_type=1 (cmd d6b4a000) kernel: [12703.151908] [4021]: scst_cmd_init_done:286:Receiving CDB: kernel: [12703.151920] (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F kernel: [12703.151934] 0: 2a 00 00 09 cc fe 00 00 08 00 00 00 00 00 00 00 *............... kernel: [12703.151955] [4021]: scst: scst_parse_cmd:601:op_name <WRITE(10)> (cmd d6b4a000), direction=1 (expected 1, set yes), bufflen=32768, out_bufflen=0, (expected len 32768, out expected len 0), flags=111 kernel: [12703.153622] [4112]: scst: scst_cmd_done_local:1598:cmd d6b4a000, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 kernel: [12703.154655] [4021]: scst: scst_cmd_init_done:284:tag=15, lun=0, CDB len=16, queue_type=1 (cmd d6b4a000) "Scst_cmd_init_done" means new coming command, "scst_cmd_done_local" means it's finished. See the 1ms gap between previous command finished and new came. You can see that if XFS was sending many commands at time, it would finish the job several (5-10) times faster. Is it possible to improve that and make XFS fully fill the device's queue during rm'ing? Thanks, Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: xfs rm performance 2010-08-02 19:03 ` xfs rm performance Vladislav Bolkhovitin @ 2010-08-02 19:18 ` Christoph Hellwig 2010-08-05 19:31 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-08-02 19:18 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, Jan Kara, Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Mon, Aug 02, 2010 at 11:03:00PM +0400, Vladislav Bolkhovitin wrote: > I traced what XFS is doing that time. The initiator is sending by a _single command at time_ the following pattern: That's exactly the queue draining we're talking about here. To see how the pattern gets better use the nobarrier option. Even with that XFS traditionally has a bad I/O pattern for metadata intensive workloads due to the amount of log I/O needed for it. Starting from Linux 2.6.35 the delayed logging code fixes this, and we hope to enable it by default after about 10 to 12 month of extensive testing. Try to re-run your test with -o delaylog,logbsize=262144 to see better log I/O pattern. If you target doesn't present a volatile write cache also add the nobarrier option mentioned above. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: xfs rm performance 2010-08-02 19:18 ` Christoph Hellwig @ 2010-08-05 19:31 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-08-05 19:31 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke Christoph Hellwig, on 08/02/2010 11:18 PM wrote: > On Mon, Aug 02, 2010 at 11:03:00PM +0400, Vladislav Bolkhovitin wrote: >> I traced what XFS is doing that time. The initiator is sending by a _single command at time_ the following pattern: > > That's exactly the queue draining we're talking about here. To see > how the pattern gets better use the nobarrier option. Yes, with this option it's almost 2 times better and I see slight queue depth (1-2-3 entries in average, max 8), but the performance is still bad: # time rm _* real 3m31.385s user 0m0.004s sys 0m26.674s > Even with that XFS traditionally has a bad I/O pattern for metadata > intensive workloads due to the amount of log I/O needed for it. > Starting from Linux 2.6.35 the delayed logging code fixes this, and > we hope to enable it by default after about 10 to 12 month of > extensive testing. > > Try to re-run your test with > > -o delaylog,logbsize=262144 > > to see better log I/O pattern. If you target doesn't present a volatile > write cache also add the nobarrier option mentioned above. Unfortunately, at the moment I can't run 2.6.35 on that kernel, but will try as soon as I can. Thanks, Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-30 14:20 ` Christoph Hellwig 2010-07-31 0:47 ` Jan Kara @ 2010-08-02 19:01 ` Vladislav Bolkhovitin 2010-08-02 19:26 ` Christoph Hellwig 1 sibling, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-08-02 19:01 UTC (permalink / raw) To: Christoph Hellwig Cc: Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke Christoph Hellwig, on 07/30/2010 06:20 PM wrote: > On Fri, Jul 30, 2010 at 05:44:08PM +0400, Vladislav Bolkhovitin wrote: >> Yes, but why not to make step further and allow to completely eliminate >> the waiting/draining using ORDERED requests? Current advanced storage >> hardware allows that. > > There is a few caes where we could do that - the fsync without metadata > changes above would be the prime example. But there's a lot lower > hanging fruit until we get to the point where it's worth trying. Yes, but, since there is also interface and file systems update coming, why not to design the interface now and then gracefully fill it with implementation? All barriers discussions are always very hot. It definitely means the current approach doesn't satisfy too many people, from FS developers to storage vendors and users. I believe this is because the whole barriers ideology is not natural, hence there are too many troubles to fit it in the real life. Apparently, this approach needs some redesign to get in a more acceptable form. IMHO, all is needed are: 1. Allow to optionally combine requests in groups and set for groups optional properties: caching and ordering modes (see below). Each group would reflect a higher level operation. 2. Allow to chain requests groups. Each chain would reflect order dependency between groups, i.e. higher level operations. This interface is a natural extension of the current interface. Natural for storage too. In the extreme, when a group is empty, it could be implemented as a barrier, although, since there would be no dependencies between not chained groups, they would be freely reordered between each other. We would need grouping requests sooner or later anyway, because otherwise it is impossible to implement selective cache flushing instead of flushing cache for the whole device as currently. This is highly demanded feature, especially for shared and distributed devices. The caching properties would be: - None (default) - no cache flushing needed. - "Flush after each request". It would be translated to FUA on write back devices with FUA, (write, sync_cache) sequence on write back devices without FUA, and to nothing on write through devices. - "Flush at once after all finished". It would be translated to one or more SYNC_CACHE commands, executed after all done and syncing _only_ what was modified in the group, not the whole device as now. The order properties would be: - None (default) - there are no order dependency between requests in the group. - ORDERED - all requests in the group must be executed in order. Additionally, if the backend device supported ORDERED commands, this facility would be used to eliminate extra queue draining. For instance, "flush after each request" on WB devices without FUA would be a sequence of ORDERED commands: [(write, sync_cache) ... (write, sync_cache) wait]. Compare to [(write, wait, sync_cache, wait) ... (write, wait, sync_cache, wait)] needed achieve the same without ORDERED commands support. For instance, your example of the fsync in XFS would be: 1) Write out all the data blocks as a group with no caching and ordering properties. 2) Wait that group to finish 3) Propagate any I/O error to the inode so we can pick them up 4) Update the inode size in the shadow in-memory structure 5) Start a transaction to log the inode size in the new group with properties "Flush at once after all finished" and no ordering (or, if necessary, (it isn't clear from your text) ORDERED). 6) Write out a log buffer containing the inode and btree updates in the new group in a chain after the group from (5) with necessary cache flushing and ordering properties. I believe, it can be implemented acceptably simply and effectively, including the I/O scheduler level, and have some ideas for that. Just my 5c from the storage vendors side. > But in most cases we don't just drain an imaginary queue but actually > need to modify software state before finishing one class of I/O and > submitting the next. > > Again, take the example of fsync, but this time we have actually > extended the file and need to log an inode size update, as well > as a modification to to the btree blocks. > > Now the fsync in XFS looks like this: > > 1) write out all the data blocks using WRITE > 2) wait for these to finish > 3) propagate any I/O error to the inode so we can pick them up > 4) update the inode size in the shadow in-memory structure > 5) start a transaction to log the inode size > 6) flush the write cache to make sure the data really is on disk Here should be "6.1) wait for it to finish" which can be eliminated if requests sent ordered, correct? > 7) write out a log buffer containing the inode and btree updates > 8) if the FUA bit is not support flush the cache again > > and yes, the flush in 6) is important so that we don't happen > to log the inode size update before all data has made it to disk > in case the cache flush in 8) is interrupted ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-02 19:01 ` [RFC] relaxed barrier semantics Vladislav Bolkhovitin @ 2010-08-02 19:26 ` Christoph Hellwig 0 siblings, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-08-02 19:26 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, Ted Ts'o, Andreas Dilger, Ric Wheeler, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Mon, Aug 02, 2010 at 11:01:53PM +0400, Vladislav Bolkhovitin wrote: > IMHO, all is needed are: What we need first is a simple interface that a) guarantees data integrity b) doesn't cause massive slowdowns and then we can optimize it later. What we absolutely don't need is a large number of different interfaces that no one understands and that all are buggy in some way. > >Now the fsync in XFS looks like this: > > > >1) write out all the data blocks using WRITE > >2) wait for these to finish > >3) propagate any I/O error to the inode so we can pick them up > >4) update the inode size in the shadow in-memory structure > >5) start a transaction to log the inode size > >6) flush the write cache to make sure the data really is on disk > > Here should be "6.1) wait for it to finish" yes > which can be eliminated if > requests sent ordered, correct? not really - if the cache flush returns we shouldn't even send the log update. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 23:04 ` Ted Ts'o ` (5 preceding siblings ...) 2010-07-30 12:56 ` Vladislav Bolkhovitin @ 2010-07-30 12:56 ` Vladislav Bolkhovitin 6 siblings, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-30 12:56 UTC (permalink / raw) To: Ted Ts'o, Andreas Dilger, Ric Wheeler, Christoph Hellwig, Tejun Heo, Vivek Goyal Ted Ts'o, on 07/30/2010 03:04 AM wrote: > On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote: >> Like James wrote, this is basically everything FUA. It is OK for >> ordered mode to allow the device to aggregate the normal filesystem >> and journal IO, but when the commit block is written it should flush >> all of the previously written data to disk. This still allows >> request re-ordering and merging inside the device, but orders the >> data vs. the commit block. Having the proposed "flush ranges" >> interface to the disk would be ideal, since there would be no wasted >> time flushing data that does not need it (i.e. other partitions). > > My understanding is that "everything FUA" can be a performance > disaster. That's because it bypasses the track buffer, and things get > written directly to disk. So there is no possibility to reorder > buffers so that they get written in one disk rotation. Depending on > the disk, it might even be that if you send N sequential sectors all > tagged with FUA, it could be slower than sending the N sectors > followed by a cache flush or SYNCHRONIZE_CACHE command. It should be, because it gives the drive opportunity to better load internal resources and provide data transfer pipelining. Although, of course, it's possible to imagine a stupid drive with nearly broken caching which would work in write through mode faster. I used word "drive", not "disk" above, because I think this discussions is not only about disks. Storage might be not only disks, but also external arrays and even clusters of arrays. They all look to the system as single "disks", but are much more advanced and sophisticated in all internal capabilities than dumb (S)ATA disks. Now such arrays and clusters are getting more and more commonly used. Anybody can make such array using just a Linux box with any OSS SCSI target software and use them with a variety of interfaces: iSCSI, Fibre Channel, SAS, InfiniBand and even familiar parallel SCSI (Funny, 2 Linux boxes connected by Wide SCSI :) ). So, why to only limit discussion to the low end disks? I believe it would be more productive if we at first determine the set of capabilities which should be used for the best performance and which advanced storage devices can provide and then go down to the lower end eliminating the use of the advantage features with sacrificing performance. Otherwise, ignoring the "hardware offload" which advanced devices provide, we would never achieve the best performance they could give. I'd start the analyze of the best performance facilities from the following: 1. Full set of SCSI queuing and task management control facilities. Namely: - SIMPLE, ORDERED, ACA and, maybe, HEAD OF QUEUE commands attributes - Never draining the queue to wait for completion of one or more commands, except some rare recovery error recovery cases. - ACA and UA_INTRCK for protecting the queue order in case if one or more commands in it finished abnormally. - Use of write back caching by default and switch to write through only for "blacklisted" drives. - FUA for sequences of few write commands, where either SYNCHRONIZE_CACHE command is an overkill, or there is internal order dependency between the commands, so they must be written to the media exactly in the required order. So, for instance, a naive sequence of meta-data updates with the corresponding journal writes would be a chain of commands: 1. 1st journal write command (SIMPLE) 2. 2d journal write command (SIMPLE) 3. 3d journal write command (SIMPLE) 4. SYNCHRONIZE_CACHE for blocks written by those 3 commands (ORDERED) 5. Necessary amount of meta-data update commands (all SIMPLE) 6. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED) 7. Command marking the transaction committed in the journal (ORDERED) That's all. No queue draining anywhere. Plus, sending commands without internal order requirements as SIMPLE would allow the drive to better schedule execution of them among internal storage (actual disks). For an error recovery case consider command (4) abnormally finished because of some external event, like Unit Attention. Then the drive would establish ACA condition and suspend the commands queue with commands from (5) in the head. Then the system would retry this command with ACA attribute. Then, when it finished, would clear the ACA condition. Then the drive would resume the queue and commands in the head ((5)) started being processed. For a simpler device (a disk without support for ORDERED queuing) the same meta-data updates would be: 1. 1st journal write command 2. 2d journal write command 3. 3d journal write command 4. The queue draining. 5. SYNCHRONIZE_CACHE 6. The queue draining. 7. Necessary amount of meta-data update commands 8. The queue draining. 9. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED) 10. The queue draining. 11. Command marking the transaction committed in the journal (ORDERED) Then we would need to figure out an interface for file systems to let them be able to specify the necessary ordering and cache flushing requirements in a generic way. The current interface looks almost good, but: 1. In it semantic of "barrier" is quite overloaded, hence confusing and hard to implement. 2. It doesn't allow to bind several requests in an ordered chain. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 19:44 ` [RFC] relaxed barrier semantics Ric Wheeler 2010-07-29 19:49 ` Christoph Hellwig @ 2010-07-31 0:35 ` Jan Kara 1 sibling, 0 replies; 155+ messages in thread From: Jan Kara @ 2010-07-31 0:35 UTC (permalink / raw) To: Ric Wheeler Cc: Ted Ts'o, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, chris.mason, swhiteho, konishi.ryusuke On Thu 29-07-10 15:44:31, Ric Wheeler wrote: > On 07/28/2010 09:44 PM, Ted Ts'o wrote: > >On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote: > >>If we move all filesystems to non-draining barriers with pre- and post- > >>flushes that might actually be a relatively easy first step. We don't > >>have the complications to deal with multiple types of barriers to > >>start with, and it'll fix the issue for devices without volatile write > >>caches completely. > >> > >>I just need some help from the filesystem folks to determine if they > >>are safe with them. > >> > >>I know for sure that ext3 and xfs are from looking through them. And > >>I know reiserfs is if we make sure it doesn't hit the code path that > >>relies on it that is currently enabled by the barrier option. > >> > >>I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks. > >>That already ends our small list of barrier supporting filesystems, and > >>possibly ocfs2, too - although the barrier implementation there seems > >>incomplete as it doesn't seem to flush caches in fsync. > >Define "are safe" --- what interface we planning on using for the > >non-draining barrier? At least for ext3, when we write the commit > >record using set_buffer_ordered(bh), it assumes that this will do a > >flush of all previous writes and that the commit will hit the disk > >before any subsequent writes are sent to the disk. So turning the > >write of a buffer head marked with set_buffered_ordered() into a FUA > >write would _not_ be safe for ext3. > > I confess that I am a bit fuzzy on FUA, but think that it means that > any FUA tagged IO will go down to persistent store before returning. > > If so, then all order dependent IO would need to be issued in order > and tagged with FUA. It would not suffice to tag just the commit > record as FUA, or do I misunderstand what FUA does? Ric, I think you misunderstood it a bit. I think the proposal for ext3 was to write ordered data + metadata to the journal except for transaction commit block, then issue SYNCHRONIZE_CACHE and then write transaction commit block either with FUA bit set or without it and call SYNCHRONIZE_CACHE after that as well. The difference from the current behavior would be that we save the queue draining we do these days... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-29 1:44 ` Ted Ts'o ` (3 preceding siblings ...) 2010-07-29 19:44 ` [RFC] relaxed barrier semantics Ric Wheeler @ 2010-07-29 19:44 ` Ric Wheeler 4 siblings, 0 replies; 155+ messages in thread From: Ric Wheeler @ 2010-07-29 19:44 UTC (permalink / raw) To: Ted Ts'o, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley On 07/28/2010 09:44 PM, Ted Ts'o wrote: > On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote: > >> If we move all filesystems to non-draining barriers with pre- and post- >> flushes that might actually be a relatively easy first step. We don't >> have the complications to deal with multiple types of barriers to >> start with, and it'll fix the issue for devices without volatile write >> caches completely. >> >> I just need some help from the filesystem folks to determine if they >> are safe with them. >> >> I know for sure that ext3 and xfs are from looking through them. And >> I know reiserfs is if we make sure it doesn't hit the code path that >> relies on it that is currently enabled by the barrier option. >> >> I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks. >> That already ends our small list of barrier supporting filesystems, and >> possibly ocfs2, too - although the barrier implementation there seems >> incomplete as it doesn't seem to flush caches in fsync. >> > Define "are safe" --- what interface we planning on using for the > non-draining barrier? At least for ext3, when we write the commit > record using set_buffer_ordered(bh), it assumes that this will do a > flush of all previous writes and that the commit will hit the disk > before any subsequent writes are sent to the disk. So turning the > write of a buffer head marked with set_buffered_ordered() into a FUA > write would _not_ be safe for ext3. > I confess that I am a bit fuzzy on FUA, but think that it means that any FUA tagged IO will go down to persistent store before returning. If so, then all order dependent IO would need to be issued in order and tagged with FUA. It would not suffice to tag just the commit record as FUA, or do I misunderstand what FUA does? (Looking for a record in the how many times can I use FUA in an email). ric > For ext4, if we don't use journal checksums, then we have the same > requirements as ext3, and the same method of requesting it. If we do > use journal checksums, what ext4 needs is a way of assuring that no > writes after the commit are reordered with respect to the disk platter > before the commit record --- but any of the writes before that, > including the commit, and be reordered because we rely on the checksum > in the commit record to know at replay time whether the last commit is > valid or not. We do that right now by calling blkdev_issue_flush() > with BLKDEF_IFL_WAIT after submitting the write of the commit block. > > - Ted > > ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:28 ` Christoph Hellwig ` (3 preceding siblings ...) 2010-07-29 1:44 ` Ted Ts'o @ 2010-08-02 16:47 ` Ryusuke Konishi 2010-08-02 17:39 ` Chris Mason 5 siblings, 0 replies; 155+ messages in thread From: Ryusuke Konishi @ 2010-08-02 16:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Wed, 28 Jul 2010 11:28:59 +0200, Christoph Hellwig <hch@lst.de> wrote: > On Wed, Jul 28, 2010 at 11:17:06AM +0200, Tejun Heo wrote: > > I'll re-read barrier code and see how hard it would be to implement a > > proper solution. > > If we move all filesystems to non-draining barriers with pre- and post- > flushes that might actually be a relatively easy first step. We don't > have the complications to deal with multiple types of barriers to > start with, and it'll fix the issue for devices without volatile write > caches completely. > > I just need some help from the filesystem folks to determine if they > are safe with them. > > I know for sure that ext3 and xfs are from looking through them. And > I know reiserfs is if we make sure it doesn't hit the code path that > relies on it that is currently enabled by the barrier option. > > I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks. With regard to nilfs, barrier is applied to writeback of super block since it saves position of a recent log and this log needs to be written to the platter prior to the super block. And so, I think a pre-flush + a FUA write can be used instead of draining for the barrier use in nilfs. Thanks, Ryusuke Konishi ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:28 ` Christoph Hellwig ` (4 preceding siblings ...) 2010-08-02 16:47 ` Ryusuke Konishi @ 2010-08-02 17:39 ` Chris Mason 2010-08-05 13:11 ` Vladislav Bolkhovitin 2010-08-05 13:11 ` Vladislav Bolkhovitin 5 siblings, 2 replies; 155+ messages in thread From: Chris Mason @ 2010-08-02 17:39 UTC (permalink / raw) To: Christoph Hellwig Cc: Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke On Wed, Jul 28, 2010 at 11:28:59AM +0200, Christoph Hellwig wrote: > On Wed, Jul 28, 2010 at 11:17:06AM +0200, Tejun Heo wrote: > > Well, if disabling barrier works around the problem for them (which is > > basically what was suggeseted in the first message), that's not too > > bad for short term, I think. > > It's a pretty horrible workaround. Requiring manual mount options to > get performance out of a setup which could trivially work out of the > box is a bad workaround. > > > I'll re-read barrier code and see how hard it would be to implement a > > proper solution. > > If we move all filesystems to non-draining barriers with pre- and post- > flushes that might actually be a relatively easy first step. We don't > have the complications to deal with multiple types of barriers to > start with, and it'll fix the issue for devices without volatile write > caches completely. > > I just need some help from the filesystem folks to determine if they > are safe with them. > > I know for sure that ext3 and xfs are from looking through them. And > I know reiserfs is if we make sure it doesn't hit the code path that > relies on it that is currently enabled by the barrier option. > > I'll just need more feedback from ext4, gfs2, btrfs and nilfs folks. > That already ends our small list of barrier supporting filesystems, and > possibly ocfs2, too - although the barrier implementation there seems > incomplete as it doesn't seem to flush caches in fsync. Btrfs is going to be similar to xfs, except because of COW we have to always pretend someone is extending the file (or filling a hole). The short answer is that a preflush of the disk cache, followed by FUA for commits is fine. Btrfs explicitly waits for all the bios it sends down without trusting other layers for silent ordering. The long answer is the btrfs commit is basically: wait for bio completion of a bunch of different things write new super block pointing to new tree roots with barrier Everything we waited for must be fully on disk before the new super block, and the new super must be fully on disk after we wait for the bh. I regret putting the ordering into the original barrier code...it definitely did help reiserfs back in the day but it stinks of magic and voodoo. When it goes wrong, we'll only notice .000000001% of the time, and even then it'll only be when people report some random corruption which we'll blindly blame on either axboe or the drive. -chris ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-02 17:39 ` Chris Mason @ 2010-08-05 13:11 ` Vladislav Bolkhovitin 2010-08-05 13:32 ` Chris Mason 2010-08-05 17:09 ` Christoph Hellwig 2010-08-05 13:11 ` Vladislav Bolkhovitin 1 sibling, 2 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-08-05 13:11 UTC (permalink / raw) To: Chris Mason, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.B Chris Mason, on 08/02/2010 09:39 PM wrote: > I regret putting the ordering into the original barrier code...it > definitely did help reiserfs back in the day but it stinks of magic and > voodoo. But if the ordering isn't in the common (block) code, how to implement the "hardware offload" for ordering, i.e. ORDERED commands, in an acceptable way? I believe, the decision was right, but the flags and magic requests based interface (and, hence, implementation) was wrong. That's it which stinks of magic and voodoo. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 13:11 ` Vladislav Bolkhovitin @ 2010-08-05 13:32 ` Chris Mason 2010-08-05 14:52 ` Hannes Reinecke ` (3 more replies) 2010-08-05 17:09 ` Christoph Hellwig 1 sibling, 4 replies; 155+ messages in thread From: Chris Mason @ 2010-08-05 13:32 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote: > Chris Mason, on 08/02/2010 09:39 PM wrote: > >I regret putting the ordering into the original barrier code...it > >definitely did help reiserfs back in the day but it stinks of magic and > >voodoo. > > But if the ordering isn't in the common (block) code, how to > implement the "hardware offload" for ordering, i.e. ORDERED > commands, in an acceptable way? > > I believe, the decision was right, but the flags and magic requests > based interface (and, hence, implementation) was wrong. That's it > which stinks of magic and voodoo. The interface definitely has flaws. We didn't expand it because James popped up with a long list of error handling problems. Basically how do the hardware and the kernel deal with a failed request at the start of the chain. Somehow the easy way of failing them all turned out to be extremely difficult. Even if that part had been refined, I think trusting the ordering down to the lower layers was a doomed idea. The list of ways it could go wrong is much much longer (and harder to debug) than the list of benefits. With all of that said, I did go ahead and benchmark real ordered tags extensively on a scsi drive in the initial implementation. There was very little performance difference. -chris ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 13:32 ` Chris Mason @ 2010-08-05 14:52 ` Hannes Reinecke 2010-08-05 14:52 ` Hannes Reinecke ` (2 subsequent siblings) 3 siblings, 0 replies; 155+ messages in thread From: Hannes Reinecke @ 2010-08-05 14:52 UTC (permalink / raw) To: Chris Mason, Vladislav Bolkhovitin, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara Chris Mason wrote: > On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote: >> Chris Mason, on 08/02/2010 09:39 PM wrote: >>> I regret putting the ordering into the original barrier code...it >>> definitely did help reiserfs back in the day but it stinks of magic and >>> voodoo. >> But if the ordering isn't in the common (block) code, how to >> implement the "hardware offload" for ordering, i.e. ORDERED >> commands, in an acceptable way? >> >> I believe, the decision was right, but the flags and magic requests >> based interface (and, hence, implementation) was wrong. That's it >> which stinks of magic and voodoo. > > The interface definitely has flaws. We didn't expand it because James > popped up with a long list of error handling problems. Basically how > do the hardware and the kernel deal with a failed request at the start > of the chain. Somehow the easy way of failing them all turned out to be > extremely difficult. > > Even if that part had been refined, I think trusting the ordering down > to the lower layers was a doomed idea. The list of ways it could go > wrong is much much longer (and harder to debug) than the list of > benefits. > > With all of that said, I did go ahead and benchmark real ordered tags > extensively on a scsi drive in the initial implementation. There was > very little performance difference. > Care to dig it up? I'd wanted to give it a try, and if someone already did some work in that area it'll make things easier here. I still think that implementing ordered tags is the correct way of doing things, implementation details notwithstanding. It looks better conceptually than using FUA, and would be easier from the request-queue side of things. (Or course, as the entire logic is pushed down to the SCSI layer :-) Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 13:32 ` Chris Mason 2010-08-05 14:52 ` Hannes Reinecke @ 2010-08-05 14:52 ` Hannes Reinecke 2010-08-05 15:17 ` Chris Mason 2010-08-05 17:07 ` Christoph Hellwig 2010-08-05 19:48 ` Vladislav Bolkhovitin 2010-08-05 19:48 ` Vladislav Bolkhovitin 3 siblings, 2 replies; 155+ messages in thread From: Hannes Reinecke @ 2010-08-05 14:52 UTC (permalink / raw) To: Chris Mason, Vladislav Bolkhovitin, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara Chris Mason wrote: > On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote: >> Chris Mason, on 08/02/2010 09:39 PM wrote: >>> I regret putting the ordering into the original barrier code...it >>> definitely did help reiserfs back in the day but it stinks of magic and >>> voodoo. >> But if the ordering isn't in the common (block) code, how to >> implement the "hardware offload" for ordering, i.e. ORDERED >> commands, in an acceptable way? >> >> I believe, the decision was right, but the flags and magic requests >> based interface (and, hence, implementation) was wrong. That's it >> which stinks of magic and voodoo. > > The interface definitely has flaws. We didn't expand it because James > popped up with a long list of error handling problems. Basically how > do the hardware and the kernel deal with a failed request at the start > of the chain. Somehow the easy way of failing them all turned out to be > extremely difficult. > > Even if that part had been refined, I think trusting the ordering down > to the lower layers was a doomed idea. The list of ways it could go > wrong is much much longer (and harder to debug) than the list of > benefits. > > With all of that said, I did go ahead and benchmark real ordered tags > extensively on a scsi drive in the initial implementation. There was > very little performance difference. > Care to dig it up? I'd wanted to give it a try, and if someone already did some work in that area it'll make things easier here. I still think that implementing ordered tags is the correct way of doing things, implementation details notwithstanding. It looks better conceptually than using FUA, and would be easier from the request-queue side of things. (Or course, as the entire logic is pushed down to the SCSI layer :-) Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 14:52 ` Hannes Reinecke @ 2010-08-05 15:17 ` Chris Mason 2010-08-05 17:07 ` Christoph Hellwig 1 sibling, 0 replies; 155+ messages in thread From: Chris Mason @ 2010-08-05 15:17 UTC (permalink / raw) To: Hannes Reinecke Cc: Vladislav Bolkhovitin, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke On Thu, Aug 05, 2010 at 04:52:15PM +0200, Hannes Reinecke wrote: > Chris Mason wrote: > > On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote: > >> Chris Mason, on 08/02/2010 09:39 PM wrote: > >>> I regret putting the ordering into the original barrier code...it > >>> definitely did help reiserfs back in the day but it stinks of magic and > >>> voodoo. > >> But if the ordering isn't in the common (block) code, how to > >> implement the "hardware offload" for ordering, i.e. ORDERED > >> commands, in an acceptable way? > >> > >> I believe, the decision was right, but the flags and magic requests > >> based interface (and, hence, implementation) was wrong. That's it > >> which stinks of magic and voodoo. > > > > The interface definitely has flaws. We didn't expand it because James > > popped up with a long list of error handling problems. Basically how > > do the hardware and the kernel deal with a failed request at the start > > of the chain. Somehow the easy way of failing them all turned out to be > > extremely difficult. > > > > Even if that part had been refined, I think trusting the ordering down > > to the lower layers was a doomed idea. The list of ways it could go > > wrong is much much longer (and harder to debug) than the list of > > benefits. > > > > With all of that said, I did go ahead and benchmark real ordered tags > > extensively on a scsi drive in the initial implementation. There was > > very little performance difference. > > > Care to dig it up? > I'd wanted to give it a try, and if someone already did some work in > that area it'll make things easier here. > > I still think that implementing ordered tags is the correct way of > doing things, implementation details notwithstanding. > > It looks better conceptually than using FUA, and would be easier > from the request-queue side of things. > (Or course, as the entire logic is pushed down to the SCSI layer :-) You see, I'm torn between the dread of giving scsi such great responsibility and the joy of sending a link for a bitkeeper patch series from 2.4.x. http://lwn.net/2002/0214/a/queue-barrier.php3 Have a lot of fun ;) -chris ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 14:52 ` Hannes Reinecke 2010-08-05 15:17 ` Chris Mason @ 2010-08-05 17:07 ` Christoph Hellwig 1 sibling, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-08-05 17:07 UTC (permalink / raw) To: Hannes Reinecke Cc: Chris Mason, Vladislav Bolkhovitin, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke On Thu, Aug 05, 2010 at 04:52:15PM +0200, Hannes Reinecke wrote: > I still think that implementing ordered tags is the correct way of > doing things, implementation details notwithstanding. > > It looks better conceptually than using FUA, and would be easier > from the request-queue side of things. Sorry, but ordered tags are in no way a replacement for the FUA bit. Admittedly the current barrier semantics are confusing because they mix up to only minimally related things: a) cache flushing b) ordering a) is what we really need from the filesystems point of view. b) is something all our filesystems can do ourself. We could use ordered tags to offload it, and I'd be happy if someone could prove that we're getting speedups from it, but it certainly does not replace a). With enough outstanding tags, be that using ordered tags or software managed ordering we could fill the disk enough that we don't need to write cache, but again that'll need a lot of benchmarking. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 13:32 ` Chris Mason 2010-08-05 14:52 ` Hannes Reinecke 2010-08-05 14:52 ` Hannes Reinecke @ 2010-08-05 19:48 ` Vladislav Bolkhovitin 2010-08-05 19:48 ` Vladislav Bolkhovitin 3 siblings, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-08-05 19:48 UTC (permalink / raw) To: Chris Mason, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.B Chris Mason, on 08/05/2010 05:32 PM wrote: > On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote: >> Chris Mason, on 08/02/2010 09:39 PM wrote: >>> I regret putting the ordering into the original barrier code...it >>> definitely did help reiserfs back in the day but it stinks of magic and >>> voodoo. >> >> But if the ordering isn't in the common (block) code, how to >> implement the "hardware offload" for ordering, i.e. ORDERED >> commands, in an acceptable way? >> >> I believe, the decision was right, but the flags and magic requests >> based interface (and, hence, implementation) was wrong. That's it >> which stinks of magic and voodoo. > > The interface definitely has flaws. We didn't expand it because James > popped up with a long list of error handling problems. Could you point on the corresponding message, please? I can't find it in my archive. > Basically how > do the hardware and the kernel deal with a failed request at the start > of the chain. Somehow the easy way of failing them all turned out to be > extremely difficult. Have you considered to not fail them all, but using ACA SCSI facility just suspend the queue, then requeue the failed request, then restart processing? I might be missing something, but using this approach the failed requests recovery should look quite simple and, most important, compact, hence easily audited. Something like below. Sorry, since it's a low level recovery, it requires some deep SCSI knowledge to follow. We need: 1. A low level driver without internal queue and masking returned status and sense. At first look, many of the existing drivers more or less satisfy this requirement, including drivers in my direct interest: qla2xxx, iscsi and ib_srp. 2. A device with support of ORDERED commands as well as ACA and UA_INTLCK facilities in QERR mode 0. Assume we have N ORDERED requests queued to a device and one of them failed. Then submitting new requests to the device would be suspended and recovery thread woken up. Let's we have a list of queued to the device requests in order as they queued. Then the recovery thread would need to deal with the following cases: 1. The failed command failed with CHECK_CONDITION and from the head of the queue. (The device now established ACA and suspended its internal queue.) Then the command should be sent to the device as ACA task and, after it's finished, ACA should be cleared. (The device now would restart its queue.) Then submitting new requests to the device would also be resumed. 2. The failed command failed with CHECK_CONDITION and isn't from the head of the queue. 2.1. The failed command in the last in the queue. ACA should be cleared and the failed command should simply be restarted. Then submitting new requests to the device would also be resumed. 2.2. The failed command isn't last in the queue. Then the recovery thread would send ACA command TEST UNIT READY to be sure all in-flight commands reached the device. Then it would abort all the commands after the failed one using ABORT TASK Task Management function. Then ACA should be cleared and the failed command as well as all the aborted commands would be resend to the device. Then submitting new requests to the device would also be resumed. 3. The failed command failed with other status than CHECK_CONDITION and from the head of the queue. 3.1. The failed command is the only queued command. Then TEST UNIT READY command should be sent to the device to get the post UA_INTLCK CHECK CONDITION and trigger ACA. Then ACA should be cleared and the failed command restarted. Then submitting new requests to the device would also be resumed. 3.2. There are other queued commands. Then the recovery thread should remember the failed command and exit. The next command would get the post UA_INTLCK CHECK CONDITION and trigger ACA. Then recovery would proceed as in (1), except that 2 failed commands would be restarted as ACA commands before clearing ACA. 4. The failed command isn't from the head of the queue and failed with other status than CHECK_CONDITION. It might happen in case of TASK QUEUE FULL condition. This case would be proceed similarly as cases (3.x), then (2.2). That's all. Simple, compact and clear for auditing. > Even if that part had been refined, I think trusting the ordering down > to the lower layers was a doomed idea. The list of ways it could go > wrong is much much longer (and harder to debug) than the list of > benefits. It's hard to debug, because it's currently a overloaded flags nightmare. It isn't the idea to trust lower level doomed, everybody trust lower levels everywhere in the kernel. Doomed the idea to provide requested functionality via a set of flags and artificial barrier requests with obscured side effects. Linux just needs a clear and _natural_ interface for that. Like one I proposed in http://marc.info/?l=linux-scsi&m=128077574815881&w=2. Yes, I am proposing to slowly start thinking to move to a new interface and implementation out from the current hell. It's obvious that what Linux has now in this area is a dead end. The new flag Christoph is going to add makes it even worse. > With all of that said, I did go ahead and benchmark real ordered tags > extensively on a scsi drive in the initial implementation. There was > very little performance difference. It isn't surprise that you didn't see much difference with a local (Wide?) SCSI drive. Such drives sit on a low latency link, simple enough to have small internal latencies and dumb enough to not make much benefits from internal reordering. But how about external arrays? Or even clusters? Nowadays everybody can build such arrays and clusters from any Linux (or other *nix) box using any OSS SCSI target implementation starting from SCST I have been developing. Such array/cluster devices use links with in an order of magnitude higher latency, they are very sophisticated inside, so have much bigger internal latencies as well as they have much bigger opportunities to optimize I/O pattern by internal reordering. All the record numbers I've seen so far were reached with deep queue. For instance, the last SCST record (>500K 4K IOPSes from a single target) was achieved with queue depth 128! So, I believe, Linux must use that possibility to get full storage performance and to finally simplify its storage stack. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 13:32 ` Chris Mason ` (2 preceding siblings ...) 2010-08-05 19:48 ` Vladislav Bolkhovitin @ 2010-08-05 19:48 ` Vladislav Bolkhovitin 2010-08-05 19:50 ` Christoph Hellwig 3 siblings, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-08-05 19:48 UTC (permalink / raw) To: Chris Mason, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.B Chris Mason, on 08/05/2010 05:32 PM wrote: > On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote: >> Chris Mason, on 08/02/2010 09:39 PM wrote: >>> I regret putting the ordering into the original barrier code...it >>> definitely did help reiserfs back in the day but it stinks of magic and >>> voodoo. >> >> But if the ordering isn't in the common (block) code, how to >> implement the "hardware offload" for ordering, i.e. ORDERED >> commands, in an acceptable way? >> >> I believe, the decision was right, but the flags and magic requests >> based interface (and, hence, implementation) was wrong. That's it >> which stinks of magic and voodoo. > > The interface definitely has flaws. We didn't expand it because James > popped up with a long list of error handling problems. Could you point on the corresponding message, please? I can't find it in my archive. > Basically how > do the hardware and the kernel deal with a failed request at the start > of the chain. Somehow the easy way of failing them all turned out to be > extremely difficult. Have you considered to not fail them all, but using ACA SCSI facility just suspend the queue, then requeue the failed request, then restart processing? I might be missing something, but using this approach the failed requests recovery should look quite simple and, most important, compact, hence easily audited. Something like below. Sorry, since it's a low level recovery, it requires some deep SCSI knowledge to follow. We need: 1. A low level driver without internal queue and masking returned status and sense. At first look, many of the existing drivers more or less satisfy this requirement, including drivers in my direct interest: qla2xxx, iscsi and ib_srp. 2. A device with support of ORDERED commands as well as ACA and UA_INTLCK facilities in QERR mode 0. Assume we have N ORDERED requests queued to a device and one of them failed. Then submitting new requests to the device would be suspended and recovery thread woken up. Let's we have a list of queued to the device requests in order as they queued. Then the recovery thread would need to deal with the following cases: 1. The failed command failed with CHECK_CONDITION and from the head of the queue. (The device now established ACA and suspended its internal queue.) Then the command should be sent to the device as ACA task and, after it's finished, ACA should be cleared. (The device now would restart its queue.) Then submitting new requests to the device would also be resumed. 2. The failed command failed with CHECK_CONDITION and isn't from the head of the queue. 2.1. The failed command in the last in the queue. ACA should be cleared and the failed command should simply be restarted. Then submitting new requests to the device would also be resumed. 2.2. The failed command isn't last in the queue. Then the recovery thread would send ACA command TEST UNIT READY to be sure all in-flight commands reached the device. Then it would abort all the commands after the failed one using ABORT TASK Task Management function. Then ACA should be cleared and the failed command as well as all the aborted commands would be resend to the device. Then submitting new requests to the device would also be resumed. 3. The failed command failed with other status than CHECK_CONDITION and from the head of the queue. 3.1. The failed command is the only queued command. Then TEST UNIT READY command should be sent to the device to get the post UA_INTLCK CHECK CONDITION and trigger ACA. Then ACA should be cleared and the failed command restarted. Then submitting new requests to the device would also be resumed. 3.2. There are other queued commands. Then the recovery thread should remember the failed command and exit. The next command would get the post UA_INTLCK CHECK CONDITION and trigger ACA. Then recovery would proceed as in (1), except that 2 failed commands would be restarted as ACA commands before clearing ACA. 4. The failed command isn't from the head of the queue and failed with other status than CHECK_CONDITION. It might happen in case of TASK QUEUE FULL condition. This case would be proceed similarly as cases (3.x), then (2.2). That's all. Simple, compact and clear for auditing. > Even if that part had been refined, I think trusting the ordering down > to the lower layers was a doomed idea. The list of ways it could go > wrong is much much longer (and harder to debug) than the list of > benefits. It's hard to debug, because it's currently a overloaded flags nightmare. It isn't the idea to trust lower level doomed, everybody trust lower levels everywhere in the kernel. Doomed the idea to provide requested functionality via a set of flags and artificial barrier requests with obscured side effects. Linux just needs a clear and _natural_ interface for that. Like one I proposed in http://marc.info/?l=linux-scsi&m=128077574815881&w=2. Yes, I am proposing to slowly start thinking to move to a new interface and implementation out from the current hell. It's obvious that what Linux has now in this area is a dead end. The new flag Christoph is going to add makes it even worse. > With all of that said, I did go ahead and benchmark real ordered tags > extensively on a scsi drive in the initial implementation. There was > very little performance difference. It isn't surprise that you didn't see much difference with a local (Wide?) SCSI drive. Such drives sit on a low latency link, simple enough to have small internal latencies and dumb enough to not make much benefits from internal reordering. But how about external arrays? Or even clusters? Nowadays everybody can build such arrays and clusters from any Linux (or other *nix) box using any OSS SCSI target implementation starting from SCST I have been developing. Such array/cluster devices use links with in an order of magnitude higher latency, they are very sophisticated inside, so have much bigger internal latencies as well as they have much bigger opportunities to optimize I/O pattern by internal reordering. All the record numbers I've seen so far were reached with deep queue. For instance, the last SCST record (>500K 4K IOPSes from a single target) was achieved with queue depth 128! So, I believe, Linux must use that possibility to get full storage performance and to finally simplify its storage stack. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 19:48 ` Vladislav Bolkhovitin @ 2010-08-05 19:50 ` Christoph Hellwig 2010-08-05 20:05 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-08-05 19:50 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Chris Mason, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke On Thu, Aug 05, 2010 at 11:48:19PM +0400, Vladislav Bolkhovitin wrote: > So, I believe, Linux must use that possibility to get full storage > performance and to finally simplify its storage stack. So instead of talking what about doing a prototype and show us what improvement it gives? ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 19:50 ` Christoph Hellwig @ 2010-08-05 20:05 ` Vladislav Bolkhovitin 2010-08-06 14:56 ` Hannes Reinecke 0 siblings, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-08-05 20:05 UTC (permalink / raw) To: Christoph Hellwig Cc: Chris Mason, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke Christoph Hellwig, on 08/05/2010 11:50 PM wrote: > On Thu, Aug 05, 2010 at 11:48:19PM +0400, Vladislav Bolkhovitin wrote: >> So, I believe, Linux must use that possibility to get full storage >> performance and to finally simplify its storage stack. > > So instead of talking what about doing a prototype and show us what > improvement it gives? Sure, I'd love to. But, unfortunately, I can't clone myself, so I'm trying to help the best of what I could: my level of storage and SCSI expertise. This area is quite special, so I'm trying to explain some misunderstandings I see and illustrate my points by some possible work flows and interfaces. But I can shut up if you'd like. Thanks, Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 20:05 ` Vladislav Bolkhovitin @ 2010-08-06 14:56 ` Hannes Reinecke 2010-08-06 18:38 ` Vladislav Bolkhovitin 2010-08-06 23:34 ` Christoph Hellwig 0 siblings, 2 replies; 155+ messages in thread From: Hannes Reinecke @ 2010-08-06 14:56 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, Chris Mason, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke Vladislav Bolkhovitin wrote: > Christoph Hellwig, on 08/05/2010 11:50 PM wrote: >> On Thu, Aug 05, 2010 at 11:48:19PM +0400, Vladislav Bolkhovitin wrote: >>> So, I believe, Linux must use that possibility to get full storage >>> performance and to finally simplify its storage stack. >> >> So instead of talking what about doing a prototype and show us what >> improvement it gives? > > Sure, I'd love to. But, unfortunately, I can't clone myself, so I'm > trying to help the best of what I could: my level of storage and SCSI > expertise. This area is quite special, so I'm trying to explain some > misunderstandings I see and illustrate my points by some possible work > flows and interfaces. > I can't, neither. But I can do bonnie runs in no time. I have done some preliminary benchmarks by just enable ordered queueing in sd.c and no other changes. Bonnie says: Writing intelligently: 115208 vs. 82739 Reading intelligently: 134133 vs. 129395 putc() performance suffers, though: I get 52M vs 90M writing and 50M vs. 65M reading. No idea why; shouldn't be that harmful here. But in any case there is some speed improvement to be had from using ordered tags. Oh, and that was against an EVA 6400. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-06 14:56 ` Hannes Reinecke @ 2010-08-06 18:38 ` Vladislav Bolkhovitin 2010-08-06 23:38 ` Christoph Hellwig 2010-08-06 23:34 ` Christoph Hellwig 1 sibling, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-08-06 18:38 UTC (permalink / raw) To: Hannes Reinecke, Tejun Heo Cc: Christoph Hellwig, Chris Mason, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke Hannes Reinecke, on 08/06/2010 06:56 PM wrote: > But I can do bonnie runs in no time. > I have done some preliminary benchmarks by just enable ordered > queueing in sd.c and no other changes. > Bonnie says: > > Writing intelligently: 115208 vs. 82739 > Reading intelligently: 134133 vs. 129395 > > putc() performance suffers, though: > I get 52M vs 90M writing and 50M vs. 65M reading. > No idea why; shouldn't be that harmful here. > > But in any case there is some speed improvement > to be had from using ordered tags. > > Oh, and that was against an EVA 6400. Here are my numbers. They are taken using: fio --bs=X --ioengine=aio --buffered=0 --size=128M --rw=read --thread --numjobs=1 --loops=100 --group_reporting --gtod_reduce=1 --name=AAA --filename=/dev/sdc --iodepth=Y /dev/sdc is 1GbE iSCSI device with on the other side iSCSI-SCST with a single 15K RPM Wide SCSI HDD. All values are in MB/s. The system (initiator) is pretty old 1.7GHz Xeon. Y | 1 2 4 8 32 ---------------------------------------------------------------------- X | 4K | 16 25 32 34 34 (initiator CPU overloaded) 16K | 25 57 72 85 85 (initiator CPU overloaded) 32K | 44 72 97 106 106 (initiator CPU overloaded) 64K | 65 95 114 115 115 (max of 1GbE) 128K | 80 112 115 115 115 (max of 1GbE) Are there still any people thinking that tagged queuing doesn't have any meaningful use? Or 350% performance increase doesn't matter? (If the system was more powerful, the difference would be even bigger.) As you can see on external storage even with 128K commands the queue should have at least 2 entries queued to go with full performance. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-06 18:38 ` Vladislav Bolkhovitin @ 2010-08-06 23:38 ` Christoph Hellwig 0 siblings, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-08-06 23:38 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Hannes Reinecke, Tejun Heo, Christoph Hellwig, Chris Mason, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke On Fri, Aug 06, 2010 at 10:38:46PM +0400, Vladislav Bolkhovitin wrote: > Are there still any people thinking that tagged queuing doesn't have any > meaningful use? > > Or 350% performance increase doesn't matter? (If the system was more > powerful, the difference would be even bigger.) > > As you can see on external storage even with 128K commands the queue > should have at least 2 entries queued to go with full performance. Vlad, no one disagrees that draining the queue is really bad for performance. That's in fact what started the whole thread. The question is wether it's worth to deal with the complexities of using tagged queing all the way through the I/O and filesystem stack, or wether to keep the existing perfectly working code to wait on individual I/O requests in the filesystem. The latter won't be able to keep the queue filled for the case where we try to max out the I/O subsystem with a single synchronous writer thread, so tagged queueing would be a clear win for that. It's not exactly the typical use case for high end storage, though - and once you have multiple threads keeping the queue busy the advantage of the tagging shrinks. Of course all this is just talking and someone would need to actually do the work of using tagged queueing in a useful (and non-buggy) way and benchmark it. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-06 14:56 ` Hannes Reinecke 2010-08-06 18:38 ` Vladislav Bolkhovitin @ 2010-08-06 23:34 ` Christoph Hellwig 1 sibling, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-08-06 23:34 UTC (permalink / raw) To: Hannes Reinecke Cc: Vladislav Bolkhovitin, Christoph Hellwig, Chris Mason, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke On Fri, Aug 06, 2010 at 04:56:56PM +0200, Hannes Reinecke wrote: > But I can do bonnie runs in no time. > I have done some preliminary benchmarks by just enable ordered > queueing in sd.c and no other changes. Enabled what exactly? ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 13:11 ` Vladislav Bolkhovitin 2010-08-05 13:32 ` Chris Mason @ 2010-08-05 17:09 ` Christoph Hellwig 2010-08-05 19:32 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-08-05 17:09 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Chris Mason, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote: > Chris Mason, on 08/02/2010 09:39 PM wrote: > >I regret putting the ordering into the original barrier code...it > >definitely did help reiserfs back in the day but it stinks of magic and > >voodoo. > > But if the ordering isn't in the common (block) code, how to implement > the "hardware offload" for ordering, i.e. ORDERED commands, in an > acceptable way? Right now we have no working implementation of actually using ordered tags for a storage device in Linux. There's very little need for common code in that implementation - basically we just need flag in the bio / request to make this one an ordered tag in addition to the existing reordering preventing in the block queue. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 17:09 ` Christoph Hellwig @ 2010-08-05 19:32 ` Vladislav Bolkhovitin 2010-08-05 19:40 ` Christoph Hellwig 0 siblings, 1 reply; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-08-05 19:32 UTC (permalink / raw) To: Christoph Hellwig Cc: Chris Mason, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke Christoph Hellwig, on 08/05/2010 09:09 PM wrote: > On Thu, Aug 05, 2010 at 05:11:56PM +0400, Vladislav Bolkhovitin wrote: >> Chris Mason, on 08/02/2010 09:39 PM wrote: >>> I regret putting the ordering into the original barrier code...it >>> definitely did help reiserfs back in the day but it stinks of magic and >>> voodoo. >> >> But if the ordering isn't in the common (block) code, how to implement >> the "hardware offload" for ordering, i.e. ORDERED commands, in an >> acceptable way? > > Right now we have no working implementation of actually using ordered > tags for a storage device in Linux. There's very little need for common > code in that implementation - basically we just need flag in the bio / > request to make this one an ordered tag in addition to the existing > reordering preventing in the block queue. New flag.. Easy to add, hard to live with. Aren't you already tied of the existing flags hell? Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-05 19:32 ` Vladislav Bolkhovitin @ 2010-08-05 19:40 ` Christoph Hellwig 0 siblings, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-08-05 19:40 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: Christoph Hellwig, Chris Mason, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, swhiteho, konishi.ryusuke On Thu, Aug 05, 2010 at 11:32:04PM +0400, Vladislav Bolkhovitin wrote: > New flag.. Easy to add, hard to live with. Aren't you already tied of > the existing flags hell? I'm tired of flags without a very well defined meaning. For example I'm really tired of the current REQ_HARDBARRIER because it means so amy different things. A must do pre-flush or must do FUA flag is very different from a must not reorder flag. Overloading the meaning is what got us into this mess. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-08-02 17:39 ` Chris Mason 2010-08-05 13:11 ` Vladislav Bolkhovitin @ 2010-08-05 13:11 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-08-05 13:11 UTC (permalink / raw) To: Chris Mason, Christoph Hellwig, Tejun Heo, Vivek Goyal, Jan Kara, jaxboe, James.B Chris Mason, on 08/02/2010 09:39 PM wrote: > I regret putting the ordering into the original barrier code...it > definitely did help reiserfs back in the day but it stinks of magic and > voodoo. But if the ordering isn't in the common (block) code, how to implement the "hardware offload" for ordering, i.e. ORDERED commands, in an acceptable way? I believe, the decision was right, but the flags and magic requests based interface (and, hence, implementation) was wrong. That's it which stinks of magic and voodoo. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:17 ` Tejun Heo 2010-07-28 9:28 ` Christoph Hellwig @ 2010-07-28 13:56 ` Vladislav Bolkhovitin 1 sibling, 0 replies; 155+ messages in thread From: Vladislav Bolkhovitin @ 2010-07-28 13:56 UTC (permalink / raw) To: Tejun Heo, Christoph Hellwig Cc: Vivek Goyal, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke Tejun Heo, on 07/28/2010 01:17 PM wrote: > On 07/28/2010 11:00 AM, Christoph Hellwig wrote: >> On Wed, Jul 28, 2010 at 10:58:30AM +0200, Tejun Heo wrote: >>> I see. It probably would be good to have ordering requirements >>> carried in the bio / request, so that filesystems can mix and match >>> barriers of different strengths as necesasry. As you seem to be >>> already working on it, are you interested in pursuing that direction? >> >> I've been working on that for a while, but it got a lot more urgent >> as there's been an application hit particularly hard by the barrier >> semantics on cache less devices and people started getting angry >> about it. That's why fixing this for cache less devices has become >> a higher priority than solving the big picture. > > Well, if disabling barrier works around the problem for them (which is > basically what was suggeseted in the first message), that's not too > bad for short term, I think. At least, there's a handy workaround. > I'll re-read barrier code and see how hard it would be to implement a > proper solution. For all the people working on barriers I'd recommend to use a Linux-based software SCSI device implemented using SCST framework (http://scst.sourceforge.net). This isn't an advertisement, SCST is really handy for such tasks. With it you can make your device be write through/write back/FUA/NV cache/etc., you can fully see the flow of commands sent by your Linux initiator, you can insert filters on some of them, perform various failure injections to check how robust your implementation, etc. SCST fully processed ORDERED commands as required by SAM. You can start from iSCSI target and vdisk backend dev handler. For it, for example, to see the full flow of commands you should perform (with proc interface) "echo "add scsi" >/proc/scsi_tgt/trace_level", to see FUA/sync cache commands only: "echo "add order" >/proc/scsi_tgt/vdisk/trace_level". The output will be in the kernel log, so you may need to increase CONFIG_LOG_BUF_SHIFT. For 1.0.1.x I have a patch implementing ACA developed by one SCST using company, which is going to be integrated in the trunk in v2.1. This patch was needed for AIX to work in full performance and now used in production. With it implementation of UA_INTLCK is trivial and I can do it upon request. Vlad ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-28 9:00 ` Christoph Hellwig 2010-07-28 9:11 ` Hannes Reinecke 2010-07-28 9:17 ` Tejun Heo @ 2010-07-28 14:42 ` Vivek Goyal 2 siblings, 0 replies; 155+ messages in thread From: Vivek Goyal @ 2010-07-28 14:42 UTC (permalink / raw) To: Christoph Hellwig Cc: Tejun Heo, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Wed, Jul 28, 2010 at 11:00:25AM +0200, Christoph Hellwig wrote: > On Wed, Jul 28, 2010 at 10:58:30AM +0200, Tejun Heo wrote: > > I see. It probably would be good to have ordering requirements > > carried in the bio / request, so that filesystems can mix and match > > barriers of different strengths as necesasry. As you seem to be > > already working on it, are you interested in pursuing that direction? > > I've been working on that for a while, but it got a lot more urgent > as there's been an application hit particularly hard by the barrier > semantics on cache less devices and people started getting angry > about it. That's why fixing this for cache less devices has become > a higher priority than solving the big picture. And in the process IO controller cgroup stuff will also benefit otherwise excessive draining on request queue takes away any service differentiation CFQ provides among groups. Vivek ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [RFC] relaxed barrier semantics 2010-07-27 17:54 ` Jan Kara 2010-07-27 18:35 ` Vivek Goyal @ 2010-07-27 19:37 ` Christoph Hellwig 2010-08-03 18:49 ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig 2 siblings, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-07-27 19:37 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke On Tue, Jul 27, 2010 at 07:54:19PM +0200, Jan Kara wrote: > OK, let me understand one thing. So the storage arrays have some caches > and queues of requests and QUEUE_ORDERED_DRAIN forces them flush all this > to the platter, right? Not quite. QUEUE_ORDERED_DRAIN does not interact with the target at all, it's entirely initiator (Linux) side. What it does it to make sure we drain the whole queue in the I/O scheduler (elevator) and in flight to the device (command queueing) by waiting for all I/O before the barrier to finish, the issue the barrier command and only then allow any newly arriving requests to proceed. > So can it happen that they somehow lose the requests that were already > issued to them (e.g. because of power failure)? We can lose the requests already on the wire, but not completed yet. That's why log write wait for all preceding log writes (or things like the I/Os required to push the tail) and fsync waits for all I/O completions manually. ^ permalink raw reply [flat|nested] 155+ messages in thread
* [PATCH, RFC 1/2] relaxed cache flushes 2010-07-27 17:54 ` Jan Kara 2010-07-27 18:35 ` Vivek Goyal 2010-07-27 19:37 ` Christoph Hellwig @ 2010-08-03 18:49 ` Christoph Hellwig 2010-08-03 18:51 ` [PATCH, RFC 2/2] dm: support REQ_FLUSH directly Christoph Hellwig 2010-08-06 16:04 ` [PATCH, RFC] relaxed barriers Tejun Heo 2 siblings, 2 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-08-03 18:49 UTC (permalink / raw) To: Jan Kara, jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.maso So instead of cracking my head on the relaxed barriers I've decided to do the easiet part first. That is relaxing explicit cache flushes done by blkdev_issue_flush. These days these are handled as an empty barrier, which is completely overkill. Instead take advantage of the way we now handle flushes, that is as REQ_FLUSH FS requests. Do a few updates to the block layer so that we handle REQ_FLUSH correctly and we can make blkdev_issue_flush submit them directly. All request based block drivers should just work with it, but bio based remappers will need some additional work. The next patch will do this for DM, but I haven't quite grasped the barrier code in MD yet. Despite doing a lot REQ_HARDBARRIER tests DRBD doesn't actually advertize any ordered mode so it's not affected. The barrier handling in the loop driver is currently broken anyway, and I'm still undecided if I want to fix it before or after this conversion. Index: linux-2.6/block/blk-barrier.c =================================================================== --- linux-2.6.orig/block/blk-barrier.c 2010-08-03 20:26:50.259005954 +0200 +++ linux-2.6/block/blk-barrier.c 2010-08-03 20:33:39.580266216 +0200 @@ -151,25 +151,7 @@ static inline bool start_ordered(struct q->ordered = q->next_ordered; q->ordseq |= QUEUE_ORDSEQ_STARTED; - /* - * For an empty barrier, there's no actual BAR request, which - * in turn makes POSTFLUSH unnecessary. Mask them off. - */ - if (!blk_rq_sectors(rq)) { - q->ordered &= ~(QUEUE_ORDERED_DO_BAR | - QUEUE_ORDERED_DO_POSTFLUSH); - /* - * Empty barrier on a write-through device w/ ordered - * tag has no command to issue and without any command - * to issue, ordering by tag can't be used. Drain - * instead. - */ - if ((q->ordered & QUEUE_ORDERED_BY_TAG) && - !(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) { - q->ordered &= ~QUEUE_ORDERED_BY_TAG; - q->ordered |= QUEUE_ORDERED_BY_DRAIN; - } - } + BUG_ON(!blk_rq_sectors(rq)); /* stash away the original request */ blk_dequeue_request(rq); @@ -311,6 +293,9 @@ int blkdev_issue_flush(struct block_devi if (!q) return -ENXIO; + if (!(q->next_ordered & QUEUE_ORDERED_DO_PREFLUSH)) + return 0; + /* * some block devices may not have their queue correctly set up here * (e.g. loop device without a backing file) and so issuing a flush @@ -327,7 +312,7 @@ int blkdev_issue_flush(struct block_devi bio->bi_private = &wait; bio_get(bio); - submit_bio(WRITE_BARRIER, bio); + submit_bio(WRITE_SYNC | REQ_FLUSH, bio); if (test_bit(BLKDEV_WAIT, &flags)) { wait_for_completion(&wait); /* Index: linux-2.6/block/elevator.c =================================================================== --- linux-2.6.orig/block/elevator.c 2010-08-03 20:26:50.268024322 +0200 +++ linux-2.6/block/elevator.c 2010-08-03 20:32:11.949256478 +0200 @@ -423,7 +423,8 @@ void elv_dispatch_sort(struct request_qu q->nr_sorted--; boundary = q->end_sector; - stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED; + stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED | \ + REQ_FLUSH; list_for_each_prev(entry, &q->queue_head) { struct request *pos = list_entry_rq(entry); Index: linux-2.6/include/linux/bio.h =================================================================== --- linux-2.6.orig/include/linux/bio.h 2010-08-03 20:26:50.298255570 +0200 +++ linux-2.6/include/linux/bio.h 2010-08-03 20:46:48.367257736 +0200 @@ -153,6 +153,7 @@ enum rq_flag_bits { __REQ_META, /* metadata io request */ __REQ_DISCARD, /* request to discard sectors */ __REQ_NOIDLE, /* don't anticipate more IO after this one */ + __REQ_FLUSH, /* request for cache flush */ /* bio only flags */ __REQ_UNPLUG, /* unplug the immediately after submission */ @@ -174,7 +175,6 @@ enum rq_flag_bits { __REQ_ALLOCED, /* request came from our alloc pool */ __REQ_COPY_USER, /* contains copies of user pages */ __REQ_INTEGRITY, /* integrity metadata has been remapped */ - __REQ_FLUSH, /* request for cache flush */ __REQ_IO_STAT, /* account I/O stat */ __REQ_MIXED_MERGE, /* merge of different types, fail separately */ __REQ_NR_BITS, /* stops here */ @@ -189,12 +189,13 @@ enum rq_flag_bits { #define REQ_META (1 << __REQ_META) #define REQ_DISCARD (1 << __REQ_DISCARD) #define REQ_NOIDLE (1 << __REQ_NOIDLE) +#define REQ_FLUSH (1 << __REQ_FLUSH) #define REQ_FAILFAST_MASK \ (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER) #define REQ_COMMON_MASK \ (REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \ - REQ_META| REQ_DISCARD | REQ_NOIDLE) + REQ_META| REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH) #define REQ_UNPLUG (1 << __REQ_UNPLUG) #define REQ_RAHEAD (1 << __REQ_RAHEAD) @@ -214,7 +215,6 @@ enum rq_flag_bits { #define REQ_ALLOCED (1 << __REQ_ALLOCED) #define REQ_COPY_USER (1 << __REQ_COPY_USER) #define REQ_INTEGRITY (1 << __REQ_INTEGRITY) -#define REQ_FLUSH (1 << __REQ_FLUSH) #define REQ_IO_STAT (1 << __REQ_IO_STAT) #define REQ_MIXED_MERGE (1 << __REQ_MIXED_MERGE) Index: linux-2.6/include/linux/blkdev.h =================================================================== --- linux-2.6.orig/include/linux/blkdev.h 2010-08-03 20:26:50.311003929 +0200 +++ linux-2.6/include/linux/blkdev.h 2010-08-03 20:32:11.956036684 +0200 @@ -589,7 +589,8 @@ static inline void blk_clear_queue_full( * it already be started by driver. */ #define RQ_NOMERGE_FLAGS \ - (REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER) + (REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER | \ + REQ_FLUSH) #define rq_mergeable(rq) \ (!((rq)->cmd_flags & RQ_NOMERGE_FLAGS) && \ (((rq)->cmd_flags & REQ_DISCARD) || \ Index: linux-2.6/block/blk-core.c =================================================================== --- linux-2.6.orig/block/blk-core.c 2010-08-03 20:26:50.275003649 +0200 +++ linux-2.6/block/blk-core.c 2010-08-03 20:32:11.960004138 +0200 @@ -1203,7 +1203,7 @@ static int __make_request(struct request const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK; int rw_flags; - if ((bio->bi_rw & REQ_HARDBARRIER) && + if ((bio->bi_rw & (REQ_HARDBARRIER|REQ_FLUSH)) && (q->next_ordered == QUEUE_ORDERED_NONE)) { bio_endio(bio, -EOPNOTSUPP); return 0; @@ -1217,7 +1217,7 @@ static int __make_request(struct request spin_lock_irq(q->queue_lock); - if (unlikely((bio->bi_rw & REQ_HARDBARRIER)) || elv_queue_empty(q)) + if ((bio->bi_rw & (REQ_HARDBARRIER|REQ_FLUSH)) || elv_queue_empty(q)) goto get_rq; el_ret = elv_merge(q, &req, bio); ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-03 18:49 ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig @ 2010-08-03 18:51 ` Christoph Hellwig 2010-08-04 4:57 ` Kiyoshi Ueda 2010-08-06 16:04 ` [PATCH, RFC] relaxed barriers Tejun Heo 1 sibling, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-08-03 18:51 UTC (permalink / raw) To: Jan Kara, jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.maso Adapt device-mapper to the new world order where even bio based devices get simple REQ_FLUSH requests for cache flushes, and need to submit them downwards for implementing barriers. Note that I've removed the unlikely statements around the REQ_FLUSH checks. While these generally aren't as common as normal read/writes they are common enough that statically mispredictim them is a really bad idea. Tested with simple linear LVM volumes only so far. Index: linux-2.6/drivers/md/dm-crypt.c =================================================================== --- linux-2.6.orig/drivers/md/dm-crypt.c 2010-08-03 20:26:49.629254174 +0200 +++ linux-2.6/drivers/md/dm-crypt.c 2010-08-03 20:36:59.279003929 +0200 @@ -1249,7 +1249,7 @@ static int crypt_map(struct dm_target *t struct dm_crypt_io *io; struct crypt_config *cc; - if (unlikely(bio_empty_barrier(bio))) { + if (bio->bi_rw & REQ_FLUSH) { cc = ti->private; bio->bi_bdev = cc->dev->bdev; return DM_MAPIO_REMAPPED; Index: linux-2.6/drivers/md/dm-raid1.c =================================================================== --- linux-2.6.orig/drivers/md/dm-raid1.c 2010-08-03 20:26:49.641003999 +0200 +++ linux-2.6/drivers/md/dm-raid1.c 2010-08-03 20:36:59.280003649 +0200 @@ -629,7 +629,7 @@ static void do_write(struct mirror_set * struct dm_io_region io[ms->nr_mirrors], *dest = io; struct mirror *m; struct dm_io_request io_req = { - .bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER), + .bi_rw = WRITE | (bio->bi_rw & (WRITE_BARRIER|REQ_FLUSH)), .mem.type = DM_IO_BVEC, .mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx, .notify.fn = write_callback, @@ -670,7 +670,7 @@ static void do_writes(struct mirror_set bio_list_init(&requeue); while ((bio = bio_list_pop(writes))) { - if (unlikely(bio_empty_barrier(bio))) { + if (bio->bi_rw & REQ_FLUSH) { bio_list_add(&sync, bio); continue; } @@ -1199,12 +1199,14 @@ static int mirror_end_io(struct dm_targe struct dm_bio_details *bd = NULL; struct dm_raid1_read_record *read_record = map_context->ptr; + if (bio->bi_rw & REQ_FLUSH) + return error; + /* * We need to dec pending if this was a write. */ if (rw == WRITE) { - if (likely(!bio_empty_barrier(bio))) - dm_rh_dec(ms->rh, map_context->ll); + dm_rh_dec(ms->rh, map_context->ll); return error; } Index: linux-2.6/drivers/md/dm-region-hash.c =================================================================== --- linux-2.6.orig/drivers/md/dm-region-hash.c 2010-08-03 20:26:49.650023346 +0200 +++ linux-2.6/drivers/md/dm-region-hash.c 2010-08-03 20:36:59.285025649 +0200 @@ -399,7 +399,7 @@ void dm_rh_mark_nosync(struct dm_region_ region_t region = dm_rh_bio_to_region(rh, bio); int recovering = 0; - if (bio_empty_barrier(bio)) { + if (bio->bi_rw & REQ_FLUSH) { rh->barrier_failure = 1; return; } @@ -524,7 +524,7 @@ void dm_rh_inc_pending(struct dm_region_ struct bio *bio; for (bio = bios->head; bio; bio = bio->bi_next) { - if (bio_empty_barrier(bio)) + if (bio->bi_rw & REQ_FLUSH) continue; rh_inc(rh, dm_rh_bio_to_region(rh, bio)); } Index: linux-2.6/drivers/md/dm-snap.c =================================================================== --- linux-2.6.orig/drivers/md/dm-snap.c 2010-08-03 20:26:49.656003091 +0200 +++ linux-2.6/drivers/md/dm-snap.c 2010-08-03 20:36:59.290023135 +0200 @@ -1581,7 +1581,7 @@ static int snapshot_map(struct dm_target chunk_t chunk; struct dm_snap_pending_exception *pe = NULL; - if (unlikely(bio_empty_barrier(bio))) { + if (bio->bi_rw & REQ_FLUSH) { bio->bi_bdev = s->cow->bdev; return DM_MAPIO_REMAPPED; } @@ -1685,7 +1685,7 @@ static int snapshot_merge_map(struct dm_ int r = DM_MAPIO_REMAPPED; chunk_t chunk; - if (unlikely(bio_empty_barrier(bio))) { + if (bio->bi_rw & REQ_FLUSH) { if (!map_context->flush_request) bio->bi_bdev = s->origin->bdev; else @@ -2123,7 +2123,7 @@ static int origin_map(struct dm_target * struct dm_dev *dev = ti->private; bio->bi_bdev = dev->bdev; - if (unlikely(bio_empty_barrier(bio))) + if (bio->bi_rw & REQ_FLUSH) return DM_MAPIO_REMAPPED; /* Only tell snapshots if this is a write */ Index: linux-2.6/drivers/md/dm-stripe.c =================================================================== --- linux-2.6.orig/drivers/md/dm-stripe.c 2010-08-03 20:26:49.663003301 +0200 +++ linux-2.6/drivers/md/dm-stripe.c 2010-08-03 20:36:59.295005744 +0200 @@ -214,7 +214,7 @@ static int stripe_map(struct dm_target * sector_t offset, chunk; uint32_t stripe; - if (unlikely(bio_empty_barrier(bio))) { + if (bio->bi_rw & REQ_FLUSH) { BUG_ON(map_context->flush_request >= sc->stripes); bio->bi_bdev = sc->stripe[map_context->flush_request].dev->bdev; return DM_MAPIO_REMAPPED; Index: linux-2.6/drivers/md/dm.c =================================================================== --- linux-2.6.orig/drivers/md/dm.c 2010-08-03 20:26:49.676004139 +0200 +++ linux-2.6/drivers/md/dm.c 2010-08-03 20:36:59.301005325 +0200 @@ -633,7 +633,7 @@ static void dec_pending(struct dm_io *io io_error = io->error; bio = io->bio; - if (bio->bi_rw & REQ_HARDBARRIER) { + if (bio == &md->barrier_bio) { /* * There can be just one barrier request so we use * a per-device variable for error reporting. @@ -851,7 +851,7 @@ void dm_requeue_unmapped_request(struct struct request_queue *q = rq->q; unsigned long flags; - if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) { + if (clone->cmd_flags & REQ_HARDBARRIER) { /* * Barrier clones share an original request. * Leave it to dm_end_request(), which handles this special @@ -950,7 +950,7 @@ static void dm_complete_request(struct r struct dm_rq_target_io *tio = clone->end_io_data; struct request *rq = tio->orig; - if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) { + if (clone->cmd_flags & REQ_HARDBARRIER) { /* * Barrier clones share an original request. So can't use * softirq_done with the original. @@ -979,7 +979,7 @@ void dm_kill_unmapped_request(struct req struct dm_rq_target_io *tio = clone->end_io_data; struct request *rq = tio->orig; - if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) { + if (clone->cmd_flags & REQ_HARDBARRIER) { /* * Barrier clones share an original request. * Leave it to dm_end_request(), which handles this special @@ -1208,7 +1208,7 @@ static int __clone_and_map(struct clone_ sector_t len = 0, max; struct dm_target_io *tio; - if (unlikely(bio_empty_barrier(bio))) + if (bio->bi_rw & REQ_FLUSH) return __clone_and_map_empty_barrier(ci); ti = dm_table_find_target(ci->map, ci->sector); @@ -1308,7 +1308,7 @@ static void __split_and_process_bio(stru ci.map = dm_get_live_table(md); if (unlikely(!ci.map)) { - if (!(bio->bi_rw & REQ_HARDBARRIER)) + if (bio != &md->barrier_bio) bio_io_error(bio); else if (!md->barrier_error) @@ -1326,7 +1326,7 @@ static void __split_and_process_bio(stru spin_lock_init(&ci.io->endio_lock); ci.sector = bio->bi_sector; ci.sector_count = bio_sectors(bio); - if (unlikely(bio_empty_barrier(bio))) + if (bio->bi_rw & REQ_FLUSH) ci.sector_count = 1; ci.idx = bio->bi_idx; @@ -1421,7 +1421,7 @@ static int _dm_request(struct request_qu * we have to queue this io for later. */ if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) || - unlikely(bio->bi_rw & REQ_HARDBARRIER)) { + unlikely(bio->bi_rw & (REQ_HARDBARRIER|REQ_FLUSH))) { up_read(&md->io_lock); if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) && @@ -1462,14 +1462,6 @@ static int dm_request(struct request_que return _dm_request(q, bio); } -static bool dm_rq_is_flush_request(struct request *rq) -{ - if (rq->cmd_flags & REQ_FLUSH) - return true; - else - return false; -} - void dm_dispatch_request(struct request *rq) { int r; @@ -1517,10 +1509,10 @@ static int setup_clone(struct request *c { int r; - if (dm_rq_is_flush_request(rq)) { + if (rq->cmd_flags & REQ_FLUSH) { blk_rq_init(NULL, clone); clone->cmd_type = REQ_TYPE_FS; - clone->cmd_flags |= (REQ_HARDBARRIER | WRITE); + clone->cmd_flags |= (WRITE_SYNC | REQ_FLUSH); } else { r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC, dm_rq_bio_constructor, tio); @@ -1573,7 +1565,7 @@ static int dm_prep_fn(struct request_que struct mapped_device *md = q->queuedata; struct request *clone; - if (unlikely(dm_rq_is_flush_request(rq))) + if (rq->cmd_flags & REQ_FLUSH) return BLKPREP_OK; if (unlikely(rq->special)) { @@ -1664,7 +1656,7 @@ static void dm_request_fn(struct request if (!rq) goto plug_and_out; - if (unlikely(dm_rq_is_flush_request(rq))) { + if (rq->cmd_flags & REQ_FLUSH) { BUG_ON(md->flush_request); md->flush_request = rq; blk_start_request(rq); @@ -2239,7 +2231,7 @@ static void dm_flush(struct mapped_devic bio_init(&md->barrier_bio); md->barrier_bio.bi_bdev = md->bdev; - md->barrier_bio.bi_rw = WRITE_BARRIER; + md->barrier_bio.bi_rw = WRITE_SYNC | REQ_FLUSH; __split_and_process_bio(md, &md->barrier_bio); dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE); @@ -2250,19 +2242,8 @@ static void process_barrier(struct mappe md->barrier_error = 0; dm_flush(md); - - if (!bio_empty_barrier(bio)) { - __split_and_process_bio(md, bio); - dm_flush(md); - } - - if (md->barrier_error != DM_ENDIO_REQUEUE) - bio_endio(bio, md->barrier_error); - else { - spin_lock_irq(&md->deferred_lock); - bio_list_add_head(&md->deferred, bio); - spin_unlock_irq(&md->deferred_lock); - } + __split_and_process_bio(md, bio); + dm_flush(md); } /* Index: linux-2.6/include/linux/bio.h =================================================================== --- linux-2.6.orig/include/linux/bio.h 2010-08-03 20:32:11.951274008 +0200 +++ linux-2.6/include/linux/bio.h 2010-08-03 20:36:59.303005325 +0200 @@ -241,10 +241,6 @@ enum rq_flag_bits { #define bio_offset(bio) bio_iovec((bio))->bv_offset #define bio_segments(bio) ((bio)->bi_vcnt - (bio)->bi_idx) #define bio_sectors(bio) ((bio)->bi_size >> 9) -#define bio_empty_barrier(bio) \ - ((bio->bi_rw & REQ_HARDBARRIER) && \ - !bio_has_data(bio) && \ - !(bio->bi_rw & REQ_DISCARD)) static inline unsigned int bio_cur_bytes(struct bio *bio) { Index: linux-2.6/drivers/md/dm-io.c =================================================================== --- linux-2.6.orig/drivers/md/dm-io.c 2010-08-03 20:26:49.685023485 +0200 +++ linux-2.6/drivers/md/dm-io.c 2010-08-03 20:36:59.308004417 +0200 @@ -364,7 +364,7 @@ static void dispatch_io(int rw, unsigned */ for (i = 0; i < num_regions; i++) { *dp = old_pages; - if (where[i].count || (rw & REQ_HARDBARRIER)) + if (where[i].count || (rw & REQ_FLUSH)) do_region(rw, i, where + i, dp, io); } ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-03 18:51 ` [PATCH, RFC 2/2] dm: support REQ_FLUSH directly Christoph Hellwig @ 2010-08-04 4:57 ` Kiyoshi Ueda 2010-08-04 8:54 ` Christoph Hellwig 0 siblings, 1 reply; 155+ messages in thread From: Kiyoshi Ueda @ 2010-08-04 4:57 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel, linux-raid Hi Christoph, On 08/04/2010 03:51 AM +0900, Christoph Hellwig wrote: > Adapt device-mapper to the new world order where even bio based devices > get simple REQ_FLUSH requests for cache flushes, and need to submit > them downwards for implementing barriers. <snip> > Index: linux-2.6/drivers/md/dm.c > =================================================================== > --- linux-2.6.orig/drivers/md/dm.c 2010-08-03 20:26:49.676004139 +0200 > +++ linux-2.6/drivers/md/dm.c 2010-08-03 20:36:59.301005325 +0200 <snip> > @@ -1573,7 +1565,7 @@ static int dm_prep_fn(struct request_que > struct mapped_device *md = q->queuedata; > struct request *clone; > > - if (unlikely(dm_rq_is_flush_request(rq))) > + if (rq->cmd_flags & REQ_FLUSH) > return BLKPREP_OK; > > if (unlikely(rq->special)) { > @@ -1664,7 +1656,7 @@ static void dm_request_fn(struct request > if (!rq) > goto plug_and_out; > > - if (unlikely(dm_rq_is_flush_request(rq))) { > + if (rq->cmd_flags & REQ_FLUSH) { > BUG_ON(md->flush_request); > md->flush_request = rq; > blk_start_request(rq); Current request-based device-mapper's flush code depends on the block-layer's barrier behavior which dispatches only one request at a time when flush is needed. In other words, current request-based device-mapper can't handle other requests while a flush request is in progress. I'll take a look how I can fix the request-based device-mapper to cope with it. I think it'll take time for carefull investigation. Thanks, Kiyoshi Ueda ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-04 4:57 ` Kiyoshi Ueda @ 2010-08-04 8:54 ` Christoph Hellwig 2010-08-05 2:16 ` Jun'ichi Nomura 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-08-04 8:54 UTC (permalink / raw) To: Kiyoshi Ueda Cc: Christoph Hellwig, Jan Kara, jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel, linux-raid On Wed, Aug 04, 2010 at 01:57:37PM +0900, Kiyoshi Ueda wrote: > > - if (unlikely(dm_rq_is_flush_request(rq))) { > > + if (rq->cmd_flags & REQ_FLUSH) { > > BUG_ON(md->flush_request); > > md->flush_request = rq; > > blk_start_request(rq); > > Current request-based device-mapper's flush code depends on > the block-layer's barrier behavior which dispatches only one request > at a time when flush is needed. > In other words, current request-based device-mapper can't handle > other requests while a flush request is in progress. > > I'll take a look how I can fix the request-based device-mapper to > cope with it. I think it'll take time for carefull investigation. Given that request based device mapper doesn't even look at the block numbers from what I can see just removing any special casing for REQ_FLUSH should probably do it. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-04 8:54 ` Christoph Hellwig @ 2010-08-05 2:16 ` Jun'ichi Nomura 2010-08-26 22:50 ` Mike Snitzer 0 siblings, 1 reply; 155+ messages in thread From: Jun'ichi Nomura @ 2010-08-05 2:16 UTC (permalink / raw) To: Christoph Hellwig Cc: Kiyoshi Ueda, Jan Kara, jaxboe, tj, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel, linux-raid Hi Christoph, (08/04/10 17:54), Christoph Hellwig wrote: > On Wed, Aug 04, 2010 at 01:57:37PM +0900, Kiyoshi Ueda wrote: >>> - if (unlikely(dm_rq_is_flush_request(rq))) { >>> + if (rq->cmd_flags & REQ_FLUSH) { >>> BUG_ON(md->flush_request); >>> md->flush_request = rq; >>> blk_start_request(rq); >> >> Current request-based device-mapper's flush code depends on >> the block-layer's barrier behavior which dispatches only one request >> at a time when flush is needed. >> In other words, current request-based device-mapper can't handle >> other requests while a flush request is in progress. >> >> I'll take a look how I can fix the request-based device-mapper to >> cope with it. I think it'll take time for carefull investigation. > > Given that request based device mapper doesn't even look at the > block numbers from what I can see just removing any special casing > for REQ_FLUSH should probably do it. Special casing is necessary because device-mapper may have to send multiple copies of REQ_FLUSH request to multiple targets, while normal request is just sent to single target. Thanks, -- Jun'ichi Nomura, NEC Corporation ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-05 2:16 ` Jun'ichi Nomura @ 2010-08-26 22:50 ` Mike Snitzer 2010-08-27 0:40 ` Mike Snitzer 2010-08-27 1:43 ` Jun'ichi Nomura 0 siblings, 2 replies; 155+ messages in thread From: Mike Snitzer @ 2010-08-26 22:50 UTC (permalink / raw) To: Jun'ichi Nomura Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj, tytso, swhiteho, chris.mason, dm-devel On Wed, Aug 04 2010 at 10:16pm -0400, Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote: > Hi Christoph, > > (08/04/10 17:54), Christoph Hellwig wrote: > > On Wed, Aug 04, 2010 at 01:57:37PM +0900, Kiyoshi Ueda wrote: > >>> - if (unlikely(dm_rq_is_flush_request(rq))) { > >>> + if (rq->cmd_flags & REQ_FLUSH) { > >>> BUG_ON(md->flush_request); > >>> md->flush_request = rq; > >>> blk_start_request(rq); > >> > >> Current request-based device-mapper's flush code depends on > >> the block-layer's barrier behavior which dispatches only one request > >> at a time when flush is needed. > >> In other words, current request-based device-mapper can't handle > >> other requests while a flush request is in progress. > >> > >> I'll take a look how I can fix the request-based device-mapper to > >> cope with it. I think it'll take time for carefull investigation. > > > > Given that request based device mapper doesn't even look at the > > block numbers from what I can see just removing any special casing > > for REQ_FLUSH should probably do it. > > Special casing is necessary because device-mapper may have to > send multiple copies of REQ_FLUSH request to multiple > targets, while normal request is just sent to single target. Yes, request-based DM is meant to have all the same capabilities as bio-based DM. So in theory it should support multiple targets but in practice it doesn't. DM's multipath target is the only consumer of request-based DM and it only ever clones a single flush request (num_flush_requests = 1). So why not remove all of request-based DM's barrier infrastructure and simply rely on the revised block layer to sequence the FLUSH+WRITE request for request-based DM? Given that we do not have a request-based DM target that requires cloning multiple FLUSH requests its unused code that is delaying DM support for the new FLUSH+FUA work (NOTE: bio-based DM obviously still needs work in this area). Once we have a need for using request-based DM for something other than multipath we can take a fresh look at implementing rq-based FLUSH+FUA. Mike p.s. I know how hard NEC worked on request-based DM's barrier support; so I'm not suggesting this lightly. For me it just seems like we're carrying complexity in DM that hasn't ever been required. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-26 22:50 ` Mike Snitzer @ 2010-08-27 0:40 ` Mike Snitzer 2010-08-27 1:20 ` Jamie Lokier 2010-08-27 1:43 ` Jun'ichi Nomura 1 sibling, 1 reply; 155+ messages in thread From: Mike Snitzer @ 2010-08-27 0:40 UTC (permalink / raw) To: Jun'ichi Nomura Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj, tytso, swhiteho, chris.mason, dm-devel On Thu, Aug 26 2010 at 6:50pm -0400, Mike Snitzer <snitzer@redhat.com> wrote: > Once we have a need for using request-based DM for something other than > multipath we can take a fresh look at implementing rq-based FLUSH+FUA. > > Mike > > p.s. I know how hard NEC worked on request-based DM's barrier support; > so I'm not suggesting this lightly. For me it just seems like we're > carrying complexity in DM that hasn't ever been required. To be clear: the piece that I was saying wasn't required is the need to for request-based DM to clone a FLUSH to send to multiple targets (saying as much was just a confusing distraction.. please ignore that). Anyway, my previous email's question still stands. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-27 0:40 ` Mike Snitzer @ 2010-08-27 1:20 ` Jamie Lokier 0 siblings, 0 replies; 155+ messages in thread From: Jamie Lokier @ 2010-08-27 1:20 UTC (permalink / raw) To: Mike Snitzer Cc: Jun'ichi Nomura, Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj, tytso, swhiteho, chris.mason, dm-devel Mike Snitzer wrote: > On Thu, Aug 26 2010 at 6:50pm -0400, > Mike Snitzer <snitzer@redhat.com> wrote: > > > Once we have a need for using request-based DM for something other than > > multipath we can take a fresh look at implementing rq-based FLUSH+FUA. > > > > Mike > > > > p.s. I know how hard NEC worked on request-based DM's barrier support; > > so I'm not suggesting this lightly. For me it just seems like we're > > carrying complexity in DM that hasn't ever been required. > > To be clear: the piece that I was saying wasn't required is the need to > for request-based DM to clone a FLUSH to send to multiple targets > (saying as much was just a confusing distraction.. please ignore that). > > Anyway, my previous email's question still stands. On a slightly related note: DM suggests a reason for the lower layer, or the request queues, to implement the trivial optimisation of discarding FLUSHes if there's been no WRITE since the previous FLUSH. That was mentioned elsewhere in this big thread as not being worth even the small effort - because the filesystem is able to make good decisions anyway. But once you have something like RAID or striping, it's quite common for the filesystem to issue a FLUSH when only a subset of the target devices have received WRITEs through the RAID/striping layer since they last received a FLUSH. -- Jamie ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-26 22:50 ` Mike Snitzer 2010-08-27 0:40 ` Mike Snitzer @ 2010-08-27 1:43 ` Jun'ichi Nomura 2010-08-27 4:08 ` Mike Snitzer 1 sibling, 1 reply; 155+ messages in thread From: Jun'ichi Nomura @ 2010-08-27 1:43 UTC (permalink / raw) To: Mike Snitzer Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj, tytso, swhiteho, chris.mason, dm-devel Hi Mike, (08/27/10 07:50), Mike Snitzer wrote: >> Special casing is necessary because device-mapper may have to >> send multiple copies of REQ_FLUSH request to multiple >> targets, while normal request is just sent to single target. > > Yes, request-based DM is meant to have all the same capabilities as > bio-based DM. So in theory it should support multiple targets but in > practice it doesn't. DM's multipath target is the only consumer of > request-based DM and it only ever clones a single flush request > (num_flush_requests = 1). This is correct. But, > So why not remove all of request-based DM's barrier infrastructure and > simply rely on the revised block layer to sequence the FLUSH+WRITE > request for request-based DM? > > Given that we do not have a request-based DM target that requires > cloning multiple FLUSH requests its unused code that is delaying DM > support for the new FLUSH+FUA work (NOTE: bio-based DM obviously still > needs work in this area). the above mentioned 'special casing' is not a hard part. See the attached patch. The hard part is discerning the error type for flush failure as discussed in the other thread. And as Kiyoshi wrote, that's an existing problem so it can be worked on as a separate issue than the new FLUSH work. Thanks, -- Jun'ichi Nomura, NEC Corporation Cope with new sequencing of flush requests in the block layer. Request-based dm used to depend on the barrier sequencer in the block layer in that, when a flush request is dispatched, there are no other requests in-flight. So it reused md->pending counter for checking completion of cloned flush requests. This patch separates the pending counter for flush request as a prepartion for the new FLUSH work, where a flush request can be dispatched while other normal requests are in-flight. Index: linux-2.6.36-rc2/drivers/md/dm.c =================================================================== --- linux-2.6.36-rc2.orig/drivers/md/dm.c +++ linux-2.6.36-rc2/drivers/md/dm.c @@ -162,6 +162,7 @@ struct mapped_device { /* A pointer to the currently processing pre/post flush request */ struct request *flush_request; + atomic_t flush_pending; /* * The current mapping. @@ -777,10 +778,16 @@ static void store_barrier_error(struct m * the md may be freed in dm_put() at the end of this function. * Or do dm_get() before calling this function and dm_put() later. */ -static void rq_completed(struct mapped_device *md, int rw, int run_queue) +static void rq_completed(struct mapped_device *md, int rw, int run_queue, bool is_flush) { atomic_dec(&md->pending[rw]); + if (is_flush) { + atomic_dec(&md->flush_pending); + if (!atomic_read(&md->flush_pending)) + wake_up(&md->wait); + } + /* nudge anyone waiting on suspend queue */ if (!md_in_flight(md)) wake_up(&md->wait); @@ -837,7 +844,7 @@ static void dm_end_request(struct reques } else blk_end_request_all(rq, error); - rq_completed(md, rw, run_queue); + rq_completed(md, rw, run_queue, is_barrier); } static void dm_unprep_request(struct request *rq) @@ -880,7 +887,7 @@ void dm_requeue_unmapped_request(struct blk_requeue_request(q, rq); spin_unlock_irqrestore(q->queue_lock, flags); - rq_completed(md, rw, 0); + rq_completed(md, rw, 0, false); } EXPORT_SYMBOL_GPL(dm_requeue_unmapped_request); @@ -1993,6 +2000,7 @@ static struct mapped_device *alloc_dev(i atomic_set(&md->pending[0], 0); atomic_set(&md->pending[1], 0); + atomic_set(&md->flush_pending, 0); init_waitqueue_head(&md->wait); INIT_WORK(&md->work, dm_wq_work); INIT_WORK(&md->barrier_work, dm_rq_barrier_work); @@ -2375,7 +2383,7 @@ void dm_put(struct mapped_device *md) } EXPORT_SYMBOL_GPL(dm_put); -static int dm_wait_for_completion(struct mapped_device *md, int interruptible) +static int dm_wait_for_completion(struct mapped_device *md, int interruptible, bool for_flush) { int r = 0; DECLARE_WAITQUEUE(wait, current); @@ -2388,6 +2396,8 @@ static int dm_wait_for_completion(struct set_current_state(interruptible); smp_mb(); + if (for_flush && !atomic_read(&md->flush_pending)) + break; if (!md_in_flight(md)) break; @@ -2408,14 +2418,14 @@ static int dm_wait_for_completion(struct static void dm_flush(struct mapped_device *md) { - dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE); + dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE, false); bio_init(&md->barrier_bio); md->barrier_bio.bi_bdev = md->bdev; md->barrier_bio.bi_rw = WRITE_BARRIER; __split_and_process_bio(md, &md->barrier_bio); - dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE); + dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE, false); } static void process_barrier(struct mapped_device *md, struct bio *bio) @@ -2512,11 +2522,12 @@ static int dm_rq_barrier(struct mapped_d clone = clone_rq(md->flush_request, md, GFP_NOIO); dm_rq_set_target_request_nr(clone, j); atomic_inc(&md->pending[rq_data_dir(clone)]); + atomic_inc(&md->flush_pending); map_request(ti, clone, md); } } - dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE); + dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE, true); dm_table_put(map); return md->barrier_error; @@ -2705,7 +2716,7 @@ int dm_suspend(struct mapped_device *md, * We call dm_wait_for_completion to wait for all existing requests * to finish. */ - r = dm_wait_for_completion(md, TASK_INTERRUPTIBLE); + r = dm_wait_for_completion(md, TASK_INTERRUPTIBLE, false); down_write(&md->io_lock); if (noflush) ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-27 1:43 ` Jun'ichi Nomura @ 2010-08-27 4:08 ` Mike Snitzer 2010-08-27 5:52 ` Jun'ichi Nomura 0 siblings, 1 reply; 155+ messages in thread From: Mike Snitzer @ 2010-08-27 4:08 UTC (permalink / raw) To: Jun'ichi Nomura Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj, tytso, swhiteho, chris.mason, dm-devel On Thu, Aug 26 2010 at 9:43pm -0400, Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote: > Hi Mike, > > (08/27/10 07:50), Mike Snitzer wrote: > >> Special casing is necessary because device-mapper may have to > >> send multiple copies of REQ_FLUSH request to multiple > >> targets, while normal request is just sent to single target. > > > > Yes, request-based DM is meant to have all the same capabilities as > > bio-based DM. So in theory it should support multiple targets but in > > practice it doesn't. DM's multipath target is the only consumer of > > request-based DM and it only ever clones a single flush request > > (num_flush_requests = 1). > > This is correct. But, > > > So why not remove all of request-based DM's barrier infrastructure and > > simply rely on the revised block layer to sequence the FLUSH+WRITE > > request for request-based DM? > > > > Given that we do not have a request-based DM target that requires > > cloning multiple FLUSH requests its unused code that is delaying DM > > support for the new FLUSH+FUA work (NOTE: bio-based DM obviously still > > needs work in this area). > > the above mentioned 'special casing' is not a hard part. > See the attached patch. Yes, Tejun suggested something like this in one of the threads. Thanks for implementing it. But do you agree that the request-based barrier code (added in commit d0bcb8786) could be reverted given the new FLUSH work? We no longer need waiting now that ordering isn't a concern. Especially so given rq-based doesn't support multiple targets. As you know, from dm_table_set_type: /* * Request-based dm supports only tables that have a single target now. * To support multiple targets, request splitting support is needed, * and that needs lots of changes in the block-layer. * (e.g. request completion process for partial completion.) */ I think we need to at least benchmark the performance of dm-mpath without any of this extra, soon to be unnecessary, code. Maybe my concern is overblown... > The hard part is discerning the error type for flush failure > as discussed in the other thread. > And as Kiyoshi wrote, that's an existing problem so it can > be worked on as a separate issue than the new FLUSH work. Right, Mike Christie will be refreshing his patchset that should enable us to resolve that separate issue. Thanks, Mike ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-27 4:08 ` Mike Snitzer @ 2010-08-27 5:52 ` Jun'ichi Nomura 2010-08-27 14:13 ` Mike Snitzer 0 siblings, 1 reply; 155+ messages in thread From: Jun'ichi Nomura @ 2010-08-27 5:52 UTC (permalink / raw) To: Mike Snitzer Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj, tytso, swhiteho, chris.mason, dm-devel Hi Mike, (08/27/10 13:08), Mike Snitzer wrote: >> the above mentioned 'special casing' is not a hard part. >> See the attached patch. > > Yes, Tejun suggested something like this in one of the threads. Thanks > for implementing it. > > But do you agree that the request-based barrier code (added in commit > d0bcb8786) could be reverted given the new FLUSH work? No, it's a separate thing. If we don't need to care about the case where multiple clones of flush request are necessary, the special casing of flush request can be removed regardless of the new FLUSH work. > We no longer need waiting now that ordering isn't a concern. Especially The waiting is not for ordering, but for multiple clones. > so given rq-based doesn't support multiple targets. As you know, from > dm_table_set_type: > > /* > * Request-based dm supports only tables that have a single target now. > * To support multiple targets, request splitting support is needed, > * and that needs lots of changes in the block-layer. > * (e.g. request completion process for partial completion.) > */ This comment is about multiple targets. The special code for barrier is for single target whose num_flush_requests > 1. Different thing. > I think we need to at least benchmark the performance of dm-mpath > without any of this extra, soon to be unnecessary, code. If there will be no need for supporting a request-based target with num_flush_requests > 1, the special handling of flush can be removed. And since there is no such target in the current tree, I don't object if you remove that part of code for good reason. Thanks, -- Jun'ichi Nomura, NEC Corporation ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-27 5:52 ` Jun'ichi Nomura @ 2010-08-27 14:13 ` Mike Snitzer 2010-08-30 4:45 ` Jun'ichi Nomura 0 siblings, 1 reply; 155+ messages in thread From: Mike Snitzer @ 2010-08-27 14:13 UTC (permalink / raw) To: Jun'ichi Nomura Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj, tytso, swhiteho, chris.mason, dm-devel On Fri, Aug 27 2010 at 1:52am -0400, Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote: > Hi Mike, > > (08/27/10 13:08), Mike Snitzer wrote: > > But do you agree that the request-based barrier code (added in commit > > d0bcb8786) could be reverted given the new FLUSH work? > > No, it's a separate thing. > If we don't need to care about the case where multiple clones > of flush request are necessary, the special casing of flush > request can be removed regardless of the new FLUSH work. Ah, yes thanks for clarifying. But we've never cared about multiple clone of a flush so it's odd that such elaborate infrastructure was introduced without a need. > > We no longer need waiting now that ordering isn't a concern. Especially > > The waiting is not for ordering, but for multiple clones. > > > so given rq-based doesn't support multiple targets. As you know, from > > dm_table_set_type: > > > > /* > > * Request-based dm supports only tables that have a single target now. > > * To support multiple targets, request splitting support is needed, > > * and that needs lots of changes in the block-layer. > > * (e.g. request completion process for partial completion.) > > */ > > This comment is about multiple targets. > The special code for barrier is for single target whose > num_flush_requests > 1. Different thing. Yes, I need to not send mail just before going to bed.. > > I think we need to at least benchmark the performance of dm-mpath > > without any of this extra, soon to be unnecessary, code. > > If there will be no need for supporting a request-based target > with num_flush_requests > 1, the special handling of flush > can be removed. > > And since there is no such target in the current tree, > I don't object if you remove that part of code for good reason. OK, certainly something to keep in mind. But _really_ knowing the multipath FLUSH+FUA performance difference (extra special-case code vs none) requires a full FLUSH conversion of request-based DM anyway. In general, request-based DM's barrier/flush code does carry a certain maintenance overhead. It is quite a bit of distracting code in the core DM which isn't buying us anything.. so we _could_ just remove it and never look back (until we have some specific need for num_flush_requests > 1 in rq-based DM). Mike ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-27 14:13 ` Mike Snitzer @ 2010-08-30 4:45 ` Jun'ichi Nomura 2010-08-30 8:33 ` Tejun Heo 0 siblings, 1 reply; 155+ messages in thread From: Jun'ichi Nomura @ 2010-08-30 4:45 UTC (permalink / raw) To: Mike Snitzer Cc: Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tj, tytso, swhiteho, chris.mason, dm-devel Hi Mike, (08/27/10 23:13), Mike Snitzer wrote: >> If there will be no need for supporting a request-based target >> with num_flush_requests > 1, the special handling of flush >> can be removed. >> >> And since there is no such target in the current tree, >> I don't object if you remove that part of code for good reason. > > OK, certainly something to keep in mind. But _really_ knowing the > multipath FLUSH+FUA performance difference (extra special-case code vs > none) requires a full FLUSH conversion of request-based DM anyway. > > In general, request-based DM's barrier/flush code does carry a certain > maintenance overhead. It is quite a bit of distracting code in the core > DM which isn't buying us anything.. so we _could_ just remove it and > never look back (until we have some specific need for num_flush_requests >> 1 in rq-based DM). So, I'm not objecting to your idea. Could you please create a patch to remove that? Thanks, -- Jun'ichi Nomura, NEC Corporation ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-30 4:45 ` Jun'ichi Nomura @ 2010-08-30 8:33 ` Tejun Heo 2010-08-30 12:43 ` Mike Snitzer 0 siblings, 1 reply; 155+ messages in thread From: Tejun Heo @ 2010-08-30 8:33 UTC (permalink / raw) To: Jun'ichi Nomura Cc: Mike Snitzer, Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tytso, swhiteho, chris.mason, dm-devel On 08/30/2010 06:45 AM, Jun'ichi Nomura wrote: > Hi Mike, > > (08/27/10 23:13), Mike Snitzer wrote: >>> If there will be no need for supporting a request-based target >>> with num_flush_requests > 1, the special handling of flush >>> can be removed. >>> >>> And since there is no such target in the current tree, >>> I don't object if you remove that part of code for good reason. >> >> OK, certainly something to keep in mind. But _really_ knowing the >> multipath FLUSH+FUA performance difference (extra special-case code vs >> none) requires a full FLUSH conversion of request-based DM anyway. >> >> In general, request-based DM's barrier/flush code does carry a certain >> maintenance overhead. It is quite a bit of distracting code in the core >> DM which isn't buying us anything.. so we _could_ just remove it and >> never look back (until we have some specific need for num_flush_requests >>> 1 in rq-based DM). > > So, I'm not objecting to your idea. > Could you please create a patch to remove that? I did that yesterday. Will post the patch soon. Thanks. -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-30 8:33 ` Tejun Heo @ 2010-08-30 12:43 ` Mike Snitzer 2010-08-30 12:45 ` Tejun Heo 0 siblings, 1 reply; 155+ messages in thread From: Mike Snitzer @ 2010-08-30 12:43 UTC (permalink / raw) To: Tejun Heo Cc: Jun'ichi Nomura, Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tytso, swhiteho, chris.mason, dm-devel On Mon, Aug 30 2010 at 4:33am -0400, Tejun Heo <tj@kernel.org> wrote: > On 08/30/2010 06:45 AM, Jun'ichi Nomura wrote: > > Hi Mike, > > > > (08/27/10 23:13), Mike Snitzer wrote: > >>> If there will be no need for supporting a request-based target > >>> with num_flush_requests > 1, the special handling of flush > >>> can be removed. > >>> > >>> And since there is no such target in the current tree, > >>> I don't object if you remove that part of code for good reason. > >> > >> OK, certainly something to keep in mind. But _really_ knowing the > >> multipath FLUSH+FUA performance difference (extra special-case code vs > >> none) requires a full FLUSH conversion of request-based DM anyway. > >> > >> In general, request-based DM's barrier/flush code does carry a certain > >> maintenance overhead. It is quite a bit of distracting code in the core > >> DM which isn't buying us anything.. so we _could_ just remove it and > >> never look back (until we have some specific need for num_flush_requests > >>> 1 in rq-based DM). > > > > So, I'm not objecting to your idea. > > Could you please create a patch to remove that? > > I did that yesterday. Will post the patch soon. I did it yesterday also, mine builds on your previous DM patchset... I'll review your recent patchset, from today, to compare and will share my findings. I was hoping we could get the current request-based code working with your new FLUSH+FUA work without removing support for num_flush_requests (yet). And then layer in the removal to give us the before and after so we would know the overhead associated with keeping/dropping num_flush_requests. But like I said earlier "we _could_ just remove it and never look back". Thanks, Mike ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC 2/2] dm: support REQ_FLUSH directly 2010-08-30 12:43 ` Mike Snitzer @ 2010-08-30 12:45 ` Tejun Heo 0 siblings, 0 replies; 155+ messages in thread From: Tejun Heo @ 2010-08-30 12:45 UTC (permalink / raw) To: Mike Snitzer Cc: Jun'ichi Nomura, Christoph Hellwig, Kiyoshi Ueda, Jan Kara, linux-scsi, jaxboe, linux-raid, linux-fsdevel, James.Bottomley, konishi.ryusuke, tytso, swhiteho, chris.mason, dm-devel Hello, On 08/30/2010 02:43 PM, Mike Snitzer wrote: > I did it yesterday also, mine builds on your previous DM patchset... > > I'll review your recent patchset, from today, to compare and will share > my findings. Thanks. :-) > I was hoping we could get the current request-based code working with > your new FLUSH+FUA work without removing support for num_flush_requests > (yet). And then layer in the removal to give us the before and after so > we would know the overhead associated with keeping/dropping > num_flush_requests. But like I said earlier "we _could_ just remove it > and never look back". I tried but it's not very easy because the original implementation depended on the block layer suppressing other requests while flush sequence is in progress. The painful part was that block layer no longer sorts requeued flush requests in front of other front inserted requests, so explicit queue suppressing can't be implemented simply. Another route would be adding a separate wait/wakeup logic for flushes (someone posted a demo patch for that which was almost there but not fully), but it seemed like a aimless effort to build a new facility to rip it out in the next patch. After all, the whole thing seemed somewhat pointless given that writes can't be routed to multiple targets (if writes can't target multiple devices, flushes won't need to either). Thanks. -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC] relaxed barriers 2010-08-03 18:49 ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig 2010-08-03 18:51 ` [PATCH, RFC 2/2] dm: support REQ_FLUSH directly Christoph Hellwig @ 2010-08-06 16:04 ` Tejun Heo 2010-08-06 23:34 ` Christoph Hellwig 2010-08-07 10:13 ` [PATCH REPOST " Tejun Heo 1 sibling, 2 replies; 155+ messages in thread From: Tejun Heo @ 2010-08-06 16:04 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel, linux-raid Hello, So, here's my shot at it. After this patch, barrier no longer dictates the ordering of other requests. The block layer sequences the barrier request without interfering with other requests (not even elevator draining). Multiple pending barriers are handled by saving those in a separate queue and servicing them one by one. Basically, barrier sequences form a separate FIFO command stream independent of other requests and all the ordering between the two streams is filesystem's responsibility. Ordered tag support is dropped as no one seems to be making any meaningful use of it. I'm fairly skeptical about its usefulness anyway. The only thing ordered tag saves is latencies between command completions and issues in barrier sequences, which isn't much to begin with and puts additional ordering restrictions compared to ordering in software (ordered tag commands will unnecessary affect processing of simple tag commands). Lightly tested for all three BAR (!WC), FLUSH and FUA cases. The multiple pending barrier code path isn't tested yet. Christoph, does this look like something the filesystems can use or have I misunderstood something? Thanks. NOT_SIGNED_OFF_YET --- block/blk-barrier.c | 253 +++++++++++++++---------------------------- block/blk-core.c | 31 ++--- block/blk.h | 5 block/elevator.c | 80 +------------ drivers/block/brd.c | 2 drivers/block/loop.c | 2 drivers/block/osdblk.c | 2 drivers/block/pktcdvd.c | 1 drivers/block/ps3disk.c | 3 drivers/block/virtio_blk.c | 4 drivers/block/xen-blkfront.c | 2 drivers/ide/ide-disk.c | 4 drivers/md/dm.c | 3 drivers/mmc/card/queue.c | 2 drivers/s390/block/dasd.c | 2 drivers/scsi/sd.c | 8 - include/linux/blkdev.h | 59 +++------- include/linux/elevator.h | 6 - 18 files changed, 154 insertions(+), 315 deletions(-) Index: work/block/blk-barrier.c =================================================================== --- work.orig/block/blk-barrier.c +++ work/block/blk-barrier.c @@ -9,6 +9,8 @@ #include "blk.h" +static struct request *queue_next_ordseq(struct request_queue *q); + /** * blk_queue_ordered - does this queue support ordered writes * @q: the request queue @@ -31,13 +33,8 @@ int blk_queue_ordered(struct request_que return -EINVAL; } - if (ordered != QUEUE_ORDERED_NONE && - ordered != QUEUE_ORDERED_DRAIN && - ordered != QUEUE_ORDERED_DRAIN_FLUSH && - ordered != QUEUE_ORDERED_DRAIN_FUA && - ordered != QUEUE_ORDERED_TAG && - ordered != QUEUE_ORDERED_TAG_FLUSH && - ordered != QUEUE_ORDERED_TAG_FUA) { + if (ordered != QUEUE_ORDERED_NONE && ordered != QUEUE_ORDERED_BAR && + ordered != QUEUE_ORDERED_FLUSH && ordered != QUEUE_ORDERED_FUA) { printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered); return -EINVAL; } @@ -60,38 +57,10 @@ unsigned blk_ordered_cur_seq(struct requ return 1 << ffz(q->ordseq); } -unsigned blk_ordered_req_seq(struct request *rq) +static struct request *blk_ordered_complete_seq(struct request_queue *q, + unsigned seq, int error) { - struct request_queue *q = rq->q; - - BUG_ON(q->ordseq == 0); - - if (rq == &q->pre_flush_rq) - return QUEUE_ORDSEQ_PREFLUSH; - if (rq == &q->bar_rq) - return QUEUE_ORDSEQ_BAR; - if (rq == &q->post_flush_rq) - return QUEUE_ORDSEQ_POSTFLUSH; - - /* - * !fs requests don't need to follow barrier ordering. Always - * put them at the front. This fixes the following deadlock. - * - * http://thread.gmane.org/gmane.linux.kernel/537473 - */ - if (!blk_fs_request(rq)) - return QUEUE_ORDSEQ_DRAIN; - - if ((rq->cmd_flags & REQ_ORDERED_COLOR) == - (q->orig_bar_rq->cmd_flags & REQ_ORDERED_COLOR)) - return QUEUE_ORDSEQ_DRAIN; - else - return QUEUE_ORDSEQ_DONE; -} - -bool blk_ordered_complete_seq(struct request_queue *q, unsigned seq, int error) -{ - struct request *rq; + struct request *rq = NULL; if (error && !q->orderr) q->orderr = error; @@ -99,16 +68,22 @@ bool blk_ordered_complete_seq(struct req BUG_ON(q->ordseq & seq); q->ordseq |= seq; - if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) - return false; - - /* - * Okay, sequence complete. - */ - q->ordseq = 0; - rq = q->orig_bar_rq; - __blk_end_request_all(rq, q->orderr); - return true; + if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) { + /* not complete yet, queue the next ordered sequence */ + rq = queue_next_ordseq(q); + } else { + /* complete this barrier request */ + __blk_end_request_all(q->orig_bar_rq, q->orderr); + q->orig_bar_rq = NULL; + q->ordseq = 0; + + /* dispatch the next barrier if there's one */ + if (!list_empty(&q->pending_barriers)) { + rq = list_entry_rq(q->pending_barriers.next); + list_move(&rq->queuelist, &q->queue_head); + } + } + return rq; } static void pre_flush_end_io(struct request *rq, int error) @@ -129,21 +104,10 @@ static void post_flush_end_io(struct req blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error); } -static void queue_flush(struct request_queue *q, unsigned which) +static void queue_flush(struct request_queue *q, struct request *rq, + rq_end_io_fn *end_io) { - struct request *rq; - rq_end_io_fn *end_io; - - if (which == QUEUE_ORDERED_DO_PREFLUSH) { - rq = &q->pre_flush_rq; - end_io = pre_flush_end_io; - } else { - rq = &q->post_flush_rq; - end_io = post_flush_end_io; - } - blk_rq_init(q, rq); - rq->cmd_flags = REQ_HARDBARRIER; rq->rq_disk = q->bar_rq.rq_disk; rq->end_io = end_io; q->prepare_flush_fn(q, rq); @@ -151,130 +115,93 @@ static void queue_flush(struct request_q elv_insert(q, rq, ELEVATOR_INSERT_FRONT); } -static inline bool start_ordered(struct request_queue *q, struct request **rqp) +static struct request *queue_next_ordseq(struct request_queue *q) { - struct request *rq = *rqp; - unsigned skip = 0; - - q->orderr = 0; - q->ordered = q->next_ordered; - q->ordseq |= QUEUE_ORDSEQ_STARTED; - - /* - * For an empty barrier, there's no actual BAR request, which - * in turn makes POSTFLUSH unnecessary. Mask them off. - */ - if (!blk_rq_sectors(rq)) { - q->ordered &= ~(QUEUE_ORDERED_DO_BAR | - QUEUE_ORDERED_DO_POSTFLUSH); - /* - * Empty barrier on a write-through device w/ ordered - * tag has no command to issue and without any command - * to issue, ordering by tag can't be used. Drain - * instead. - */ - if ((q->ordered & QUEUE_ORDERED_BY_TAG) && - !(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) - q->ordered &= ~QUEUE_ORDERED_BY_TAG; - } - - /* stash away the original request */ - blk_dequeue_request(rq); - q->orig_bar_rq = rq; - rq = NULL; - - /* - * Queue ordered sequence. As we stack them at the head, we - * need to queue in reverse order. Note that we rely on that - * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs - * request gets inbetween ordered sequence. - */ - if (q->ordered & QUEUE_ORDERED_DO_POSTFLUSH) { - queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH); - rq = &q->post_flush_rq; - } else - skip |= QUEUE_ORDSEQ_POSTFLUSH; + struct request *rq = &q->bar_rq; - if (q->ordered & QUEUE_ORDERED_DO_BAR) { - rq = &q->bar_rq; + switch (blk_ordered_cur_seq(q)) { + case QUEUE_ORDSEQ_PREFLUSH: + queue_flush(q, rq, pre_flush_end_io); + break; + case QUEUE_ORDSEQ_BAR: /* initialize proxy request and queue it */ blk_rq_init(q, rq); - if (bio_data_dir(q->orig_bar_rq->bio) == WRITE) - rq->cmd_flags |= REQ_RW; + init_request_from_bio(rq, q->orig_bar_rq->bio); + rq->cmd_flags &= ~REQ_HARDBARRIER; if (q->ordered & QUEUE_ORDERED_DO_FUA) rq->cmd_flags |= REQ_FUA; - init_request_from_bio(rq, q->orig_bar_rq->bio); rq->end_io = bar_end_io; elv_insert(q, rq, ELEVATOR_INSERT_FRONT); - } else - skip |= QUEUE_ORDSEQ_BAR; + break; - if (q->ordered & QUEUE_ORDERED_DO_PREFLUSH) { - queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH); - rq = &q->pre_flush_rq; - } else - skip |= QUEUE_ORDSEQ_PREFLUSH; + case QUEUE_ORDSEQ_POSTFLUSH: + queue_flush(q, rq, post_flush_end_io); + break; - if (!(q->ordered & QUEUE_ORDERED_BY_TAG) && queue_in_flight(q)) - rq = NULL; - else - skip |= QUEUE_ORDSEQ_DRAIN; - - *rqp = rq; - - /* - * Complete skipped sequences. If whole sequence is complete, - * return false to tell elevator that this request is gone. - */ - return !blk_ordered_complete_seq(q, skip, 0); + default: + BUG(); + } + return rq; } -bool blk_do_ordered(struct request_queue *q, struct request **rqp) +struct request *blk_do_ordered(struct request_queue *q, struct request *rq) { - struct request *rq = *rqp; - const int is_barrier = blk_fs_request(rq) && blk_barrier_rq(rq); + unsigned skip = 0; - if (!q->ordseq) { - if (!is_barrier) - return true; - - if (q->next_ordered != QUEUE_ORDERED_NONE) - return start_ordered(q, rqp); - else { - /* - * Queue ordering not supported. Terminate - * with prejudice. - */ - blk_dequeue_request(rq); - __blk_end_request_all(rq, -EOPNOTSUPP); - *rqp = NULL; - return false; - } + if (!blk_barrier_rq(rq)) + return rq; + + if (q->ordseq) { + /* + * Barrier is already in progress and they can't be + * processed in parallel. Queue for later processing. + */ + list_move_tail(&rq->queuelist, &q->pending_barriers); + return NULL; + } + + if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) { + /* + * Queue ordering not supported. Terminate + * with prejudice. + */ + blk_dequeue_request(rq); + __blk_end_request_all(rq, -EOPNOTSUPP); + return NULL; } /* - * Ordered sequence in progress + * Start a new ordered sequence */ + q->orderr = 0; + q->ordered = q->next_ordered; + q->ordseq |= QUEUE_ORDSEQ_STARTED; - /* Special requests are not subject to ordering rules. */ - if (!blk_fs_request(rq) && - rq != &q->pre_flush_rq && rq != &q->post_flush_rq) - return true; - - if (q->ordered & QUEUE_ORDERED_BY_TAG) { - /* Ordered by tag. Blocking the next barrier is enough. */ - if (is_barrier && rq != &q->bar_rq) - *rqp = NULL; - } else { - /* Ordered by draining. Wait for turn. */ - WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q)); - if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q)) - *rqp = NULL; - } + /* + * For an empty barrier, there's no actual BAR request, which + * in turn makes POSTFLUSH unnecessary. Mask them off. + */ + if (!blk_rq_sectors(rq)) + q->ordered &= ~(QUEUE_ORDERED_DO_BAR | + QUEUE_ORDERED_DO_POSTFLUSH); + + /* stash away the original request */ + blk_dequeue_request(rq); + q->orig_bar_rq = rq; + + if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) + skip |= QUEUE_ORDSEQ_PREFLUSH; + + if (!(q->ordered & QUEUE_ORDERED_DO_BAR)) + skip |= QUEUE_ORDSEQ_BAR; + + if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH)) + skip |= QUEUE_ORDSEQ_POSTFLUSH; - return true; + /* complete skipped sequences and return the first sequence */ + return blk_ordered_complete_seq(q, skip, 0); } static void bio_end_empty_barrier(struct bio *bio, int err) Index: work/include/linux/blkdev.h =================================================================== --- work.orig/include/linux/blkdev.h +++ work/include/linux/blkdev.h @@ -106,7 +106,6 @@ enum rq_flag_bits { __REQ_FAILED, /* set if the request failed */ __REQ_QUIET, /* don't worry about errors */ __REQ_PREEMPT, /* set for "ide_preempt" requests */ - __REQ_ORDERED_COLOR, /* is before or after barrier */ __REQ_RW_SYNC, /* request is sync (sync write or read) */ __REQ_ALLOCED, /* request came from our alloc pool */ __REQ_RW_META, /* metadata io request */ @@ -135,7 +134,6 @@ enum rq_flag_bits { #define REQ_FAILED (1 << __REQ_FAILED) #define REQ_QUIET (1 << __REQ_QUIET) #define REQ_PREEMPT (1 << __REQ_PREEMPT) -#define REQ_ORDERED_COLOR (1 << __REQ_ORDERED_COLOR) #define REQ_RW_SYNC (1 << __REQ_RW_SYNC) #define REQ_ALLOCED (1 << __REQ_ALLOCED) #define REQ_RW_META (1 << __REQ_RW_META) @@ -437,9 +435,10 @@ struct request_queue * reserved for flush operations */ unsigned int ordered, next_ordered, ordseq; - int orderr, ordcolor; - struct request pre_flush_rq, bar_rq, post_flush_rq; - struct request *orig_bar_rq; + int orderr; + struct request bar_rq; + struct request *orig_bar_rq; + struct list_head pending_barriers; struct mutex sysfs_lock; @@ -543,47 +542,33 @@ enum { * Hardbarrier is supported with one of the following methods. * * NONE : hardbarrier unsupported - * DRAIN : ordering by draining is enough - * DRAIN_FLUSH : ordering by draining w/ pre and post flushes - * DRAIN_FUA : ordering by draining w/ pre flush and FUA write - * TAG : ordering by tag is enough - * TAG_FLUSH : ordering by tag w/ pre and post flushes - * TAG_FUA : ordering by tag w/ pre flush and FUA write - */ - QUEUE_ORDERED_BY_TAG = 0x02, - QUEUE_ORDERED_DO_PREFLUSH = 0x10, - QUEUE_ORDERED_DO_BAR = 0x20, - QUEUE_ORDERED_DO_POSTFLUSH = 0x40, - QUEUE_ORDERED_DO_FUA = 0x80, + * BAR : writing out barrier is enough + * FLUSH : barrier and surrounding pre and post flushes + * FUA : FUA barrier w/ pre flush + */ + QUEUE_ORDERED_DO_PREFLUSH = 1 << 0, + QUEUE_ORDERED_DO_BAR = 1 << 1, + QUEUE_ORDERED_DO_POSTFLUSH = 1 << 2, + QUEUE_ORDERED_DO_FUA = 1 << 3, - QUEUE_ORDERED_NONE = 0x00, + QUEUE_ORDERED_NONE = 0, - QUEUE_ORDERED_DRAIN = QUEUE_ORDERED_DO_BAR, - QUEUE_ORDERED_DRAIN_FLUSH = QUEUE_ORDERED_DRAIN | + QUEUE_ORDERED_BAR = QUEUE_ORDERED_DO_BAR, + QUEUE_ORDERED_FLUSH = QUEUE_ORDERED_DO_BAR | QUEUE_ORDERED_DO_PREFLUSH | QUEUE_ORDERED_DO_POSTFLUSH, - QUEUE_ORDERED_DRAIN_FUA = QUEUE_ORDERED_DRAIN | - QUEUE_ORDERED_DO_PREFLUSH | - QUEUE_ORDERED_DO_FUA, - - QUEUE_ORDERED_TAG = QUEUE_ORDERED_BY_TAG | - QUEUE_ORDERED_DO_BAR, - QUEUE_ORDERED_TAG_FLUSH = QUEUE_ORDERED_TAG | - QUEUE_ORDERED_DO_PREFLUSH | - QUEUE_ORDERED_DO_POSTFLUSH, - QUEUE_ORDERED_TAG_FUA = QUEUE_ORDERED_TAG | + QUEUE_ORDERED_FUA = QUEUE_ORDERED_DO_BAR | QUEUE_ORDERED_DO_PREFLUSH | QUEUE_ORDERED_DO_FUA, /* * Ordered operation sequence */ - QUEUE_ORDSEQ_STARTED = 0x01, /* flushing in progress */ - QUEUE_ORDSEQ_DRAIN = 0x02, /* waiting for the queue to be drained */ - QUEUE_ORDSEQ_PREFLUSH = 0x04, /* pre-flushing in progress */ - QUEUE_ORDSEQ_BAR = 0x08, /* original barrier req in progress */ - QUEUE_ORDSEQ_POSTFLUSH = 0x10, /* post-flushing in progress */ - QUEUE_ORDSEQ_DONE = 0x20, + QUEUE_ORDSEQ_STARTED = (1 << 0), /* flushing in progress */ + QUEUE_ORDSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */ + QUEUE_ORDSEQ_BAR = (1 << 2), /* barrier write in progress */ + QUEUE_ORDSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */ + QUEUE_ORDSEQ_DONE = (1 << 4), }; #define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags) @@ -965,10 +950,8 @@ extern void blk_queue_rq_timed_out(struc extern void blk_queue_rq_timeout(struct request_queue *, unsigned int); extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev); extern int blk_queue_ordered(struct request_queue *, unsigned, prepare_flush_fn *); -extern bool blk_do_ordered(struct request_queue *, struct request **); extern unsigned blk_ordered_cur_seq(struct request_queue *); extern unsigned blk_ordered_req_seq(struct request *); -extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int); extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *); extern void blk_dump_rq_flags(struct request *, char *); Index: work/drivers/block/brd.c =================================================================== --- work.orig/drivers/block/brd.c +++ work/drivers/block/brd.c @@ -479,7 +479,7 @@ static struct brd_device *brd_alloc(int if (!brd->brd_queue) goto out_free_dev; blk_queue_make_request(brd->brd_queue, brd_make_request); - blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_TAG, NULL); + blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_BAR, NULL); blk_queue_max_hw_sectors(brd->brd_queue, 1024); blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY); Index: work/drivers/block/virtio_blk.c =================================================================== --- work.orig/drivers/block/virtio_blk.c +++ work/drivers/block/virtio_blk.c @@ -368,10 +368,10 @@ static int __devinit virtblk_probe(struc /* If barriers are supported, tell block layer that queue is ordered */ if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) - blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, + blk_queue_ordered(q, QUEUE_ORDERED_FLUSH, virtblk_prepare_flush); else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER)) - blk_queue_ordered(q, QUEUE_ORDERED_TAG, NULL); + blk_queue_ordered(q, QUEUE_ORDERED_BAR, NULL); /* If disk is read-only in the host, the guest should obey */ if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO)) Index: work/drivers/scsi/sd.c =================================================================== --- work.orig/drivers/scsi/sd.c +++ work/drivers/scsi/sd.c @@ -2103,15 +2103,13 @@ static int sd_revalidate_disk(struct gen /* * We now have all cache related info, determine how we deal - * with ordered requests. Note that as the current SCSI - * dispatch function can alter request order, we cannot use - * QUEUE_ORDERED_TAG_* even when ordered tag is supported. + * with ordered requests. */ if (sdkp->WCE) ordered = sdkp->DPOFUA - ? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH; + ? QUEUE_ORDERED_FUA : QUEUE_ORDERED_FLUSH; else - ordered = QUEUE_ORDERED_DRAIN; + ordered = QUEUE_ORDERED_BAR; blk_queue_ordered(sdkp->disk->queue, ordered, sd_prepare_flush); Index: work/block/blk-core.c =================================================================== --- work.orig/block/blk-core.c +++ work/block/blk-core.c @@ -520,6 +520,7 @@ struct request_queue *blk_alloc_queue_no init_timer(&q->unplug_timer); setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q); INIT_LIST_HEAD(&q->timeout_list); + INIT_LIST_HEAD(&q->pending_barriers); INIT_WORK(&q->unplug_work, blk_unplug_work); kobject_init(&q->kobj, &blk_queue_ktype); @@ -1036,22 +1037,6 @@ void blk_insert_request(struct request_q } EXPORT_SYMBOL(blk_insert_request); -/* - * add-request adds a request to the linked list. - * queue lock is held and interrupts disabled, as we muck with the - * request queue list. - */ -static inline void add_request(struct request_queue *q, struct request *req) -{ - drive_stat_acct(req, 1); - - /* - * elevator indicated where it wants this request to be - * inserted at elevator_merge time - */ - __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0); -} - static void part_round_stats_single(int cpu, struct hd_struct *part, unsigned long now) { @@ -1184,6 +1169,7 @@ static int __make_request(struct request const bool sync = bio_rw_flagged(bio, BIO_RW_SYNCIO); const bool unplug = bio_rw_flagged(bio, BIO_RW_UNPLUG); const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK; + int where = ELEVATOR_INSERT_SORT; int rw_flags; if (bio_rw_flagged(bio, BIO_RW_BARRIER) && @@ -1191,6 +1177,7 @@ static int __make_request(struct request bio_endio(bio, -EOPNOTSUPP); return 0; } + /* * low level driver can indicate that it wants pages above a * certain limit bounced to low memory (ie for highmem, or even @@ -1200,7 +1187,12 @@ static int __make_request(struct request spin_lock_irq(q->queue_lock); - if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER)) || elv_queue_empty(q)) + if (bio_rw_flagged(bio, BIO_RW_BARRIER)) { + where = ELEVATOR_INSERT_ORDERED; + goto get_rq; + } + + if (elv_queue_empty(q)) goto get_rq; el_ret = elv_merge(q, &req, bio); @@ -1297,7 +1289,10 @@ get_rq: req->cpu = blk_cpu_to_group(smp_processor_id()); if (queue_should_plug(q) && elv_queue_empty(q)) blk_plug_device(q); - add_request(q, req); + + /* insert the request into the elevator */ + drive_stat_acct(req, 1); + __elv_add_request(q, req, where, 0); out: if (unplug || !queue_should_plug(q)) __generic_unplug_device(q); Index: work/block/elevator.c =================================================================== --- work.orig/block/elevator.c +++ work/block/elevator.c @@ -564,7 +564,7 @@ void elv_requeue_request(struct request_ rq->cmd_flags &= ~REQ_STARTED; - elv_insert(q, rq, ELEVATOR_INSERT_REQUEUE); + elv_insert(q, rq, ELEVATOR_INSERT_FRONT); } void elv_drain_elevator(struct request_queue *q) @@ -611,8 +611,6 @@ void elv_quiesce_end(struct request_queu void elv_insert(struct request_queue *q, struct request *rq, int where) { - struct list_head *pos; - unsigned ordseq; int unplug_it = 1; trace_block_rq_insert(q, rq); @@ -622,10 +620,14 @@ void elv_insert(struct request_queue *q, switch (where) { case ELEVATOR_INSERT_FRONT: rq->cmd_flags |= REQ_SOFTBARRIER; - list_add(&rq->queuelist, &q->queue_head); break; + case ELEVATOR_INSERT_ORDERED: + rq->cmd_flags |= REQ_SOFTBARRIER; + list_add_tail(&rq->queuelist, &q->queue_head); + break; + case ELEVATOR_INSERT_BACK: rq->cmd_flags |= REQ_SOFTBARRIER; elv_drain_elevator(q); @@ -661,36 +663,6 @@ void elv_insert(struct request_queue *q, q->elevator->ops->elevator_add_req_fn(q, rq); break; - case ELEVATOR_INSERT_REQUEUE: - /* - * If ordered flush isn't in progress, we do front - * insertion; otherwise, requests should be requeued - * in ordseq order. - */ - rq->cmd_flags |= REQ_SOFTBARRIER; - - /* - * Most requeues happen because of a busy condition, - * don't force unplug of the queue for that case. - */ - unplug_it = 0; - - if (q->ordseq == 0) { - list_add(&rq->queuelist, &q->queue_head); - break; - } - - ordseq = blk_ordered_req_seq(rq); - - list_for_each(pos, &q->queue_head) { - struct request *pos_rq = list_entry_rq(pos); - if (ordseq <= blk_ordered_req_seq(pos_rq)) - break; - } - - list_add_tail(&rq->queuelist, pos); - break; - default: printk(KERN_ERR "%s: bad insertion point %d\n", __func__, where); @@ -709,32 +681,14 @@ void elv_insert(struct request_queue *q, void __elv_add_request(struct request_queue *q, struct request *rq, int where, int plug) { - if (q->ordcolor) - rq->cmd_flags |= REQ_ORDERED_COLOR; - if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)) { - /* - * toggle ordered color - */ - if (blk_barrier_rq(rq)) - q->ordcolor ^= 1; - - /* - * barriers implicitly indicate back insertion - */ - if (where == ELEVATOR_INSERT_SORT) - where = ELEVATOR_INSERT_BACK; - - /* - * this request is scheduling boundary, update - * end_sector - */ + /* barriers are scheduling boundary, update end_sector */ if (blk_fs_request(rq) || blk_discard_rq(rq)) { q->end_sector = rq_end_sector(rq); q->boundary_rq = rq; } } else if (!(rq->cmd_flags & REQ_ELVPRIV) && - where == ELEVATOR_INSERT_SORT) + where == ELEVATOR_INSERT_SORT) where = ELEVATOR_INSERT_BACK; if (plug) @@ -846,24 +800,6 @@ void elv_completed_request(struct reques if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn) e->ops->elevator_completed_req_fn(q, rq); } - - /* - * Check if the queue is waiting for fs requests to be - * drained for flush sequence. - */ - if (unlikely(q->ordseq)) { - struct request *next = NULL; - - if (!list_empty(&q->queue_head)) - next = list_entry_rq(q->queue_head.next); - - if (!queue_in_flight(q) && - blk_ordered_cur_seq(q) == QUEUE_ORDSEQ_DRAIN && - (!next || blk_ordered_req_seq(next) > QUEUE_ORDSEQ_DRAIN)) { - blk_ordered_complete_seq(q, QUEUE_ORDSEQ_DRAIN, 0); - __blk_run_queue(q); - } - } } #define to_elv(atr) container_of((atr), struct elv_fs_entry, attr) Index: work/block/blk.h =================================================================== --- work.orig/block/blk.h +++ work/block/blk.h @@ -51,6 +51,8 @@ static inline void blk_clear_rq_complete */ #define ELV_ON_HASH(rq) (!hlist_unhashed(&(rq)->hash)) +struct request *blk_do_ordered(struct request_queue *q, struct request *rq); + static inline struct request *__elv_next_request(struct request_queue *q) { struct request *rq; @@ -58,7 +60,8 @@ static inline struct request *__elv_next while (1) { while (!list_empty(&q->queue_head)) { rq = list_entry_rq(q->queue_head.next); - if (blk_do_ordered(q, &rq)) + rq = blk_do_ordered(q, rq); + if (rq) return rq; } Index: work/drivers/block/loop.c =================================================================== --- work.orig/drivers/block/loop.c +++ work/drivers/block/loop.c @@ -831,7 +831,7 @@ static int loop_set_fd(struct loop_devic lo->lo_queue->unplug_fn = loop_unplug; if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync) - blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN, NULL); + blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_BAR, NULL); set_capacity(lo->lo_disk, size); bd_set_size(bdev, size << 9); Index: work/drivers/block/osdblk.c =================================================================== --- work.orig/drivers/block/osdblk.c +++ work/drivers/block/osdblk.c @@ -446,7 +446,7 @@ static int osdblk_init_disk(struct osdbl blk_queue_stack_limits(q, osd_request_queue(osdev->osd)); blk_queue_prep_rq(q, blk_queue_start_tag); - blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, osdblk_prepare_flush); + blk_queue_ordered(q, QUEUE_ORDERED_FLUSH, osdblk_prepare_flush); disk->queue = q; Index: work/drivers/block/ps3disk.c =================================================================== --- work.orig/drivers/block/ps3disk.c +++ work/drivers/block/ps3disk.c @@ -480,8 +480,7 @@ static int __devinit ps3disk_probe(struc blk_queue_dma_alignment(queue, dev->blk_size-1); blk_queue_logical_block_size(queue, dev->blk_size); - blk_queue_ordered(queue, QUEUE_ORDERED_DRAIN_FLUSH, - ps3disk_prepare_flush); + blk_queue_ordered(queue, QUEUE_ORDERED_FLUSH, ps3disk_prepare_flush); blk_queue_max_segments(queue, -1); blk_queue_max_segment_size(queue, dev->bounce_size); Index: work/drivers/block/xen-blkfront.c =================================================================== --- work.orig/drivers/block/xen-blkfront.c +++ work/drivers/block/xen-blkfront.c @@ -373,7 +373,7 @@ static int xlvbd_barrier(struct blkfront int err; err = blk_queue_ordered(info->rq, - info->feature_barrier ? QUEUE_ORDERED_DRAIN : QUEUE_ORDERED_NONE, + info->feature_barrier ? QUEUE_ORDERED_BAR : QUEUE_ORDERED_NONE, NULL); if (err) Index: work/drivers/ide/ide-disk.c =================================================================== --- work.orig/drivers/ide/ide-disk.c +++ work/drivers/ide/ide-disk.c @@ -537,11 +537,11 @@ static void update_ordered(ide_drive_t * drive->name, barrier ? "" : "not "); if (barrier) { - ordered = QUEUE_ORDERED_DRAIN_FLUSH; + ordered = QUEUE_ORDERED_FLUSH; prep_fn = idedisk_prepare_flush; } } else - ordered = QUEUE_ORDERED_DRAIN; + ordered = QUEUE_ORDERED_BAR; blk_queue_ordered(drive->queue, ordered, prep_fn); } Index: work/drivers/md/dm.c =================================================================== --- work.orig/drivers/md/dm.c +++ work/drivers/md/dm.c @@ -1912,8 +1912,7 @@ static struct mapped_device *alloc_dev(i blk_queue_softirq_done(md->queue, dm_softirq_done); blk_queue_prep_rq(md->queue, dm_prep_fn); blk_queue_lld_busy(md->queue, dm_lld_busy); - blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH, - dm_rq_prepare_flush); + blk_queue_ordered(md->queue, QUEUE_ORDERED_FLUSH, dm_rq_prepare_flush); md->disk = alloc_disk(1); if (!md->disk) Index: work/drivers/mmc/card/queue.c =================================================================== --- work.orig/drivers/mmc/card/queue.c +++ work/drivers/mmc/card/queue.c @@ -128,7 +128,7 @@ int mmc_init_queue(struct mmc_queue *mq, mq->req = NULL; blk_queue_prep_rq(mq->queue, mmc_prep_request); - blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN, NULL); + blk_queue_ordered(mq->queue, QUEUE_ORDERED_BAR, NULL); queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue); #ifdef CONFIG_MMC_BLOCK_BOUNCE Index: work/drivers/s390/block/dasd.c =================================================================== --- work.orig/drivers/s390/block/dasd.c +++ work/drivers/s390/block/dasd.c @@ -2196,7 +2196,7 @@ static void dasd_setup_queue(struct dasd */ blk_queue_max_segment_size(block->request_queue, PAGE_SIZE); blk_queue_segment_boundary(block->request_queue, PAGE_SIZE - 1); - blk_queue_ordered(block->request_queue, QUEUE_ORDERED_DRAIN, NULL); + blk_queue_ordered(block->request_queue, QUEUE_ORDERED_BAR, NULL); } /* Index: work/include/linux/elevator.h =================================================================== --- work.orig/include/linux/elevator.h +++ work/include/linux/elevator.h @@ -162,9 +162,9 @@ extern struct request *elv_rb_find(struc * Insertion selection */ #define ELEVATOR_INSERT_FRONT 1 -#define ELEVATOR_INSERT_BACK 2 -#define ELEVATOR_INSERT_SORT 3 -#define ELEVATOR_INSERT_REQUEUE 4 +#define ELEVATOR_INSERT_ORDERED 2 +#define ELEVATOR_INSERT_BACK 3 +#define ELEVATOR_INSERT_SORT 4 /* * return values from elevator_may_queue_fn Index: work/drivers/block/pktcdvd.c =================================================================== --- work.orig/drivers/block/pktcdvd.c +++ work/drivers/block/pktcdvd.c @@ -752,7 +752,6 @@ static int pkt_generic_packet(struct pkt rq->timeout = 60*HZ; rq->cmd_type = REQ_TYPE_BLOCK_PC; - rq->cmd_flags |= REQ_HARDBARRIER; if (cgc->quiet) rq->cmd_flags |= REQ_QUIET; ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH, RFC] relaxed barriers 2010-08-06 16:04 ` [PATCH, RFC] relaxed barriers Tejun Heo @ 2010-08-06 23:34 ` Christoph Hellwig 2010-08-07 10:13 ` [PATCH REPOST " Tejun Heo 1 sibling, 0 replies; 155+ messages in thread From: Christoph Hellwig @ 2010-08-06 23:34 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Hellwig, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel, linux-raid > Christoph, does this look like something the filesystems can use or > have I misunderstood something? This sounds very useful. I'll review and test it once I get a bit time. ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH REPOST RFC] relaxed barriers 2010-08-06 16:04 ` [PATCH, RFC] relaxed barriers Tejun Heo 2010-08-06 23:34 ` Christoph Hellwig @ 2010-08-07 10:13 ` Tejun Heo 2010-08-08 14:31 ` Christoph Hellwig 1 sibling, 1 reply; 155+ messages in thread From: Tejun Heo @ 2010-08-07 10:13 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel, linux-raid The patch was on top of v2.6.35 but was generated against dirty tree and wouldn't apply cleanly. Here's the proper one. Thanks. --- block/blk-barrier.c | 255 +++++++++++++++---------------------------- block/blk-core.c | 31 ++--- block/blk.h | 5 block/elevator.c | 80 +------------ drivers/block/brd.c | 2 drivers/block/loop.c | 2 drivers/block/osdblk.c | 2 drivers/block/pktcdvd.c | 1 drivers/block/ps3disk.c | 3 drivers/block/virtio_blk.c | 4 drivers/block/xen-blkfront.c | 2 drivers/ide/ide-disk.c | 4 drivers/md/dm.c | 3 drivers/mmc/card/queue.c | 2 drivers/s390/block/dasd.c | 2 drivers/scsi/sd.c | 8 - include/linux/blkdev.h | 63 +++------- include/linux/elevator.h | 6 - 18 files changed, 155 insertions(+), 320 deletions(-) Index: work/block/blk-barrier.c =================================================================== --- work.orig/block/blk-barrier.c +++ work/block/blk-barrier.c @@ -9,6 +9,8 @@ #include "blk.h" +static struct request *queue_next_ordseq(struct request_queue *q); + /** * blk_queue_ordered - does this queue support ordered writes * @q: the request queue @@ -31,13 +33,8 @@ int blk_queue_ordered(struct request_que return -EINVAL; } - if (ordered != QUEUE_ORDERED_NONE && - ordered != QUEUE_ORDERED_DRAIN && - ordered != QUEUE_ORDERED_DRAIN_FLUSH && - ordered != QUEUE_ORDERED_DRAIN_FUA && - ordered != QUEUE_ORDERED_TAG && - ordered != QUEUE_ORDERED_TAG_FLUSH && - ordered != QUEUE_ORDERED_TAG_FUA) { + if (ordered != QUEUE_ORDERED_NONE && ordered != QUEUE_ORDERED_BAR && + ordered != QUEUE_ORDERED_FLUSH && ordered != QUEUE_ORDERED_FUA) { printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered); return -EINVAL; } @@ -60,38 +57,10 @@ unsigned blk_ordered_cur_seq(struct requ return 1 << ffz(q->ordseq); } -unsigned blk_ordered_req_seq(struct request *rq) +static struct request *blk_ordered_complete_seq(struct request_queue *q, + unsigned seq, int error) { - struct request_queue *q = rq->q; - - BUG_ON(q->ordseq == 0); - - if (rq == &q->pre_flush_rq) - return QUEUE_ORDSEQ_PREFLUSH; - if (rq == &q->bar_rq) - return QUEUE_ORDSEQ_BAR; - if (rq == &q->post_flush_rq) - return QUEUE_ORDSEQ_POSTFLUSH; - - /* - * !fs requests don't need to follow barrier ordering. Always - * put them at the front. This fixes the following deadlock. - * - * http://thread.gmane.org/gmane.linux.kernel/537473 - */ - if (!blk_fs_request(rq)) - return QUEUE_ORDSEQ_DRAIN; - - if ((rq->cmd_flags & REQ_ORDERED_COLOR) == - (q->orig_bar_rq->cmd_flags & REQ_ORDERED_COLOR)) - return QUEUE_ORDSEQ_DRAIN; - else - return QUEUE_ORDSEQ_DONE; -} - -bool blk_ordered_complete_seq(struct request_queue *q, unsigned seq, int error) -{ - struct request *rq; + struct request *rq = NULL; if (error && !q->orderr) q->orderr = error; @@ -99,16 +68,22 @@ bool blk_ordered_complete_seq(struct req BUG_ON(q->ordseq & seq); q->ordseq |= seq; - if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) - return false; - - /* - * Okay, sequence complete. - */ - q->ordseq = 0; - rq = q->orig_bar_rq; - __blk_end_request_all(rq, q->orderr); - return true; + if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) { + /* not complete yet, queue the next ordered sequence */ + rq = queue_next_ordseq(q); + } else { + /* complete this barrier request */ + __blk_end_request_all(q->orig_bar_rq, q->orderr); + q->orig_bar_rq = NULL; + q->ordseq = 0; + + /* dispatch the next barrier if there's one */ + if (!list_empty(&q->pending_barriers)) { + rq = list_entry_rq(q->pending_barriers.next); + list_move(&rq->queuelist, &q->queue_head); + } + } + return rq; } static void pre_flush_end_io(struct request *rq, int error) @@ -129,21 +104,10 @@ static void post_flush_end_io(struct req blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error); } -static void queue_flush(struct request_queue *q, unsigned which) +static void queue_flush(struct request_queue *q, struct request *rq, + rq_end_io_fn *end_io) { - struct request *rq; - rq_end_io_fn *end_io; - - if (which == QUEUE_ORDERED_DO_PREFLUSH) { - rq = &q->pre_flush_rq; - end_io = pre_flush_end_io; - } else { - rq = &q->post_flush_rq; - end_io = post_flush_end_io; - } - blk_rq_init(q, rq); - rq->cmd_flags = REQ_HARDBARRIER; rq->rq_disk = q->bar_rq.rq_disk; rq->end_io = end_io; q->prepare_flush_fn(q, rq); @@ -151,132 +115,93 @@ static void queue_flush(struct request_q elv_insert(q, rq, ELEVATOR_INSERT_FRONT); } -static inline bool start_ordered(struct request_queue *q, struct request **rqp) +static struct request *queue_next_ordseq(struct request_queue *q) { - struct request *rq = *rqp; - unsigned skip = 0; + struct request *rq = &q->bar_rq; - q->orderr = 0; - q->ordered = q->next_ordered; - q->ordseq |= QUEUE_ORDSEQ_STARTED; - - /* - * For an empty barrier, there's no actual BAR request, which - * in turn makes POSTFLUSH unnecessary. Mask them off. - */ - if (!blk_rq_sectors(rq)) { - q->ordered &= ~(QUEUE_ORDERED_DO_BAR | - QUEUE_ORDERED_DO_POSTFLUSH); - /* - * Empty barrier on a write-through device w/ ordered - * tag has no command to issue and without any command - * to issue, ordering by tag can't be used. Drain - * instead. - */ - if ((q->ordered & QUEUE_ORDERED_BY_TAG) && - !(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) { - q->ordered &= ~QUEUE_ORDERED_BY_TAG; - q->ordered |= QUEUE_ORDERED_BY_DRAIN; - } - } - - /* stash away the original request */ - blk_dequeue_request(rq); - q->orig_bar_rq = rq; - rq = NULL; - - /* - * Queue ordered sequence. As we stack them at the head, we - * need to queue in reverse order. Note that we rely on that - * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs - * request gets inbetween ordered sequence. - */ - if (q->ordered & QUEUE_ORDERED_DO_POSTFLUSH) { - queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH); - rq = &q->post_flush_rq; - } else - skip |= QUEUE_ORDSEQ_POSTFLUSH; - - if (q->ordered & QUEUE_ORDERED_DO_BAR) { - rq = &q->bar_rq; + switch (blk_ordered_cur_seq(q)) { + case QUEUE_ORDSEQ_PREFLUSH: + queue_flush(q, rq, pre_flush_end_io); + break; + case QUEUE_ORDSEQ_BAR: /* initialize proxy request and queue it */ blk_rq_init(q, rq); - if (bio_data_dir(q->orig_bar_rq->bio) == WRITE) - rq->cmd_flags |= REQ_RW; + init_request_from_bio(rq, q->orig_bar_rq->bio); + rq->cmd_flags &= ~REQ_HARDBARRIER; if (q->ordered & QUEUE_ORDERED_DO_FUA) rq->cmd_flags |= REQ_FUA; - init_request_from_bio(rq, q->orig_bar_rq->bio); rq->end_io = bar_end_io; elv_insert(q, rq, ELEVATOR_INSERT_FRONT); - } else - skip |= QUEUE_ORDSEQ_BAR; + break; - if (q->ordered & QUEUE_ORDERED_DO_PREFLUSH) { - queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH); - rq = &q->pre_flush_rq; - } else - skip |= QUEUE_ORDSEQ_PREFLUSH; + case QUEUE_ORDSEQ_POSTFLUSH: + queue_flush(q, rq, post_flush_end_io); + break; - if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q)) - rq = NULL; - else - skip |= QUEUE_ORDSEQ_DRAIN; - - *rqp = rq; - - /* - * Complete skipped sequences. If whole sequence is complete, - * return false to tell elevator that this request is gone. - */ - return !blk_ordered_complete_seq(q, skip, 0); + default: + BUG(); + } + return rq; } -bool blk_do_ordered(struct request_queue *q, struct request **rqp) +struct request *blk_do_ordered(struct request_queue *q, struct request *rq) { - struct request *rq = *rqp; - const int is_barrier = blk_fs_request(rq) && blk_barrier_rq(rq); + unsigned skip = 0; - if (!q->ordseq) { - if (!is_barrier) - return true; - - if (q->next_ordered != QUEUE_ORDERED_NONE) - return start_ordered(q, rqp); - else { - /* - * Queue ordering not supported. Terminate - * with prejudice. - */ - blk_dequeue_request(rq); - __blk_end_request_all(rq, -EOPNOTSUPP); - *rqp = NULL; - return false; - } + if (!blk_barrier_rq(rq)) + return rq; + + if (q->ordseq) { + /* + * Barrier is already in progress and they can't be + * processed in parallel. Queue for later processing. + */ + list_move_tail(&rq->queuelist, &q->pending_barriers); + return NULL; + } + + if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) { + /* + * Queue ordering not supported. Terminate + * with prejudice. + */ + blk_dequeue_request(rq); + __blk_end_request_all(rq, -EOPNOTSUPP); + return NULL; } /* - * Ordered sequence in progress + * Start a new ordered sequence */ + q->orderr = 0; + q->ordered = q->next_ordered; + q->ordseq |= QUEUE_ORDSEQ_STARTED; - /* Special requests are not subject to ordering rules. */ - if (!blk_fs_request(rq) && - rq != &q->pre_flush_rq && rq != &q->post_flush_rq) - return true; - - if (q->ordered & QUEUE_ORDERED_BY_TAG) { - /* Ordered by tag. Blocking the next barrier is enough. */ - if (is_barrier && rq != &q->bar_rq) - *rqp = NULL; - } else { - /* Ordered by draining. Wait for turn. */ - WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q)); - if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q)) - *rqp = NULL; - } + /* + * For an empty barrier, there's no actual BAR request, which + * in turn makes POSTFLUSH unnecessary. Mask them off. + */ + if (!blk_rq_sectors(rq)) + q->ordered &= ~(QUEUE_ORDERED_DO_BAR | + QUEUE_ORDERED_DO_POSTFLUSH); + + /* stash away the original request */ + blk_dequeue_request(rq); + q->orig_bar_rq = rq; + + if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) + skip |= QUEUE_ORDSEQ_PREFLUSH; + + if (!(q->ordered & QUEUE_ORDERED_DO_BAR)) + skip |= QUEUE_ORDSEQ_BAR; + + if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH)) + skip |= QUEUE_ORDSEQ_POSTFLUSH; - return true; + /* complete skipped sequences and return the first sequence */ + return blk_ordered_complete_seq(q, skip, 0); } static void bio_end_empty_barrier(struct bio *bio, int err) Index: work/include/linux/blkdev.h =================================================================== --- work.orig/include/linux/blkdev.h +++ work/include/linux/blkdev.h @@ -106,7 +106,6 @@ enum rq_flag_bits { __REQ_FAILED, /* set if the request failed */ __REQ_QUIET, /* don't worry about errors */ __REQ_PREEMPT, /* set for "ide_preempt" requests */ - __REQ_ORDERED_COLOR, /* is before or after barrier */ __REQ_RW_SYNC, /* request is sync (sync write or read) */ __REQ_ALLOCED, /* request came from our alloc pool */ __REQ_RW_META, /* metadata io request */ @@ -135,7 +134,6 @@ enum rq_flag_bits { #define REQ_FAILED (1 << __REQ_FAILED) #define REQ_QUIET (1 << __REQ_QUIET) #define REQ_PREEMPT (1 << __REQ_PREEMPT) -#define REQ_ORDERED_COLOR (1 << __REQ_ORDERED_COLOR) #define REQ_RW_SYNC (1 << __REQ_RW_SYNC) #define REQ_ALLOCED (1 << __REQ_ALLOCED) #define REQ_RW_META (1 << __REQ_RW_META) @@ -437,9 +435,10 @@ struct request_queue * reserved for flush operations */ unsigned int ordered, next_ordered, ordseq; - int orderr, ordcolor; - struct request pre_flush_rq, bar_rq, post_flush_rq; - struct request *orig_bar_rq; + int orderr; + struct request bar_rq; + struct request *orig_bar_rq; + struct list_head pending_barriers; struct mutex sysfs_lock; @@ -543,49 +542,33 @@ enum { * Hardbarrier is supported with one of the following methods. * * NONE : hardbarrier unsupported - * DRAIN : ordering by draining is enough - * DRAIN_FLUSH : ordering by draining w/ pre and post flushes - * DRAIN_FUA : ordering by draining w/ pre flush and FUA write - * TAG : ordering by tag is enough - * TAG_FLUSH : ordering by tag w/ pre and post flushes - * TAG_FUA : ordering by tag w/ pre flush and FUA write - */ - QUEUE_ORDERED_BY_DRAIN = 0x01, - QUEUE_ORDERED_BY_TAG = 0x02, - QUEUE_ORDERED_DO_PREFLUSH = 0x10, - QUEUE_ORDERED_DO_BAR = 0x20, - QUEUE_ORDERED_DO_POSTFLUSH = 0x40, - QUEUE_ORDERED_DO_FUA = 0x80, - - QUEUE_ORDERED_NONE = 0x00, - - QUEUE_ORDERED_DRAIN = QUEUE_ORDERED_BY_DRAIN | - QUEUE_ORDERED_DO_BAR, - QUEUE_ORDERED_DRAIN_FLUSH = QUEUE_ORDERED_DRAIN | - QUEUE_ORDERED_DO_PREFLUSH | - QUEUE_ORDERED_DO_POSTFLUSH, - QUEUE_ORDERED_DRAIN_FUA = QUEUE_ORDERED_DRAIN | - QUEUE_ORDERED_DO_PREFLUSH | - QUEUE_ORDERED_DO_FUA, + * BAR : writing out barrier is enough + * FLUSH : barrier and surrounding pre and post flushes + * FUA : FUA barrier w/ pre flush + */ + QUEUE_ORDERED_DO_PREFLUSH = 1 << 0, + QUEUE_ORDERED_DO_BAR = 1 << 1, + QUEUE_ORDERED_DO_POSTFLUSH = 1 << 2, + QUEUE_ORDERED_DO_FUA = 1 << 3, + + QUEUE_ORDERED_NONE = 0, - QUEUE_ORDERED_TAG = QUEUE_ORDERED_BY_TAG | - QUEUE_ORDERED_DO_BAR, - QUEUE_ORDERED_TAG_FLUSH = QUEUE_ORDERED_TAG | + QUEUE_ORDERED_BAR = QUEUE_ORDERED_DO_BAR, + QUEUE_ORDERED_FLUSH = QUEUE_ORDERED_DO_BAR | QUEUE_ORDERED_DO_PREFLUSH | QUEUE_ORDERED_DO_POSTFLUSH, - QUEUE_ORDERED_TAG_FUA = QUEUE_ORDERED_TAG | + QUEUE_ORDERED_FUA = QUEUE_ORDERED_DO_BAR | QUEUE_ORDERED_DO_PREFLUSH | QUEUE_ORDERED_DO_FUA, /* * Ordered operation sequence */ - QUEUE_ORDSEQ_STARTED = 0x01, /* flushing in progress */ - QUEUE_ORDSEQ_DRAIN = 0x02, /* waiting for the queue to be drained */ - QUEUE_ORDSEQ_PREFLUSH = 0x04, /* pre-flushing in progress */ - QUEUE_ORDSEQ_BAR = 0x08, /* original barrier req in progress */ - QUEUE_ORDSEQ_POSTFLUSH = 0x10, /* post-flushing in progress */ - QUEUE_ORDSEQ_DONE = 0x20, + QUEUE_ORDSEQ_STARTED = (1 << 0), /* flushing in progress */ + QUEUE_ORDSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */ + QUEUE_ORDSEQ_BAR = (1 << 2), /* barrier write in progress */ + QUEUE_ORDSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */ + QUEUE_ORDSEQ_DONE = (1 << 4), }; #define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags) @@ -967,10 +950,8 @@ extern void blk_queue_rq_timed_out(struc extern void blk_queue_rq_timeout(struct request_queue *, unsigned int); extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev); extern int blk_queue_ordered(struct request_queue *, unsigned, prepare_flush_fn *); -extern bool blk_do_ordered(struct request_queue *, struct request **); extern unsigned blk_ordered_cur_seq(struct request_queue *); extern unsigned blk_ordered_req_seq(struct request *); -extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int); extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *); extern void blk_dump_rq_flags(struct request *, char *); Index: work/drivers/block/brd.c =================================================================== --- work.orig/drivers/block/brd.c +++ work/drivers/block/brd.c @@ -479,7 +479,7 @@ static struct brd_device *brd_alloc(int if (!brd->brd_queue) goto out_free_dev; blk_queue_make_request(brd->brd_queue, brd_make_request); - blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_TAG, NULL); + blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_BAR, NULL); blk_queue_max_hw_sectors(brd->brd_queue, 1024); blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY); Index: work/drivers/block/virtio_blk.c =================================================================== --- work.orig/drivers/block/virtio_blk.c +++ work/drivers/block/virtio_blk.c @@ -368,10 +368,10 @@ static int __devinit virtblk_probe(struc /* If barriers are supported, tell block layer that queue is ordered */ if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) - blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, + blk_queue_ordered(q, QUEUE_ORDERED_FLUSH, virtblk_prepare_flush); else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER)) - blk_queue_ordered(q, QUEUE_ORDERED_TAG, NULL); + blk_queue_ordered(q, QUEUE_ORDERED_BAR, NULL); /* If disk is read-only in the host, the guest should obey */ if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO)) Index: work/drivers/scsi/sd.c =================================================================== --- work.orig/drivers/scsi/sd.c +++ work/drivers/scsi/sd.c @@ -2103,15 +2103,13 @@ static int sd_revalidate_disk(struct gen /* * We now have all cache related info, determine how we deal - * with ordered requests. Note that as the current SCSI - * dispatch function can alter request order, we cannot use - * QUEUE_ORDERED_TAG_* even when ordered tag is supported. + * with ordered requests. */ if (sdkp->WCE) ordered = sdkp->DPOFUA - ? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH; + ? QUEUE_ORDERED_FUA : QUEUE_ORDERED_FLUSH; else - ordered = QUEUE_ORDERED_DRAIN; + ordered = QUEUE_ORDERED_BAR; blk_queue_ordered(sdkp->disk->queue, ordered, sd_prepare_flush); Index: work/block/blk-core.c =================================================================== --- work.orig/block/blk-core.c +++ work/block/blk-core.c @@ -520,6 +520,7 @@ struct request_queue *blk_alloc_queue_no init_timer(&q->unplug_timer); setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q); INIT_LIST_HEAD(&q->timeout_list); + INIT_LIST_HEAD(&q->pending_barriers); INIT_WORK(&q->unplug_work, blk_unplug_work); kobject_init(&q->kobj, &blk_queue_ktype); @@ -1036,22 +1037,6 @@ void blk_insert_request(struct request_q } EXPORT_SYMBOL(blk_insert_request); -/* - * add-request adds a request to the linked list. - * queue lock is held and interrupts disabled, as we muck with the - * request queue list. - */ -static inline void add_request(struct request_queue *q, struct request *req) -{ - drive_stat_acct(req, 1); - - /* - * elevator indicated where it wants this request to be - * inserted at elevator_merge time - */ - __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0); -} - static void part_round_stats_single(int cpu, struct hd_struct *part, unsigned long now) { @@ -1184,6 +1169,7 @@ static int __make_request(struct request const bool sync = bio_rw_flagged(bio, BIO_RW_SYNCIO); const bool unplug = bio_rw_flagged(bio, BIO_RW_UNPLUG); const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK; + int where = ELEVATOR_INSERT_SORT; int rw_flags; if (bio_rw_flagged(bio, BIO_RW_BARRIER) && @@ -1191,6 +1177,7 @@ static int __make_request(struct request bio_endio(bio, -EOPNOTSUPP); return 0; } + /* * low level driver can indicate that it wants pages above a * certain limit bounced to low memory (ie for highmem, or even @@ -1200,7 +1187,12 @@ static int __make_request(struct request spin_lock_irq(q->queue_lock); - if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER)) || elv_queue_empty(q)) + if (bio_rw_flagged(bio, BIO_RW_BARRIER)) { + where = ELEVATOR_INSERT_ORDERED; + goto get_rq; + } + + if (elv_queue_empty(q)) goto get_rq; el_ret = elv_merge(q, &req, bio); @@ -1297,7 +1289,10 @@ get_rq: req->cpu = blk_cpu_to_group(smp_processor_id()); if (queue_should_plug(q) && elv_queue_empty(q)) blk_plug_device(q); - add_request(q, req); + + /* insert the request into the elevator */ + drive_stat_acct(req, 1); + __elv_add_request(q, req, where, 0); out: if (unplug || !queue_should_plug(q)) __generic_unplug_device(q); Index: work/block/elevator.c =================================================================== --- work.orig/block/elevator.c +++ work/block/elevator.c @@ -564,7 +564,7 @@ void elv_requeue_request(struct request_ rq->cmd_flags &= ~REQ_STARTED; - elv_insert(q, rq, ELEVATOR_INSERT_REQUEUE); + elv_insert(q, rq, ELEVATOR_INSERT_FRONT); } void elv_drain_elevator(struct request_queue *q) @@ -611,8 +611,6 @@ void elv_quiesce_end(struct request_queu void elv_insert(struct request_queue *q, struct request *rq, int where) { - struct list_head *pos; - unsigned ordseq; int unplug_it = 1; trace_block_rq_insert(q, rq); @@ -622,10 +620,14 @@ void elv_insert(struct request_queue *q, switch (where) { case ELEVATOR_INSERT_FRONT: rq->cmd_flags |= REQ_SOFTBARRIER; - list_add(&rq->queuelist, &q->queue_head); break; + case ELEVATOR_INSERT_ORDERED: + rq->cmd_flags |= REQ_SOFTBARRIER; + list_add_tail(&rq->queuelist, &q->queue_head); + break; + case ELEVATOR_INSERT_BACK: rq->cmd_flags |= REQ_SOFTBARRIER; elv_drain_elevator(q); @@ -661,36 +663,6 @@ void elv_insert(struct request_queue *q, q->elevator->ops->elevator_add_req_fn(q, rq); break; - case ELEVATOR_INSERT_REQUEUE: - /* - * If ordered flush isn't in progress, we do front - * insertion; otherwise, requests should be requeued - * in ordseq order. - */ - rq->cmd_flags |= REQ_SOFTBARRIER; - - /* - * Most requeues happen because of a busy condition, - * don't force unplug of the queue for that case. - */ - unplug_it = 0; - - if (q->ordseq == 0) { - list_add(&rq->queuelist, &q->queue_head); - break; - } - - ordseq = blk_ordered_req_seq(rq); - - list_for_each(pos, &q->queue_head) { - struct request *pos_rq = list_entry_rq(pos); - if (ordseq <= blk_ordered_req_seq(pos_rq)) - break; - } - - list_add_tail(&rq->queuelist, pos); - break; - default: printk(KERN_ERR "%s: bad insertion point %d\n", __func__, where); @@ -709,32 +681,14 @@ void elv_insert(struct request_queue *q, void __elv_add_request(struct request_queue *q, struct request *rq, int where, int plug) { - if (q->ordcolor) - rq->cmd_flags |= REQ_ORDERED_COLOR; - if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)) { - /* - * toggle ordered color - */ - if (blk_barrier_rq(rq)) - q->ordcolor ^= 1; - - /* - * barriers implicitly indicate back insertion - */ - if (where == ELEVATOR_INSERT_SORT) - where = ELEVATOR_INSERT_BACK; - - /* - * this request is scheduling boundary, update - * end_sector - */ + /* barriers are scheduling boundary, update end_sector */ if (blk_fs_request(rq) || blk_discard_rq(rq)) { q->end_sector = rq_end_sector(rq); q->boundary_rq = rq; } } else if (!(rq->cmd_flags & REQ_ELVPRIV) && - where == ELEVATOR_INSERT_SORT) + where == ELEVATOR_INSERT_SORT) where = ELEVATOR_INSERT_BACK; if (plug) @@ -846,24 +800,6 @@ void elv_completed_request(struct reques if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn) e->ops->elevator_completed_req_fn(q, rq); } - - /* - * Check if the queue is waiting for fs requests to be - * drained for flush sequence. - */ - if (unlikely(q->ordseq)) { - struct request *next = NULL; - - if (!list_empty(&q->queue_head)) - next = list_entry_rq(q->queue_head.next); - - if (!queue_in_flight(q) && - blk_ordered_cur_seq(q) == QUEUE_ORDSEQ_DRAIN && - (!next || blk_ordered_req_seq(next) > QUEUE_ORDSEQ_DRAIN)) { - blk_ordered_complete_seq(q, QUEUE_ORDSEQ_DRAIN, 0); - __blk_run_queue(q); - } - } } #define to_elv(atr) container_of((atr), struct elv_fs_entry, attr) Index: work/block/blk.h =================================================================== --- work.orig/block/blk.h +++ work/block/blk.h @@ -51,6 +51,8 @@ static inline void blk_clear_rq_complete */ #define ELV_ON_HASH(rq) (!hlist_unhashed(&(rq)->hash)) +struct request *blk_do_ordered(struct request_queue *q, struct request *rq); + static inline struct request *__elv_next_request(struct request_queue *q) { struct request *rq; @@ -58,7 +60,8 @@ static inline struct request *__elv_next while (1) { while (!list_empty(&q->queue_head)) { rq = list_entry_rq(q->queue_head.next); - if (blk_do_ordered(q, &rq)) + rq = blk_do_ordered(q, rq); + if (rq) return rq; } Index: work/drivers/block/loop.c =================================================================== --- work.orig/drivers/block/loop.c +++ work/drivers/block/loop.c @@ -831,7 +831,7 @@ static int loop_set_fd(struct loop_devic lo->lo_queue->unplug_fn = loop_unplug; if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync) - blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN, NULL); + blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_BAR, NULL); set_capacity(lo->lo_disk, size); bd_set_size(bdev, size << 9); Index: work/drivers/block/osdblk.c =================================================================== --- work.orig/drivers/block/osdblk.c +++ work/drivers/block/osdblk.c @@ -446,7 +446,7 @@ static int osdblk_init_disk(struct osdbl blk_queue_stack_limits(q, osd_request_queue(osdev->osd)); blk_queue_prep_rq(q, blk_queue_start_tag); - blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH, osdblk_prepare_flush); + blk_queue_ordered(q, QUEUE_ORDERED_FLUSH, osdblk_prepare_flush); disk->queue = q; Index: work/drivers/block/ps3disk.c =================================================================== --- work.orig/drivers/block/ps3disk.c +++ work/drivers/block/ps3disk.c @@ -480,8 +480,7 @@ static int __devinit ps3disk_probe(struc blk_queue_dma_alignment(queue, dev->blk_size-1); blk_queue_logical_block_size(queue, dev->blk_size); - blk_queue_ordered(queue, QUEUE_ORDERED_DRAIN_FLUSH, - ps3disk_prepare_flush); + blk_queue_ordered(queue, QUEUE_ORDERED_FLUSH, ps3disk_prepare_flush); blk_queue_max_segments(queue, -1); blk_queue_max_segment_size(queue, dev->bounce_size); Index: work/drivers/block/xen-blkfront.c =================================================================== --- work.orig/drivers/block/xen-blkfront.c +++ work/drivers/block/xen-blkfront.c @@ -373,7 +373,7 @@ static int xlvbd_barrier(struct blkfront int err; err = blk_queue_ordered(info->rq, - info->feature_barrier ? QUEUE_ORDERED_DRAIN : QUEUE_ORDERED_NONE, + info->feature_barrier ? QUEUE_ORDERED_BAR : QUEUE_ORDERED_NONE, NULL); if (err) Index: work/drivers/ide/ide-disk.c =================================================================== --- work.orig/drivers/ide/ide-disk.c +++ work/drivers/ide/ide-disk.c @@ -537,11 +537,11 @@ static void update_ordered(ide_drive_t * drive->name, barrier ? "" : "not "); if (barrier) { - ordered = QUEUE_ORDERED_DRAIN_FLUSH; + ordered = QUEUE_ORDERED_FLUSH; prep_fn = idedisk_prepare_flush; } } else - ordered = QUEUE_ORDERED_DRAIN; + ordered = QUEUE_ORDERED_BAR; blk_queue_ordered(drive->queue, ordered, prep_fn); } Index: work/drivers/md/dm.c =================================================================== --- work.orig/drivers/md/dm.c +++ work/drivers/md/dm.c @@ -1912,8 +1912,7 @@ static struct mapped_device *alloc_dev(i blk_queue_softirq_done(md->queue, dm_softirq_done); blk_queue_prep_rq(md->queue, dm_prep_fn); blk_queue_lld_busy(md->queue, dm_lld_busy); - blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH, - dm_rq_prepare_flush); + blk_queue_ordered(md->queue, QUEUE_ORDERED_FLUSH, dm_rq_prepare_flush); md->disk = alloc_disk(1); if (!md->disk) Index: work/drivers/mmc/card/queue.c =================================================================== --- work.orig/drivers/mmc/card/queue.c +++ work/drivers/mmc/card/queue.c @@ -128,7 +128,7 @@ int mmc_init_queue(struct mmc_queue *mq, mq->req = NULL; blk_queue_prep_rq(mq->queue, mmc_prep_request); - blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN, NULL); + blk_queue_ordered(mq->queue, QUEUE_ORDERED_BAR, NULL); queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue); #ifdef CONFIG_MMC_BLOCK_BOUNCE Index: work/drivers/s390/block/dasd.c =================================================================== --- work.orig/drivers/s390/block/dasd.c +++ work/drivers/s390/block/dasd.c @@ -2196,7 +2196,7 @@ static void dasd_setup_queue(struct dasd */ blk_queue_max_segment_size(block->request_queue, PAGE_SIZE); blk_queue_segment_boundary(block->request_queue, PAGE_SIZE - 1); - blk_queue_ordered(block->request_queue, QUEUE_ORDERED_DRAIN, NULL); + blk_queue_ordered(block->request_queue, QUEUE_ORDERED_BAR, NULL); } /* Index: work/include/linux/elevator.h =================================================================== --- work.orig/include/linux/elevator.h +++ work/include/linux/elevator.h @@ -162,9 +162,9 @@ extern struct request *elv_rb_find(struc * Insertion selection */ #define ELEVATOR_INSERT_FRONT 1 -#define ELEVATOR_INSERT_BACK 2 -#define ELEVATOR_INSERT_SORT 3 -#define ELEVATOR_INSERT_REQUEUE 4 +#define ELEVATOR_INSERT_ORDERED 2 +#define ELEVATOR_INSERT_BACK 3 +#define ELEVATOR_INSERT_SORT 4 /* * return values from elevator_may_queue_fn Index: work/drivers/block/pktcdvd.c =================================================================== --- work.orig/drivers/block/pktcdvd.c +++ work/drivers/block/pktcdvd.c @@ -752,7 +752,6 @@ static int pkt_generic_packet(struct pkt rq->timeout = 60*HZ; rq->cmd_type = REQ_TYPE_BLOCK_PC; - rq->cmd_flags |= REQ_HARDBARRIER; if (cgc->quiet) rq->cmd_flags |= REQ_QUIET; ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH REPOST RFC] relaxed barriers 2010-08-07 10:13 ` [PATCH REPOST " Tejun Heo @ 2010-08-08 14:31 ` Christoph Hellwig 2010-08-09 14:50 ` Tejun Heo 0 siblings, 1 reply; 155+ messages in thread From: Christoph Hellwig @ 2010-08-08 14:31 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Hellwig, Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel, linux-raid On Sat, Aug 07, 2010 at 12:13:06PM +0200, Tejun Heo wrote: > The patch was on top of v2.6.35 but was generated against dirty tree > and wouldn't apply cleanly. Here's the proper one. Here's an updated version: (a) ported to Jens' current block tree (b) optimize barriers on devices not requiring flushes to be no-ops (b) redo the blk_queue_ordered interface to just set QUEUE_HAS_FLUSH and QUEUE_HAS_FUA flags. Index: linux-2.6/block/blk-barrier.c =================================================================== --- linux-2.6.orig/block/blk-barrier.c 2010-08-07 12:53:23.727479189 -0400 +++ linux-2.6/block/blk-barrier.c 2010-08-07 14:52:21.402479191 -0400 @@ -9,37 +9,36 @@ #include "blk.h" +/* + * Ordered operation sequence. + */ +enum { + QUEUE_ORDSEQ_STARTED = (1 << 0), /* flushing in progress */ + QUEUE_ORDSEQ_PREFLUSH = (1 << 1), /* pre-flushing in progress */ + QUEUE_ORDSEQ_BAR = (1 << 2), /* barrier write in progress */ + QUEUE_ORDSEQ_POSTFLUSH = (1 << 3), /* post-flushing in progress */ + QUEUE_ORDSEQ_DONE = (1 << 4), +}; + +static struct request *queue_next_ordseq(struct request_queue *q); + /** - * blk_queue_ordered - does this queue support ordered writes - * @q: the request queue - * @ordered: one of QUEUE_ORDERED_* - * - * Description: - * For journalled file systems, doing ordered writes on a commit - * block instead of explicitly doing wait_on_buffer (which is bad - * for performance) can be a big win. Block drivers supporting this - * feature should call this function and indicate so. - * + * blk_queue_cache_features - set the supported cache control features + * @q: the request queue + * @cache_features: the support features **/ -int blk_queue_ordered(struct request_queue *q, unsigned ordered) +int blk_queue_cache_features(struct request_queue *q, unsigned cache_features) { - if (ordered != QUEUE_ORDERED_NONE && - ordered != QUEUE_ORDERED_DRAIN && - ordered != QUEUE_ORDERED_DRAIN_FLUSH && - ordered != QUEUE_ORDERED_DRAIN_FUA && - ordered != QUEUE_ORDERED_TAG && - ordered != QUEUE_ORDERED_TAG_FLUSH && - ordered != QUEUE_ORDERED_TAG_FUA) { - printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered); + if (cache_features & ~(QUEUE_HAS_FLUSH|QUEUE_HAS_FUA)) { + printk(KERN_ERR "blk_queue_cache_features: bad value %d\n", + cache_features); return -EINVAL; } - q->ordered = ordered; - q->next_ordered = ordered; - + q->cache_features = cache_features; return 0; } -EXPORT_SYMBOL(blk_queue_ordered); +EXPORT_SYMBOL(blk_queue_cache_features); /* * Cache flushing for ordered writes handling @@ -51,38 +50,10 @@ unsigned blk_ordered_cur_seq(struct requ return 1 << ffz(q->ordseq); } -unsigned blk_ordered_req_seq(struct request *rq) -{ - struct request_queue *q = rq->q; - - BUG_ON(q->ordseq == 0); - - if (rq == &q->pre_flush_rq) - return QUEUE_ORDSEQ_PREFLUSH; - if (rq == &q->bar_rq) - return QUEUE_ORDSEQ_BAR; - if (rq == &q->post_flush_rq) - return QUEUE_ORDSEQ_POSTFLUSH; - - /* - * !fs requests don't need to follow barrier ordering. Always - * put them at the front. This fixes the following deadlock. - * - * http://thread.gmane.org/gmane.linux.kernel/537473 - */ - if (rq->cmd_type != REQ_TYPE_FS) - return QUEUE_ORDSEQ_DRAIN; - - if ((rq->cmd_flags & REQ_ORDERED_COLOR) == - (q->orig_bar_rq->cmd_flags & REQ_ORDERED_COLOR)) - return QUEUE_ORDSEQ_DRAIN; - else - return QUEUE_ORDSEQ_DONE; -} - -bool blk_ordered_complete_seq(struct request_queue *q, unsigned seq, int error) +static struct request *blk_ordered_complete_seq(struct request_queue *q, + unsigned seq, int error) { - struct request *rq; + struct request *rq = NULL; if (error && !q->orderr) q->orderr = error; @@ -90,16 +61,22 @@ bool blk_ordered_complete_seq(struct req BUG_ON(q->ordseq & seq); q->ordseq |= seq; - if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) - return false; - - /* - * Okay, sequence complete. - */ - q->ordseq = 0; - rq = q->orig_bar_rq; - __blk_end_request_all(rq, q->orderr); - return true; + if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) { + /* not complete yet, queue the next ordered sequence */ + rq = queue_next_ordseq(q); + } else { + /* complete this barrier request */ + __blk_end_request_all(q->orig_bar_rq, q->orderr); + q->orig_bar_rq = NULL; + q->ordseq = 0; + + /* dispatch the next barrier if there's one */ + if (!list_empty(&q->pending_barriers)) { + rq = list_entry_rq(q->pending_barriers.next); + list_move(&rq->queuelist, &q->queue_head); + } + } + return rq; } static void pre_flush_end_io(struct request *rq, int error) @@ -120,155 +97,100 @@ static void post_flush_end_io(struct req blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error); } -static void queue_flush(struct request_queue *q, unsigned which) +static void init_flush_request(struct request_queue *q, struct request *rq) { - struct request *rq; - rq_end_io_fn *end_io; + rq->cmd_type = REQ_TYPE_FS; + rq->cmd_flags = REQ_FLUSH; + rq->rq_disk = q->orig_bar_rq->rq_disk; +} - if (which == QUEUE_ORDERED_DO_PREFLUSH) { - rq = &q->pre_flush_rq; - end_io = pre_flush_end_io; - } else { - rq = &q->post_flush_rq; - end_io = post_flush_end_io; - } +/* + * Initialize proxy request and queue it. + */ +static struct request *queue_next_ordseq(struct request_queue *q) +{ + struct request *rq = &q->bar_rq; blk_rq_init(q, rq); - rq->cmd_type = REQ_TYPE_FS; - rq->cmd_flags = REQ_HARDBARRIER | REQ_FLUSH; - rq->rq_disk = q->orig_bar_rq->rq_disk; - rq->end_io = end_io; + + switch (blk_ordered_cur_seq(q)) { + case QUEUE_ORDSEQ_PREFLUSH: + init_flush_request(q, rq); + rq->end_io = pre_flush_end_io; + break; + case QUEUE_ORDSEQ_BAR: + init_request_from_bio(rq, q->orig_bar_rq->bio); + rq->cmd_flags &= ~REQ_HARDBARRIER; + if (q->cache_features & QUEUE_HAS_FUA) + rq->cmd_flags |= REQ_FUA; + rq->end_io = bar_end_io; + break; + case QUEUE_ORDSEQ_POSTFLUSH: + init_flush_request(q, rq); + rq->end_io = post_flush_end_io; + break; + default: + BUG(); + } elv_insert(q, rq, ELEVATOR_INSERT_FRONT); + return rq; } -static inline bool start_ordered(struct request_queue *q, struct request **rqp) +struct request *blk_do_ordered(struct request_queue *q, struct request *rq) { - struct request *rq = *rqp; unsigned skip = 0; - q->orderr = 0; - q->ordered = q->next_ordered; - q->ordseq |= QUEUE_ORDSEQ_STARTED; + if (rq->cmd_type != REQ_TYPE_FS) + return rq; + if (!(rq->cmd_flags & REQ_HARDBARRIER)) + return rq; - /* - * For an empty barrier, there's no actual BAR request, which - * in turn makes POSTFLUSH unnecessary. Mask them off. - */ - if (!blk_rq_sectors(rq)) { - q->ordered &= ~(QUEUE_ORDERED_DO_BAR | - QUEUE_ORDERED_DO_POSTFLUSH); + if (!(q->cache_features & QUEUE_HAS_FLUSH)) { /* - * Empty barrier on a write-through device w/ ordered - * tag has no command to issue and without any command - * to issue, ordering by tag can't be used. Drain - * instead. + * No flush required. We can just send on write requests + * and complete cache flush requests ASAP. */ - if ((q->ordered & QUEUE_ORDERED_BY_TAG) && - !(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) { - q->ordered &= ~QUEUE_ORDERED_BY_TAG; - q->ordered |= QUEUE_ORDERED_BY_DRAIN; + if (blk_rq_sectors(rq)) { + rq->cmd_flags &= ~REQ_HARDBARRIER; + return rq; } + blk_dequeue_request(rq); + __blk_end_request_all(rq, 0); + return NULL; } - /* stash away the original request */ - blk_dequeue_request(rq); - q->orig_bar_rq = rq; - rq = NULL; - - /* - * Queue ordered sequence. As we stack them at the head, we - * need to queue in reverse order. Note that we rely on that - * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs - * request gets inbetween ordered sequence. - */ - if (q->ordered & QUEUE_ORDERED_DO_POSTFLUSH) { - queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH); - rq = &q->post_flush_rq; - } else - skip |= QUEUE_ORDSEQ_POSTFLUSH; - - if (q->ordered & QUEUE_ORDERED_DO_BAR) { - rq = &q->bar_rq; - - /* initialize proxy request and queue it */ - blk_rq_init(q, rq); - if (bio_data_dir(q->orig_bar_rq->bio) == WRITE) - rq->cmd_flags |= REQ_WRITE; - if (q->ordered & QUEUE_ORDERED_DO_FUA) - rq->cmd_flags |= REQ_FUA; - init_request_from_bio(rq, q->orig_bar_rq->bio); - rq->end_io = bar_end_io; - - elv_insert(q, rq, ELEVATOR_INSERT_FRONT); - } else - skip |= QUEUE_ORDSEQ_BAR; - - if (q->ordered & QUEUE_ORDERED_DO_PREFLUSH) { - queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH); - rq = &q->pre_flush_rq; - } else - skip |= QUEUE_ORDSEQ_PREFLUSH; - - if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q)) - rq = NULL; - else - skip |= QUEUE_ORDSEQ_DRAIN; + if (q->ordseq) { + /* + * Barrier is already in progress and they can't be + * processed in parallel. Queue for later processing. + */ + list_move_tail(&rq->queuelist, &q->pending_barriers); + return NULL; + } - *rqp = rq; /* - * Complete skipped sequences. If whole sequence is complete, - * return false to tell elevator that this request is gone. + * Start a new ordered sequence */ - return !blk_ordered_complete_seq(q, skip, 0); -} - -bool blk_do_ordered(struct request_queue *q, struct request **rqp) -{ - struct request *rq = *rqp; - const int is_barrier = rq->cmd_type == REQ_TYPE_FS && - (rq->cmd_flags & REQ_HARDBARRIER); - - if (!q->ordseq) { - if (!is_barrier) - return true; - - if (q->next_ordered != QUEUE_ORDERED_NONE) - return start_ordered(q, rqp); - else { - /* - * Queue ordering not supported. Terminate - * with prejudice. - */ - blk_dequeue_request(rq); - __blk_end_request_all(rq, -EOPNOTSUPP); - *rqp = NULL; - return false; - } - } + q->orderr = 0; + q->ordseq |= QUEUE_ORDSEQ_STARTED; /* - * Ordered sequence in progress + * For an empty barrier, there's no actual BAR request, which + * in turn makes POSTFLUSH unnecessary. Mask them off. */ + if (!blk_rq_sectors(rq)) + skip |= (QUEUE_ORDSEQ_BAR|QUEUE_ORDSEQ_POSTFLUSH); + else if (q->cache_features & QUEUE_HAS_FUA) + skip |= QUEUE_ORDSEQ_POSTFLUSH; - /* Special requests are not subject to ordering rules. */ - if (rq->cmd_type != REQ_TYPE_FS && - rq != &q->pre_flush_rq && rq != &q->post_flush_rq) - return true; - - if (q->ordered & QUEUE_ORDERED_BY_TAG) { - /* Ordered by tag. Blocking the next barrier is enough. */ - if (is_barrier && rq != &q->bar_rq) - *rqp = NULL; - } else { - /* Ordered by draining. Wait for turn. */ - WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q)); - if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q)) - *rqp = NULL; - } + /* stash away the original request */ + blk_dequeue_request(rq); + q->orig_bar_rq = rq; - return true; + /* complete skipped sequences and return the first sequence */ + return blk_ordered_complete_seq(q, skip, 0); } static void bio_end_empty_barrier(struct bio *bio, int err) Index: linux-2.6/include/linux/blkdev.h =================================================================== --- linux-2.6.orig/include/linux/blkdev.h 2010-08-07 12:53:23.774479189 -0400 +++ linux-2.6/include/linux/blkdev.h 2010-08-07 14:51:42.751479190 -0400 @@ -354,13 +354,20 @@ struct request_queue #ifdef CONFIG_BLK_DEV_IO_TRACE struct blk_trace *blk_trace; #endif + + /* + * Features this queue understands. + */ + unsigned int cache_features; + /* * reserved for flush operations */ - unsigned int ordered, next_ordered, ordseq; - int orderr, ordcolor; - struct request pre_flush_rq, bar_rq, post_flush_rq; - struct request *orig_bar_rq; + unsigned int ordseq; + int orderr; + struct request bar_rq; + struct request *orig_bar_rq; + struct list_head pending_barriers; struct mutex sysfs_lock; @@ -461,54 +468,12 @@ static inline void queue_flag_clear(unsi __clear_bit(flag, &q->queue_flags); } +/* + * Possible features to control a volatile write cache. + */ enum { - /* - * Hardbarrier is supported with one of the following methods. - * - * NONE : hardbarrier unsupported - * DRAIN : ordering by draining is enough - * DRAIN_FLUSH : ordering by draining w/ pre and post flushes - * DRAIN_FUA : ordering by draining w/ pre flush and FUA write - * TAG : ordering by tag is enough - * TAG_FLUSH : ordering by tag w/ pre and post flushes - * TAG_FUA : ordering by tag w/ pre flush and FUA write - */ - QUEUE_ORDERED_BY_DRAIN = 0x01, - QUEUE_ORDERED_BY_TAG = 0x02, - QUEUE_ORDERED_DO_PREFLUSH = 0x10, - QUEUE_ORDERED_DO_BAR = 0x20, - QUEUE_ORDERED_DO_POSTFLUSH = 0x40, - QUEUE_ORDERED_DO_FUA = 0x80, - - QUEUE_ORDERED_NONE = 0x00, - - QUEUE_ORDERED_DRAIN = QUEUE_ORDERED_BY_DRAIN | - QUEUE_ORDERED_DO_BAR, - QUEUE_ORDERED_DRAIN_FLUSH = QUEUE_ORDERED_DRAIN | - QUEUE_ORDERED_DO_PREFLUSH | - QUEUE_ORDERED_DO_POSTFLUSH, - QUEUE_ORDERED_DRAIN_FUA = QUEUE_ORDERED_DRAIN | - QUEUE_ORDERED_DO_PREFLUSH | - QUEUE_ORDERED_DO_FUA, - - QUEUE_ORDERED_TAG = QUEUE_ORDERED_BY_TAG | - QUEUE_ORDERED_DO_BAR, - QUEUE_ORDERED_TAG_FLUSH = QUEUE_ORDERED_TAG | - QUEUE_ORDERED_DO_PREFLUSH | - QUEUE_ORDERED_DO_POSTFLUSH, - QUEUE_ORDERED_TAG_FUA = QUEUE_ORDERED_TAG | - QUEUE_ORDERED_DO_PREFLUSH | - QUEUE_ORDERED_DO_FUA, - - /* - * Ordered operation sequence - */ - QUEUE_ORDSEQ_STARTED = 0x01, /* flushing in progress */ - QUEUE_ORDSEQ_DRAIN = 0x02, /* waiting for the queue to be drained */ - QUEUE_ORDSEQ_PREFLUSH = 0x04, /* pre-flushing in progress */ - QUEUE_ORDSEQ_BAR = 0x08, /* original barrier req in progress */ - QUEUE_ORDSEQ_POSTFLUSH = 0x10, /* post-flushing in progress */ - QUEUE_ORDSEQ_DONE = 0x20, + QUEUE_HAS_FLUSH = 1 << 0, /* supports REQ_FLUSH */ + QUEUE_HAS_FUA = 1 << 1, /* supports REQ_FUA */ }; #define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags) @@ -879,11 +844,9 @@ extern void blk_queue_softirq_done(struc extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *); extern void blk_queue_rq_timeout(struct request_queue *, unsigned int); extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev); -extern int blk_queue_ordered(struct request_queue *, unsigned); -extern bool blk_do_ordered(struct request_queue *, struct request **); +extern int blk_queue_cache_features(struct request_queue *, unsigned); extern unsigned blk_ordered_cur_seq(struct request_queue *); extern unsigned blk_ordered_req_seq(struct request *); -extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int); extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *); extern void blk_dump_rq_flags(struct request *, char *); Index: linux-2.6/drivers/block/virtio_blk.c =================================================================== --- linux-2.6.orig/drivers/block/virtio_blk.c 2010-08-07 12:53:23.800479189 -0400 +++ linux-2.6/drivers/block/virtio_blk.c 2010-08-07 14:51:34.198479189 -0400 @@ -388,31 +388,8 @@ static int __devinit virtblk_probe(struc vblk->disk->driverfs_dev = &vdev->dev; index++; - if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) { - /* - * If the FLUSH feature is supported we do have support for - * flushing a volatile write cache on the host. Use that - * to implement write barrier support. - */ - blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH); - } else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER)) { - /* - * If the BARRIER feature is supported the host expects us - * to order request by tags. This implies there is not - * volatile write cache on the host, and that the host - * never re-orders outstanding I/O. This feature is not - * useful for real life scenarious and deprecated. - */ - blk_queue_ordered(q, QUEUE_ORDERED_TAG); - } else { - /* - * If the FLUSH feature is not supported we must assume that - * the host does not perform any kind of volatile write - * caching. We still need to drain the queue to provider - * proper barrier semantics. - */ - blk_queue_ordered(q, QUEUE_ORDERED_DRAIN); - } + if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) + blk_queue_cache_features(q, QUEUE_HAS_FLUSH); /* If disk is read-only in the host, the guest should obey */ if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO)) Index: linux-2.6/drivers/scsi/sd.c =================================================================== --- linux-2.6.orig/drivers/scsi/sd.c 2010-08-07 12:53:23.872479189 -0400 +++ linux-2.6/drivers/scsi/sd.c 2010-08-07 14:54:47.812479189 -0400 @@ -2109,7 +2109,7 @@ static int sd_revalidate_disk(struct gen struct scsi_disk *sdkp = scsi_disk(disk); struct scsi_device *sdp = sdkp->device; unsigned char *buffer; - unsigned ordered; + unsigned ordered = 0; SCSI_LOG_HLQUEUE(3, sd_printk(KERN_INFO, sdkp, "sd_revalidate_disk\n")); @@ -2151,17 +2151,14 @@ static int sd_revalidate_disk(struct gen /* * We now have all cache related info, determine how we deal - * with ordered requests. Note that as the current SCSI - * dispatch function can alter request order, we cannot use - * QUEUE_ORDERED_TAG_* even when ordered tag is supported. + * with barriers. */ - if (sdkp->WCE) - ordered = sdkp->DPOFUA - ? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH; - else - ordered = QUEUE_ORDERED_DRAIN; - - blk_queue_ordered(sdkp->disk->queue, ordered); + if (sdkp->WCE) { + ordered |= QUEUE_HAS_FLUSH; + if (sdkp->DPOFUA) + ordered |= QUEUE_HAS_FUA; + } + blk_queue_cache_features(sdkp->disk->queue, ordered); set_capacity(disk, sdkp->capacity); kfree(buffer); Index: linux-2.6/block/blk-core.c =================================================================== --- linux-2.6.orig/block/blk-core.c 2010-08-07 12:53:23.744479189 -0400 +++ linux-2.6/block/blk-core.c 2010-08-07 14:56:35.087479189 -0400 @@ -520,6 +520,7 @@ struct request_queue *blk_alloc_queue_no init_timer(&q->unplug_timer); setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q); INIT_LIST_HEAD(&q->timeout_list); + INIT_LIST_HEAD(&q->pending_barriers); INIT_WORK(&q->unplug_work, blk_unplug_work); kobject_init(&q->kobj, &blk_queue_ktype); @@ -1037,22 +1038,6 @@ void blk_insert_request(struct request_q } EXPORT_SYMBOL(blk_insert_request); -/* - * add-request adds a request to the linked list. - * queue lock is held and interrupts disabled, as we muck with the - * request queue list. - */ -static inline void add_request(struct request_queue *q, struct request *req) -{ - drive_stat_acct(req, 1); - - /* - * elevator indicated where it wants this request to be - * inserted at elevator_merge time - */ - __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0); -} - static void part_round_stats_single(int cpu, struct hd_struct *part, unsigned long now) { @@ -1201,13 +1186,9 @@ static int __make_request(struct request const bool sync = (bio->bi_rw & REQ_SYNC); const bool unplug = (bio->bi_rw & REQ_UNPLUG); const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK; + int where = ELEVATOR_INSERT_SORT; int rw_flags; - if ((bio->bi_rw & REQ_HARDBARRIER) && - (q->next_ordered == QUEUE_ORDERED_NONE)) { - bio_endio(bio, -EOPNOTSUPP); - return 0; - } /* * low level driver can indicate that it wants pages above a * certain limit bounced to low memory (ie for highmem, or even @@ -1217,7 +1198,12 @@ static int __make_request(struct request spin_lock_irq(q->queue_lock); - if (unlikely((bio->bi_rw & REQ_HARDBARRIER)) || elv_queue_empty(q)) + if (bio->bi_rw & REQ_HARDBARRIER) { + where = ELEVATOR_INSERT_ORDERED; + goto get_rq; + } + + if (elv_queue_empty(q)) goto get_rq; el_ret = elv_merge(q, &req, bio); @@ -1314,7 +1300,10 @@ get_rq: req->cpu = blk_cpu_to_group(smp_processor_id()); if (queue_should_plug(q) && elv_queue_empty(q)) blk_plug_device(q); - add_request(q, req); + + /* insert the request into the elevator */ + drive_stat_acct(req, 1); + __elv_add_request(q, req, where, 0); out: if (unplug || !queue_should_plug(q)) __generic_unplug_device(q); Index: linux-2.6/block/elevator.c =================================================================== --- linux-2.6.orig/block/elevator.c 2010-08-07 12:53:23.752479189 -0400 +++ linux-2.6/block/elevator.c 2010-08-07 12:53:53.162479190 -0400 @@ -564,7 +564,7 @@ void elv_requeue_request(struct request_ rq->cmd_flags &= ~REQ_STARTED; - elv_insert(q, rq, ELEVATOR_INSERT_REQUEUE); + elv_insert(q, rq, ELEVATOR_INSERT_FRONT); } void elv_drain_elevator(struct request_queue *q) @@ -611,8 +611,6 @@ void elv_quiesce_end(struct request_queu void elv_insert(struct request_queue *q, struct request *rq, int where) { - struct list_head *pos; - unsigned ordseq; int unplug_it = 1; trace_block_rq_insert(q, rq); @@ -622,10 +620,14 @@ void elv_insert(struct request_queue *q, switch (where) { case ELEVATOR_INSERT_FRONT: rq->cmd_flags |= REQ_SOFTBARRIER; - list_add(&rq->queuelist, &q->queue_head); break; + case ELEVATOR_INSERT_ORDERED: + rq->cmd_flags |= REQ_SOFTBARRIER; + list_add_tail(&rq->queuelist, &q->queue_head); + break; + case ELEVATOR_INSERT_BACK: rq->cmd_flags |= REQ_SOFTBARRIER; elv_drain_elevator(q); @@ -662,36 +664,6 @@ void elv_insert(struct request_queue *q, q->elevator->ops->elevator_add_req_fn(q, rq); break; - case ELEVATOR_INSERT_REQUEUE: - /* - * If ordered flush isn't in progress, we do front - * insertion; otherwise, requests should be requeued - * in ordseq order. - */ - rq->cmd_flags |= REQ_SOFTBARRIER; - - /* - * Most requeues happen because of a busy condition, - * don't force unplug of the queue for that case. - */ - unplug_it = 0; - - if (q->ordseq == 0) { - list_add(&rq->queuelist, &q->queue_head); - break; - } - - ordseq = blk_ordered_req_seq(rq); - - list_for_each(pos, &q->queue_head) { - struct request *pos_rq = list_entry_rq(pos); - if (ordseq <= blk_ordered_req_seq(pos_rq)) - break; - } - - list_add_tail(&rq->queuelist, pos); - break; - default: printk(KERN_ERR "%s: bad insertion point %d\n", __func__, where); @@ -710,33 +682,15 @@ void elv_insert(struct request_queue *q, void __elv_add_request(struct request_queue *q, struct request *rq, int where, int plug) { - if (q->ordcolor) - rq->cmd_flags |= REQ_ORDERED_COLOR; - if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)) { - /* - * toggle ordered color - */ - if (rq->cmd_flags & REQ_HARDBARRIER) - q->ordcolor ^= 1; - - /* - * barriers implicitly indicate back insertion - */ - if (where == ELEVATOR_INSERT_SORT) - where = ELEVATOR_INSERT_BACK; - - /* - * this request is scheduling boundary, update - * end_sector - */ + /* barriers are scheduling boundary, update end_sector */ if (rq->cmd_type == REQ_TYPE_FS || (rq->cmd_flags & REQ_DISCARD)) { q->end_sector = rq_end_sector(rq); q->boundary_rq = rq; } } else if (!(rq->cmd_flags & REQ_ELVPRIV) && - where == ELEVATOR_INSERT_SORT) + where == ELEVATOR_INSERT_SORT) where = ELEVATOR_INSERT_BACK; if (plug) @@ -849,24 +803,6 @@ void elv_completed_request(struct reques e->ops->elevator_completed_req_fn) e->ops->elevator_completed_req_fn(q, rq); } - - /* - * Check if the queue is waiting for fs requests to be - * drained for flush sequence. - */ - if (unlikely(q->ordseq)) { - struct request *next = NULL; - - if (!list_empty(&q->queue_head)) - next = list_entry_rq(q->queue_head.next); - - if (!queue_in_flight(q) && - blk_ordered_cur_seq(q) == QUEUE_ORDSEQ_DRAIN && - (!next || blk_ordered_req_seq(next) > QUEUE_ORDSEQ_DRAIN)) { - blk_ordered_complete_seq(q, QUEUE_ORDSEQ_DRAIN, 0); - __blk_run_queue(q); - } - } } #define to_elv(atr) container_of((atr), struct elv_fs_entry, attr) Index: linux-2.6/block/blk.h =================================================================== --- linux-2.6.orig/block/blk.h 2010-08-07 12:53:23.762479189 -0400 +++ linux-2.6/block/blk.h 2010-08-07 12:53:53.171479190 -0400 @@ -51,6 +51,8 @@ static inline void blk_clear_rq_complete */ #define ELV_ON_HASH(rq) (!hlist_unhashed(&(rq)->hash)) +struct request *blk_do_ordered(struct request_queue *q, struct request *rq); + static inline struct request *__elv_next_request(struct request_queue *q) { struct request *rq; @@ -58,7 +60,8 @@ static inline struct request *__elv_next while (1) { while (!list_empty(&q->queue_head)) { rq = list_entry_rq(q->queue_head.next); - if (blk_do_ordered(q, &rq)) + rq = blk_do_ordered(q, rq); + if (rq) return rq; } Index: linux-2.6/drivers/block/xen-blkfront.c =================================================================== --- linux-2.6.orig/drivers/block/xen-blkfront.c 2010-08-07 12:53:23.807479189 -0400 +++ linux-2.6/drivers/block/xen-blkfront.c 2010-08-07 14:44:39.564479189 -0400 @@ -417,30 +417,6 @@ static int xlvbd_init_blk_queue(struct g return 0; } - -static int xlvbd_barrier(struct blkfront_info *info) -{ - int err; - const char *barrier; - - switch (info->feature_barrier) { - case QUEUE_ORDERED_DRAIN: barrier = "enabled (drain)"; break; - case QUEUE_ORDERED_TAG: barrier = "enabled (tag)"; break; - case QUEUE_ORDERED_NONE: barrier = "disabled"; break; - default: return -EINVAL; - } - - err = blk_queue_ordered(info->rq, info->feature_barrier); - - if (err) - return err; - - printk(KERN_INFO "blkfront: %s: barriers %s\n", - info->gd->disk_name, barrier); - return 0; -} - - static int xlvbd_alloc_gendisk(blkif_sector_t capacity, struct blkfront_info *info, u16 vdisk_info, u16 sector_size) @@ -516,8 +492,6 @@ static int xlvbd_alloc_gendisk(blkif_sec info->rq = gd->queue; info->gd = gd; - xlvbd_barrier(info); - if (vdisk_info & VDISK_READONLY) set_disk_ro(gd, 1); @@ -662,8 +636,6 @@ static irqreturn_t blkif_interrupt(int i printk(KERN_WARNING "blkfront: %s: write barrier op failed\n", info->gd->disk_name); error = -EOPNOTSUPP; - info->feature_barrier = QUEUE_ORDERED_NONE; - xlvbd_barrier(info); } /* fall through */ case BLKIF_OP_READ: @@ -1073,24 +1045,6 @@ static void blkfront_connect(struct blkf "feature-barrier", "%lu", &barrier, NULL); - /* - * If there's no "feature-barrier" defined, then it means - * we're dealing with a very old backend which writes - * synchronously; draining will do what needs to get done. - * - * If there are barriers, then we can do full queued writes - * with tagged barriers. - * - * If barriers are not supported, then there's no much we can - * do, so just set ordering to NONE. - */ - if (err) - info->feature_barrier = QUEUE_ORDERED_DRAIN; - else if (barrier) - info->feature_barrier = QUEUE_ORDERED_TAG; - else - info->feature_barrier = QUEUE_ORDERED_NONE; - err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size); if (err) { xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s", Index: linux-2.6/drivers/ide/ide-disk.c =================================================================== --- linux-2.6.orig/drivers/ide/ide-disk.c 2010-08-07 12:53:23.889479189 -0400 +++ linux-2.6/drivers/ide/ide-disk.c 2010-08-07 15:00:30.215479189 -0400 @@ -518,12 +518,13 @@ static int ide_do_setfeature(ide_drive_t static void update_ordered(ide_drive_t *drive) { - u16 *id = drive->id; - unsigned ordered = QUEUE_ORDERED_NONE; + unsigned ordered = 0; if (drive->dev_flags & IDE_DFLAG_WCACHE) { + u16 *id = drive->id; unsigned long long capacity; int barrier; + /* * We must avoid issuing commands a drive does not * understand or we may crash it. We check flush cache @@ -543,13 +544,18 @@ static void update_ordered(ide_drive_t * drive->name, barrier ? "" : "not "); if (barrier) { - ordered = QUEUE_ORDERED_DRAIN_FLUSH; + printk(KERN_INFO "%s: cache flushes supported\n", + drive->name); blk_queue_prep_rq(drive->queue, idedisk_prep_fn); + ordered |= QUEUE_HAS_FLUSH; + } else { + printk(KERN_INFO + "%s: WARNING: cache flushes not supported\n", + drive->name); } - } else - ordered = QUEUE_ORDERED_DRAIN; + } - blk_queue_ordered(drive->queue, ordered); + blk_queue_cache_features(drive->queue, ordered); } ide_devset_get_flag(wcache, IDE_DFLAG_WCACHE); Index: linux-2.6/drivers/md/dm.c =================================================================== --- linux-2.6.orig/drivers/md/dm.c 2010-08-07 12:53:23.905479189 -0400 +++ linux-2.6/drivers/md/dm.c 2010-08-07 14:51:38.240479189 -0400 @@ -1908,7 +1908,7 @@ static struct mapped_device *alloc_dev(i blk_queue_softirq_done(md->queue, dm_softirq_done); blk_queue_prep_rq(md->queue, dm_prep_fn); blk_queue_lld_busy(md->queue, dm_lld_busy); - blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH); + blk_queue_cache_features(md->queue, QUEUE_HAS_FLUSH); md->disk = alloc_disk(1); if (!md->disk) Index: linux-2.6/drivers/mmc/card/queue.c =================================================================== --- linux-2.6.orig/drivers/mmc/card/queue.c 2010-08-07 12:53:23.927479189 -0400 +++ linux-2.6/drivers/mmc/card/queue.c 2010-08-07 14:30:09.666479189 -0400 @@ -128,7 +128,6 @@ int mmc_init_queue(struct mmc_queue *mq, mq->req = NULL; blk_queue_prep_rq(mq->queue, mmc_prep_request); - blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN); queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue); #ifdef CONFIG_MMC_BLOCK_BOUNCE Index: linux-2.6/drivers/s390/block/dasd.c =================================================================== --- linux-2.6.orig/drivers/s390/block/dasd.c 2010-08-07 12:53:23.939479189 -0400 +++ linux-2.6/drivers/s390/block/dasd.c 2010-08-07 14:30:13.307479189 -0400 @@ -2197,7 +2197,6 @@ static void dasd_setup_queue(struct dasd */ blk_queue_max_segment_size(block->request_queue, PAGE_SIZE); blk_queue_segment_boundary(block->request_queue, PAGE_SIZE - 1); - blk_queue_ordered(block->request_queue, QUEUE_ORDERED_DRAIN); } /* Index: linux-2.6/include/linux/elevator.h =================================================================== --- linux-2.6.orig/include/linux/elevator.h 2010-08-07 12:53:23.781479189 -0400 +++ linux-2.6/include/linux/elevator.h 2010-08-07 12:53:53.208479190 -0400 @@ -162,9 +162,9 @@ extern struct request *elv_rb_find(struc * Insertion selection */ #define ELEVATOR_INSERT_FRONT 1 -#define ELEVATOR_INSERT_BACK 2 -#define ELEVATOR_INSERT_SORT 3 -#define ELEVATOR_INSERT_REQUEUE 4 +#define ELEVATOR_INSERT_ORDERED 2 +#define ELEVATOR_INSERT_BACK 3 +#define ELEVATOR_INSERT_SORT 4 /* * return values from elevator_may_queue_fn Index: linux-2.6/drivers/block/pktcdvd.c =================================================================== --- linux-2.6.orig/drivers/block/pktcdvd.c 2010-08-07 12:53:23.815479189 -0400 +++ linux-2.6/drivers/block/pktcdvd.c 2010-08-07 12:53:53.211479190 -0400 @@ -753,7 +753,6 @@ static int pkt_generic_packet(struct pkt rq->timeout = 60*HZ; rq->cmd_type = REQ_TYPE_BLOCK_PC; - rq->cmd_flags |= REQ_HARDBARRIER; if (cgc->quiet) rq->cmd_flags |= REQ_QUIET; Index: linux-2.6/drivers/block/brd.c =================================================================== --- linux-2.6.orig/drivers/block/brd.c 2010-08-07 12:53:23.825479189 -0400 +++ linux-2.6/drivers/block/brd.c 2010-08-07 14:26:12.293479191 -0400 @@ -482,7 +482,6 @@ static struct brd_device *brd_alloc(int if (!brd->brd_queue) goto out_free_dev; blk_queue_make_request(brd->brd_queue, brd_make_request); - blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_TAG); blk_queue_max_hw_sectors(brd->brd_queue, 1024); blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY); Index: linux-2.6/drivers/block/loop.c =================================================================== --- linux-2.6.orig/drivers/block/loop.c 2010-08-07 12:53:23.836479189 -0400 +++ linux-2.6/drivers/block/loop.c 2010-08-07 14:51:27.937479189 -0400 @@ -831,8 +831,8 @@ static int loop_set_fd(struct loop_devic lo->lo_queue->queuedata = lo; lo->lo_queue->unplug_fn = loop_unplug; - if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync) - blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN); + /* XXX(hch): loop can't properly deal with flush requests currently */ +// blk_queue_cache_features(lo->lo_queue, QUEUE_HAS_FLUSH); set_capacity(lo->lo_disk, size); bd_set_size(bdev, size << 9); Index: linux-2.6/drivers/block/osdblk.c =================================================================== --- linux-2.6.orig/drivers/block/osdblk.c 2010-08-07 12:53:23.843479189 -0400 +++ linux-2.6/drivers/block/osdblk.c 2010-08-07 14:51:30.091479189 -0400 @@ -439,7 +439,7 @@ static int osdblk_init_disk(struct osdbl blk_queue_stack_limits(q, osd_request_queue(osdev->osd)); blk_queue_prep_rq(q, blk_queue_start_tag); - blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH); + blk_queue_cache_features(q, QUEUE_HAS_FLUSH); disk->queue = q; Index: linux-2.6/drivers/block/ps3disk.c =================================================================== --- linux-2.6.orig/drivers/block/ps3disk.c 2010-08-07 12:53:23.859479189 -0400 +++ linux-2.6/drivers/block/ps3disk.c 2010-08-07 14:51:32.204479189 -0400 @@ -468,7 +468,7 @@ static int __devinit ps3disk_probe(struc blk_queue_dma_alignment(queue, dev->blk_size-1); blk_queue_logical_block_size(queue, dev->blk_size); - blk_queue_ordered(queue, QUEUE_ORDERED_DRAIN_FLUSH); + blk_queue_cache_features(queue, QUEUE_HAS_FLUSH); blk_queue_max_segments(queue, -1); blk_queue_max_segment_size(queue, dev->bounce_size); Index: linux-2.6/include/linux/blk_types.h =================================================================== --- linux-2.6.orig/include/linux/blk_types.h 2010-08-07 12:53:23.793479189 -0400 +++ linux-2.6/include/linux/blk_types.h 2010-08-07 12:53:53.243479190 -0400 @@ -141,7 +141,6 @@ enum rq_flag_bits { __REQ_FAILED, /* set if the request failed */ __REQ_QUIET, /* don't worry about errors */ __REQ_PREEMPT, /* set for "ide_preempt" requests */ - __REQ_ORDERED_COLOR, /* is before or after barrier */ __REQ_ALLOCED, /* request came from our alloc pool */ __REQ_COPY_USER, /* contains copies of user pages */ __REQ_INTEGRITY, /* integrity metadata has been remapped */ @@ -181,7 +180,6 @@ enum rq_flag_bits { #define REQ_FAILED (1 << __REQ_FAILED) #define REQ_QUIET (1 << __REQ_QUIET) #define REQ_PREEMPT (1 << __REQ_PREEMPT) -#define REQ_ORDERED_COLOR (1 << __REQ_ORDERED_COLOR) #define REQ_ALLOCED (1 << __REQ_ALLOCED) #define REQ_COPY_USER (1 << __REQ_COPY_USER) #define REQ_INTEGRITY (1 << __REQ_INTEGRITY) ^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [PATCH REPOST RFC] relaxed barriers 2010-08-08 14:31 ` Christoph Hellwig @ 2010-08-09 14:50 ` Tejun Heo 0 siblings, 0 replies; 155+ messages in thread From: Tejun Heo @ 2010-08-09 14:50 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, jaxboe, James.Bottomley, linux-fsdevel, linux-scsi, tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel, linux-raid On 08/08/2010 04:31 PM, Christoph Hellwig wrote: > On Sat, Aug 07, 2010 at 12:13:06PM +0200, Tejun Heo wrote: >> The patch was on top of v2.6.35 but was generated against dirty tree >> and wouldn't apply cleanly. Here's the proper one. > > Here's an updated version: > > (a) ported to Jens' current block tree > (b) optimize barriers on devices not requiring flushes to be no-ops > (b) redo the blk_queue_ordered interface to just set QUEUE_HAS_FLUSH > and QUEUE_HAS_FUA flags. Nice. I'm working on a properly split patchset implementing REQ_FLUSH/FUA based interface, which replaces REQ_HARDBARRIER. Empty request w/ REQ_FLUSH just flushes cache but has no other ordering restrictions. REQ_FLUSH + data means preflush + data write. REQ_FUA + data means data would be committed to NV media on completion. REQ_FLUSH + FUA + data means preflush + NV data write. All FLUSH/FUA requests w/ data are ordered only against each other. I think I'll be able to post in several days. Thanks. -- tejun ^ permalink raw reply [flat|nested] 155+ messages in thread
end of thread, other threads:[~2010-08-30 12:45 UTC | newest] Thread overview: 155+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-07-27 16:56 [RFC] relaxed barrier semantics Christoph Hellwig 2010-07-27 17:54 ` Jan Kara 2010-07-27 18:35 ` Vivek Goyal 2010-07-27 18:42 ` James Bottomley 2010-07-27 18:51 ` Ric Wheeler 2010-07-27 19:43 ` Christoph Hellwig 2010-07-27 19:38 ` Christoph Hellwig 2010-07-28 8:08 ` Tejun Heo 2010-07-28 8:20 ` Tejun Heo 2010-07-28 13:55 ` Vladislav Bolkhovitin 2010-07-28 14:23 ` Tejun Heo 2010-07-28 14:37 ` James Bottomley 2010-07-28 14:44 ` Tejun Heo 2010-07-28 16:17 ` Vladislav Bolkhovitin 2010-07-28 16:17 ` Vladislav Bolkhovitin 2010-07-28 16:16 ` Vladislav Bolkhovitin 2010-07-28 8:24 ` Christoph Hellwig 2010-07-28 8:40 ` Tejun Heo 2010-07-28 8:50 ` Christoph Hellwig 2010-07-28 8:58 ` Tejun Heo 2010-07-28 9:00 ` Christoph Hellwig 2010-07-28 9:11 ` Hannes Reinecke 2010-07-28 9:16 ` Christoph Hellwig 2010-07-28 9:24 ` Tejun Heo 2010-07-28 9:38 ` Christoph Hellwig 2010-07-28 9:28 ` Steven Whitehouse 2010-07-28 9:35 ` READ_META semantics, was " Christoph Hellwig 2010-07-28 13:52 ` Jeff Moyer 2010-07-28 9:17 ` Tejun Heo 2010-07-28 9:28 ` Christoph Hellwig 2010-07-28 9:48 ` Tejun Heo 2010-07-28 10:19 ` Steven Whitehouse 2010-07-28 11:45 ` Christoph Hellwig 2010-07-28 12:47 ` Jan Kara 2010-07-28 23:00 ` Christoph Hellwig 2010-07-29 10:45 ` Jan Kara 2010-07-29 16:54 ` Joel Becker 2010-07-29 17:02 ` Christoph Hellwig 2010-07-29 17:02 ` Christoph Hellwig 2010-07-29 1:44 ` Ted Ts'o 2010-07-29 2:43 ` Vivek Goyal 2010-07-29 2:43 ` Vivek Goyal 2010-07-29 8:42 ` Christoph Hellwig 2010-07-29 20:02 ` Vivek Goyal 2010-07-29 20:06 ` Christoph Hellwig 2010-07-30 3:17 ` Vivek Goyal 2010-07-30 7:07 ` Christoph Hellwig 2010-07-30 7:41 ` Vivek Goyal 2010-08-02 18:28 ` [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) Vivek Goyal 2010-08-03 13:03 ` Christoph Hellwig 2010-08-04 15:29 ` Vivek Goyal 2010-08-04 16:21 ` Christoph Hellwig 2010-07-29 8:31 ` [RFC] relaxed barrier semantics Christoph Hellwig 2010-07-29 11:16 ` Jan Kara 2010-07-29 13:00 ` extfs reliability Vladislav Bolkhovitin 2010-07-29 13:08 ` Christoph Hellwig 2010-07-29 14:12 ` Vladislav Bolkhovitin 2010-07-29 14:34 ` Jan Kara 2010-07-29 18:20 ` Vladislav Bolkhovitin 2010-07-29 18:49 ` Vladislav Bolkhovitin 2010-07-29 14:26 ` Jan Kara 2010-07-29 18:20 ` Vladislav Bolkhovitin 2010-07-29 18:58 ` Ted Ts'o 2010-07-29 19:44 ` [RFC] relaxed barrier semantics Ric Wheeler 2010-07-29 19:49 ` Christoph Hellwig 2010-07-29 19:56 ` Ric Wheeler 2010-07-29 19:59 ` James Bottomley 2010-07-29 20:03 ` Christoph Hellwig 2010-07-29 20:07 ` James Bottomley 2010-07-29 20:11 ` Christoph Hellwig 2010-07-30 12:45 ` Vladislav Bolkhovitin 2010-07-30 12:56 ` Christoph Hellwig 2010-08-04 1:58 ` Jamie Lokier 2010-07-30 12:46 ` Vladislav Bolkhovitin 2010-07-30 12:57 ` Christoph Hellwig 2010-07-30 13:09 ` Vladislav Bolkhovitin 2010-07-30 13:12 ` Christoph Hellwig 2010-07-30 17:40 ` Vladislav Bolkhovitin 2010-07-29 20:58 ` Ric Wheeler 2010-07-29 22:30 ` Andreas Dilger 2010-07-29 23:04 ` Ted Ts'o 2010-07-29 23:08 ` Ric Wheeler 2010-07-29 23:08 ` Ric Wheeler 2010-07-29 23:28 ` James Bottomley 2010-07-29 23:37 ` James Bottomley 2010-07-30 0:19 ` Ted Ts'o 2010-07-30 12:56 ` Vladislav Bolkhovitin 2010-07-30 7:11 ` Christoph Hellwig 2010-07-30 7:11 ` Christoph Hellwig 2010-07-30 12:56 ` Vladislav Bolkhovitin 2010-07-30 13:07 ` Tejun Heo 2010-07-30 13:22 ` Vladislav Bolkhovitin 2010-07-30 13:27 ` Vladislav Bolkhovitin 2010-07-30 13:09 ` Christoph Hellwig 2010-07-30 13:25 ` Vladislav Bolkhovitin 2010-07-30 13:34 ` Christoph Hellwig 2010-07-30 13:44 ` Vladislav Bolkhovitin 2010-07-30 14:20 ` Christoph Hellwig 2010-07-31 0:47 ` Jan Kara 2010-07-31 9:12 ` Christoph Hellwig 2010-08-02 13:14 ` Jan Kara 2010-08-02 10:38 ` Vladislav Bolkhovitin 2010-08-02 12:48 ` Christoph Hellwig 2010-08-02 19:03 ` xfs rm performance Vladislav Bolkhovitin 2010-08-02 19:18 ` Christoph Hellwig 2010-08-05 19:31 ` Vladislav Bolkhovitin 2010-08-02 19:01 ` [RFC] relaxed barrier semantics Vladislav Bolkhovitin 2010-08-02 19:26 ` Christoph Hellwig 2010-07-30 12:56 ` Vladislav Bolkhovitin 2010-07-31 0:35 ` Jan Kara 2010-07-29 19:44 ` Ric Wheeler 2010-08-02 16:47 ` Ryusuke Konishi 2010-08-02 17:39 ` Chris Mason 2010-08-05 13:11 ` Vladislav Bolkhovitin 2010-08-05 13:32 ` Chris Mason 2010-08-05 14:52 ` Hannes Reinecke 2010-08-05 14:52 ` Hannes Reinecke 2010-08-05 15:17 ` Chris Mason 2010-08-05 17:07 ` Christoph Hellwig 2010-08-05 19:48 ` Vladislav Bolkhovitin 2010-08-05 19:48 ` Vladislav Bolkhovitin 2010-08-05 19:50 ` Christoph Hellwig 2010-08-05 20:05 ` Vladislav Bolkhovitin 2010-08-06 14:56 ` Hannes Reinecke 2010-08-06 18:38 ` Vladislav Bolkhovitin 2010-08-06 23:38 ` Christoph Hellwig 2010-08-06 23:34 ` Christoph Hellwig 2010-08-05 17:09 ` Christoph Hellwig 2010-08-05 19:32 ` Vladislav Bolkhovitin 2010-08-05 19:40 ` Christoph Hellwig 2010-08-05 13:11 ` Vladislav Bolkhovitin 2010-07-28 13:56 ` Vladislav Bolkhovitin 2010-07-28 14:42 ` Vivek Goyal 2010-07-27 19:37 ` Christoph Hellwig 2010-08-03 18:49 ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig 2010-08-03 18:51 ` [PATCH, RFC 2/2] dm: support REQ_FLUSH directly Christoph Hellwig 2010-08-04 4:57 ` Kiyoshi Ueda 2010-08-04 8:54 ` Christoph Hellwig 2010-08-05 2:16 ` Jun'ichi Nomura 2010-08-26 22:50 ` Mike Snitzer 2010-08-27 0:40 ` Mike Snitzer 2010-08-27 1:20 ` Jamie Lokier 2010-08-27 1:43 ` Jun'ichi Nomura 2010-08-27 4:08 ` Mike Snitzer 2010-08-27 5:52 ` Jun'ichi Nomura 2010-08-27 14:13 ` Mike Snitzer 2010-08-30 4:45 ` Jun'ichi Nomura 2010-08-30 8:33 ` Tejun Heo 2010-08-30 12:43 ` Mike Snitzer 2010-08-30 12:45 ` Tejun Heo 2010-08-06 16:04 ` [PATCH, RFC] relaxed barriers Tejun Heo 2010-08-06 23:34 ` Christoph Hellwig 2010-08-07 10:13 ` [PATCH REPOST " Tejun Heo 2010-08-08 14:31 ` Christoph Hellwig 2010-08-09 14:50 ` Tejun Heo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.