SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!

All of lore.kernel.org
 help / color / mirror / Atom feed

* SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!
@ 2010-04-19 11:32 Desai, Kashyap
  2010-04-19 16:20 ` Mike Christie
  0 siblings, 1 reply; 11+ messages in thread
From: Desai, Kashyap @ 2010-04-19 11:32 UTC (permalink / raw)
  To: linux-scsi; +Cc: James.Bottomley

I am facing one issue with scsi stack.
Here is a background of my test.

Mount ext3 file system with journaling support with barrier=1, commit=5
Now, with this setup file system will do submit_bh with  WRITE_BARRIER flag set for interval of 5 seconds. (This is a part of journaling.)
Eventually it will call queue_flush() which will generate SCSI command of CDB: SYNCHRONIZE_CAHCE and insert it into the request queue.
I observed that creation of SYNCHRONIZE_CACHE is a part of sd_prepare_flush(). Here we have timeout set to SD_TIMEOUT but retries are not set.
Because of retries of the request is not set, there is no retries allowed for SYNCHRONIZE_CACHE at mid layer.

Because of zero retries for SYNCHRONIZE_CACHE command at mid-layer, it is creating trouble for file system.
In current situation, Even though LLD send back commands with DID_RESET, SYNCHRONIZE_CACHE will fail immediately without going for any retries, when HBA is in recovery state. Eventually this information goes to File system and it sees SYNCHRONIZE_CAHCE is failed and file system goes to Read only mode.

My question is "Can we add in sd_prepare_flush(), rq->retries = X" some reasonable retries value ?

Thanks,
Kashyap

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!
  2010-04-19 11:32 SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.! Desai, Kashyap
@ 2010-04-19 16:20 ` Mike Christie
  2010-04-19 18:14   ` Bernd Schubert
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Christie @ 2010-04-19 16:20 UTC (permalink / raw)
  To: Desai, Kashyap; +Cc: linux-scsi, James.Bottomley

On 04/19/2010 06:32 AM, Desai, Kashyap wrote:
> I am facing one issue with scsi stack.
> Here is a background of my test.
>
> Mount ext3 file system with journaling support with barrier=1, commit=5
> Now, with this setup file system will do submit_bh with  WRITE_BARRIER flag set for interval of 5 seconds. (This is a part of journaling.)
> Eventually it will call queue_flush() which will generate SCSI command of CDB: SYNCHRONIZE_CAHCE and insert it into the request queue.
> I observed that creation of SYNCHRONIZE_CACHE is a part of sd_prepare_flush(). Here we have timeout set to SD_TIMEOUT but retries are not set.
> Because of retries of the request is not set, there is no retries allowed for SYNCHRONIZE_CACHE at mid layer.
>
> Because of zero retries for SYNCHRONIZE_CACHE command at mid-layer, it is creating trouble for file system.
> In current situation, Even though LLD send back commands with DID_RESET, SYNCHRONIZE_CACHE will fail immediately without going for any retries, when HBA is in recovery state. Eventually this information goes to File system and it sees SYNCHRONIZE_CAHCE is failed and file system goes to Read only mode.
>
> My question is "Can we add in sd_prepare_flush(), rq->retries = X" some reasonable retries value ?
>

I am not sure where we want it, but I think we want to be able to set 
both the retries and timeout. I have seen where a sync cache can take 
longer than the default 30 secs.

Do you think we want to the block layer to manage retries/timeouts for 
all block device flushes or is this more device specific? I was thinking 
that we may want to create a sysfs interface under the block dirs and 
have blk-sysfs.c and blk-barrier.c handle this. queue_flush could set 
the timeout and retries that is set by some new files under 
/sys/block/sdX/queue/ ?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!
  2010-04-19 16:20 ` Mike Christie
@ 2010-04-19 18:14   ` Bernd Schubert
  2010-04-19 18:45     ` James Bottomley
  0 siblings, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2010-04-19 18:14 UTC (permalink / raw)
  To: Mike Christie
  Cc: Desai, Kashyap, linux-scsi, James.Bottomley@HansenPartnership.com

On Monday 19 April 2010, Mike Christie wrote:
> On 04/19/2010 06:32 AM, Desai, Kashyap wrote:
> > I am facing one issue with scsi stack.
> > Here is a background of my test.
> >
> > Mount ext3 file system with journaling support with barrier=1, commit=5
> > Now, with this setup file system will do submit_bh with  WRITE_BARRIER
> > flag set for interval of 5 seconds. (This is a part of journaling.)
> > Eventually it will call queue_flush() which will generate SCSI command of
> > CDB: SYNCHRONIZE_CAHCE and insert it into the request queue. I observed
> > that creation of SYNCHRONIZE_CACHE is a part of sd_prepare_flush(). Here
> > we have timeout set to SD_TIMEOUT but retries are not set. Because of
> > retries of the request is not set, there is no retries allowed for
> > SYNCHRONIZE_CACHE at mid layer.
> >
> > Because of zero retries for SYNCHRONIZE_CACHE command at mid-layer, it is
> > creating trouble for file system. In current situation, Even though LLD
> > send back commands with DID_RESET, SYNCHRONIZE_CACHE will fail
> > immediately without going for any retries, when HBA is in recovery state.
> > Eventually this information goes to File system and it sees
> > SYNCHRONIZE_CAHCE is failed and file system goes to Read only mode.
> >
> > My question is "Can we add in sd_prepare_flush(), rq->retries = X" some
> > reasonable retries value ?
> 
> I am not sure where we want it, but I think we want to be able to set
> both the retries and timeout. I have seen where a sync cache can take
> longer than the default 30 secs.
> 
> Do you think we want to the block layer to manage retries/timeouts for
> all block device flushes or is this more device specific? I was thinking
> that we may want to create a sysfs interface under the block dirs and
> have blk-sysfs.c and blk-barrier.c handle this. queue_flush could set
> the timeout and retries that is set by some new files under
> /sys/block/sdX/queue/ ?


Good that now also other people run into it. 30s is far too small for any 
hardware raid unit with SATA disks. 

http://markmail.org/message/ewicheafcvgwm4p7

I wrote this patch while having trouble with Infortrend Raids, but it also 
comes up with DDN storage if the write back cache is enabled. 
Shall I update the patch, add retries and then resend the entire series? 


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!
  2010-04-19 18:14   ` Bernd Schubert
@ 2010-04-19 18:45     ` James Bottomley
  2010-04-19 19:17       ` Bernd Schubert
  2010-04-29 20:13       ` Ric Wheeler
  0 siblings, 2 replies; 11+ messages in thread
From: James Bottomley @ 2010-04-19 18:45 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Mike Christie, Desai, Kashyap, linux-scsi

On Mon, 2010-04-19 at 20:14 +0200, Bernd Schubert wrote:
> On Monday 19 April 2010, Mike Christie wrote:
> > On 04/19/2010 06:32 AM, Desai, Kashyap wrote:
> > > I am facing one issue with scsi stack.
> > > Here is a background of my test.
> > >
> > > Mount ext3 file system with journaling support with barrier=1, commit=5
> > > Now, with this setup file system will do submit_bh with  WRITE_BARRIER
> > > flag set for interval of 5 seconds. (This is a part of journaling.)
> > > Eventually it will call queue_flush() which will generate SCSI command of
> > > CDB: SYNCHRONIZE_CAHCE and insert it into the request queue. I observed
> > > that creation of SYNCHRONIZE_CACHE is a part of sd_prepare_flush(). Here
> > > we have timeout set to SD_TIMEOUT but retries are not set. Because of
> > > retries of the request is not set, there is no retries allowed for
> > > SYNCHRONIZE_CACHE at mid layer.
> > >
> > > Because of zero retries for SYNCHRONIZE_CACHE command at mid-layer, it is
> > > creating trouble for file system. In current situation, Even though LLD
> > > send back commands with DID_RESET, SYNCHRONIZE_CACHE will fail
> > > immediately without going for any retries, when HBA is in recovery state.
> > > Eventually this information goes to File system and it sees
> > > SYNCHRONIZE_CAHCE is failed and file system goes to Read only mode.
> > >
> > > My question is "Can we add in sd_prepare_flush(), rq->retries = X" some
> > > reasonable retries value ?
> > 
> > I am not sure where we want it, but I think we want to be able to set
> > both the retries and timeout. I have seen where a sync cache can take
> > longer than the default 30 secs.
> > 
> > Do you think we want to the block layer to manage retries/timeouts for
> > all block device flushes or is this more device specific? I was thinking
> > that we may want to create a sysfs interface under the block dirs and
> > have blk-sysfs.c and blk-barrier.c handle this. queue_flush could set
> > the timeout and retries that is set by some new files under
> > /sys/block/sdX/queue/ ?
> 
> 
> Good that now also other people run into it. 30s is far too small for any 
> hardware raid unit with SATA disks. 

It's far too short for just about any HW RAID since they all tend to
have multi-megabytes to gigabytes of cache (some of the high end have
terrabytes).  It has to be said that most arrays with battery backed
caches lie when asked to flush the cache, but we probably need to get
users into the habit of not using flush barriers with external Arrays.

> http://markmail.org/message/ewicheafcvgwm4p7
> 
> I wrote this patch while having trouble with Infortrend Raids, but it also 
> comes up with DDN storage if the write back cache is enabled. 
> Shall I update the patch, add retries and then resend the entire series? 

rq->timeout is the timeout of the request triggering the flush ... it's
likely the wrong value since it's for a fast completing r/w operation,
whereas this is a slow drain operation.

James



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!
  2010-04-19 18:45     ` James Bottomley
@ 2010-04-19 19:17       ` Bernd Schubert
  2010-04-20  5:05         ` Desai, Kashyap
  2010-04-29 20:13       ` Ric Wheeler
  1 sibling, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2010-04-19 19:17 UTC (permalink / raw)
  To: James Bottomley; +Cc: Mike Christie, Desai, Kashyap, linux-scsi, Bernd Schubert

On Monday 19 April 2010, James Bottomley wrote:
> On Mon, 2010-04-19 at 20:14 +0200, Bernd Schubert wrote:
> > On Monday 19 April 2010, Mike Christie wrote:
> > > On 04/19/2010 06:32 AM, Desai, Kashyap wrote:
> > > > I am facing one issue with scsi stack.
> > > > Here is a background of my test.
> > > >
> > > > Mount ext3 file system with journaling support with barrier=1,
> > > > commit=5 Now, with this setup file system will do submit_bh with 
> > > > WRITE_BARRIER flag set for interval of 5 seconds. (This is a part of
> > > > journaling.) Eventually it will call queue_flush() which will
> > > > generate SCSI command of CDB: SYNCHRONIZE_CAHCE and insert it into
> > > > the request queue. I observed that creation of SYNCHRONIZE_CACHE is a
> > > > part of sd_prepare_flush(). Here we have timeout set to SD_TIMEOUT
> > > > but retries are not set. Because of retries of the request is not
> > > > set, there is no retries allowed for SYNCHRONIZE_CACHE at mid layer.
> > > >
> > > > Because of zero retries for SYNCHRONIZE_CACHE command at mid-layer,
> > > > it is creating trouble for file system. In current situation, Even
> > > > though LLD send back commands with DID_RESET, SYNCHRONIZE_CACHE will
> > > > fail immediately without going for any retries, when HBA is in
> > > > recovery state. Eventually this information goes to File system and
> > > > it sees
> > > > SYNCHRONIZE_CAHCE is failed and file system goes to Read only mode.
> > > >
> > > > My question is "Can we add in sd_prepare_flush(), rq->retries = X"
> > > > some reasonable retries value ?
> > >
> > > I am not sure where we want it, but I think we want to be able to set
> > > both the retries and timeout. I have seen where a sync cache can take
> > > longer than the default 30 secs.
> > >
> > > Do you think we want to the block layer to manage retries/timeouts for
> > > all block device flushes or is this more device specific? I was
> > > thinking that we may want to create a sysfs interface under the block
> > > dirs and have blk-sysfs.c and blk-barrier.c handle this. queue_flush
> > > could set the timeout and retries that is set by some new files under
> > > /sys/block/sdX/queue/ ?
> >
> > Good that now also other people run into it. 30s is far too small for any
> > hardware raid unit with SATA disks.
> 
> It's far too short for just about any HW RAID since they all tend to
> have multi-megabytes to gigabytes of cache (some of the high end have
> terrabytes).  It has to be said that most arrays with battery backed

For DDN storage 30s are actually sufficient, unless disk delays come up. But 
then we presently also only have a rather small cache only (2GB) with lots of 
disks. 
Nowadays one can get an UPS protected DDN-9900 controller, but the firmware 
still properly handles the SYNC_CACHE command.

> caches lie when asked to flush the cache, but we probably need to get
> users into the habit of not using flush barriers with external Arrays.
> 
> > http://markmail.org/message/ewicheafcvgwm4p7
> >
> > I wrote this patch while having trouble with Infortrend Raids, but it
> > also comes up with DDN storage if the write back cache is enabled.
> > Shall I update the patch, add retries and then resend the entire series?
> 
> rq->timeout is the timeout of the request triggering the flush ... it's
> likely the wrong value since it's for a fast completing r/w operation,
> whereas this is a slow drain operation.

Hmm, in the past we had scsi_device->timeout, but I thought this was given up 
in favour of scsi_device->request_queue->rq_timeout? (somehwere around 
2.6.27?)


Thanks,
Bernd


-- 
Bernd Schubert
DataDirect Networks

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!
  2010-04-19 19:17       ` Bernd Schubert
@ 2010-04-20  5:05         ` Desai, Kashyap
  2010-04-20 14:54           ` [PATCH] " Bernd Schubert
  0 siblings, 1 reply; 11+ messages in thread
From: Desai, Kashyap @ 2010-04-20  5:05 UTC (permalink / raw)
  To: Bernd Schubert, James Bottomley; +Cc: Mike Christie, linux-scsi, Bernd Schubert



> -----Original Message-----
> From: Bernd Schubert [mailto:bs_lists@aakef.fastmail.fm]
> Sent: Tuesday, April 20, 2010 12:47 AM
> To: James Bottomley
> Cc: Mike Christie; Desai, Kashyap; linux-scsi@vger.kernel.org; Bernd
> Schubert
> Subject: Re: SYNCHRONIZE_CACHE from sd_preppare_flush does not have
> retries.!
> 
> On Monday 19 April 2010, James Bottomley wrote:
> > On Mon, 2010-04-19 at 20:14 +0200, Bernd Schubert wrote:
> > > On Monday 19 April 2010, Mike Christie wrote:
> > > > On 04/19/2010 06:32 AM, Desai, Kashyap wrote:
> > > > > I am facing one issue with scsi stack.
> > > > > Here is a background of my test.
> > > > >
> > > > > Mount ext3 file system with journaling support with barrier=1,
> > > > > commit=5 Now, with this setup file system will do submit_bh
> with
> > > > > WRITE_BARRIER flag set for interval of 5 seconds. (This is a
> part of
> > > > > journaling.) Eventually it will call queue_flush() which will
> > > > > generate SCSI command of CDB: SYNCHRONIZE_CAHCE and insert it
> into
> > > > > the request queue. I observed that creation of
> SYNCHRONIZE_CACHE is a
> > > > > part of sd_prepare_flush(). Here we have timeout set to
> SD_TIMEOUT
> > > > > but retries are not set. Because of retries of the request is
> not
> > > > > set, there is no retries allowed for SYNCHRONIZE_CACHE at mid
> layer.
> > > > >
> > > > > Because of zero retries for SYNCHRONIZE_CACHE command at mid-
> layer,
> > > > > it is creating trouble for file system. In current situation,
> Even
> > > > > though LLD send back commands with DID_RESET, SYNCHRONIZE_CACHE
> will
> > > > > fail immediately without going for any retries, when HBA is in
> > > > > recovery state. Eventually this information goes to File system
> and
> > > > > it sees
> > > > > SYNCHRONIZE_CAHCE is failed and file system goes to Read only
> mode.
> > > > >
> > > > > My question is "Can we add in sd_prepare_flush(), rq->retries =
> X"
> > > > > some reasonable retries value ?
> > > >
> > > > I am not sure where we want it, but I think we want to be able to
> set
> > > > both the retries and timeout. I have seen where a sync cache can
> take
> > > > longer than the default 30 secs.
> > > >
> > > > Do you think we want to the block layer to manage
> retries/timeouts for
> > > > all block device flushes or is this more device specific? I was
> > > > thinking that we may want to create a sysfs interface under the
> block
> > > > dirs and have blk-sysfs.c and blk-barrier.c handle this.
> queue_flush
> > > > could set the timeout and retries that is set by some new files
> under
> > > > /sys/block/sdX/queue/ ?
Thanks a lot for your comments.

This is very close to my understanding. I feel this is more close to block layer and I am almost agreeing with your thought.
I tried to understand why upstream does not have retries at queue_flush()/sd_prepare_flush() ??? It looks like there is not specific reason.
If I am wrong can someone explain is there any specific reason not to set rq->retries in sd_prepare_flush? 

Thanks,
Kashyap
> > >
> > > Good that now also other people run into it. 30s is far too small
> for any
> > > hardware raid unit with SATA disks.
> >
> > It's far too short for just about any HW RAID since they all tend to
> > have multi-megabytes to gigabytes of cache (some of the high end have
> > terrabytes).  It has to be said that most arrays with battery backed
> 
> For DDN storage 30s are actually sufficient, unless disk delays come
> up. But
> then we presently also only have a rather small cache only (2GB) with
> lots of
> disks.
> Nowadays one can get an UPS protected DDN-9900 controller, but the
> firmware
> still properly handles the SYNC_CACHE command.
> 
> > caches lie when asked to flush the cache, but we probably need to get
> > users into the habit of not using flush barriers with external
> Arrays.
> >
> > > http://markmail.org/message/ewicheafcvgwm4p7
> > >
> > > I wrote this patch while having trouble with Infortrend Raids, but
> it
> > > also comes up with DDN storage if the write back cache is enabled.
> > > Shall I update the patch, add retries and then resend the entire
> series?
> >
> > rq->timeout is the timeout of the request triggering the flush ...
> it's
> > likely the wrong value since it's for a fast completing r/w
> operation,
> > whereas this is a slow drain operation.
> 
> Hmm, in the past we had scsi_device->timeout, but I thought this was
> given up
> in favour of scsi_device->request_queue->rq_timeout? (somehwere around
> 2.6.27?)
> 
> 
> Thanks,
> Bernd
> 
> 
> --
> Bernd Schubert
> DataDirect Networks

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] Re: SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!
  2010-04-20  5:05         ` Desai, Kashyap
@ 2010-04-20 14:54           ` Bernd Schubert
  2010-04-20 19:32             ` Mike Christie
  0 siblings, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2010-04-20 14:54 UTC (permalink / raw)
  To: Desai, Kashyap; +Cc: James Bottomley, Mike Christie, linux-scsi

On Tuesday 20 April 2010, Desai, Kashyap wrote:
> > > > > all block device flushes or is this more device specific? I was
> > > > > thinking that we may want to create a sysfs interface under the
> >
> > block
> >
> > > > > dirs and have blk-sysfs.c and blk-barrier.c handle this.
> >
> > queue_flush
> >
> > > > > could set the timeout and retries that is set by some new files
> >
> > under
> >
> > > > > /sys/block/sdX/queue/ ?
> 
> Thanks a lot for your comments.
> 
> This is very close to my understanding. I feel this is more close to block
>  layer and I am almost agreeing with your thought. I tried to understand
>  why upstream does not have retries at queue_flush()/sd_prepare_flush() ???
>  It looks like there is not specific reason. If I am wrong can someone
>  explain is there any specific reason not to set rq->retries in
>  sd_prepare_flush?

I don't understand why we should need another queue timeout, when we 
already have a device timeout that is used as queue timeout?
Below is an updated patch. And I think as of commit 
242f9dcb8ba6f68fcd217a119a7648a4f69290e9 ("block: unify request 
timeout handling") using struct request_queue->rq_timeout is correct.

What I really wonder about, if we shouldn't add a scsi sysfs entry 
to set a default timeout? In some environments (e.g. ib_srp with a 
large number of LUNs through an IB switch) scanning the those devices 
can be slow and besides that many of those device scan functions 
use SD_TIMEOUT, we also have not set timeouts using udev already, 
as udev will only be called once the scan is over...



[SCSI] Use user defined timeouts for SYNCHRONIZE_CACHE

Switch to adjustable timeout value for scsi SYNCHRONIZE_CACHE commands.
We have been running into problems with fixed values on troublesome devices.

Update: add SD_MAX_RETRIES in sd_prepare_flush()

Patch originally written for Q-Leap, but DDN needs it as well.


Signed-off-by: Bernd Schubert <bschubert@ddn.com>


---
 drivers/scsi/sd.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux-git/drivers/scsi/sd.c
===================================================================
--- linux-git.orig/drivers/scsi/sd.c
+++ linux-git/drivers/scsi/sd.c
@@ -1020,7 +1020,8 @@ static int sd_sync_cache(struct scsi_dis
 		 * flush everything.
 		 */
 		res = scsi_execute_req(sdp, cmd, DMA_NONE, NULL, 0, &sshdr,
-				       SD_TIMEOUT, SD_MAX_RETRIES, NULL);
+				       sdp->request_queue->rq_timeout,
+				       SD_MAX_RETRIES, NULL);
 		if (res == 0)
 			break;
 	}
@@ -1039,7 +1040,8 @@ static int sd_sync_cache(struct scsi_dis
 static void sd_prepare_flush(struct request_queue *q, struct request *rq)
 {
 	rq->cmd_type = REQ_TYPE_BLOCK_PC;
-	rq->timeout = SD_TIMEOUT;
+	rq->timeout = q->rq_timeout;
+	rq->retries = SD_MAX_RETRIES;
 	rq->cmd[0] = SYNCHRONIZE_CACHE;
 	rq->cmd_len = 10;
 }


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Re: SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!
  2010-04-20 14:54           ` [PATCH] " Bernd Schubert
@ 2010-04-20 19:32             ` Mike Christie
  2010-04-20 20:38               ` Bernd Schubert
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Christie @ 2010-04-20 19:32 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Desai, Kashyap, James Bottomley, linux-scsi

On 04/20/2010 09:54 AM, Bernd Schubert wrote:
> On Tuesday 20 April 2010, Desai, Kashyap wrote:
>>>>>> all block device flushes or is this more device specific? I was
>>>>>> thinking that we may want to create a sysfs interface under the
>>>
>>> block
>>>
>>>>>> dirs and have blk-sysfs.c and blk-barrier.c handle this.
>>>
>>> queue_flush
>>>
>>>>>> could set the timeout and retries that is set by some new files
>>>
>>> under
>>>
>>>>>> /sys/block/sdX/queue/ ?
>>
>> Thanks a lot for your comments.
>>
>> This is very close to my understanding. I feel this is more close to block
>>   layer and I am almost agreeing with your thought. I tried to understand
>>   why upstream does not have retries at queue_flush()/sd_prepare_flush() ???
>>   It looks like there is not specific reason. If I am wrong can someone
>>   explain is there any specific reason not to set rq->retries in
>>   sd_prepare_flush?
>
> I don't understand why we should need another queue timeout, when we
> already have a device timeout that is used as queue timeout?


I think a possible problem is that the one timeout setting is now being 
used for two different operations. Currently, for disks that timeout is 
used for READ/WRITEs. For the sync cache, if we are hitting a problem 
with it taking longer than the current value of 30 secs, then could we 
want to set the READ/WRITE timeout to a lower value like 60 secs, but 
then the sync cache timeout might have to be multiple minutes.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Re: SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!
  2010-04-20 19:32             ` Mike Christie
@ 2010-04-20 20:38               ` Bernd Schubert
  2010-04-22 16:23                 ` James Bottomley
  0 siblings, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2010-04-20 20:38 UTC (permalink / raw)
  To: Mike Christie; +Cc: Desai, Kashyap, James Bottomley, linux-scsi

On Tuesday 20 April 2010, Mike Christie wrote:
> On 04/20/2010 09:54 AM, Bernd Schubert wrote:
> > On Tuesday 20 April 2010, Desai, Kashyap wrote:
> >>>>>> all block device flushes or is this more device specific? I was
> >>>>>> thinking that we may want to create a sysfs interface under the
> >>>
> >>> block
> >>>
> >>>>>> dirs and have blk-sysfs.c and blk-barrier.c handle this.
> >>>
> >>> queue_flush
> >>>
> >>>>>> could set the timeout and retries that is set by some new files
> >>>
> >>> under
> >>>
> >>>>>> /sys/block/sdX/queue/ ?
> >>
> >> Thanks a lot for your comments.
> >>
> >> This is very close to my understanding. I feel this is more close to
> >> block layer and I am almost agreeing with your thought. I tried to
> >> understand why upstream does not have retries at
> >> queue_flush()/sd_prepare_flush() ??? It looks like there is not specific
> >> reason. If I am wrong can someone explain is there any specific reason
> >> not to set rq->retries in sd_prepare_flush?
> >
> > I don't understand why we should need another queue timeout, when we
> > already have a device timeout that is used as queue timeout?
> 
> I think a possible problem is that the one timeout setting is now being
> used for two different operations. Currently, for disks that timeout is
> used for READ/WRITEs. For the sync cache, if we are hitting a problem
> with it taking longer than the current value of 30 secs, then could we
> want to set the READ/WRITE timeout to a lower value like 60 secs, but
> then the sync cache timeout might have to be multiple minutes.

Hmm, I don't know storage systems where it might take several minutes, but 
sure, if a system has hundreds of gigabyte of cache with only a few disks, a 
very long timeout might required. Though, I'm not sure if such device 
shouldn't ignore the SYNC_CACHE command at all.
DDN storage recommends to set the host timeout one seconds less than the 
command timeout on the controller (70 to 90 seconds).

I hope I will find some time on Thursday or Friday to improve the patch.

Thanks,
Bernd


-- 
Bernd Schubert
DataDirect Networks

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Re: SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!
  2010-04-20 20:38               ` Bernd Schubert
@ 2010-04-22 16:23                 ` James Bottomley
  0 siblings, 0 replies; 11+ messages in thread
From: James Bottomley @ 2010-04-22 16:23 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Mike Christie, Desai, Kashyap, linux-scsi

On Tue, 2010-04-20 at 22:38 +0200, Bernd Schubert wrote:
> On Tuesday 20 April 2010, Mike Christie wrote:
> > On 04/20/2010 09:54 AM, Bernd Schubert wrote:
> > > On Tuesday 20 April 2010, Desai, Kashyap wrote:
> > >>>>>> all block device flushes or is this more device specific? I was
> > >>>>>> thinking that we may want to create a sysfs interface under the
> > >>>
> > >>> block
> > >>>
> > >>>>>> dirs and have blk-sysfs.c and blk-barrier.c handle this.
> > >>>
> > >>> queue_flush
> > >>>
> > >>>>>> could set the timeout and retries that is set by some new files
> > >>>
> > >>> under
> > >>>
> > >>>>>> /sys/block/sdX/queue/ ?
> > >>
> > >> Thanks a lot for your comments.
> > >>
> > >> This is very close to my understanding. I feel this is more close to
> > >> block layer and I am almost agreeing with your thought. I tried to
> > >> understand why upstream does not have retries at
> > >> queue_flush()/sd_prepare_flush() ??? It looks like there is not specific
> > >> reason. If I am wrong can someone explain is there any specific reason
> > >> not to set rq->retries in sd_prepare_flush?
> > >
> > > I don't understand why we should need another queue timeout, when we
> > > already have a device timeout that is used as queue timeout?
> > 
> > I think a possible problem is that the one timeout setting is now being
> > used for two different operations. Currently, for disks that timeout is
> > used for READ/WRITEs. For the sync cache, if we are hitting a problem
> > with it taking longer than the current value of 30 secs, then could we
> > want to set the READ/WRITE timeout to a lower value like 60 secs, but
> > then the sync cache timeout might have to be multiple minutes.
> 
> Hmm, I don't know storage systems where it might take several minutes, but 
> sure, if a system has hundreds of gigabyte of cache with only a few disks, a 
> very long timeout might required. Though, I'm not sure if such device 
> shouldn't ignore the SYNC_CACHE command at all.
> DDN storage recommends to set the host timeout one seconds less than the 
> command timeout on the controller (70 to 90 seconds).
> 
> I hope I will find some time on Thursday or Friday to improve the patch.

Just to bring closure to this, although I personally don't think
increasing the timeout for flush to be a good idea, most other people
do, so an additional flush timeout somewhere in block controllable via
sysfs, that we can take the timeout from would likely be the best way to
go.

Thanks,

James



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!
  2010-04-19 18:45     ` James Bottomley
  2010-04-19 19:17       ` Bernd Schubert
@ 2010-04-29 20:13       ` Ric Wheeler
  1 sibling, 0 replies; 11+ messages in thread
From: Ric Wheeler @ 2010-04-29 20:13 UTC (permalink / raw)
  To: James Bottomley; +Cc: Bernd Schubert, Mike Christie, Desai, Kashyap, linux-scsi

On 04/19/2010 02:45 PM, James Bottomley wrote:
> On Mon, 2010-04-19 at 20:14 +0200, Bernd Schubert wrote:
>    
>> On Monday 19 April 2010, Mike Christie wrote:
>>      
>>> On 04/19/2010 06:32 AM, Desai, Kashyap wrote:
>>>        
>>>> I am facing one issue with scsi stack.
>>>> Here is a background of my test.
>>>>
>>>> Mount ext3 file system with journaling support with barrier=1, commit=5
>>>> Now, with this setup file system will do submit_bh with  WRITE_BARRIER
>>>> flag set for interval of 5 seconds. (This is a part of journaling.)
>>>> Eventually it will call queue_flush() which will generate SCSI command of
>>>> CDB: SYNCHRONIZE_CAHCE and insert it into the request queue. I observed
>>>> that creation of SYNCHRONIZE_CACHE is a part of sd_prepare_flush(). Here
>>>> we have timeout set to SD_TIMEOUT but retries are not set. Because of
>>>> retries of the request is not set, there is no retries allowed for
>>>> SYNCHRONIZE_CACHE at mid layer.
>>>>
>>>> Because of zero retries for SYNCHRONIZE_CACHE command at mid-layer, it is
>>>> creating trouble for file system. In current situation, Even though LLD
>>>> send back commands with DID_RESET, SYNCHRONIZE_CACHE will fail
>>>> immediately without going for any retries, when HBA is in recovery state.
>>>> Eventually this information goes to File system and it sees
>>>> SYNCHRONIZE_CAHCE is failed and file system goes to Read only mode.
>>>>
>>>> My question is "Can we add in sd_prepare_flush(), rq->retries = X" some
>>>> reasonable retries value ?
>>>>          
>>> I am not sure where we want it, but I think we want to be able to set
>>> both the retries and timeout. I have seen where a sync cache can take
>>> longer than the default 30 secs.
>>>
>>> Do you think we want to the block layer to manage retries/timeouts for
>>> all block device flushes or is this more device specific? I was thinking
>>> that we may want to create a sysfs interface under the block dirs and
>>> have blk-sysfs.c and blk-barrier.c handle this. queue_flush could set
>>> the timeout and retries that is set by some new files under
>>> /sys/block/sdX/queue/ ?
>>>        
>>
>> Good that now also other people run into it. 30s is far too small for any
>> hardware raid unit with SATA disks.
>>      
> It's far too short for just about any HW RAID since they all tend to
> have multi-megabytes to gigabytes of cache (some of the high end have
> terrabytes).  It has to be said that most arrays with battery backed
> caches lie when asked to flush the cache, but we probably need to get
> users into the habit of not using flush barriers with external Arrays.
>
>    

As you say, a lot of arrays don't do anything with these requests. We 
should be using barriers (and sending down the synch commands) when the 
device advertises a write back cache.

For a simple hardware RAID card, it is not sufficient simply to flush 
the write cache on the HBA. It must also either disable the write cache 
on the downstream devices (S-ATA/SAS disks, etc) or send cache flush 
commands to each device.

In those cases, this could be a quite lengthy process...

>> http://markmail.org/message/ewicheafcvgwm4p7
>>
>> I wrote this patch while having trouble with Infortrend Raids, but it also
>> comes up with DDN storage if the write back cache is enabled.
>> Shall I update the patch, add retries and then resend the entire series?
>>      
> rq->timeout is the timeout of the request triggering the flush ... it's
> likely the wrong value since it's for a fast completing r/w operation,
> whereas this is a slow drain operation.
>
> James
>
>    


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-04-30 17:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-19 11:32 SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.! Desai, Kashyap
2010-04-19 16:20 ` Mike Christie
2010-04-19 18:14   ` Bernd Schubert
2010-04-19 18:45     ` James Bottomley
2010-04-19 19:17       ` Bernd Schubert
2010-04-20  5:05         ` Desai, Kashyap
2010-04-20 14:54           ` [PATCH] " Bernd Schubert
2010-04-20 19:32             ` Mike Christie
2010-04-20 20:38               ` Bernd Schubert
2010-04-22 16:23                 ` James Bottomley
2010-04-29 20:13       ` Ric Wheeler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.