All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] blk: fix a wrong accounting of hd_struct->in_flight
@ 2010-10-12  6:38 Yasuaki Ishimatsu
  2010-10-12  8:19 ` Jens Axboe
  2010-10-14  6:07 ` [PATCH] " KOSAKI Motohiro
  0 siblings, 2 replies; 16+ messages in thread
From: Yasuaki Ishimatsu @ 2010-10-12  6:38 UTC (permalink / raw)
  To: axboe, linux-kernel

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 block/blk-core.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

Index: linux-2.6.36-rc7/block/blk-core.c
===================================================================
--- linux-2.6.36-rc7.orig/block/blk-core.c	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/block/blk-core.c	2010-10-09 05:53:51.000000000 +0900
@@ -1202,6 +1202,8 @@ static int __make_request(struct request
 	const bool unplug = !!(bio->bi_rw & REQ_UNPLUG);
 	const unsigned long ff = bio->bi_rw & REQ_FAILFAST_MASK;
 	int rw_flags;
+	struct hd_struct *src_part;
+	struct hd_struct *dst_part;

 	if ((bio->bi_rw & REQ_HARDBARRIER) &&
 	    (q->next_ordered == QUEUE_ORDERED_NONE)) {
@@ -1268,7 +1270,17 @@ static int __make_request(struct request
 		 * not touch req->buffer either...
 		 */
 		req->buffer = bio_data(bio);
+		src_part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
 		req->__sector = bio->bi_sector;
+		dst_part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		if (unlikely(src_part != dst_part)) {
+			int rw = rq_data_dir(req);
+
+			part_stat_lock();
+			part_dec_in_flight(src_part, rw);
+			part_inc_in_flight(dst_part, rw);
+			part_stat_unlock();
+		}
 		req->__data_len += bytes;
 		req->ioprio = ioprio_best(req->ioprio, prio);
 		if (!blk_rq_cpu_valid(req))


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-12  6:38 [PATCH] blk: fix a wrong accounting of hd_struct->in_flight Yasuaki Ishimatsu
@ 2010-10-12  8:19 ` Jens Axboe
  2010-10-14 12:48   ` [PATCH v2] " Yasuaki Ishimatsu
  2010-10-14  6:07 ` [PATCH] " KOSAKI Motohiro
  1 sibling, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2010-10-12  8:19 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: linux-kernel

On 2010-10-12 08:38, Yasuaki Ishimatsu wrote:
> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> /proc/diskstats would display a strange output as follows.
> 
> $ cat /proc/diskstats |grep sda
>    8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
>    8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
>                                                 ~~~~~~~~~~
>    8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
>    8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
>    8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
>    8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
> 
> Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
> merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
> 
> The detailed root cause is as follows.
> 
> Assuming that there are two partition, sda1 and sda2.
> 
> 1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
>    is 0 and sda2's one is 1.
> 
>         | hd_struct->in_flight
>    ---------------------------
>    sda1 |          0
>    sda2 |          1
>    ---------------------------
> 
> 2. A bio belongs to sda1 is issued and is merged into the request mentioned on
>    step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
>    from sda2 region to sda1 region. However the two partition's
>    hd_struct->in_flight are not changed.
> 
>         | hd_struct->in_flight
>    ---------------------------
>    sda1 |          0
>    sda2 |          1
>    ---------------------------
> 
> 3. The request is finished and blk_account_io_done() is called. In this case,
>    sda2's hd_struct->in_flight, not a sda1's one, is decremented.
> 
>         | hd_struct->in_flight
>    ---------------------------
>    sda1 |         -1
>    sda2 |          1
>    ---------------------------
> 
> The patch fixes the problem.
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> ---
>  block/blk-core.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> Index: linux-2.6.36-rc7/block/blk-core.c
> ===================================================================
> --- linux-2.6.36-rc7.orig/block/blk-core.c	2010-10-07 05:39:52.000000000 +0900
> +++ linux-2.6.36-rc7/block/blk-core.c	2010-10-09 05:53:51.000000000 +0900
> @@ -1202,6 +1202,8 @@ static int __make_request(struct request
>  	const bool unplug = !!(bio->bi_rw & REQ_UNPLUG);
>  	const unsigned long ff = bio->bi_rw & REQ_FAILFAST_MASK;
>  	int rw_flags;
> +	struct hd_struct *src_part;
> +	struct hd_struct *dst_part;
> 
>  	if ((bio->bi_rw & REQ_HARDBARRIER) &&
>  	    (q->next_ordered == QUEUE_ORDERED_NONE)) {
> @@ -1268,7 +1270,17 @@ static int __make_request(struct request
>  		 * not touch req->buffer either...
>  		 */
>  		req->buffer = bio_data(bio);
> +		src_part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
>  		req->__sector = bio->bi_sector;
> +		dst_part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		if (unlikely(src_part != dst_part)) {
> +			int rw = rq_data_dir(req);
> +
> +			part_stat_lock();
> +			part_dec_in_flight(src_part, rw);
> +			part_inc_in_flight(dst_part, rw);
> +			part_stat_unlock();
> +		}
>  		req->__data_len += bytes;
>  		req->ioprio = ioprio_best(req->ioprio, prio);
>  		if (!blk_rq_cpu_valid(req))

Ugh, this is nasty, and two extra part lookups per IO is definitely
going to hurt.

I would suggest fixing this different - cache the part lookup in the
request. That has the advantage of being faster for the normal case as
well since it'll reduce the number of lookups, plus we can get rid of
some of the blk_do_io_stat() checks, or at least reduce them to checking
for a valid rq->part check.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-12  6:38 [PATCH] blk: fix a wrong accounting of hd_struct->in_flight Yasuaki Ishimatsu
  2010-10-12  8:19 ` Jens Axboe
@ 2010-10-14  6:07 ` KOSAKI Motohiro
  2010-10-14 12:44   ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: KOSAKI Motohiro @ 2010-10-14  6:07 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: kosaki.motohiro, axboe, linux-kernel

Hello,

> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> /proc/diskstats would display a strange output as follows.
> 
> $ cat /proc/diskstats |grep sda
>    8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
>    8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
>                                                 ~~~~~~~~~~
>    8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
>    8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
>    8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
>    8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Hm, this is very nasty and crap.



> @@ -1268,7 +1270,17 @@ static int __make_request(struct request
>  		 * not touch req->buffer either...
>  		 */
>  		req->buffer = bio_data(bio);
> +		src_part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
>  		req->__sector = bio->bi_sector;
> +		dst_part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));

I think this is wrong. disk_map_sector_rcu() require
rcu read lock held (see function comment). all other call site take 
part_stat_lock() before disk_map_sector_rcu() call.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-14  6:07 ` [PATCH] " KOSAKI Motohiro
@ 2010-10-14 12:44   ` Jens Axboe
  2010-10-14 23:30     ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2010-10-14 12:44 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Yasuaki Ishimatsu, linux-kernel

On 2010-10-14 08:07, KOSAKI Motohiro wrote:
>> @@ -1268,7 +1270,17 @@ static int __make_request(struct request
>>  		 * not touch req->buffer either...
>>  		 */
>>  		req->buffer = bio_data(bio);
>> +		src_part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
>>  		req->__sector = bio->bi_sector;
>> +		dst_part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> 
> I think this is wrong. disk_map_sector_rcu() require
> rcu read lock held (see function comment). all other call site take 
> part_stat_lock() before disk_map_sector_rcu() call.

It's called under the queue lock with irqs disabled, which implies a
rcu critical section.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-12  8:19 ` Jens Axboe
@ 2010-10-14 12:48   ` Yasuaki Ishimatsu
  2010-10-14 12:55     ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Yasuaki Ishimatsu @ 2010-10-14 12:48 UTC (permalink / raw)
  To: Jens Axboe, kosaki.motohiro; +Cc: linux-kernel

Hi, Jens, Kosaki,

Thank you for your comments.
I fixed the patch. How about it?

Thanks,
Yasuaki Ishimatsu
===

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 block/blk-core.c       |   11 +++++++++--
 include/linux/blkdev.h |    1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

Index: linux-2.6.36-rc7/include/linux/blkdev.h
===================================================================
--- linux-2.6.36-rc7.orig/include/linux/blkdev.h	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/include/linux/blkdev.h	2010-10-14 17:37:33.000000000 +0900
@@ -115,6 +115,7 @@ struct request {
 	void *elevator_private3;

 	struct gendisk *rq_disk;
+	struct hd_struct *part;
 	unsigned long start_time;
 #ifdef CONFIG_BLK_CGROUP
 	unsigned long long start_time_ns;
Index: linux-2.6.36-rc7/block/blk-core.c
===================================================================
--- linux-2.6.36-rc7.orig/block/blk-core.c	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/block/blk-core.c	2010-10-14 17:25:43.000000000 +0900
@@ -66,9 +66,15 @@ static void drive_stat_acct(struct reque
 	cpu = part_stat_lock();
 	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));

-	if (!new_io)
+	if (!new_io) {
+		if (unlikely(rq->part != part)) {
+			part_dec_in_flight(rq->part, rw);
+			part_inc_in_flight(part, rw);
+			rq->part = part;
+		}
 		part_stat_inc(cpu, part, merges[rw]);
-	else {
+	} else {
+		rq->part = part;
 		part_round_stats(cpu, part);
 		part_inc_in_flight(part, rw);
 	}
@@ -128,6 +134,7 @@ void blk_rq_init(struct request_queue *q
 	rq->ref_count = 1;
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
+	rq->part = NULL;
 }
 EXPORT_SYMBOL(blk_rq_init);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-14 12:48   ` [PATCH v2] " Yasuaki Ishimatsu
@ 2010-10-14 12:55     ` Jens Axboe
  2010-10-15  8:39       ` [PATCH v3] " Yasuaki Ishimatsu
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2010-10-14 12:55 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: kosaki.motohiro, linux-kernel

On 2010-10-14 14:48, Yasuaki Ishimatsu wrote:
> Index: linux-2.6.36-rc7/block/blk-core.c
> ===================================================================
> --- linux-2.6.36-rc7.orig/block/blk-core.c	2010-10-07 05:39:52.000000000 +0900
> +++ linux-2.6.36-rc7/block/blk-core.c	2010-10-14 17:25:43.000000000 +0900
> @@ -66,9 +66,15 @@ static void drive_stat_acct(struct reque
>  	cpu = part_stat_lock();
>  	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> 
> -	if (!new_io)
> +	if (!new_io) {
> +		if (unlikely(rq->part != part)) {
> +			part_dec_in_flight(rq->part, rw);
> +			part_inc_in_flight(part, rw);
> +			rq->part = part;
> +		}
>  		part_stat_inc(cpu, part, merges[rw]);
> -	else {
> +	} else {
> +		rq->part = part;
>  		part_round_stats(cpu, part);
>  		part_inc_in_flight(part, rw);
>  	}

I was thinking that we'd do away with the lookup always if ->part was
already set. It will probably require a quiscing of IO on partition
table reload, though.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-14 12:44   ` Jens Axboe
@ 2010-10-14 23:30     ` Paul E. McKenney
  2010-10-15  7:30       ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2010-10-14 23:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: KOSAKI Motohiro, Yasuaki Ishimatsu, linux-kernel

On Thu, Oct 14, 2010 at 02:44:32PM +0200, Jens Axboe wrote:
> On 2010-10-14 08:07, KOSAKI Motohiro wrote:
> >> @@ -1268,7 +1270,17 @@ static int __make_request(struct request
> >>  		 * not touch req->buffer either...
> >>  		 */
> >>  		req->buffer = bio_data(bio);
> >> +		src_part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> >>  		req->__sector = bio->bi_sector;
> >> +		dst_part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> > 
> > I think this is wrong. disk_map_sector_rcu() require
> > rcu read lock held (see function comment). all other call site take 
> > part_stat_lock() before disk_map_sector_rcu() call.
> 
> It's called under the queue lock with irqs disabled, which implies a
> rcu critical section.

Having irqs disabled does imply an rcu_read_lock_sched() or an
rcu_read_lock_bh(), but not an rcu_read_lock(), especially if
CONFIG_PREEMPT_RCU.

So an explicit rcu_read_lock() does seem to be needed here.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-14 23:30     ` Paul E. McKenney
@ 2010-10-15  7:30       ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2010-10-15  7:30 UTC (permalink / raw)
  To: paulmck; +Cc: KOSAKI Motohiro, Yasuaki Ishimatsu, linux-kernel

On 2010-10-15 01:30, Paul E. McKenney wrote:
> On Thu, Oct 14, 2010 at 02:44:32PM +0200, Jens Axboe wrote:
>> On 2010-10-14 08:07, KOSAKI Motohiro wrote:
>>>> @@ -1268,7 +1270,17 @@ static int __make_request(struct request
>>>>  		 * not touch req->buffer either...
>>>>  		 */
>>>>  		req->buffer = bio_data(bio);
>>>> +		src_part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
>>>>  		req->__sector = bio->bi_sector;
>>>> +		dst_part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
>>>
>>> I think this is wrong. disk_map_sector_rcu() require
>>> rcu read lock held (see function comment). all other call site take 
>>> part_stat_lock() before disk_map_sector_rcu() call.
>>
>> It's called under the queue lock with irqs disabled, which implies a
>> rcu critical section.
> 
> Having irqs disabled does imply an rcu_read_lock_sched() or an
> rcu_read_lock_bh(), but not an rcu_read_lock(), especially if
> CONFIG_PREEMPT_RCU.
> 
> So an explicit rcu_read_lock() does seem to be needed here.

Thanks Paul, I stand corrected. The final patch will be vastly
different, but it's surely worth keeping in mind.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-14 12:55     ` Jens Axboe
@ 2010-10-15  8:39       ` Yasuaki Ishimatsu
  2010-10-15 10:04         ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Yasuaki Ishimatsu @ 2010-10-15  8:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: kosaki.motohiro, linux-kernel

Hi Jens,

Jens Axboe wrote:
> On 2010-10-14 14:48, Yasuaki Ishimatsu wrote:
>> Index: linux-2.6.36-rc7/block/blk-core.c
>> ===================================================================
>> --- linux-2.6.36-rc7.orig/block/blk-core.c	2010-10-07 05:39:52.000000000 +0900
>> +++ linux-2.6.36-rc7/block/blk-core.c	2010-10-14 17:25:43.000000000 +0900
>> @@ -66,9 +66,15 @@ static void drive_stat_acct(struct reque
>>  	cpu = part_stat_lock();
>>  	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>>
>> -	if (!new_io)
>> +	if (!new_io) {
>> +		if (unlikely(rq->part != part)) {
>> +			part_dec_in_flight(rq->part, rw);
>> +			part_inc_in_flight(part, rw);
>> +			rq->part = part;
>> +		}
>>  		part_stat_inc(cpu, part, merges[rw]);
>> -	else {
>> +	} else {
>> +		rq->part = part;
>>  		part_round_stats(cpu, part);
>>  		part_inc_in_flight(part, rw);
>>  	}
> 
> I was thinking that we'd do away with the lookup always if ->part was
> already set. It will probably require a quiscing of IO on partition
> table reload, though.

O.K.
I removed extra part lookups. Following patch also fixed a wrong accounting of
hd_struct->in_flight. But I could not invent how to stop IOs when
reloading partition table. Do you have some idea?

Thsanks,
Yasuaki Ishimatsu
===

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 block/blk-core.c       |   13 ++++++++-----
 block/blk-merge.c      |    2 +-
 include/linux/blkdev.h |    1 +
 3 files changed, 10 insertions(+), 6 deletions(-)

Index: linux-2.6.36-rc7/block/blk-core.c
===================================================================
--- linux-2.6.36-rc7.orig/block/blk-core.c	2010-10-15 09:21:37.000000000 +0900
+++ linux-2.6.36-rc7/block/blk-core.c	2010-10-15 09:44:23.000000000 +0900
@@ -64,13 +64,15 @@ static void drive_stat_acct(struct reque
 		return;

 	cpu = part_stat_lock();
-	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));

-	if (!new_io)
+	if (!new_io) {
+		part = rq->part;
 		part_stat_inc(cpu, part, merges[rw]);
-	else {
+	} else {
+		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 		part_round_stats(cpu, part);
 		part_inc_in_flight(part, rw);
+		rq->part = part;
 	}

 	part_stat_unlock();
@@ -128,6 +130,7 @@ void blk_rq_init(struct request_queue *q
 	rq->ref_count = 1;
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
+	rq->part = NULL;
 }
 EXPORT_SYMBOL(blk_rq_init);

@@ -1759,7 +1762,7 @@ static void blk_account_io_completion(st
 		int cpu;

 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
 		part_stat_unlock();
 	}
@@ -1779,7 +1782,7 @@ static void blk_account_io_done(struct r
 		int cpu;

 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;

 		part_stat_inc(cpu, part, ios[rw]);
 		part_stat_add(cpu, part, ticks[rw], duration);
Index: linux-2.6.36-rc7/block/blk-merge.c
===================================================================
--- linux-2.6.36-rc7.orig/block/blk-merge.c	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/block/blk-merge.c	2010-10-15 09:38:45.000000000 +0900
@@ -343,7 +343,7 @@ static void blk_account_io_merge(struct
 		int cpu;

 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = rq->part;

 		part_round_stats(cpu, part);
 		part_dec_in_flight(part, rq_data_dir(req));
Index: linux-2.6.36-rc7/include/linux/blkdev.h
===================================================================
--- linux-2.6.36-rc7.orig/include/linux/blkdev.h	2010-10-15 09:21:37.000000000 +0900
+++ linux-2.6.36-rc7/include/linux/blkdev.h	2010-10-15 09:26:22.000000000 +0900
@@ -115,6 +115,7 @@ struct request {
 	void *elevator_private3;

 	struct gendisk *rq_disk;
+	struct hd_struct *part;
 	unsigned long start_time;
 #ifdef CONFIG_BLK_CGROUP
 	unsigned long long start_time_ns;



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-15  8:39       ` [PATCH v3] " Yasuaki Ishimatsu
@ 2010-10-15 10:04         ` Jens Axboe
  2010-10-18  8:28           ` [PATCH v4] " Yasuaki Ishimatsu
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2010-10-15 10:04 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: kosaki.motohiro, linux-kernel

On 2010-10-15 10:39, Yasuaki Ishimatsu wrote:
> Hi Jens,
> 
> Jens Axboe wrote:
>> On 2010-10-14 14:48, Yasuaki Ishimatsu wrote:
>>> Index: linux-2.6.36-rc7/block/blk-core.c
>>> ===================================================================
>>> --- linux-2.6.36-rc7.orig/block/blk-core.c	2010-10-07 05:39:52.000000000 +0900
>>> +++ linux-2.6.36-rc7/block/blk-core.c	2010-10-14 17:25:43.000000000 +0900
>>> @@ -66,9 +66,15 @@ static void drive_stat_acct(struct reque
>>>  	cpu = part_stat_lock();
>>>  	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>>>
>>> -	if (!new_io)
>>> +	if (!new_io) {
>>> +		if (unlikely(rq->part != part)) {
>>> +			part_dec_in_flight(rq->part, rw);
>>> +			part_inc_in_flight(part, rw);
>>> +			rq->part = part;
>>> +		}
>>>  		part_stat_inc(cpu, part, merges[rw]);
>>> -	else {
>>> +	} else {
>>> +		rq->part = part;
>>>  		part_round_stats(cpu, part);
>>>  		part_inc_in_flight(part, rw);
>>>  	}
>>
>> I was thinking that we'd do away with the lookup always if ->part was
>> already set. It will probably require a quiscing of IO on partition
>> table reload, though.
> 
> O.K.
> I removed extra part lookups. Following patch also fixed a wrong accounting of
> hd_struct->in_flight. But I could not invent how to stop IOs when
> reloading partition table. Do you have some idea?

This looks good! To quiesce the queue, something like the below.
Completely untested.

diff --git a/block/blk-core.c b/block/blk-core.c
index 32a1c12..dce2f68 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -796,11 +796,16 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	rl->starved[is_sync] = 0;
 
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
+	if (priv) {
 		rl->elvpriv++;
 
-	if (blk_queue_io_stat(q))
-		rw_flags |= REQ_IO_STAT;
+		/*
+		 * Don't do stats for non-priv requests
+		 */
+		if (blk_queue_io_stat(q))
+			rw_flags |= REQ_IO_STAT;
+	}
+
 	spin_unlock_irq(q->queue_lock);
 
 	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
diff --git a/block/genhd.c b/block/genhd.c
index 59a2db6..2ecbe7d 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -925,8 +925,10 @@ static void disk_free_ptbl_rcu_cb(struct rcu_head *head)
 {
 	struct disk_part_tbl *ptbl =
 		container_of(head, struct disk_part_tbl, rcu_head);
+	struct gendisk *disk = ptbl->disk;
 
 	kfree(ptbl);
+	elv_quiesce_end(disk->queue);
 }
 
 /**
@@ -949,6 +951,7 @@ static void disk_replace_part_tbl(struct gendisk *disk,
 
 	if (old_ptbl) {
 		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
+		elv_quiesce_start(disk->queue);
 		call_rcu(&old_ptbl->rcu_head, disk_free_ptbl_rcu_cb);
 	}
 }
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 5f2f4c4..69b21bb 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -130,6 +130,7 @@ struct disk_part_tbl {
 	struct rcu_head rcu_head;
 	int len;
 	struct hd_struct *last_lookup;
+	struct gendisk *disk;
 	struct hd_struct *part[];
 };
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-15 10:04         ` Jens Axboe
@ 2010-10-18  8:28           ` Yasuaki Ishimatsu
  2010-10-18  8:34             ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Yasuaki Ishimatsu @ 2010-10-18  8:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: kosaki.motohiro, linux-kernel

Hi Jens,

> This looks good! To quiesce the queue, something like the below.
> Completely untested.

Thank you for your advice.
I applied your idea to the patch.

Regards,
Yasuaki Ishimatsu
===

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 block/blk-core.c         |   24 ++++++++++++++++--------
 block/blk-merge.c        |    2 +-
 block/blk.h              |    4 ----
 block/genhd.c            |    5 +++++
 fs/partitions/check.c    |    5 +++++
 include/linux/blkdev.h   |    1 +
 include/linux/elevator.h |    2 ++
 7 files changed, 30 insertions(+), 13 deletions(-)

Index: linux-2.6.36-rc7/block/blk-core.c
===================================================================
--- linux-2.6.36-rc7.orig/block/blk-core.c	2010-10-15 09:21:37.000000000 +0900
+++ linux-2.6.36-rc7/block/blk-core.c	2010-10-18 14:45:19.000000000 +0900
@@ -64,13 +64,15 @@ static void drive_stat_acct(struct reque
 		return;

 	cpu = part_stat_lock();
-	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));

-	if (!new_io)
+	if (!new_io) {
+		part = rq->part;
 		part_stat_inc(cpu, part, merges[rw]);
-	else {
+	} else {
+		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 		part_round_stats(cpu, part);
 		part_inc_in_flight(part, rw);
+		rq->part = part;
 	}

 	part_stat_unlock();
@@ -128,6 +130,7 @@ void blk_rq_init(struct request_queue *q
 	rq->ref_count = 1;
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
+	rq->part = NULL;
 }
 EXPORT_SYMBOL(blk_rq_init);

@@ -796,11 +799,16 @@ static struct request *get_request(struc
 	rl->starved[is_sync] = 0;

 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
+	if (priv) {
 		rl->elvpriv++;

-	if (blk_queue_io_stat(q))
-		rw_flags |= REQ_IO_STAT;
+		/*
+		 * Don't do stats for non-priv requests
+		 */
+		if (blk_queue_io_stat(q))
+			rw_flags |= REQ_IO_STAT;
+	}
+
 	spin_unlock_irq(q->queue_lock);

 	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
@@ -1759,7 +1767,7 @@ static void blk_account_io_completion(st
 		int cpu;

 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
 		part_stat_unlock();
 	}
@@ -1779,7 +1787,7 @@ static void blk_account_io_done(struct r
 		int cpu;

 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;

 		part_stat_inc(cpu, part, ios[rw]);
 		part_stat_add(cpu, part, ticks[rw], duration);
Index: linux-2.6.36-rc7/block/blk-merge.c
===================================================================
--- linux-2.6.36-rc7.orig/block/blk-merge.c	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/block/blk-merge.c	2010-10-18 14:41:03.000000000 +0900
@@ -343,7 +343,7 @@ static void blk_account_io_merge(struct
 		int cpu;

 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;

 		part_round_stats(cpu, part);
 		part_dec_in_flight(part, rq_data_dir(req));
Index: linux-2.6.36-rc7/include/linux/blkdev.h
===================================================================
--- linux-2.6.36-rc7.orig/include/linux/blkdev.h	2010-10-15 09:21:37.000000000 +0900
+++ linux-2.6.36-rc7/include/linux/blkdev.h	2010-10-15 09:26:22.000000000 +0900
@@ -115,6 +115,7 @@ struct request {
 	void *elevator_private3;

 	struct gendisk *rq_disk;
+	struct hd_struct *part;
 	unsigned long start_time;
 #ifdef CONFIG_BLK_CGROUP
 	unsigned long long start_time_ns;
Index: linux-2.6.36-rc7/block/genhd.c
===================================================================
--- linux-2.6.36-rc7.orig/block/genhd.c	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/block/genhd.c	2010-10-18 14:38:04.000000000 +0900
@@ -944,12 +944,17 @@ static void disk_replace_part_tbl(struct
 				  struct disk_part_tbl *new_ptbl)
 {
 	struct disk_part_tbl *old_ptbl = disk->part_tbl;
+	struct request_queue *q = disk->queue;

 	rcu_assign_pointer(disk->part_tbl, new_ptbl);

 	if (old_ptbl) {
 		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
+		spin_lock_irq(q->queue_lock);
+		elv_quiesce_start(q);
 		call_rcu(&old_ptbl->rcu_head, disk_free_ptbl_rcu_cb);
+		elv_quiesce_end(q);
+		spin_unlock_irq(q->queue_lock);
 	}
 }

Index: linux-2.6.36-rc7/fs/partitions/check.c
===================================================================
--- linux-2.6.36-rc7.orig/fs/partitions/check.c	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/fs/partitions/check.c	2010-10-18 16:19:58.000000000 +0900
@@ -375,6 +375,7 @@ void delete_partition(struct gendisk *di
 {
 	struct disk_part_tbl *ptbl = disk->part_tbl;
 	struct hd_struct *part;
+	struct request_queue *q = disk->queue;

 	if (partno >= ptbl->len)
 		return;
@@ -389,7 +390,11 @@ void delete_partition(struct gendisk *di
 	kobject_put(part->holder_dir);
 	device_del(part_to_dev(part));

+	spin_lock_irq(q->queue_lock);
+	elv_quiesce_start(disk->queue);
 	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
+	elv_quiesce_end(disk->queue);
+	spin_unlock_irq(q->queue_lock);
 }

 static ssize_t whole_disk_show(struct device *dev,
Index: linux-2.6.36-rc7/block/blk.h
===================================================================
--- linux-2.6.36-rc7.orig/block/blk.h	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/block/blk.h	2010-10-18 16:22:47.000000000 +0900
@@ -110,10 +110,6 @@ void blk_queue_congestion_threshold(stru

 int blk_dev_init(void);

-void elv_quiesce_start(struct request_queue *q);
-void elv_quiesce_end(struct request_queue *q);
-
-
 /*
  * Return the threshold (number of used requests) at which the queue is
  * considered to be congested.  It include a little hysteresis to keep the
Index: linux-2.6.36-rc7/include/linux/elevator.h
===================================================================
--- linux-2.6.36-rc7.orig/include/linux/elevator.h	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/include/linux/elevator.h	2010-10-18 17:09:58.000000000 +0900
@@ -121,6 +121,8 @@ extern void elv_completed_request(struct
 extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
+extern void elv_quiesce_start(struct request_queue *);
+extern void elv_quiesce_end(struct request_queue *);

 /*
  * io scheduler registration


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-18  8:28           ` [PATCH v4] " Yasuaki Ishimatsu
@ 2010-10-18  8:34             ` Jens Axboe
  2010-10-18 12:19               ` [PATCH v5] " Yasuaki Ishimatsu
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2010-10-18  8:34 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: kosaki.motohiro, linux-kernel

On 2010-10-18 10:28, Yasuaki Ishimatsu wrote:
> Hi Jens,
> 
>> This looks good! To quiesce the queue, something like the below.
>> Completely untested.
> 
> Thank you for your advice.
> I applied your idea to the patch.

But you changed it, though:

>  	if (old_ptbl) {
>  		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
> +		spin_lock_irq(q->queue_lock);
> +		elv_quiesce_start(q);
>  		call_rcu(&old_ptbl->rcu_head, disk_free_ptbl_rcu_cb);
> +		elv_quiesce_end(q);
> +		spin_unlock_irq(q->queue_lock);
>  	}
>  }

That is not going to work. The point is to start the drain period
before, then end it when the callback has gone through. By placing it
just after the call_rcu() call, there's no guarentee that the RCU grace
period has elapsed. That is why I placed it inside the rcu callback. Why
did you move it?

For the above to work, you'd need to use synchronize_rcu() instead.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v5] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-18  8:34             ` Jens Axboe
@ 2010-10-18 12:19               ` Yasuaki Ishimatsu
  2010-10-18 12:21                 ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Yasuaki Ishimatsu @ 2010-10-18 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: kosaki.motohiro, linux-kernel

Hi Jens,

Jens Axboe wrote:
> On 2010-10-18 10:28, Yasuaki Ishimatsu wrote:
>> Hi Jens,
>>
>>> This looks good! To quiesce the queue, something like the below.
>>> Completely untested.
>> Thank you for your advice.
>> I applied your idea to the patch.
> 
> But you changed it, though:
> 
>>  	if (old_ptbl) {
>>  		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
>> +		spin_lock_irq(q->queue_lock);
>> +		elv_quiesce_start(q);
>>  		call_rcu(&old_ptbl->rcu_head, disk_free_ptbl_rcu_cb);
>> +		elv_quiesce_end(q);
>> +		spin_unlock_irq(q->queue_lock);
>>  	}
>>  }
> 
> That is not going to work. The point is to start the drain period
> before, then end it when the callback has gone through. By placing it
> just after the call_rcu() call, there's no guarentee that the RCU grace
> period has elapsed. That is why I placed it inside the rcu callback. Why
> did you move it?

Ah...
I misunderstood the purpose of the call_rcu().
I moved elv_quiesce_end() to the rcu callback.

Regards,
Yasuaki Ishimatsu.
===

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 block/blk-core.c         |   24 ++++++++++++++++--------
 block/blk-merge.c        |    2 +-
 block/blk.h              |    4 ----
 block/genhd.c            |   14 ++++++++++++++
 fs/partitions/check.c    |   12 ++++++++++++
 include/linux/blkdev.h   |    1 +
 include/linux/elevator.h |    2 ++
 include/linux/genhd.h    |    1 +
 8 files changed, 47 insertions(+), 13 deletions(-)

Index: linux-2.6.36-rc7/block/blk-core.c
===================================================================
--- linux-2.6.36-rc7.orig/block/blk-core.c	2010-10-15 09:21:37.000000000 +0900
+++ linux-2.6.36-rc7/block/blk-core.c	2010-10-18 14:45:19.000000000 +0900
@@ -64,13 +64,15 @@ static void drive_stat_acct(struct reque
 		return;

 	cpu = part_stat_lock();
-	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));

-	if (!new_io)
+	if (!new_io) {
+		part = rq->part;
 		part_stat_inc(cpu, part, merges[rw]);
-	else {
+	} else {
+		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 		part_round_stats(cpu, part);
 		part_inc_in_flight(part, rw);
+		rq->part = part;
 	}

 	part_stat_unlock();
@@ -128,6 +130,7 @@ void blk_rq_init(struct request_queue *q
 	rq->ref_count = 1;
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
+	rq->part = NULL;
 }
 EXPORT_SYMBOL(blk_rq_init);

@@ -796,11 +799,16 @@ static struct request *get_request(struc
 	rl->starved[is_sync] = 0;

 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
+	if (priv) {
 		rl->elvpriv++;

-	if (blk_queue_io_stat(q))
-		rw_flags |= REQ_IO_STAT;
+		/*
+		 * Don't do stats for non-priv requests
+		 */
+		if (blk_queue_io_stat(q))
+			rw_flags |= REQ_IO_STAT;
+	}
+
 	spin_unlock_irq(q->queue_lock);

 	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
@@ -1759,7 +1767,7 @@ static void blk_account_io_completion(st
 		int cpu;

 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
 		part_stat_unlock();
 	}
@@ -1779,7 +1787,7 @@ static void blk_account_io_done(struct r
 		int cpu;

 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;

 		part_stat_inc(cpu, part, ios[rw]);
 		part_stat_add(cpu, part, ticks[rw], duration);
Index: linux-2.6.36-rc7/block/blk-merge.c
===================================================================
--- linux-2.6.36-rc7.orig/block/blk-merge.c	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/block/blk-merge.c	2010-10-18 14:41:03.000000000 +0900
@@ -343,7 +343,7 @@ static void blk_account_io_merge(struct
 		int cpu;

 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;

 		part_round_stats(cpu, part);
 		part_dec_in_flight(part, rq_data_dir(req));
Index: linux-2.6.36-rc7/include/linux/blkdev.h
===================================================================
--- linux-2.6.36-rc7.orig/include/linux/blkdev.h	2010-10-15 09:21:37.000000000 +0900
+++ linux-2.6.36-rc7/include/linux/blkdev.h	2010-10-15 09:26:22.000000000 +0900
@@ -115,6 +115,7 @@ struct request {
 	void *elevator_private3;

 	struct gendisk *rq_disk;
+	struct hd_struct *part;
 	unsigned long start_time;
 #ifdef CONFIG_BLK_CGROUP
 	unsigned long long start_time_ns;
Index: linux-2.6.36-rc7/block/genhd.c
===================================================================
--- linux-2.6.36-rc7.orig/block/genhd.c	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/block/genhd.c	2010-10-18 20:50:18.000000000 +0900
@@ -925,8 +925,15 @@ static void disk_free_ptbl_rcu_cb(struct
 {
 	struct disk_part_tbl *ptbl =
 		container_of(head, struct disk_part_tbl, rcu_head);
+	struct gendisk *disk = ptbl->disk;
+	struct request_queue *q = disk->queue;
+	unsigned long flags;

 	kfree(ptbl);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	elv_quiesce_end(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
 }

 /**
@@ -944,11 +951,17 @@ static void disk_replace_part_tbl(struct
 				  struct disk_part_tbl *new_ptbl)
 {
 	struct disk_part_tbl *old_ptbl = disk->part_tbl;
+	struct request_queue *q = disk->queue;

 	rcu_assign_pointer(disk->part_tbl, new_ptbl);

 	if (old_ptbl) {
 		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
+
+		spin_lock_irq(q->queue_lock);
+		elv_quiesce_start(q);
+		spin_unlock_irq(q->queue_lock);
+
 		call_rcu(&old_ptbl->rcu_head, disk_free_ptbl_rcu_cb);
 	}
 }
@@ -989,6 +1002,7 @@ int disk_expand_part_tbl(struct gendisk
 		return -ENOMEM;

 	new_ptbl->len = target;
+	new_ptbl->disk = disk;

 	for (i = 0; i < len; i++)
 		rcu_assign_pointer(new_ptbl->part[i], old_ptbl->part[i]);
Index: linux-2.6.36-rc7/fs/partitions/check.c
===================================================================
--- linux-2.6.36-rc7.orig/fs/partitions/check.c	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/fs/partitions/check.c	2010-10-18 20:20:43.000000000 +0900
@@ -364,17 +364,25 @@ struct device_type part_type = {
 static void delete_partition_rcu_cb(struct rcu_head *head)
 {
 	struct hd_struct *part = container_of(head, struct hd_struct, rcu_head);
+	struct gendisk *disk = part_to_disk(part);
+	struct request_queue *q = disk->queue;
+	unsigned long flags;

 	part->start_sect = 0;
 	part->nr_sects = 0;
 	part_stat_set_all(part, 0);
 	put_device(part_to_dev(part));
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	elv_quiesce_end(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
 }

 void delete_partition(struct gendisk *disk, int partno)
 {
 	struct disk_part_tbl *ptbl = disk->part_tbl;
 	struct hd_struct *part;
+	struct request_queue *q = disk->queue;

 	if (partno >= ptbl->len)
 		return;
@@ -389,6 +397,10 @@ void delete_partition(struct gendisk *di
 	kobject_put(part->holder_dir);
 	device_del(part_to_dev(part));

+	spin_lock_irq(q->queue_lock);
+	elv_quiesce_start(q);
+	spin_unlock_irq(q->queue_lock);
+
 	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
 }

Index: linux-2.6.36-rc7/block/blk.h
===================================================================
--- linux-2.6.36-rc7.orig/block/blk.h	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/block/blk.h	2010-10-18 16:22:47.000000000 +0900
@@ -110,10 +110,6 @@ void blk_queue_congestion_threshold(stru

 int blk_dev_init(void);

-void elv_quiesce_start(struct request_queue *q);
-void elv_quiesce_end(struct request_queue *q);
-
-
 /*
  * Return the threshold (number of used requests) at which the queue is
  * considered to be congested.  It include a little hysteresis to keep the
Index: linux-2.6.36-rc7/include/linux/elevator.h
===================================================================
--- linux-2.6.36-rc7.orig/include/linux/elevator.h	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/include/linux/elevator.h	2010-10-18 17:09:58.000000000 +0900
@@ -121,6 +121,8 @@ extern void elv_completed_request(struct
 extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
+extern void elv_quiesce_start(struct request_queue *);
+extern void elv_quiesce_end(struct request_queue *);

 /*
  * io scheduler registration
Index: linux-2.6.36-rc7/include/linux/genhd.h
===================================================================
--- linux-2.6.36-rc7.orig/include/linux/genhd.h	2010-10-07 05:39:52.000000000 +0900
+++ linux-2.6.36-rc7/include/linux/genhd.h	2010-10-18 19:57:36.000000000 +0900
@@ -130,6 +130,7 @@ struct disk_part_tbl {
 	struct rcu_head rcu_head;
 	int len;
 	struct hd_struct *last_lookup;
+	struct gendisk *disk;
 	struct hd_struct *part[];
 };



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-18 12:19               ` [PATCH v5] " Yasuaki Ishimatsu
@ 2010-10-18 12:21                 ` Jens Axboe
  2010-10-19  2:22                   ` Yasuaki Ishimatsu
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2010-10-18 12:21 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: kosaki.motohiro, linux-kernel

On 2010-10-18 14:19, Yasuaki Ishimatsu wrote:
> Hi Jens,
> 
> Jens Axboe wrote:
>> On 2010-10-18 10:28, Yasuaki Ishimatsu wrote:
>>> Hi Jens,
>>>
>>>> This looks good! To quiesce the queue, something like the below.
>>>> Completely untested.
>>> Thank you for your advice.
>>> I applied your idea to the patch.
>>
>> But you changed it, though:
>>
>>>  	if (old_ptbl) {
>>>  		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
>>> +		spin_lock_irq(q->queue_lock);
>>> +		elv_quiesce_start(q);
>>>  		call_rcu(&old_ptbl->rcu_head, disk_free_ptbl_rcu_cb);
>>> +		elv_quiesce_end(q);
>>> +		spin_unlock_irq(q->queue_lock);
>>>  	}
>>>  }
>>
>> That is not going to work. The point is to start the drain period
>> before, then end it when the callback has gone through. By placing it
>> just after the call_rcu() call, there's no guarentee that the RCU grace
>> period has elapsed. That is why I placed it inside the rcu callback. Why
>> did you move it?
> 
> Ah...
> I misunderstood the purpose of the call_rcu().
> I moved elv_quiesce_end() to the rcu callback.

This version looks good, thanks for following through on this. What kind
of testing did you do?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-18 12:21                 ` Jens Axboe
@ 2010-10-19  2:22                   ` Yasuaki Ishimatsu
  2010-10-19 10:02                     ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Yasuaki Ishimatsu @ 2010-10-19  2:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: kosaki.motohiro, linux-kernel

Hi Jens,

> This version looks good, thanks for following through on this. What kind
> of testing did you do?

I did three kinds of test.

1. run a reproducer.
   I confirmed the problem was fixed using attached the reproducer.

2. remove and make a partition which is running I/O.
   I confirmed the kernel panic or hungup did not occur using following steps.

     # dd if=/dev/sda of=null &
     # while true
     > do
     > /sbin/parted /dev/sda rm 3 > /dev/null
     > /sbin/parted /dev/sda mkpart primary ext2 42.5GB 52.5GB > /dev/null
     > sleep 2
     > done

3. reload a partition which is running I/O.
   I confirmed the kernel panic or hungup did not occur using following steps.

     # dd if=/dev/sda of=null &
     # while true
     > do
     > /sbin/hdparm -z /dev/sda
     > sleep 2
     > done

--- <reproducer>
/*
 *  This test program continues to read 512 bytes data from specified arguments.
 *
 * 	Usage : read <sec_no> <blk_dev>
 * 		@sec_no  : sector number
 * 		@blk_dev : block device
 *
 *	ex) If you want to read 512 bytes data from sector number 100 of
 *	    /dev/sda
 *
 *	    # ./read 100 /dev/sda
 *
 *	How to build:
 *
 *	    $ gcc -D _GNU_SOURCE -o read read.c
 *
 *	How to occurs a problem which is a wrong accounting of
 *	hd_struct->in_flight:
 *
 *	    1. confirm a start sector of a partition
 *	       # cat /sys/block/sda/sda2/start
 *	       1044225
 *	    2. run the test program against sector no.1044225 of /dev/sda
 *	       # ./read 1044225 /dev/sda &
 *	    3. run the test program against sector no.1044224 of /dev/sda
 *	       # ./read 1044224 /dev/sda &
 *	    4. confirm the /proc/diskstats
 *	       # cat /proc/diskstats |grep sda
 *	          8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
 *	          8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
 *	          8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
 *	          8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
 *	          8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
 *	          8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
 */

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>

#include <sys/types.h>
#include <sys/stat.h>

#define SECTOR 512  // sector size
#define ALIGN  512  // for direct io

int main(int argc, char *argv[])
{
	int fd;
	int i;
	int sector_n;
	unsigned long ret;
	void *buf;

	if ( argc != 3 ) {
		printf("invalid argument\n");
		exit(1);
	}

	sector_n = atoi(argv[1]);
	printf("read 512 bytes data from setor no.%d\n", sector_n);

	ret = posix_memalign(&buf, ALIGN, SECTOR);
	if (ret != 0) {
		perror("posix_memalign");	
		exit(1);
	}

	fd = open(argv[2], O_RDONLY|O_DIRECT);

	if (fd < 0) {
		perror("open");	
		exit(1);
	}

	while (1) {
		ret = lseek(fd, 0, SEEK_SET);	
		if (ret < 0) {
			perror("lseek");
			printf("ret = %d\n", ret);
			exit(1);
		}
		for (i = 0; i < SECTOR; i++) {
			ret = lseek(fd, sector_n, SEEK_CUR);
			if (ret < 0) {
				perror("lseek");
				printf("ret = %d\n", ret);
				exit(1)	;
			}
		}

		ret = read(fd, buf, SECTOR);
		if (ret != SECTOR) {
			perror("read");
			exit(1);
		}
	}

	free(buf);
	close(fd);
}
---

Regards,
Yasuaki Ishimatsu




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5] blk: fix a wrong accounting of hd_struct->in_flight
  2010-10-19  2:22                   ` Yasuaki Ishimatsu
@ 2010-10-19 10:02                     ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2010-10-19 10:02 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: kosaki.motohiro, linux-kernel

On 2010-10-19 04:22, Yasuaki Ishimatsu wrote:
> Hi Jens,
> 
>> This version looks good, thanks for following through on this. What kind
>> of testing did you do?
> 
> I did three kinds of test.
> 
> 1. run a reproducer.
>    I confirmed the problem was fixed using attached the reproducer.
> 
> 2. remove and make a partition which is running I/O.
>    I confirmed the kernel panic or hungup did not occur using following steps.
> 
>      # dd if=/dev/sda of=null &
>      # while true
>      > do
>      > /sbin/parted /dev/sda rm 3 > /dev/null
>      > /sbin/parted /dev/sda mkpart primary ext2 42.5GB 52.5GB > /dev/null
>      > sleep 2
>      > done
> 
> 3. reload a partition which is running I/O.
>    I confirmed the kernel panic or hungup did not occur using following steps.
> 
>      # dd if=/dev/sda of=null &
>      # while true
>      > do
>      > /sbin/hdparm -z /dev/sda
>      > sleep 2
>      > done

Thanks, sounds good! I have queued it up for 2.6.37 and marked as a
stable backport candidate.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-10-19 10:02 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-12  6:38 [PATCH] blk: fix a wrong accounting of hd_struct->in_flight Yasuaki Ishimatsu
2010-10-12  8:19 ` Jens Axboe
2010-10-14 12:48   ` [PATCH v2] " Yasuaki Ishimatsu
2010-10-14 12:55     ` Jens Axboe
2010-10-15  8:39       ` [PATCH v3] " Yasuaki Ishimatsu
2010-10-15 10:04         ` Jens Axboe
2010-10-18  8:28           ` [PATCH v4] " Yasuaki Ishimatsu
2010-10-18  8:34             ` Jens Axboe
2010-10-18 12:19               ` [PATCH v5] " Yasuaki Ishimatsu
2010-10-18 12:21                 ` Jens Axboe
2010-10-19  2:22                   ` Yasuaki Ishimatsu
2010-10-19 10:02                     ` Jens Axboe
2010-10-14  6:07 ` [PATCH] " KOSAKI Motohiro
2010-10-14 12:44   ` Jens Axboe
2010-10-14 23:30     ` Paul E. McKenney
2010-10-15  7:30       ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.