All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] Don't merge different partition's IOs
@ 2010-12-06  9:44 Yasuaki Ishimatsu
  2010-12-06 16:08 ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Yasuaki Ishimatsu @ 2010-12-06  9:44 UTC (permalink / raw)
  To: jaxboe, vgoyal, jmarchan, torvalds, linux-kernel

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

PROBLEM:

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

HOW TO FIX:

The patch fixes the problem by changing a merging rule.

The problem is caused by merging different partition's I/Os. So the patch
check whether a merging bio or request is a same partition as a request or not
by using a partition's start sector and size.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

---
 block/blk-core.c       |    2 ++
 block/blk-merge.c      |   11 +++++++++++
 include/linux/blkdev.h |   14 ++++++++++++++
 3 files changed, 27 insertions(+)

Index: linux-2.6.37-rc3/block/blk-core.c
===================================================================
--- linux-2.6.37-rc3.orig/block/blk-core.c	2010-11-30 13:33:02.000000000 +0900
+++ linux-2.6.37-rc3/block/blk-core.c	2010-12-03 17:15:50.000000000 +0900
@@ -71,6 +71,8 @@ static void drive_stat_acct(struct reque
 	else {
 		part_round_stats(cpu, part);
 		part_inc_in_flight(part, rw);
+		rq->__part_start_sect = part->start_sect;
+		rq->__part_size = part->nr_sects;
 	}

 	part_stat_unlock();
Index: linux-2.6.37-rc3/block/blk-merge.c
===================================================================
--- linux-2.6.37-rc3.orig/block/blk-merge.c	2010-11-30 13:33:02.000000000 +0900
+++ linux-2.6.37-rc3/block/blk-merge.c	2010-12-03 17:15:50.000000000 +0900
@@ -235,6 +235,9 @@ int ll_back_merge_fn(struct request_queu
 	else
 		max_sectors = queue_max_sectors(q);

+	if (blk_rq_part_sector(req) + blk_rq_part_size(req) <= bio->bi_sector)
+		return 0;
+
 	if (blk_rq_sectors(req) + bio_sectors(bio) > max_sectors) {
 		req->cmd_flags |= REQ_NOMERGE;
 		if (req == q->last_merge)
@@ -259,6 +262,8 @@ int ll_front_merge_fn(struct request_que
 	else
 		max_sectors = queue_max_sectors(q);

+	if (bio->bi_sector < blk_rq_part_sector(req))
+		return 0;

 	if (blk_rq_sectors(req) + bio_sectors(bio) > max_sectors) {
 		req->cmd_flags |= REQ_NOMERGE;
@@ -382,6 +387,12 @@ static int attempt_merge(struct request_
 		return 0;

 	/*
+	 * Don't merge different partition's request
+	 */
+	if (blk_rq_part_sector(req) != blk_rq_part_sector(next))
+		return 0;
+
+	/*
 	 * not contiguous
 	 */
 	if (blk_rq_pos(req) + blk_rq_sectors(req) != blk_rq_pos(next))
Index: linux-2.6.37-rc3/include/linux/blkdev.h
===================================================================
--- linux-2.6.37-rc3.orig/include/linux/blkdev.h	2010-11-30 13:33:02.000000000 +0900
+++ linux-2.6.37-rc3/include/linux/blkdev.h	2010-11-30 16:46:19.000000000 +0900
@@ -91,6 +91,8 @@ struct request {
 	/* the following two fields are internal, NEVER access directly */
 	unsigned int __data_len;	/* total data len */
 	sector_t __sector;		/* sector cursor */
+	sector_t __part_start_sect;	/* partition's start sector*/
+	sector_t __part_size;		/* partition's size*/

 	struct bio *bio;
 	struct bio *biotail;
@@ -724,6 +726,8 @@ static inline struct request_queue *bdev
  * blk_rq_err_bytes()		: bytes left till the next error boundary
  * blk_rq_sectors()		: sectors left in the entire request
  * blk_rq_cur_sectors()		: sectors left in the current segment
+ * blk_rq_part_sector()		: partition's start sector
+ * blk_rq_part_size()		: partition's size
  */
 static inline sector_t blk_rq_pos(const struct request *rq)
 {
@@ -752,6 +756,16 @@ static inline unsigned int blk_rq_cur_se
 	return blk_rq_cur_bytes(rq) >> 9;
 }

+static inline sector_t blk_rq_part_sector(const struct request *rq)
+{
+	return rq->__part_start_sect;
+}
+
+static inline sector_t blk_rq_part_size(const struct request *rq)
+{
+	return rq->__part_size;
+}
+
 /*
  * Request issue related functions.
  */


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-06  9:44 [PATCH 1/2] Don't merge different partition's IOs Yasuaki Ishimatsu
@ 2010-12-06 16:08 ` Linus Torvalds
  2010-12-07  7:18   ` Satoru Takeuchi
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2010-12-06 16:08 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: jaxboe, vgoyal, jmarchan, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1035 bytes --]

2010/12/6 Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>:
>
> The problem is caused by merging different partition's I/Os. So the patch
> check whether a merging bio or request is a same partition as a request or not
> by using a partition's start sector and size.

I really think this is wrong.

We should just carry the partition information around in the req and
the bio, and just compare the pointers, rather than compare the range.
No need to even dereference the pointers, you should be able to just
do

   /* don't merge if not on the same partition */
   if (bio->part != req->part)
      return 0;

or something.

This is doubly true since the accounting already does that horrible
partition lookup: rather than look it up, we should just _set_ it in
__generic_make_request(), where I think we already know it since we do
that whole blk_partition_remap().

So just something like the appended (TOTALLY UNTESTED) perhaps?

Note that this should get it right even for overlapping partitions etc.

                     Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 2679 bytes --]

 block/blk-core.c          |    2 ++
 block/blk-merge.c         |    9 +++++++++
 include/linux/blk_types.h |    2 +-
 include/linux/blkdev.h    |    1 +
 4 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4ce953f..a0c13a9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1323,6 +1323,7 @@ static inline void blk_partition_remap(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
 
+	bio->bi_part = bdev;
 	if (bio_sectors(bio) && bdev != bdev->bd_contains) {
 		struct hd_struct *p = bdev->bd_part;
 
@@ -2437,6 +2438,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 	rq->__data_len = bio->bi_size;
 	rq->bio = rq->biotail = bio;
 
+	rq->rq_part = bio->bi_part;
 	if (bio->bi_bdev)
 		rq->rq_disk = bio->bi_bdev->bd_disk;
 }
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 77b7c26..3913f5f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -230,6 +230,9 @@ int ll_back_merge_fn(struct request_queue *q, struct request *req,
 {
 	unsigned short max_sectors;
 
+	if (req->rq_part != bio->bi_part)
+		return 0;
+
 	if (unlikely(req->cmd_type == REQ_TYPE_BLOCK_PC))
 		max_sectors = queue_max_hw_sectors(q);
 	else
@@ -254,6 +257,9 @@ int ll_front_merge_fn(struct request_queue *q, struct request *req,
 {
 	unsigned short max_sectors;
 
+	if (bio->bi_part != req->rq_part)
+		return 0;
+
 	if (unlikely(req->cmd_type == REQ_TYPE_BLOCK_PC))
 		max_sectors = queue_max_hw_sectors(q);
 	else
@@ -392,6 +398,9 @@ static int attempt_merge(struct request_queue *q, struct request *req,
 	    || next->special)
 		return 0;
 
+	if (req->rq_part != next->rq_part)
+		return 0;
+
 	/*
 	 * If we are allowed to merge, then append bio list
 	 * from next to rq and release next. merge_requests_fn
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 46ad519..0940480 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -34,7 +34,7 @@ struct bio {
 	sector_t		bi_sector;	/* device address in 512 byte
 						   sectors */
 	struct bio		*bi_next;	/* request queue link */
-	struct block_device	*bi_bdev;
+	struct block_device	*bi_bdev, *bi_part;
 	unsigned long		bi_flags;	/* status, command, etc */
 	unsigned long		bi_rw;		/* bottom bits READ/WRITE,
 						 * top bits priority
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index aae86fd..f31bd4f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -114,6 +114,7 @@ struct request {
 	void *elevator_private2;
 	void *elevator_private3;
 
+	struct block_device *rq_part;
 	struct gendisk *rq_disk;
 	unsigned long start_time;
 #ifdef CONFIG_BLK_CGROUP

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-06 16:08 ` Linus Torvalds
@ 2010-12-07  7:18   ` Satoru Takeuchi
  2010-12-07 18:39     ` Vivek Goyal
  2010-12-08  7:33     ` Jens Axboe
  0 siblings, 2 replies; 47+ messages in thread
From: Satoru Takeuchi @ 2010-12-07  7:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Yasuaki Ishimatsu, jaxboe, vgoyal, jmarchan, linux-kernel

Hi Linus, Yasuaki,  and Jens

(2010/12/07 1:08), Linus Torvalds wrote:
> 2010/12/6 Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>:
>>
>> The problem is caused by merging different partition's I/Os. So the patch
>> check whether a merging bio or request is a same partition as a request or not
>> by using a partition's start sector and size.
>
> I really think this is wrong.
>
> We should just carry the partition information around in the req and
> the bio, and just compare the pointers, rather than compare the range.
> No need to even dereference the pointers, you should be able to just
> do
>
>     /* don't merge if not on the same partition */
>     if (bio->part != req->part)
>        return 0;
>
> or something.
>
> This is doubly true since the accounting already does that horrible
> partition lookup: rather than look it up, we should just _set_ it in
> __generic_make_request(), where I think we already know it since we do
> that whole blk_partition_remap().
>
> So just something like the appended (TOTALLY UNTESTED) perhaps?
>
> Note that this should get it right even for overlapping partitions etc.
>
>                       Linus

The problem can occur even if your patches are applied. Think about a case
like the following.

  1) There are 2 partition, sda1 and sda2, on sda.
  2) Open sda and issue an IO to sda2's first sector. Then sda2's in_flight
     is incremented though you open not sda2 but sda. It is because of
     partition lookup method. It is based on which partition rq->__sector
     sector belongs to.
  3) Issue an IO to sda1's last sector and it merged to the IO issued in
     step (2) because their part are both sda. In addition, rq->__sector
     is modified to the sda1's region.
  4) After completing the IO, sda1's in_flight is decremented and diskstat
     is corrupted here.

I think fixing this case is difficult and would cause more complexity.

I hit on another approach. Although it doesn'tprevent any merge as Linus
preferred, it can fix the problem anyway. In this idea, in_flight is
incremented and decremented for the partition which the request belonged
to in its creation. It has the following merits.

  - It can fix the problem which Yasuaki reported, including the cases which
    I mentioned above.
  - It only append one extra field to request.

Although it would causes a bit gap, it doesn't have most influences because
merging requests beyond partitions is the rare case.

I confirmed the attached patch can be applied to 2.6.37-rc4 and succeeded
to compile. If you can accept this idea, I'll test it soon.

---
  block/blk-core.c       |   12 +++++++-----
  block/blk-merge.c      |    2 +-
  include/linux/blkdev.h |    6 ++++++
  3 files changed, 14 insertions(+), 6 deletions(-)

Index: linux-2.6.37-rc4/block/blk-core.c
===================================================================
--- linux-2.6.37-rc4.orig/block/blk-core.c	2010-11-30 13:42:04.000000000 +0900
+++ linux-2.6.37-rc4/block/blk-core.c	2010-12-07 14:31:55.000000000 +0900
@@ -64,11 +64,13 @@ static void drive_stat_acct(struct reque
  		return;
  
  	cpu = part_stat_lock();
-	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
  
-	if (!new_io)
+	if (!new_io) {
+		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_init_pos(rq));
  		part_stat_inc(cpu, part, merges[rw]);
-	else {
+	} else {
+		rq->__initial_sector = rq->__sector;
+		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_init_pos(rq));
  		part_round_stats(cpu, part);
  		part_inc_in_flight(part, rw);
  	}
@@ -1776,7 +1778,7 @@ static void blk_account_io_completion(st
  		int cpu;
  
  		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = disk_map_sector_rcu(req->rq_disk, blk_rq_init_pos(req));
  		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
  		part_stat_unlock();
  	}
@@ -1796,7 +1798,7 @@ static void blk_account_io_done(struct r
  		int cpu;
  
  		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = disk_map_sector_rcu(req->rq_disk, blk_rq_init_pos(req));
  
  		part_stat_inc(cpu, part, ios[rw]);
  		part_stat_add(cpu, part, ticks[rw], duration);
Index: linux-2.6.37-rc4/block/blk-merge.c
===================================================================
--- linux-2.6.37-rc4.orig/block/blk-merge.c	2010-11-30 13:42:04.000000000 +0900
+++ linux-2.6.37-rc4/block/blk-merge.c	2010-12-07 14:14:55.000000000 +0900
@@ -351,7 +351,7 @@ static void blk_account_io_merge(struct
  		int cpu;
  
  		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = disk_map_sector_rcu(req->rq_disk, blk_rq_init_pos(req));
  
  		part_round_stats(cpu, part);
  		part_dec_in_flight(part, rq_data_dir(req));
Index: linux-2.6.37-rc4/include/linux/blkdev.h
===================================================================
--- linux-2.6.37-rc4.orig/include/linux/blkdev.h	2010-11-30 13:42:04.000000000 +0900
+++ linux-2.6.37-rc4/include/linux/blkdev.h	2010-12-07 14:13:11.000000000 +0900
@@ -91,6 +91,7 @@ struct request {
  	/* the following two fields are internal, NEVER access directly */
  	unsigned int __data_len;	/* total data len */
  	sector_t __sector;		/* sector cursor */
+	sector_t __initial_sector;
  
  	struct bio *bio;
  	struct bio *biotail;
@@ -730,6 +731,11 @@ static inline sector_t blk_rq_pos(const
  	return rq->__sector;
  }
  
+static inline sector_t blk_rq_init_pos(const struct request *rq)
+{
+	return rq->__initial_sector;
+}
+
  static inline unsigned int blk_rq_bytes(const struct request *rq)
  {
  	return rq->__data_len;


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-07  7:18   ` Satoru Takeuchi
@ 2010-12-07 18:39     ` Vivek Goyal
  2010-12-08  7:33     ` Jens Axboe
  1 sibling, 0 replies; 47+ messages in thread
From: Vivek Goyal @ 2010-12-07 18:39 UTC (permalink / raw)
  To: Satoru Takeuchi
  Cc: Linus Torvalds, Yasuaki Ishimatsu, jaxboe, jmarchan, linux-kernel

On Tue, Dec 07, 2010 at 04:18:27PM +0900, Satoru Takeuchi wrote:
> Hi Linus, Yasuaki,  and Jens
> 
> (2010/12/07 1:08), Linus Torvalds wrote:
> >2010/12/6 Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>:
> >>
> >>The problem is caused by merging different partition's I/Os. So the patch
> >>check whether a merging bio or request is a same partition as a request or not
> >>by using a partition's start sector and size.
> >
> >I really think this is wrong.
> >
> >We should just carry the partition information around in the req and
> >the bio, and just compare the pointers, rather than compare the range.
> >No need to even dereference the pointers, you should be able to just
> >do
> >
> >    /* don't merge if not on the same partition */
> >    if (bio->part != req->part)
> >       return 0;
> >
> >or something.
> >
> >This is doubly true since the accounting already does that horrible
> >partition lookup: rather than look it up, we should just _set_ it in
> >__generic_make_request(), where I think we already know it since we do
> >that whole blk_partition_remap().
> >
> >So just something like the appended (TOTALLY UNTESTED) perhaps?
> >
> >Note that this should get it right even for overlapping partitions etc.
> >
> >                      Linus
> 
> The problem can occur even if your patches are applied. Think about a case
> like the following.
> 
>  1) There are 2 partition, sda1 and sda2, on sda.
>  2) Open sda and issue an IO to sda2's first sector. Then sda2's in_flight
>     is incremented though you open not sda2 but sda. It is because of
>     partition lookup method. It is based on which partition rq->__sector
>     sector belongs to.
>  3) Issue an IO to sda1's last sector and it merged to the IO issued in
>     step (2) because their part are both sda. In addition, rq->__sector
>     is modified to the sda1's region.
>  4) After completing the IO, sda1's in_flight is decremented and diskstat
>     is corrupted here.
> 
> I think fixing this case is difficult and would cause more complexity.
> 
> I hit on another approach. Although it doesn'tprevent any merge as Linus
> preferred, it can fix the problem anyway. In this idea, in_flight is
> incremented and decremented for the partition which the request belonged
> to in its creation. It has the following merits.
> 
>  - It can fix the problem which Yasuaki reported, including the cases which
>    I mentioned above.
>  - It only append one extra field to request.
> 
> Although it would causes a bit gap, it doesn't have most influences because
> merging requests beyond partitions is the rare case.
> 
> I confirmed the attached patch can be applied to 2.6.37-rc4 and succeeded
> to compile. If you can accept this idea, I'll test it soon.
> 
> ---
>  block/blk-core.c       |   12 +++++++-----
>  block/blk-merge.c      |    2 +-
>  include/linux/blkdev.h |    6 ++++++
>  3 files changed, 14 insertions(+), 6 deletions(-)
> 
> Index: linux-2.6.37-rc4/block/blk-core.c
> ===================================================================
> --- linux-2.6.37-rc4.orig/block/blk-core.c	2010-11-30 13:42:04.000000000 +0900
> +++ linux-2.6.37-rc4/block/blk-core.c	2010-12-07 14:31:55.000000000 +0900
> @@ -64,11 +64,13 @@ static void drive_stat_acct(struct reque
>  		return;
>  	cpu = part_stat_lock();
> -	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> -	if (!new_io)
> +	if (!new_io) {
> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_init_pos(rq));
>  		part_stat_inc(cpu, part, merges[rw]);
> -	else {
> +	} else {
> +		rq->__initial_sector = rq->__sector;
> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_init_pos(rq));
>  		part_round_stats(cpu, part);
>  		part_inc_in_flight(part, rw);

Ok, so idea seems to be that lets keep track of the sector number against
which we do the accounting. Even if we are doing merging later, accounting
sector of the request will not change so that in_flight will not go out
of the sync.

The only thing is that by allowing merging across partitions, one request
can have IO from multiple partitions and all of it being accounted to
only one partition. So it is more of little accounting error. Though I am
not sure how big a issue that is.

This sounds reasonable to me.

>  	}
> @@ -1776,7 +1778,7 @@ static void blk_account_io_completion(st
>  		int cpu;
>  		cpu = part_stat_lock();
> -		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		part = disk_map_sector_rcu(req->rq_disk, blk_rq_init_pos(req));
>  		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
>  		part_stat_unlock();
>  	}
> @@ -1796,7 +1798,7 @@ static void blk_account_io_done(struct r
>  		int cpu;
>  		cpu = part_stat_lock();
> -		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		part = disk_map_sector_rcu(req->rq_disk, blk_rq_init_pos(req));
>  		part_stat_inc(cpu, part, ios[rw]);
>  		part_stat_add(cpu, part, ticks[rw], duration);
> Index: linux-2.6.37-rc4/block/blk-merge.c
> ===================================================================
> --- linux-2.6.37-rc4.orig/block/blk-merge.c	2010-11-30 13:42:04.000000000 +0900
> +++ linux-2.6.37-rc4/block/blk-merge.c	2010-12-07 14:14:55.000000000 +0900
> @@ -351,7 +351,7 @@ static void blk_account_io_merge(struct
>  		int cpu;
>  		cpu = part_stat_lock();
> -		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		part = disk_map_sector_rcu(req->rq_disk, blk_rq_init_pos(req));
>  		part_round_stats(cpu, part);
>  		part_dec_in_flight(part, rq_data_dir(req));
> Index: linux-2.6.37-rc4/include/linux/blkdev.h
> ===================================================================
> --- linux-2.6.37-rc4.orig/include/linux/blkdev.h	2010-11-30 13:42:04.000000000 +0900
> +++ linux-2.6.37-rc4/include/linux/blkdev.h	2010-12-07 14:13:11.000000000 +0900
> @@ -91,6 +91,7 @@ struct request {
>  	/* the following two fields are internal, NEVER access directly */
>  	unsigned int __data_len;	/* total data len */
>  	sector_t __sector;		/* sector cursor */
> +	sector_t __initial_sector;

Would "acct_sector" be a better name. It just means this is the sector
which we will be using for accounting purposes of this rq.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-07  7:18   ` Satoru Takeuchi
  2010-12-07 18:39     ` Vivek Goyal
@ 2010-12-08  7:33     ` Jens Axboe
  2010-12-08  7:59       ` Satoru Takeuchi
  1 sibling, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2010-12-08  7:33 UTC (permalink / raw)
  To: Satoru Takeuchi
  Cc: Linus Torvalds, Yasuaki Ishimatsu, vgoyal, jmarchan, linux-kernel

On 2010-12-07 15:18, Satoru Takeuchi wrote:
> Hi Linus, Yasuaki,  and Jens
> 
> (2010/12/07 1:08), Linus Torvalds wrote:
>> 2010/12/6 Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>:
>>>
>>> The problem is caused by merging different partition's I/Os. So the patch
>>> check whether a merging bio or request is a same partition as a request or not
>>> by using a partition's start sector and size.
>>
>> I really think this is wrong.
>>
>> We should just carry the partition information around in the req and
>> the bio, and just compare the pointers, rather than compare the range.
>> No need to even dereference the pointers, you should be able to just
>> do
>>
>>     /* don't merge if not on the same partition */
>>     if (bio->part != req->part)
>>        return 0;
>>
>> or something.
>>
>> This is doubly true since the accounting already does that horrible
>> partition lookup: rather than look it up, we should just _set_ it in
>> __generic_make_request(), where I think we already know it since we do
>> that whole blk_partition_remap().
>>
>> So just something like the appended (TOTALLY UNTESTED) perhaps?
>>
>> Note that this should get it right even for overlapping partitions etc.
>>
>>                       Linus
> 
> The problem can occur even if your patches are applied. Think about a case
> like the following.
> 
>   1) There are 2 partition, sda1 and sda2, on sda.
>   2) Open sda and issue an IO to sda2's first sector. Then sda2's in_flight
>      is incremented though you open not sda2 but sda. It is because of
>      partition lookup method. It is based on which partition rq->__sector
>      sector belongs to.
>   3) Issue an IO to sda1's last sector and it merged to the IO issued in
>      step (2) because their part are both sda. In addition, rq->__sector
>      is modified to the sda1's region.
>   4) After completing the IO, sda1's in_flight is decremented and diskstat
>      is corrupted here.
> 
> I think fixing this case is difficult and would cause more complexity.
> 
> I hit on another approach. Although it doesn'tprevent any merge as Linus
> preferred, it can fix the problem anyway. In this idea, in_flight is
> incremented and decremented for the partition which the request belonged
> to in its creation. It has the following merits.

I really would prefer if we fixed up the patchset we ended up reverting.
At least that had a purpose with growing struct request, since we saved
on doing the partition lookups.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-08  7:33     ` Jens Axboe
@ 2010-12-08  7:59       ` Satoru Takeuchi
  2010-12-08  8:06         ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Satoru Takeuchi @ 2010-12-08  7:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Yasuaki Ishimatsu, vgoyal, jmarchan, linux-kernel

Hi Jens,

(2010/12/08 16:33), Jens Axboe wrote:
> On 2010-12-07 15:18, Satoru Takeuchi wrote:

>> I hit on another approach. Although it doesn'tprevent any merge as Linus
>> preferred, it can fix the problem anyway. In this idea, in_flight is
>> incremented and decremented for the partition which the request belonged
>> to in its creation. It has the following merits.

Revert is already finished. 2.6.37-rc-5 and latest stable kernel doesn't
contain Yasuaki's former logic.

https://lkml.org/lkml/2010/10/24/118

>
> I really would prefer if we fixed up the patchset we ended up reverting.
> At least that had a purpose with growing struct request, since we saved
> on doing the partition lookups.
>

Thanks,
Satoru


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-08  7:59       ` Satoru Takeuchi
@ 2010-12-08  8:06         ` Jens Axboe
  2010-12-08  8:11           ` Satoru Takeuchi
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2010-12-08  8:06 UTC (permalink / raw)
  To: Satoru Takeuchi
  Cc: Linus Torvalds, Yasuaki Ishimatsu, vgoyal, jmarchan, linux-kernel

On 2010-12-08 15:59, Satoru Takeuchi wrote:
> Hi Jens,
> 
> (2010/12/08 16:33), Jens Axboe wrote:
>> On 2010-12-07 15:18, Satoru Takeuchi wrote:
> 
>>> I hit on another approach. Although it doesn'tprevent any merge as Linus
>>> preferred, it can fix the problem anyway. In this idea, in_flight is
>>> incremented and decremented for the partition which the request belonged
>>> to in its creation. It has the following merits.
> 
> Revert is already finished. 2.6.37-rc-5 and latest stable kernel doesn't
> contain Yasuaki's former logic.
> 
> https://lkml.org/lkml/2010/10/24/118

Yes I know, that is why I said:

>> I really would prefer if we fixed up the patchset we ended up reverting.
>> At least that had a purpose with growing struct request, since we saved
>> on doing the partition lookups.

That I prefer we fix that code up, since I think it's the best solution
to the problem.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-08  8:06         ` Jens Axboe
@ 2010-12-08  8:11           ` Satoru Takeuchi
  2010-12-08 14:46             ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Satoru Takeuchi @ 2010-12-08  8:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Yasuaki Ishimatsu, vgoyal, jmarchan, linux-kernel

Hi Jens,

(2010/12/08 17:06), Jens Axboe wrote:
>>>> I hit on another approach. Although it doesn'tprevent any merge as Linus
>>>> preferred, it can fix the problem anyway. In this idea, in_flight is
>>>> incremented and decremented for the partition which the request belonged
>>>> to in its creation. It has the following merits.
>>
>> Revert is already finished. 2.6.37-rc-5 and latest stable kernel doesn't
>> contain Yasuaki's former logic.
>>
>> https://lkml.org/lkml/2010/10/24/118
>
> Yes I know, that is why I said:
>
>>> I really would prefer if we fixed up the patchset we ended up reverting.
>>> At least that had a purpose with growing struct request, since we saved
>>> on doing the partition lookups.
>
> That I prefer we fix that code up, since I think it's the best solution
> to the problem.
>

I already postedit.

https://lkml.org/lkml/2010/12/8/12

I think it is OK without mail subject :-)

Thanks,
Satoru


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-08  8:11           ` Satoru Takeuchi
@ 2010-12-08 14:46             ` Jens Axboe
  2010-12-08 15:51               ` Vivek Goyal
  2010-12-10 16:12               ` Jerome Marchand
  0 siblings, 2 replies; 47+ messages in thread
From: Jens Axboe @ 2010-12-08 14:46 UTC (permalink / raw)
  To: Satoru Takeuchi
  Cc: Linus Torvalds, Yasuaki Ishimatsu, vgoyal, jmarchan, linux-kernel

On 2010-12-08 16:11, Satoru Takeuchi wrote:
> Hi Jens,
> 
> (2010/12/08 17:06), Jens Axboe wrote:
>>>>> I hit on another approach. Although it doesn'tprevent any merge as Linus
>>>>> preferred, it can fix the problem anyway. In this idea, in_flight is
>>>>> incremented and decremented for the partition which the request belonged
>>>>> to in its creation. It has the following merits.
>>>
>>> Revert is already finished. 2.6.37-rc-5 and latest stable kernel doesn't
>>> contain Yasuaki's former logic.
>>>
>>> https://lkml.org/lkml/2010/10/24/118
>>
>> Yes I know, that is why I said:
>>
>>>> I really would prefer if we fixed up the patchset we ended up reverting..
>>>> At least that had a purpose with growing struct request, since we saved
>>>> on doing the partition lookups.
>>
>> That I prefer we fix that code up, since I think it's the best solution
>> to the problem.
>>
> 
> I already postedit.
> 
> https://lkml.org/lkml/2010/12/8/12
> 
> I think it is OK without mail subject :-)

No, that's not it at all. What I mean (and what was reverted) was
caching the partition lookup, and using that for the stats. The problem
with that approach turned out to be the elevator queiscing logic not
being fully correct. One easier way to fix that would be to reference
count the part stats, instead of having to drain the queue.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-08 14:46             ` Jens Axboe
@ 2010-12-08 15:51               ` Vivek Goyal
  2010-12-08 15:58                 ` Vivek Goyal
  2010-12-10 16:12               ` Jerome Marchand
  1 sibling, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2010-12-08 15:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu, jmarchan,
	linux-kernel

On Wed, Dec 08, 2010 at 10:46:04PM +0800, Jens Axboe wrote:
> On 2010-12-08 16:11, Satoru Takeuchi wrote:
> > Hi Jens,
> > 
> > (2010/12/08 17:06), Jens Axboe wrote:
> >>>>> I hit on another approach. Although it doesn'tprevent any merge as Linus
> >>>>> preferred, it can fix the problem anyway. In this idea, in_flight is
> >>>>> incremented and decremented for the partition which the request belonged
> >>>>> to in its creation. It has the following merits.
> >>>
> >>> Revert is already finished. 2.6.37-rc-5 and latest stable kernel doesn't
> >>> contain Yasuaki's former logic.
> >>>
> >>> https://lkml.org/lkml/2010/10/24/118
> >>
> >> Yes I know, that is why I said:
> >>
> >>>> I really would prefer if we fixed up the patchset we ended up reverting..
> >>>> At least that had a purpose with growing struct request, since we saved
> >>>> on doing the partition lookups.
> >>
> >> That I prefer we fix that code up, since I think it's the best solution
> >> to the problem.
> >>
> > 
> > I already postedit.
> > 
> > https://lkml.org/lkml/2010/12/8/12
> > 
> > I think it is OK without mail subject :-)
> 
> No, that's not it at all. What I mean (and what was reverted) was
> caching the partition lookup, and using that for the stats. The problem
> with that approach turned out to be the elevator queiscing logic not
> being fully correct. One easier way to fix that would be to reference
> count the part stats, instead of having to drain the queue.

Taking reference to hd_struct and storing it in rq, will definitely save
us 1 lookup while doing accounting on completion path. It does not save
on rq size though.

IIUC, current patch does not increase the number of existing lookups. So 
current situation does not deteriorate with the patch.

But storing a reference in rq and avoiding 1 lookup in completion path
definitely sounds better.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-08 15:51               ` Vivek Goyal
@ 2010-12-08 15:58                 ` Vivek Goyal
  2010-12-10 11:22                   ` Jerome Marchand
  0 siblings, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2010-12-08 15:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu, jmarchan,
	linux-kernel

On Wed, Dec 08, 2010 at 10:51:37AM -0500, Vivek Goyal wrote:
> On Wed, Dec 08, 2010 at 10:46:04PM +0800, Jens Axboe wrote:
> > On 2010-12-08 16:11, Satoru Takeuchi wrote:
> > > Hi Jens,
> > > 
> > > (2010/12/08 17:06), Jens Axboe wrote:
> > >>>>> I hit on another approach. Although it doesn'tprevent any merge as Linus
> > >>>>> preferred, it can fix the problem anyway. In this idea, in_flight is
> > >>>>> incremented and decremented for the partition which the request belonged
> > >>>>> to in its creation. It has the following merits.
> > >>>
> > >>> Revert is already finished. 2.6.37-rc-5 and latest stable kernel doesn't
> > >>> contain Yasuaki's former logic.
> > >>>
> > >>> https://lkml.org/lkml/2010/10/24/118
> > >>
> > >> Yes I know, that is why I said:
> > >>
> > >>>> I really would prefer if we fixed up the patchset we ended up reverting..
> > >>>> At least that had a purpose with growing struct request, since we saved
> > >>>> on doing the partition lookups.
> > >>
> > >> That I prefer we fix that code up, since I think it's the best solution
> > >> to the problem.
> > >>
> > > 
> > > I already postedit.
> > > 
> > > https://lkml.org/lkml/2010/12/8/12
> > > 
> > > I think it is OK without mail subject :-)
> > 
> > No, that's not it at all. What I mean (and what was reverted) was
> > caching the partition lookup, and using that for the stats. The problem
> > with that approach turned out to be the elevator queiscing logic not
> > being fully correct. One easier way to fix that would be to reference
> > count the part stats, instead of having to drain the queue.
> 
> Taking reference to hd_struct and storing it in rq, will definitely save
> us 1 lookup while doing accounting on completion path. It does not save
> on rq size though.
> 
> IIUC, current patch does not increase the number of existing lookups. So 
> current situation does not deteriorate with the patch.
> 
> But storing a reference in rq and avoiding 1 lookup in completion path
> definitely sounds better.
> 

Storing a pointer to partition in rq also got the advantage that we can
easily not allow merging of requests across partitions for better
accounting.

Satoru, so yes, if you can implement what jens is suggesting, would be
good.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-08 15:58                 ` Vivek Goyal
@ 2010-12-10 11:22                   ` Jerome Marchand
  0 siblings, 0 replies; 47+ messages in thread
From: Jerome Marchand @ 2010-12-10 11:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel

On 12/08/2010 04:58 PM, Vivek Goyal wrote:
> 
> Storing a pointer to partition in rq also got the advantage that we can
> easily not allow merging of requests across partitions for better
> accounting.

I don't think it really matters (as long as we keep in_flight consistent
of course). The requests that got merged across partition are very likely 
rare enough to be neglectable.

Jerome

> 
> Satoru, so yes, if you can implement what jens is suggesting, would be
> good.
> 
> Thanks
> Vivek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-08 14:46             ` Jens Axboe
  2010-12-08 15:51               ` Vivek Goyal
@ 2010-12-10 16:12               ` Jerome Marchand
  2010-12-10 16:55                 ` Vivek Goyal
  1 sibling, 1 reply; 47+ messages in thread
From: Jerome Marchand @ 2010-12-10 16:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu, vgoyal, linux-kernel

On 12/08/2010 03:46 PM, Jens Axboe wrote:
> 
> No, that's not it at all. What I mean (and what was reverted) was
> caching the partition lookup, and using that for the stats. The problem
> with that approach turned out to be the elevator queiscing logic not
> being fully correct. One easier way to fix that would be to reference
> count the part stats, instead of having to drain the queue.
> 

The partition is already deleted in a rcu callback function. If I
understand RCU correctly, we just need to use rcu_dereference() each time
we use rq->part. Something like the following *untested* patch.

Jerome

---

diff --git a/block/blk-core.c b/block/blk-core.c
index 4ce953f..70d23f9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -64,13 +64,15 @@ static void drive_stat_acct(struct request *rq, int new_io)
 		return;
 
 	cpu = part_stat_lock();
-	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 
-	if (!new_io)
+	if (!new_io) {
+		part = blk_rq_get_part(rq);
 		part_stat_inc(cpu, part, merges[rw]);
-	else {
+	} else {
+		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 		part_round_stats(cpu, part);
 		part_inc_in_flight(part, rw);
+		rq->part = part;
 	}
 
 	part_stat_unlock();
@@ -128,6 +130,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->ref_count = 1;
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
+	rq->part = NULL;
 }
 EXPORT_SYMBOL(blk_rq_init);
 
@@ -1776,7 +1784,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = blk_rq_get_part(req);
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
 		part_stat_unlock();
 	}
@@ -1796,7 +1804,7 @@ static void blk_account_io_done(struct request *req)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = blk_rq_get_part(req);
 
 		part_stat_inc(cpu, part, ios[rw]);
 		part_stat_add(cpu, part, ticks[rw], duration);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 77b7c26..0ea5416 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -351,7 +351,7 @@ static void blk_account_io_merge(struct request *req)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = blk_rq_get_part(req);
 
 		part_round_stats(cpu, part);
 		part_dec_in_flight(part, rq_data_dir(req));
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index aae86fd..00a3b93 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -115,6 +115,7 @@ struct request {
 	void *elevator_private3;
 
 	struct gendisk *rq_disk;
+	struct hd_struct __rcu *part;
 	unsigned long start_time;
 #ifdef CONFIG_BLK_CGROUP
 	unsigned long long start_time_ns;
@@ -752,6 +753,12 @@ static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
 	return blk_rq_cur_bytes(rq) >> 9;
 }
 
+static inline struct hd_struct *blk_rq_get_part(const struct request *rq)
+{
+	struct hd_struct *part = rcu_dereference(rq->part);
+	return part;
+}
+
 /*
  * Request issue related functions.
  */

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-10 16:12               ` Jerome Marchand
@ 2010-12-10 16:55                 ` Vivek Goyal
  2010-12-14 20:25                   ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2010-12-10 16:55 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Jens Axboe, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel

On Fri, Dec 10, 2010 at 05:12:04PM +0100, Jerome Marchand wrote:
> On 12/08/2010 03:46 PM, Jens Axboe wrote:
> > 
> > No, that's not it at all. What I mean (and what was reverted) was
> > caching the partition lookup, and using that for the stats. The problem
> > with that approach turned out to be the elevator queiscing logic not
> > being fully correct. One easier way to fix that would be to reference
> > count the part stats, instead of having to drain the queue.
> > 
> 
> The partition is already deleted in a rcu callback function. If I
> understand RCU correctly, we just need to use rcu_dereference() each time
> we use rq->part. Something like the following *untested* patch.

Jerome,

I don't think that rcu_dereference() is going to solve the problem. The
partition table will be freed as soon as one rcu period is over. So to
make sure partition table is not freed one has to be holding
rcu_read_lock(). It is not a good idea to keep rcu period going till
request finishes so a better idea will to to reference count it.

Thanks
Vivek

> 
> Jerome
> 
> ---
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4ce953f..70d23f9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -64,13 +64,15 @@ static void drive_stat_acct(struct request *rq, int new_io)
>  		return;
>  
>  	cpu = part_stat_lock();
> -	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>  
> -	if (!new_io)
> +	if (!new_io) {
> +		part = blk_rq_get_part(rq);
>  		part_stat_inc(cpu, part, merges[rw]);
> -	else {
> +	} else {
> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>  		part_round_stats(cpu, part);
>  		part_inc_in_flight(part, rw);
> +		rq->part = part;
>  	}
>  
>  	part_stat_unlock();
> @@ -128,6 +130,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
>  	rq->ref_count = 1;
>  	rq->start_time = jiffies;
>  	set_start_time_ns(rq);
> +	rq->part = NULL;
>  }
>  EXPORT_SYMBOL(blk_rq_init);
>  
> @@ -1776,7 +1784,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
>  		int cpu;
>  
>  		cpu = part_stat_lock();
> -		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		part = blk_rq_get_part(req);
>  		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
>  		part_stat_unlock();
>  	}
> @@ -1796,7 +1804,7 @@ static void blk_account_io_done(struct request *req)
>  		int cpu;
>  
>  		cpu = part_stat_lock();
> -		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		part = blk_rq_get_part(req);
>  
>  		part_stat_inc(cpu, part, ios[rw]);
>  		part_stat_add(cpu, part, ticks[rw], duration);
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 77b7c26..0ea5416 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -351,7 +351,7 @@ static void blk_account_io_merge(struct request *req)
>  		int cpu;
>  
>  		cpu = part_stat_lock();
> -		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		part = blk_rq_get_part(req);
>  
>  		part_round_stats(cpu, part);
>  		part_dec_in_flight(part, rq_data_dir(req));
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index aae86fd..00a3b93 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -115,6 +115,7 @@ struct request {
>  	void *elevator_private3;
>  
>  	struct gendisk *rq_disk;
> +	struct hd_struct __rcu *part;
>  	unsigned long start_time;
>  #ifdef CONFIG_BLK_CGROUP
>  	unsigned long long start_time_ns;
> @@ -752,6 +753,12 @@ static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
>  	return blk_rq_cur_bytes(rq) >> 9;
>  }
>  
> +static inline struct hd_struct *blk_rq_get_part(const struct request *rq)
> +{
> +	struct hd_struct *part = rcu_dereference(rq->part);
> +	return part;
> +}
> +
>  /*
>   * Request issue related functions.
>   */

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] Don't merge different partition's IOs
  2010-12-10 16:55                 ` Vivek Goyal
@ 2010-12-14 20:25                   ` Jens Axboe
  2010-12-17 13:42                     ` [PATCH] block: fix accounting bug on cross partition merges Jerome Marchand
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2010-12-14 20:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jerome Marchand, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On 2010-12-10 17:55, Vivek Goyal wrote:
> On Fri, Dec 10, 2010 at 05:12:04PM +0100, Jerome Marchand wrote:
>> On 12/08/2010 03:46 PM, Jens Axboe wrote:
>>>
>>> No, that's not it at all. What I mean (and what was reverted) was
>>> caching the partition lookup, and using that for the stats. The problem
>>> with that approach turned out to be the elevator queiscing logic not
>>> being fully correct. One easier way to fix that would be to reference
>>> count the part stats, instead of having to drain the queue.
>>>
>>
>> The partition is already deleted in a rcu callback function. If I
>> understand RCU correctly, we just need to use rcu_dereference() each time
>> we use rq->part. Something like the following *untested* patch.
> 
> Jerome,
> 
> I don't think that rcu_dereference() is going to solve the problem. The
> partition table will be freed as soon as one rcu period is over. So to
> make sure partition table is not freed one has to be holding
> rcu_read_lock(). It is not a good idea to keep rcu period going till
> request finishes so a better idea will to to reference count it.

Exactly. The only change you would do to partition handling is ensure
that each io grabs a reference to it and drops it at the end. You need
not even do that in the core bits outside each IO, we just need to
ensure that the partition struct persists in memory even if it is no
longer the valid partition table. The rcu call to free the memory would
happen when the ref drops to zero.

What Vivek says it completely correct, and rcu_dereference() would only
help if you also had rcu_read_lock() over each IO. That is not feasible
at all.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH] block: fix accounting bug on cross partition merges
  2010-12-14 20:25                   ` Jens Axboe
@ 2010-12-17 13:42                     ` Jerome Marchand
  2010-12-17 19:06                       ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Jerome Marchand @ 2010-12-17 13:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel


/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

Also add a refcount to struct hd_struct to keep the partition in
memory as long as users exist.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 block/blk-core.c       |   15 ++++++++++-----
 block/blk-merge.c      |    3 ++-
 block/genhd.c          |    1 +
 fs/partitions/check.c  |   11 ++++++++++-
 include/linux/blkdev.h |    1 +
 include/linux/genhd.h  |    2 ++
 6 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4ce953f..064921d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -64,13 +64,16 @@ static void drive_stat_acct(struct request *rq, int new_io)
 		return;
 
 	cpu = part_stat_lock();
-	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 
-	if (!new_io)
+	if (!new_io) {
+		part = rq->part;
 		part_stat_inc(cpu, part, merges[rw]);
-	else {
+	} else {
+		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 		part_round_stats(cpu, part);
 		part_inc_in_flight(part, rw);
+		kref_get(&part->ref);
+		rq->part = part;
 	}
 
 	part_stat_unlock();
@@ -128,6 +131,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->ref_count = 1;
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
+	rq->part = NULL;
 }
 EXPORT_SYMBOL(blk_rq_init);
 
@@ -1776,7 +1780,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
 		part_stat_unlock();
 	}
@@ -1796,13 +1800,14 @@ static void blk_account_io_done(struct request *req)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 
 		part_stat_inc(cpu, part, ios[rw]);
 		part_stat_add(cpu, part, ticks[rw], duration);
 		part_round_stats(cpu, part);
 		part_dec_in_flight(part, rw);
 
+		kref_put(&part->ref, __delete_partition);
 		part_stat_unlock();
 	}
 }
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 77b7c26..b06b83b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -351,11 +351,12 @@ static void blk_account_io_merge(struct request *req)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 
 		part_round_stats(cpu, part);
 		part_dec_in_flight(part, rq_data_dir(req));
 
+		kref_put(&part->ref, __delete_partition);
 		part_stat_unlock();
 	}
 }
diff --git a/block/genhd.c b/block/genhd.c
index 5fa2b44..0c55eae 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1192,6 +1192,7 @@ struct gendisk *alloc_disk_node(int minors, int node_id)
 			return NULL;
 		}
 		disk->part_tbl->part[0] = &disk->part0;
+		kref_init(&disk->part0.ref);
 
 		disk->minors = minors;
 		rand_initialize_disk(disk);
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index 0a8b0ad..4b5aa9d 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -372,6 +372,14 @@ static void delete_partition_rcu_cb(struct rcu_head *head)
 	put_device(part_to_dev(part));
 }
 
+void __delete_partition(struct kref *ref)
+{
+	struct hd_struct *part =
+		container_of(ref, struct hd_struct, ref);
+
+	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
+}
+
 void delete_partition(struct gendisk *disk, int partno)
 {
 	struct disk_part_tbl *ptbl = disk->part_tbl;
@@ -390,7 +398,7 @@ void delete_partition(struct gendisk *disk, int partno)
 	kobject_put(part->holder_dir);
 	device_del(part_to_dev(part));
 
-	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
+	kref_put(&part->ref, __delete_partition);
 }
 
 static ssize_t whole_disk_show(struct device *dev,
@@ -489,6 +497,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
 	if (!dev_get_uevent_suppress(ddev))
 		kobject_uevent(&pdev->kobj, KOBJ_ADD);
 
+	kref_init(&p->ref);
 	return p;
 
 out_free_info:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index aae86fd..482a7fd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -115,6 +115,7 @@ struct request {
 	void *elevator_private3;
 
 	struct gendisk *rq_disk;
+	struct hd_struct *part;
 	unsigned long start_time;
 #ifdef CONFIG_BLK_CGROUP
 	unsigned long long start_time_ns;
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 7a7b9c1..2ba2792 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -116,6 +116,7 @@ struct hd_struct {
 	struct disk_stats dkstats;
 #endif
 	struct rcu_head rcu_head;
+	struct kref ref;
 };
 
 #define GENHD_FL_REMOVABLE			1
@@ -583,6 +584,7 @@ extern struct hd_struct * __must_check add_partition(struct gendisk *disk,
 						     sector_t len, int flags,
 						     struct partition_meta_info
 						       *info);
+extern void __delete_partition(struct kref *ref);
 extern void delete_partition(struct gendisk *, int);
 extern void printk_all_partitions(void);
 

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] block: fix accounting bug on cross partition merges
  2010-12-17 13:42                     ` [PATCH] block: fix accounting bug on cross partition merges Jerome Marchand
@ 2010-12-17 19:06                       ` Jens Axboe
  2010-12-17 22:32                         ` Vivek Goyal
  2010-12-23 15:10                         ` Jerome Marchand
  0 siblings, 2 replies; 47+ messages in thread
From: Jens Axboe @ 2010-12-17 19:06 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Vivek Goyal, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel

On 2010-12-17 14:42, Jerome Marchand wrote:
> 
> /proc/diskstats would display a strange output as follows.

[snip]

This looks a lot better! One comment:

> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4ce953f..064921d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -64,13 +64,16 @@ static void drive_stat_acct(struct request *rq, int new_io)
>  		return;
>  
>  	cpu = part_stat_lock();
> -	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>  
> -	if (!new_io)
> +	if (!new_io) {
> +		part = rq->part;
>  		part_stat_inc(cpu, part, merges[rw]);
> -	else {
> +	} else {
> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>  		part_round_stats(cpu, part);
>  		part_inc_in_flight(part, rw);
> +		kref_get(&part->ref);
> +		rq->part = part;
>  	}
>  
>  	part_stat_unlock();

I don't think this is completely safe. The rcu lock is held due to the
part_stat_lock(), but that only prevents the __delete_partition()
callback from happening. Lets say you have this:

CPU0                                         CPU1
part = disk_map_sector_rcu()
                                             kref_put(part); <- now 0
part_stat_unlock()
                                             __delete_partition();
                                             ...
                                             delete_partition_rcu_cb();
merge, or endio, boom

Now rq has ->part pointing to freed memory, later merges or end
accounting will touch freed memory.

I think we can fix this by just having delete_partition_rcu_rb() check
the reference count and return if non-zero. Since someone holds a
reference to the table, they will drop it and we'll re-schedule the rcu
callback.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] block: fix accounting bug on cross partition merges
  2010-12-17 19:06                       ` Jens Axboe
@ 2010-12-17 22:32                         ` Vivek Goyal
  2010-12-23 15:10                         ` Jerome Marchand
  1 sibling, 0 replies; 47+ messages in thread
From: Vivek Goyal @ 2010-12-17 22:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jerome Marchand, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On Fri, Dec 17, 2010 at 08:06:09PM +0100, Jens Axboe wrote:
> On 2010-12-17 14:42, Jerome Marchand wrote:
> > 
> > /proc/diskstats would display a strange output as follows.
> 
> [snip]
> 
> This looks a lot better! One comment:
> 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 4ce953f..064921d 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -64,13 +64,16 @@ static void drive_stat_acct(struct request *rq, int new_io)
> >  		return;
> >  
> >  	cpu = part_stat_lock();
> > -	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> >  
> > -	if (!new_io)
> > +	if (!new_io) {
> > +		part = rq->part;
> >  		part_stat_inc(cpu, part, merges[rw]);
> > -	else {
> > +	} else {
> > +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> >  		part_round_stats(cpu, part);
> >  		part_inc_in_flight(part, rw);
> > +		kref_get(&part->ref);
> > +		rq->part = part;
> >  	}
> >  
> >  	part_stat_unlock();
> 
> I don't think this is completely safe. The rcu lock is held due to the
> part_stat_lock(), but that only prevents the __delete_partition()
> callback from happening. Lets say you have this:
> 
> CPU0                                         CPU1
> part = disk_map_sector_rcu()
>                                              kref_put(part); <- now 0
> part_stat_unlock()
>                                              __delete_partition();
>                                              ...
>                                              delete_partition_rcu_cb();
> merge, or endio, boom
> 
> Now rq has ->part pointing to freed memory, later merges or end
> accounting will touch freed memory.
> 
> I think we can fix this by just having delete_partition_rcu_rb() check
> the reference count and return if non-zero. Since someone holds a
> reference to the table, they will drop it and we'll re-schedule the rcu
> callback.

This is interesting. Using RCU with kref(). So even if somebody has done
a kref_put() and this is last reference, but rcu period is not over, somebody
can still go and take reference again and set it to 1 again and then
partition will not be freed as delete_partition_rcu_cb() will find it set.

I guess read shall have to be atomic_read() and struct kref is opaque so
one might have to introduce kref_read() or something like that and
possibly update Documentation/kref.txt for this usage of with RCU. I would
also recommend it to get it reviewed from  Paul McKenney to make sure this
usage of RCU is fine.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] block: fix accounting bug on cross partition merges
  2010-12-17 19:06                       ` Jens Axboe
  2010-12-17 22:32                         ` Vivek Goyal
@ 2010-12-23 15:10                         ` Jerome Marchand
  2010-12-23 15:39                           ` Vivek Goyal
  1 sibling, 1 reply; 47+ messages in thread
From: Jerome Marchand @ 2010-12-23 15:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel

On 12/17/2010 08:06 PM, Jens Axboe wrote:
> On 2010-12-17 14:42, Jerome Marchand wrote:
>>
>> /proc/diskstats would display a strange output as follows.
> 
> [snip]
> 
> This looks a lot better! One comment:
> 
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 4ce953f..064921d 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -64,13 +64,16 @@ static void drive_stat_acct(struct request *rq, int new_io)
>>  		return;
>>  
>>  	cpu = part_stat_lock();
>> -	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>>  
>> -	if (!new_io)
>> +	if (!new_io) {
>> +		part = rq->part;
>>  		part_stat_inc(cpu, part, merges[rw]);
>> -	else {
>> +	} else {
>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>>  		part_round_stats(cpu, part);
>>  		part_inc_in_flight(part, rw);
>> +		kref_get(&part->ref);
>> +		rq->part = part;
>>  	}
>>  
>>  	part_stat_unlock();
> 
> I don't think this is completely safe. The rcu lock is held due to the
> part_stat_lock(), but that only prevents the __delete_partition()
> callback from happening. Lets say you have this:
> 
> CPU0                                         CPU1
> part = disk_map_sector_rcu()
>                                              kref_put(part); <- now 0

A kref_get(part) happens here, or am I missing something?
If we use an atomic kref_test_and_get() function, which only increment the
refcount if it's not zero, we would be safe. If kref_test_and_get() fails,
we can cache rq->rq_disk->part0 instead. See my drafty patch below.

> part_stat_unlock()
>                                              __delete_partition();
>                                              ...
>                                              delete_partition_rcu_cb();
> merge, or endio, boom
> 
> Now rq has ->part pointing to freed memory, later merges or end
> accounting will touch freed memory.
> 
> I think we can fix this by just having delete_partition_rcu_rb() check
> the reference count and return if non-zero. Since someone holds a
> reference to the table, they will drop it and we'll re-schedule the rcu
> callback.
> 
> 


---

diff --git a/block/blk-core.c b/block/blk-core.c
index 4ce953f..72d12d2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -64,13 +64,21 @@ static void drive_stat_acct(struct request *rq, int new_io)
 		return;
 
 	cpu = part_stat_lock();
-	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 
-	if (!new_io)
+	if (!new_io) {
+		part = rq->part;
 		part_stat_inc(cpu, part, merges[rw]);
-	else {
+	} else {
+		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
+		if(part->partno && !kref_test_and_get(&part->ref))
+			/*
+			 * The partition is already being removed,
+			 * the request will be accounted on the disk only
+			 */
+			part = &rq->rq_disk->part0;
 		part_round_stats(cpu, part);
 		part_inc_in_flight(part, rw);
+		rq->part = part;
 	}
 
 	part_stat_unlock();
@@ -128,6 +136,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->ref_count = 1;
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
+	rq->part = NULL;
 }
 EXPORT_SYMBOL(blk_rq_init);
 
@@ -1776,7 +1785,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
 		part_stat_unlock();
 	}
@@ -1796,13 +1805,15 @@ static void blk_account_io_done(struct request *req)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 
 		part_stat_inc(cpu, part, ios[rw]);
 		part_stat_add(cpu, part, ticks[rw], duration);
 		part_round_stats(cpu, part);
 		part_dec_in_flight(part, rw);
 
+		if (part->partno)
+			kref_put(&part->ref, __delete_partition);
 		part_stat_unlock();
 	}
 }
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 77b7c26..3334578 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -351,11 +351,13 @@ static void blk_account_io_merge(struct request *req)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 
 		part_round_stats(cpu, part);
 		part_dec_in_flight(part, rq_data_dir(req));
 
+		if (part->partno)
+			kref_put(&part->ref, __delete_partition);
 		part_stat_unlock();
 	}
 }
diff --git a/block/genhd.c b/block/genhd.c
index 5fa2b44..0c55eae 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1192,6 +1192,7 @@ struct gendisk *alloc_disk_node(int minors, int node_id)
 			return NULL;
 		}
 		disk->part_tbl->part[0] = &disk->part0;
+		kref_init(&disk->part0.ref);
 
 		disk->minors = minors;
 		rand_initialize_disk(disk);
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index 0a8b0ad..0123717 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -372,6 +372,13 @@ static void delete_partition_rcu_cb(struct rcu_head *head)
 	put_device(part_to_dev(part));
 }
 
+void __delete_partition(struct kref *ref)
+{
+	struct hd_struct *part = container_of(ref, struct hd_struct, ref);
+
+	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
+}
+
 void delete_partition(struct gendisk *disk, int partno)
 {
 	struct disk_part_tbl *ptbl = disk->part_tbl;
@@ -390,7 +397,7 @@ void delete_partition(struct gendisk *disk, int partno)
 	kobject_put(part->holder_dir);
 	device_del(part_to_dev(part));
 
-	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
+	kref_put(&part->ref, __delete_partition);
 }
 
 static ssize_t whole_disk_show(struct device *dev,
@@ -489,6 +496,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
 	if (!dev_get_uevent_suppress(ddev))
 		kobject_uevent(&pdev->kobj, KOBJ_ADD);
 
+	kref_init(&p->ref);
 	return p;
 
 out_free_info:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index aae86fd..482a7fd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -115,6 +115,7 @@ struct request {
 	void *elevator_private3;
 
 	struct gendisk *rq_disk;
+	struct hd_struct *part;
 	unsigned long start_time;
 #ifdef CONFIG_BLK_CGROUP
 	unsigned long long start_time_ns;
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 7a7b9c1..2ba2792 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -116,6 +116,7 @@ struct hd_struct {
 	struct disk_stats dkstats;
 #endif
 	struct rcu_head rcu_head;
+	struct kref ref;
 };
 
 #define GENHD_FL_REMOVABLE			1
@@ -583,6 +584,7 @@ extern struct hd_struct * __must_check add_partition(struct gendisk *disk,
 						     sector_t len, int flags,
 						     struct partition_meta_info
 						       *info);
+extern void __delete_partition(struct kref *ref);
 extern void delete_partition(struct gendisk *, int);
 extern void printk_all_partitions(void);
 
diff --git a/include/linux/kref.h b/include/linux/kref.h
index 6cc38fc..90b9e44 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -23,6 +23,7 @@ struct kref {
 
 void kref_init(struct kref *kref);
 void kref_get(struct kref *kref);
+int kref_test_and_get(struct kref *kref);
 int kref_put(struct kref *kref, void (*release) (struct kref *kref));
 
 #endif /* _KREF_H_ */
diff --git a/lib/kref.c b/lib/kref.c
index d3d227a..5f663b9 100644
--- a/lib/kref.c
+++ b/lib/kref.c
@@ -37,6 +37,22 @@ void kref_get(struct kref *kref)
 }
 
 /**
+ * kref_test_and_get - increment refcount for object only if refcount is not
+ * zero.
+ * @kref: object.
+ *
+ * Return non-zero if the refcount was incremented, 0 otherwise
+ */
+int kref_test_and_get(struct kref *kref)
+{
+	int ret;
+	smp_mb__before_atomic_inc();
+	ret = atomic_inc_not_zero(&kref->refcount);
+	smp_mb__after_atomic_inc();
+	return ret;
+}
+
+/**
  * kref_put - decrement refcount for object.
  * @kref: object.
  * @release: pointer to the function that will clean up the object when the

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] block: fix accounting bug on cross partition merges
  2010-12-23 15:10                         ` Jerome Marchand
@ 2010-12-23 15:39                           ` Vivek Goyal
  2010-12-23 17:04                             ` Jerome Marchand
  0 siblings, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2010-12-23 15:39 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Jens Axboe, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel

On Thu, Dec 23, 2010 at 04:10:04PM +0100, Jerome Marchand wrote:
> On 12/17/2010 08:06 PM, Jens Axboe wrote:
> > On 2010-12-17 14:42, Jerome Marchand wrote:
> >>
> >> /proc/diskstats would display a strange output as follows.
> > 
> > [snip]
> > 
> > This looks a lot better! One comment:
> > 
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 4ce953f..064921d 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -64,13 +64,16 @@ static void drive_stat_acct(struct request *rq, int new_io)
> >>  		return;
> >>  
> >>  	cpu = part_stat_lock();
> >> -	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> >>  
> >> -	if (!new_io)
> >> +	if (!new_io) {
> >> +		part = rq->part;
> >>  		part_stat_inc(cpu, part, merges[rw]);
> >> -	else {
> >> +	} else {
> >> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> >>  		part_round_stats(cpu, part);
> >>  		part_inc_in_flight(part, rw);
> >> +		kref_get(&part->ref);
> >> +		rq->part = part;
> >>  	}
> >>  
> >>  	part_stat_unlock();
> > 
> > I don't think this is completely safe. The rcu lock is held due to the
> > part_stat_lock(), but that only prevents the __delete_partition()
> > callback from happening. Lets say you have this:
> > 
> > CPU0                                         CPU1
> > part = disk_map_sector_rcu()
> >                                              kref_put(part); <- now 0
> 
> A kref_get(part) happens here, or am I missing something?

It might happen that kref_get(part) is called after kref_put() has been
called and reference has reached 0 and and delete_parition() call has
been scheduled as soon as rcu grace period is over. So if you do
kref_get() after that, it is not going to help.

> If we use an atomic kref_test_and_get() function, which only increment the
> refcount if it's not zero, we would be safe. If kref_test_and_get() fails,
> we can cache rq->rq_disk->part0 instead. See my drafty patch below.

Conceptually it kind of makes sense to me. So if even if we get the
pointer to partition under rcu_read_lock(), we will not account the IO
to partition if it is going away.

> 
> > part_stat_unlock()
> >                                              __delete_partition();
> >                                              ...
> >                                              delete_partition_rcu_cb();
> > merge, or endio, boom
> > 
> > Now rq has ->part pointing to freed memory, later merges or end
> > accounting will touch freed memory.
> > 
> > I think we can fix this by just having delete_partition_rcu_rb() check
> > the reference count and return if non-zero. Since someone holds a
> > reference to the table, they will drop it and we'll re-schedule the rcu
> > callback.
> > 
> > 
> 
> 
> ---
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4ce953f..72d12d2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -64,13 +64,21 @@ static void drive_stat_acct(struct request *rq, int new_io)
>  		return;
>  
>  	cpu = part_stat_lock();
> -	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>  
> -	if (!new_io)
> +	if (!new_io) {
> +		part = rq->part;
>  		part_stat_inc(cpu, part, merges[rw]);
> -	else {
> +	} else {
> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> +		if(part->partno && !kref_test_and_get(&part->ref))
> +			/*

Do we have to check this part->partno both while taking and releasing
reference. Can't we take one extra reference for disk->part0, at
alloc_disk_node() time so that it is never freed and only freed when
disk is going away and gendisk is being freed.

That way, you don't have to differentiate between disk->part0 and rest
of the partitions while taking or dropping references.

Thanks
Vivek

> +			 * The partition is already being removed,
> +			 * the request will be accounted on the disk only
> +			 */
> +			part = &rq->rq_disk->part0;
>  		part_round_stats(cpu, part);
>  		part_inc_in_flight(part, rw);
> +		rq->part = part;
>  	}
>  
>  	part_stat_unlock();
> @@ -128,6 +136,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
>  	rq->ref_count = 1;
>  	rq->start_time = jiffies;
>  	set_start_time_ns(rq);
> +	rq->part = NULL;
>  }
>  EXPORT_SYMBOL(blk_rq_init);
>  
> @@ -1776,7 +1785,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
>  		int cpu;
>  
>  		cpu = part_stat_lock();
> -		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		part = req->part;
>  		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
>  		part_stat_unlock();
>  	}
> @@ -1796,13 +1805,15 @@ static void blk_account_io_done(struct request *req)
>  		int cpu;
>  
>  		cpu = part_stat_lock();
> -		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		part = req->part;
>  
>  		part_stat_inc(cpu, part, ios[rw]);
>  		part_stat_add(cpu, part, ticks[rw], duration);
>  		part_round_stats(cpu, part);
>  		part_dec_in_flight(part, rw);
>  
> +		if (part->partno)
> +			kref_put(&part->ref, __delete_partition);
>  		part_stat_unlock();
>  	}
>  }
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 77b7c26..3334578 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -351,11 +351,13 @@ static void blk_account_io_merge(struct request *req)
>  		int cpu;
>  
>  		cpu = part_stat_lock();
> -		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
> +		part = req->part;
>  
>  		part_round_stats(cpu, part);
>  		part_dec_in_flight(part, rq_data_dir(req));
>  
> +		if (part->partno)
> +			kref_put(&part->ref, __delete_partition);
>  		part_stat_unlock();
>  	}
>  }
> diff --git a/block/genhd.c b/block/genhd.c
> index 5fa2b44..0c55eae 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1192,6 +1192,7 @@ struct gendisk *alloc_disk_node(int minors, int node_id)
>  			return NULL;
>  		}
>  		disk->part_tbl->part[0] = &disk->part0;
> +		kref_init(&disk->part0.ref);
>  
>  		disk->minors = minors;
>  		rand_initialize_disk(disk);
> diff --git a/fs/partitions/check.c b/fs/partitions/check.c
> index 0a8b0ad..0123717 100644
> --- a/fs/partitions/check.c
> +++ b/fs/partitions/check.c
> @@ -372,6 +372,13 @@ static void delete_partition_rcu_cb(struct rcu_head *head)
>  	put_device(part_to_dev(part));
>  }
>  
> +void __delete_partition(struct kref *ref)
> +{
> +	struct hd_struct *part = container_of(ref, struct hd_struct, ref);
> +
> +	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
> +}
> +
>  void delete_partition(struct gendisk *disk, int partno)
>  {
>  	struct disk_part_tbl *ptbl = disk->part_tbl;
> @@ -390,7 +397,7 @@ void delete_partition(struct gendisk *disk, int partno)
>  	kobject_put(part->holder_dir);
>  	device_del(part_to_dev(part));
>  
> -	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
> +	kref_put(&part->ref, __delete_partition);
>  }
>  
>  static ssize_t whole_disk_show(struct device *dev,
> @@ -489,6 +496,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
>  	if (!dev_get_uevent_suppress(ddev))
>  		kobject_uevent(&pdev->kobj, KOBJ_ADD);
>  
> +	kref_init(&p->ref);
>  	return p;
>  
>  out_free_info:
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index aae86fd..482a7fd 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -115,6 +115,7 @@ struct request {
>  	void *elevator_private3;
>  
>  	struct gendisk *rq_disk;
> +	struct hd_struct *part;
>  	unsigned long start_time;
>  #ifdef CONFIG_BLK_CGROUP
>  	unsigned long long start_time_ns;
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 7a7b9c1..2ba2792 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -116,6 +116,7 @@ struct hd_struct {
>  	struct disk_stats dkstats;
>  #endif
>  	struct rcu_head rcu_head;
> +	struct kref ref;
>  };
>  
>  #define GENHD_FL_REMOVABLE			1
> @@ -583,6 +584,7 @@ extern struct hd_struct * __must_check add_partition(struct gendisk *disk,
>  						     sector_t len, int flags,
>  						     struct partition_meta_info
>  						       *info);
> +extern void __delete_partition(struct kref *ref);
>  extern void delete_partition(struct gendisk *, int);
>  extern void printk_all_partitions(void);
>  
> diff --git a/include/linux/kref.h b/include/linux/kref.h
> index 6cc38fc..90b9e44 100644
> --- a/include/linux/kref.h
> +++ b/include/linux/kref.h
> @@ -23,6 +23,7 @@ struct kref {
>  
>  void kref_init(struct kref *kref);
>  void kref_get(struct kref *kref);
> +int kref_test_and_get(struct kref *kref);
>  int kref_put(struct kref *kref, void (*release) (struct kref *kref));
>  
>  #endif /* _KREF_H_ */
> diff --git a/lib/kref.c b/lib/kref.c
> index d3d227a..5f663b9 100644
> --- a/lib/kref.c
> +++ b/lib/kref.c
> @@ -37,6 +37,22 @@ void kref_get(struct kref *kref)
>  }
>  
>  /**
> + * kref_test_and_get - increment refcount for object only if refcount is not
> + * zero.
> + * @kref: object.
> + *
> + * Return non-zero if the refcount was incremented, 0 otherwise
> + */
> +int kref_test_and_get(struct kref *kref)
> +{
> +	int ret;
> +	smp_mb__before_atomic_inc();
> +	ret = atomic_inc_not_zero(&kref->refcount);
> +	smp_mb__after_atomic_inc();
> +	return ret;
> +}
> +
> +/**
>   * kref_put - decrement refcount for object.
>   * @kref: object.
>   * @release: pointer to the function that will clean up the object when the

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] block: fix accounting bug on cross partition merges
  2010-12-23 15:39                           ` Vivek Goyal
@ 2010-12-23 17:04                             ` Jerome Marchand
  2010-12-24 19:29                               ` Vivek Goyal
  0 siblings, 1 reply; 47+ messages in thread
From: Jerome Marchand @ 2010-12-23 17:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel

On 12/23/2010 04:39 PM, Vivek Goyal wrote:
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 4ce953f..72d12d2 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -64,13 +64,21 @@ static void drive_stat_acct(struct request *rq, int new_io)
>>  		return;
>>  
>>  	cpu = part_stat_lock();
>> -	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>>  
>> -	if (!new_io)
>> +	if (!new_io) {
>> +		part = rq->part;
>>  		part_stat_inc(cpu, part, merges[rw]);
>> -	else {
>> +	} else {
>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>> +		if(part->partno && !kref_test_and_get(&part->ref))
>> +			/*
> 
> Do we have to check this part->partno both while taking and releasing
> reference. Can't we take one extra reference for disk->part0, at
> alloc_disk_node() time so that it is never freed and only freed when
> disk is going away and gendisk is being freed.
> 
> That way, you don't have to differentiate between disk->part0 and rest
> of the partitions while taking or dropping references.

We could do it in any way, as long as we don't end up trying to free
disk->part0. I choose not to touch part0->ref at all, but we could also
drop all the part->partno test, and get a reference on part0 when we use
it as a backup. I have no strong opinion about a way or an other.

Jerome

> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] block: fix accounting bug on cross partition merges
  2010-12-23 17:04                             ` Jerome Marchand
@ 2010-12-24 19:29                               ` Vivek Goyal
  2011-01-04 15:52                                 ` [PATCH 1/2] kref: add kref_test_and_get Jerome Marchand
  0 siblings, 1 reply; 47+ messages in thread
From: Vivek Goyal @ 2010-12-24 19:29 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Jens Axboe, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel

On Thu, Dec 23, 2010 at 06:04:11PM +0100, Jerome Marchand wrote:
> On 12/23/2010 04:39 PM, Vivek Goyal wrote:
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 4ce953f..72d12d2 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -64,13 +64,21 @@ static void drive_stat_acct(struct request *rq, int new_io)
> >>  		return;
> >>  
> >>  	cpu = part_stat_lock();
> >> -	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> >>  
> >> -	if (!new_io)
> >> +	if (!new_io) {
> >> +		part = rq->part;
> >>  		part_stat_inc(cpu, part, merges[rw]);
> >> -	else {
> >> +	} else {
> >> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> >> +		if(part->partno && !kref_test_and_get(&part->ref))
> >> +			/*
> > 
> > Do we have to check this part->partno both while taking and releasing
> > reference. Can't we take one extra reference for disk->part0, at
> > alloc_disk_node() time so that it is never freed and only freed when
> > disk is going away and gendisk is being freed.
> > 
> > That way, you don't have to differentiate between disk->part0 and rest
> > of the partitions while taking or dropping references.
> 
> We could do it in any way, as long as we don't end up trying to free
> disk->part0. I choose not to touch part0->ref at all, but we could also
> drop all the part->partno test, and get a reference on part0 when we use
> it as a backup. I have no strong opinion about a way or an other.

Ok, I personally like taking an extra reference to disk->part0 and put
a proper comment there so that at rest of the places we don't try to
differentiate between part0 and other partitions. 

Having said that I am not too concerned about it. It is just one of the
minor details. So post the new version of patch after testing.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/2] kref: add kref_test_and_get
  2010-12-24 19:29                               ` Vivek Goyal
@ 2011-01-04 15:52                                 ` Jerome Marchand
  2011-01-04 15:55                                   ` [PATCH 2/2] block: fix accounting bug on cross partition merges Jerome Marchand
                                                     ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Jerome Marchand @ 2011-01-04 15:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel, Greg Kroah-Hartman


Add kref_test_and_get() function, which atomically add a reference only if
refcount is not zero. This prevent to add a reference to an object that is
already being removed.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
---
 include/linux/kref.h |    1 +
 lib/kref.c           |   16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/kref.h b/include/linux/kref.h
index 6cc38fc..90b9e44 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -23,6 +23,7 @@ struct kref {
 
 void kref_init(struct kref *kref);
 void kref_get(struct kref *kref);
+int kref_test_and_get(struct kref *kref);
 int kref_put(struct kref *kref, void (*release) (struct kref *kref));
 
 #endif /* _KREF_H_ */
diff --git a/lib/kref.c b/lib/kref.c
index d3d227a..5f663b9 100644
--- a/lib/kref.c
+++ b/lib/kref.c
@@ -37,6 +37,22 @@ void kref_get(struct kref *kref)
 }
 
 /**
+ * kref_test_and_get - increment refcount for object only if refcount is not
+ * zero.
+ * @kref: object.
+ *
+ * Return non-zero if the refcount was incremented, 0 otherwise
+ */
+int kref_test_and_get(struct kref *kref)
+{
+	int ret;
+	smp_mb__before_atomic_inc();
+	ret = atomic_inc_not_zero(&kref->refcount);
+	smp_mb__after_atomic_inc();
+	return ret;
+}
+
+/**
  * kref_put - decrement refcount for object.
  * @kref: object.
  * @release: pointer to the function that will clean up the object when the

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-04 15:52                                 ` [PATCH 1/2] kref: add kref_test_and_get Jerome Marchand
@ 2011-01-04 15:55                                   ` Jerome Marchand
  2011-01-04 21:00                                     ` Greg KH
  2011-01-05 14:00                                     ` Jens Axboe
  2011-01-04 16:05                                   ` [PATCH 1/2] kref: add kref_test_and_get Eric Dumazet
  2011-01-04 20:57                                   ` [PATCH 1/2] " Greg KH
  2 siblings, 2 replies; 47+ messages in thread
From: Jerome Marchand @ 2011-01-04 15:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel, Greg Kroah-Hartman


/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

Also add a refcount to struct hd_struct to keep the partition in
memory as long as users exist. We use kref_test_and_get() to ensure
we don't add a reference to a partition which is going away.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 block/blk-core.c       |   26 +++++++++++++++++++++-----
 block/blk-merge.c      |    3 ++-
 block/genhd.c          |    1 +
 fs/partitions/check.c  |   10 +++++++++-
 include/linux/blkdev.h |    1 +
 include/linux/genhd.h  |    2 ++
 6 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4ce953f..409ad36 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -64,13 +64,27 @@ static void drive_stat_acct(struct request *rq, int new_io)
 		return;
 
 	cpu = part_stat_lock();
-	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 
-	if (!new_io)
+	if (!new_io) {
+		part = rq->part;
 		part_stat_inc(cpu, part, merges[rw]);
-	else {
+	} else {
+		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
+		if (!kref_test_and_get(&part->ref)) {
+			/*
+			 * The partition is already being removed,
+			 * the request will be accounted on the disk only
+			 *
+			 * We take a reference on disk->part0 although that
+			 * partition will never be deleted, so we can treat
+			 * it as any other partition.
+			 */
+			part = &rq->rq_disk->part0;
+			kref_get(&part->ref);
+		}
 		part_round_stats(cpu, part);
 		part_inc_in_flight(part, rw);
+		rq->part = part;
 	}
 
 	part_stat_unlock();
@@ -128,6 +142,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->ref_count = 1;
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
+	rq->part = NULL;
 }
 EXPORT_SYMBOL(blk_rq_init);
 
@@ -1776,7 +1791,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
 		part_stat_unlock();
 	}
@@ -1796,13 +1811,14 @@ static void blk_account_io_done(struct request *req)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 
 		part_stat_inc(cpu, part, ios[rw]);
 		part_stat_add(cpu, part, ticks[rw], duration);
 		part_round_stats(cpu, part);
 		part_dec_in_flight(part, rw);
 
+		kref_put(&part->ref, __delete_partition);
 		part_stat_unlock();
 	}
 }
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 74bc4a7..23ea74b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -351,11 +351,12 @@ static void blk_account_io_merge(struct request *req)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 
 		part_round_stats(cpu, part);
 		part_dec_in_flight(part, rq_data_dir(req));
 
+		kref_put(&part->ref, __delete_partition);
 		part_stat_unlock();
 	}
 }
diff --git a/block/genhd.c b/block/genhd.c
index 5fa2b44..0c55eae 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1192,6 +1192,7 @@ struct gendisk *alloc_disk_node(int minors, int node_id)
 			return NULL;
 		}
 		disk->part_tbl->part[0] = &disk->part0;
+		kref_init(&disk->part0.ref);
 
 		disk->minors = minors;
 		rand_initialize_disk(disk);
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index 0a8b0ad..0123717 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -372,6 +372,13 @@ static void delete_partition_rcu_cb(struct rcu_head *head)
 	put_device(part_to_dev(part));
 }
 
+void __delete_partition(struct kref *ref)
+{
+	struct hd_struct *part = container_of(ref, struct hd_struct, ref);
+
+	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
+}
+
 void delete_partition(struct gendisk *disk, int partno)
 {
 	struct disk_part_tbl *ptbl = disk->part_tbl;
@@ -390,7 +397,7 @@ void delete_partition(struct gendisk *disk, int partno)
 	kobject_put(part->holder_dir);
 	device_del(part_to_dev(part));
 
-	call_rcu(&part->rcu_head, delete_partition_rcu_cb);
+	kref_put(&part->ref, __delete_partition);
 }
 
 static ssize_t whole_disk_show(struct device *dev,
@@ -489,6 +496,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
 	if (!dev_get_uevent_suppress(ddev))
 		kobject_uevent(&pdev->kobj, KOBJ_ADD);
 
+	kref_init(&p->ref);
 	return p;
 
 out_free_info:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 36ab42c..7572b19 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -115,6 +115,7 @@ struct request {
 	void *elevator_private3;
 
 	struct gendisk *rq_disk;
+	struct hd_struct *part;
 	unsigned long start_time;
 #ifdef CONFIG_BLK_CGROUP
 	unsigned long long start_time_ns;
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 7a7b9c1..2ba2792 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -116,6 +116,7 @@ struct hd_struct {
 	struct disk_stats dkstats;
 #endif
 	struct rcu_head rcu_head;
+	struct kref ref;
 };
 
 #define GENHD_FL_REMOVABLE			1
@@ -583,6 +584,7 @@ extern struct hd_struct * __must_check add_partition(struct gendisk *disk,
 						     sector_t len, int flags,
 						     struct partition_meta_info
 						       *info);
+extern void __delete_partition(struct kref *ref);
 extern void delete_partition(struct gendisk *, int);
 extern void printk_all_partitions(void);
 


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] kref: add kref_test_and_get
  2011-01-04 15:52                                 ` [PATCH 1/2] kref: add kref_test_and_get Jerome Marchand
  2011-01-04 15:55                                   ` [PATCH 2/2] block: fix accounting bug on cross partition merges Jerome Marchand
@ 2011-01-04 16:05                                   ` Eric Dumazet
  2011-01-05 15:02                                     ` [PATCH 1/2 v2] " Jerome Marchand
  2011-01-04 20:57                                   ` [PATCH 1/2] " Greg KH
  2 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2011-01-04 16:05 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Vivek Goyal, Jens Axboe, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel, Greg Kroah-Hartman

Le mardi 04 janvier 2011 à 16:52 +0100, Jerome Marchand a écrit :
> Add kref_test_and_get() function, which atomically add a reference only if
> refcount is not zero. This prevent to add a reference to an object that is
> already being removed.
> 
> Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

> +int kref_test_and_get(struct kref *kref)
> +{
> +	int ret;
> +	smp_mb__before_atomic_inc();
> +	ret = atomic_inc_not_zero(&kref->refcount);
> +	smp_mb__after_atomic_inc();
> +	return ret;
> +}
> +
> +/**

atomic_inc_not_zero() has full memory barrier semantic already, you dont
need smp_mb__before_atomic_inc() or smp_mb__after_atomic_inc();




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] kref: add kref_test_and_get
  2011-01-04 15:52                                 ` [PATCH 1/2] kref: add kref_test_and_get Jerome Marchand
  2011-01-04 15:55                                   ` [PATCH 2/2] block: fix accounting bug on cross partition merges Jerome Marchand
  2011-01-04 16:05                                   ` [PATCH 1/2] kref: add kref_test_and_get Eric Dumazet
@ 2011-01-04 20:57                                   ` Greg KH
  2011-01-05 13:35                                     ` Jerome Marchand
  2 siblings, 1 reply; 47+ messages in thread
From: Greg KH @ 2011-01-04 20:57 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Vivek Goyal, Jens Axboe, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On Tue, Jan 04, 2011 at 04:52:26PM +0100, Jerome Marchand wrote:
> 
> Add kref_test_and_get() function, which atomically add a reference only if
> refcount is not zero. This prevent to add a reference to an object that is
> already being removed.

We just removed a function like this recently as it really isn't the
solution for what you need here at all.

Please use a lock that protects the desctruction of the kref, like the
rest of the kernel, to keep someone from grabing a reference while the
device could be going away.  That's the only way to safely do this.

See the archives for why this overall, isn't a good idea for a kref to
have this type of function.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-04 15:55                                   ` [PATCH 2/2] block: fix accounting bug on cross partition merges Jerome Marchand
@ 2011-01-04 21:00                                     ` Greg KH
  2011-01-05 13:51                                       ` Jerome Marchand
  2011-01-05 13:55                                       ` Jens Axboe
  2011-01-05 14:00                                     ` Jens Axboe
  1 sibling, 2 replies; 47+ messages in thread
From: Greg KH @ 2011-01-04 21:00 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Vivek Goyal, Jens Axboe, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On Tue, Jan 04, 2011 at 04:55:13PM +0100, Jerome Marchand wrote:
> Also add a refcount to struct hd_struct to keep the partition in
> memory as long as users exist. We use kref_test_and_get() to ensure
> we don't add a reference to a partition which is going away.

No, don't do this, use a kref correctly and no such function should be
needed.

> +	} else {
> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));

That is the function that should properly increment the reference count
on the object.  If the object is "being removed", then it will return
NULL and you need to check that.  Do that and you do not need to add:

> +		if (!kref_test_and_get(&part->ref)) {

At all.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] kref: add kref_test_and_get
  2011-01-04 20:57                                   ` [PATCH 1/2] " Greg KH
@ 2011-01-05 13:35                                     ` Jerome Marchand
  2011-01-05 15:55                                       ` Greg KH
  0 siblings, 1 reply; 47+ messages in thread
From: Jerome Marchand @ 2011-01-05 13:35 UTC (permalink / raw)
  To: Greg KH
  Cc: Vivek Goyal, Jens Axboe, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On 01/04/2011 09:57 PM, Greg KH wrote:
> On Tue, Jan 04, 2011 at 04:52:26PM +0100, Jerome Marchand wrote:
>>
>> Add kref_test_and_get() function, which atomically add a reference only if
>> refcount is not zero. This prevent to add a reference to an object that is
>> already being removed.
> 
> We just removed a function like this recently as it really isn't the
> solution for what you need here at all.

What function are you talking about? kref_set? If yes, I don't see the
similarity between this function and mine.

Regards
Jerome

> 
> Please use a lock that protects the desctruction of the kref, like the
> rest of the kernel, to keep someone from grabing a reference while the
> device could be going away.  That's the only way to safely do this.
> 
> See the archives for why this overall, isn't a good idea for a kref to
> have this type of function.
> 
> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-04 21:00                                     ` Greg KH
@ 2011-01-05 13:51                                       ` Jerome Marchand
  2011-01-05 16:00                                         ` Greg KH
  2011-01-05 13:55                                       ` Jens Axboe
  1 sibling, 1 reply; 47+ messages in thread
From: Jerome Marchand @ 2011-01-05 13:51 UTC (permalink / raw)
  To: Greg KH
  Cc: Vivek Goyal, Jens Axboe, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On 01/04/2011 10:00 PM, Greg KH wrote:
> On Tue, Jan 04, 2011 at 04:55:13PM +0100, Jerome Marchand wrote:
>> Also add a refcount to struct hd_struct to keep the partition in
>> memory as long as users exist. We use kref_test_and_get() to ensure
>> we don't add a reference to a partition which is going away.
> 
> No, don't do this, use a kref correctly and no such function should be
> needed.
> 
>> +	} else {
>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> 
> That is the function that should properly increment the reference count
> on the object.

Agreed.

>  If the object is "being removed", then it will return
> NULL and you need to check that.  Do that and you do not need to add:

The object is actually removed in a rcu callback function. We could
certainly add a flag to hd_struct, set by the release function, to
indicate disk_map_sector_rcu() that the partition is being removed, but
why not use the refcount instead?

Thanks,
Jerome

> 
>> +		if (!kref_test_and_get(&part->ref)) {
> 
> At all.
> 
> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-04 21:00                                     ` Greg KH
  2011-01-05 13:51                                       ` Jerome Marchand
@ 2011-01-05 13:55                                       ` Jens Axboe
  2011-01-05 15:58                                         ` Greg KH
  1 sibling, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2011-01-05 13:55 UTC (permalink / raw)
  To: Greg KH
  Cc: Jerome Marchand, Vivek Goyal, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On 2011-01-04 22:00, Greg KH wrote:
> On Tue, Jan 04, 2011 at 04:55:13PM +0100, Jerome Marchand wrote:
>> Also add a refcount to struct hd_struct to keep the partition in
>> memory as long as users exist. We use kref_test_and_get() to ensure
>> we don't add a reference to a partition which is going away.
> 
> No, don't do this, use a kref correctly and no such function should be
> needed.
> 
>> +	} else {
>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> 
> That is the function that should properly increment the reference count
> on the object.  If the object is "being removed", then it will return
> NULL and you need to check that.  Do that and you do not need to add:

It doesn't matter if you do it in there of after the fact, since the
"lock" (RCU) is being held across the call. See my original suggestion
here:

https://lkml.org/lkml/2010/12/17/275

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-04 15:55                                   ` [PATCH 2/2] block: fix accounting bug on cross partition merges Jerome Marchand
  2011-01-04 21:00                                     ` Greg KH
@ 2011-01-05 14:00                                     ` Jens Axboe
  2011-01-05 14:09                                       ` Jerome Marchand
  1 sibling, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2011-01-05 14:00 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Vivek Goyal, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel, Greg Kroah-Hartman

On 2011-01-04 16:55, Jerome Marchand wrote:
> +	} else {
> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> +		if (!kref_test_and_get(&part->ref)) {
> +			/*
> +			 * The partition is already being removed,
> +			 * the request will be accounted on the disk only
> +			 *
> +			 * We take a reference on disk->part0 although that
> +			 * partition will never be deleted, so we can treat
> +			 * it as any other partition.
> +			 */
> +			part = &rq->rq_disk->part0;
> +			kref_get(&part->ref);
> +		}

This still doesn't work. So you are inside the if {} block, you know
what someone has dropped the last reference and the call_rcu() is being
scheduled. Adding a reference now will not prevent 'part' from going
away as soon as you do part_stat_unlock(). As per my last email, you
need a check in the rcu callback to ensure that the ref is 0.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-05 14:00                                     ` Jens Axboe
@ 2011-01-05 14:09                                       ` Jerome Marchand
  2011-01-05 14:17                                         ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Jerome Marchand @ 2011-01-05 14:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel, Greg Kroah-Hartman

On 01/05/2011 03:00 PM, Jens Axboe wrote:
> On 2011-01-04 16:55, Jerome Marchand wrote:
>> +	} else {
>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>> +		if (!kref_test_and_get(&part->ref)) {
>> +			/*
>> +			 * The partition is already being removed,
>> +			 * the request will be accounted on the disk only
>> +			 *
>> +			 * We take a reference on disk->part0 although that
>> +			 * partition will never be deleted, so we can treat
>> +			 * it as any other partition.
>> +			 */
>> +			part = &rq->rq_disk->part0;
>> +			kref_get(&part->ref);
>> +		}
> 
> This still doesn't work. So you are inside the if {} block, you know
> what someone has dropped the last reference and the call_rcu() is being
> scheduled. Adding a reference now will not prevent 'part' from going
> away as soon as you do part_stat_unlock().

And what is the problem with that since we don't use 'part' (as returned
by disk_map_sector_rcu()), but disk->part0 instead?

> As per my last email, you
> need a check in the rcu callback to ensure that the ref is 0.
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-05 14:09                                       ` Jerome Marchand
@ 2011-01-05 14:17                                         ` Jens Axboe
  0 siblings, 0 replies; 47+ messages in thread
From: Jens Axboe @ 2011-01-05 14:17 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Vivek Goyal, Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu,
	linux-kernel, Greg Kroah-Hartman

On 2011-01-05 15:09, Jerome Marchand wrote:
> On 01/05/2011 03:00 PM, Jens Axboe wrote:
>> On 2011-01-04 16:55, Jerome Marchand wrote:
>>> +	} else {
>>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>>> +		if (!kref_test_and_get(&part->ref)) {
>>> +			/*
>>> +			 * The partition is already being removed,
>>> +			 * the request will be accounted on the disk only
>>> +			 *
>>> +			 * We take a reference on disk->part0 although that
>>> +			 * partition will never be deleted, so we can treat
>>> +			 * it as any other partition.
>>> +			 */
>>> +			part = &rq->rq_disk->part0;
>>> +			kref_get(&part->ref);
>>> +		}
>>
>> This still doesn't work. So you are inside the if {} block, you know
>> what someone has dropped the last reference and the call_rcu() is being
>> scheduled. Adding a reference now will not prevent 'part' from going
>> away as soon as you do part_stat_unlock().
> 
> And what is the problem with that since we don't use 'part' (as returned
> by disk_map_sector_rcu()), but disk->part0 instead?

Ugh, I'm the one who's blind now. part0 is indeed fine, I didn't read
that carefully enough.

So I think your patch looks safe now, I don't see any holes in it.
Whether we move the kref to inside the lookup or not, that doesn't
change anything wrt using an atomic_inc_not_zero(). Can you resend 1/2
with the manual barriers removed?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/2 v2] kref: add kref_test_and_get
  2011-01-04 16:05                                   ` [PATCH 1/2] kref: add kref_test_and_get Eric Dumazet
@ 2011-01-05 15:02                                     ` Jerome Marchand
  2011-01-05 15:43                                       ` Alexey Dobriyan
  2011-01-05 15:56                                       ` Greg KH
  0 siblings, 2 replies; 47+ messages in thread
From: Jerome Marchand @ 2011-01-05 15:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Vivek Goyal, Jens Axboe, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel, Greg Kroah-Hartman


Add kref_test_and_get() function, which atomically add a reference only if
refcount is not zero. This prevent to add a reference to an object that is
already being removed.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
---
 include/linux/kref.h |    1 +
 lib/kref.c           |   12 ++++++++++++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/kref.h b/include/linux/kref.h
index 6cc38fc..90b9e44 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -23,6 +23,7 @@ struct kref {
 
 void kref_init(struct kref *kref);
 void kref_get(struct kref *kref);
+int kref_test_and_get(struct kref *kref);
 int kref_put(struct kref *kref, void (*release) (struct kref *kref));
 
 #endif /* _KREF_H_ */
diff --git a/lib/kref.c b/lib/kref.c
index d3d227a..e7a6e10 100644
--- a/lib/kref.c
+++ b/lib/kref.c
@@ -37,6 +37,18 @@ void kref_get(struct kref *kref)
 }
 
 /**
+ * kref_test_and_get - increment refcount for object only if refcount is not
+ * zero.
+ * @kref: object.
+ *
+ * Return non-zero if the refcount was incremented, 0 otherwise
+ */
+int kref_test_and_get(struct kref *kref)
+{
+	return atomic_inc_not_zero(&kref->refcount);
+}
+
+/**
  * kref_put - decrement refcount for object.
  * @kref: object.
  * @release: pointer to the function that will clean up the object when the

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2 v2] kref: add kref_test_and_get
  2011-01-05 15:02                                     ` [PATCH 1/2 v2] " Jerome Marchand
@ 2011-01-05 15:43                                       ` Alexey Dobriyan
  2011-01-05 15:57                                         ` Greg KH
  2011-01-05 15:56                                       ` Greg KH
  1 sibling, 1 reply; 47+ messages in thread
From: Alexey Dobriyan @ 2011-01-05 15:43 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Eric Dumazet, Vivek Goyal, Jens Axboe, Satoru Takeuchi,
	Linus Torvalds, Yasuaki Ishimatsu, linux-kernel,
	Greg Kroah-Hartman

On Wed, Jan 5, 2011 at 5:02 PM, Jerome Marchand <jmarchan@redhat.com> wrote:
> Add kref_test_and_get() function, which atomically add a reference only if
> refcount is not zero. This prevent to add a reference to an object that is
> already being removed.

Is kref supposed to be used only for "toy" structures where you aren't
expected to do get possibly dead object?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2] kref: add kref_test_and_get
  2011-01-05 13:35                                     ` Jerome Marchand
@ 2011-01-05 15:55                                       ` Greg KH
  0 siblings, 0 replies; 47+ messages in thread
From: Greg KH @ 2011-01-05 15:55 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Vivek Goyal, Jens Axboe, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On Wed, Jan 05, 2011 at 02:35:04PM +0100, Jerome Marchand wrote:
> On 01/04/2011 09:57 PM, Greg KH wrote:
> > On Tue, Jan 04, 2011 at 04:52:26PM +0100, Jerome Marchand wrote:
> >>
> >> Add kref_test_and_get() function, which atomically add a reference only if
> >> refcount is not zero. This prevent to add a reference to an object that is
> >> already being removed.
> > 
> > We just removed a function like this recently as it really isn't the
> > solution for what you need here at all.
> 
> What function are you talking about? kref_set? If yes, I don't see the
> similarity between this function and mine.

Sorry, yes, kref_set was recently removed.  I think this function was
added, and removed, even way before that.  Or at the least, it gets
proposed every few years and shot down as being incorrect.  This isn't
the first time :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2 v2] kref: add kref_test_and_get
  2011-01-05 15:02                                     ` [PATCH 1/2 v2] " Jerome Marchand
  2011-01-05 15:43                                       ` Alexey Dobriyan
@ 2011-01-05 15:56                                       ` Greg KH
  1 sibling, 0 replies; 47+ messages in thread
From: Greg KH @ 2011-01-05 15:56 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Eric Dumazet, Vivek Goyal, Jens Axboe, Satoru Takeuchi,
	Linus Torvalds, Yasuaki Ishimatsu, linux-kernel

On Wed, Jan 05, 2011 at 04:02:17PM +0100, Jerome Marchand wrote:
> 
> Add kref_test_and_get() function, which atomically add a reference only if
> refcount is not zero. This prevent to add a reference to an object that is
> already being removed.

Again, no, I do not like this and will not accept this, sorry.

Please read the kref documentation for how to properly handle this,
without this function.

greg k-h

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/2 v2] kref: add kref_test_and_get
  2011-01-05 15:43                                       ` Alexey Dobriyan
@ 2011-01-05 15:57                                         ` Greg KH
  0 siblings, 0 replies; 47+ messages in thread
From: Greg KH @ 2011-01-05 15:57 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Jerome Marchand, Eric Dumazet, Vivek Goyal, Jens Axboe,
	Satoru Takeuchi, Linus Torvalds, Yasuaki Ishimatsu, linux-kernel

On Wed, Jan 05, 2011 at 05:43:22PM +0200, Alexey Dobriyan wrote:
> On Wed, Jan 5, 2011 at 5:02 PM, Jerome Marchand <jmarchan@redhat.com> wrote:
> > Add kref_test_and_get() function, which atomically add a reference only if
> > refcount is not zero. This prevent to add a reference to an object that is
> > already being removed.
> 
> Is kref supposed to be used only for "toy" structures where you aren't
> expected to do get possibly dead object?

Of course not, have you read the kref documentation?

{sigh}

Why do we even write these things if no one reads them...

greg k-h

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-05 13:55                                       ` Jens Axboe
@ 2011-01-05 15:58                                         ` Greg KH
  2011-01-05 18:46                                           ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Greg KH @ 2011-01-05 15:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jerome Marchand, Vivek Goyal, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On Wed, Jan 05, 2011 at 02:55:51PM +0100, Jens Axboe wrote:
> On 2011-01-04 22:00, Greg KH wrote:
> > On Tue, Jan 04, 2011 at 04:55:13PM +0100, Jerome Marchand wrote:
> >> Also add a refcount to struct hd_struct to keep the partition in
> >> memory as long as users exist. We use kref_test_and_get() to ensure
> >> we don't add a reference to a partition which is going away.
> > 
> > No, don't do this, use a kref correctly and no such function should be
> > needed.
> > 
> >> +	} else {
> >> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> > 
> > That is the function that should properly increment the reference count
> > on the object.  If the object is "being removed", then it will return
> > NULL and you need to check that.  Do that and you do not need to add:
> 
> It doesn't matter if you do it in there of after the fact, since the
> "lock" (RCU) is being held across the call. See my original suggestion
> here:
> 
> https://lkml.org/lkml/2010/12/17/275

Ok, that's fine, just do it without adding that kref function and I have
no objection :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-05 13:51                                       ` Jerome Marchand
@ 2011-01-05 16:00                                         ` Greg KH
  2011-01-05 16:19                                           ` Jerome Marchand
  0 siblings, 1 reply; 47+ messages in thread
From: Greg KH @ 2011-01-05 16:00 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Vivek Goyal, Jens Axboe, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On Wed, Jan 05, 2011 at 02:51:28PM +0100, Jerome Marchand wrote:
> On 01/04/2011 10:00 PM, Greg KH wrote:
> > On Tue, Jan 04, 2011 at 04:55:13PM +0100, Jerome Marchand wrote:
> >> Also add a refcount to struct hd_struct to keep the partition in
> >> memory as long as users exist. We use kref_test_and_get() to ensure
> >> we don't add a reference to a partition which is going away.
> > 
> > No, don't do this, use a kref correctly and no such function should be
> > needed.
> > 
> >> +	} else {
> >> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> > 
> > That is the function that should properly increment the reference count
> > on the object.
> 
> Agreed.
> 
> >  If the object is "being removed", then it will return
> > NULL and you need to check that.  Do that and you do not need to add:
> 
> The object is actually removed in a rcu callback function. We could
> certainly add a flag to hd_struct, set by the release function, to
> indicate disk_map_sector_rcu() that the partition is being removed, but
> why not use the refcount instead?

Because you have to properly serialize the grabbing of a kref if you
don't have a valid pointer in the first place, otherwise it will not
work properly at all.  Your new function still does not properly handle
the race condition of dropping the last reference and then having the
kref be cleaned up.  You are giving false hope to the user of the api
that what they are doing is correct.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-05 16:00                                         ` Greg KH
@ 2011-01-05 16:19                                           ` Jerome Marchand
  2011-01-05 16:27                                             ` Greg KH
  0 siblings, 1 reply; 47+ messages in thread
From: Jerome Marchand @ 2011-01-05 16:19 UTC (permalink / raw)
  To: Greg KH
  Cc: Vivek Goyal, Jens Axboe, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On 01/05/2011 05:00 PM, Greg KH wrote:
> On Wed, Jan 05, 2011 at 02:51:28PM +0100, Jerome Marchand wrote:
>> On 01/04/2011 10:00 PM, Greg KH wrote:
>>> On Tue, Jan 04, 2011 at 04:55:13PM +0100, Jerome Marchand wrote:
>>>> Also add a refcount to struct hd_struct to keep the partition in
>>>> memory as long as users exist. We use kref_test_and_get() to ensure
>>>> we don't add a reference to a partition which is going away.
>>>
>>> No, don't do this, use a kref correctly and no such function should be
>>> needed.
>>>
>>>> +	} else {
>>>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>>>
>>> That is the function that should properly increment the reference count
>>> on the object.
>>
>> Agreed.
>>
>>>  If the object is "being removed", then it will return
>>> NULL and you need to check that.  Do that and you do not need to add:
>>
>> The object is actually removed in a rcu callback function. We could
>> certainly add a flag to hd_struct, set by the release function, to
>> indicate disk_map_sector_rcu() that the partition is being removed, but
>> why not use the refcount instead?
> 
> Because you have to properly serialize the grabbing of a kref if you
> don't have a valid pointer in the first place, otherwise it will not
> work properly at all.  Your new function still does not properly handle
> the race condition of dropping the last reference and then having the
> kref be cleaned up.  You are giving false hope to the user of the api
> that what they are doing is correct.
> 

For clarification, is your objection only about not adding that misleading
function to kref api (I understand that), or is my code actually racy?

thanks,
Jerome


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-05 16:19                                           ` Jerome Marchand
@ 2011-01-05 16:27                                             ` Greg KH
  0 siblings, 0 replies; 47+ messages in thread
From: Greg KH @ 2011-01-05 16:27 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Vivek Goyal, Jens Axboe, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On Wed, Jan 05, 2011 at 05:19:19PM +0100, Jerome Marchand wrote:
> On 01/05/2011 05:00 PM, Greg KH wrote:
> > On Wed, Jan 05, 2011 at 02:51:28PM +0100, Jerome Marchand wrote:
> >> On 01/04/2011 10:00 PM, Greg KH wrote:
> >>> On Tue, Jan 04, 2011 at 04:55:13PM +0100, Jerome Marchand wrote:
> >>>> Also add a refcount to struct hd_struct to keep the partition in
> >>>> memory as long as users exist. We use kref_test_and_get() to ensure
> >>>> we don't add a reference to a partition which is going away.
> >>>
> >>> No, don't do this, use a kref correctly and no such function should be
> >>> needed.
> >>>
> >>>> +	} else {
> >>>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> >>>
> >>> That is the function that should properly increment the reference count
> >>> on the object.
> >>
> >> Agreed.
> >>
> >>>  If the object is "being removed", then it will return
> >>> NULL and you need to check that.  Do that and you do not need to add:
> >>
> >> The object is actually removed in a rcu callback function. We could
> >> certainly add a flag to hd_struct, set by the release function, to
> >> indicate disk_map_sector_rcu() that the partition is being removed, but
> >> why not use the refcount instead?
> > 
> > Because you have to properly serialize the grabbing of a kref if you
> > don't have a valid pointer in the first place, otherwise it will not
> > work properly at all.  Your new function still does not properly handle
> > the race condition of dropping the last reference and then having the
> > kref be cleaned up.  You are giving false hope to the user of the api
> > that what they are doing is correct.
> > 
> 
> For clarification, is your objection only about not adding that misleading
> function to kref api (I understand that), or is my code actually racy?

As you are adding a misleading function to the kref api, and by using
it, causing a racy implementation, I would say both :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-05 15:58                                         ` Greg KH
@ 2011-01-05 18:46                                           ` Jens Axboe
  2011-01-05 20:08                                             ` Greg KH
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2011-01-05 18:46 UTC (permalink / raw)
  To: Greg KH
  Cc: Jerome Marchand, Vivek Goyal, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On 2011-01-05 16:58, Greg KH wrote:
> On Wed, Jan 05, 2011 at 02:55:51PM +0100, Jens Axboe wrote:
>> On 2011-01-04 22:00, Greg KH wrote:
>>> On Tue, Jan 04, 2011 at 04:55:13PM +0100, Jerome Marchand wrote:
>>>> Also add a refcount to struct hd_struct to keep the partition in
>>>> memory as long as users exist. We use kref_test_and_get() to ensure
>>>> we don't add a reference to a partition which is going away.
>>>
>>> No, don't do this, use a kref correctly and no such function should be
>>> needed.
>>>
>>>> +	} else {
>>>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>>>
>>> That is the function that should properly increment the reference count
>>> on the object.  If the object is "being removed", then it will return
>>> NULL and you need to check that.  Do that and you do not need to add:
>>
>> It doesn't matter if you do it in there of after the fact, since the
>> "lock" (RCU) is being held across the call. See my original suggestion
>> here:
>>
>> https://lkml.org/lkml/2010/12/17/275
> 
> Ok, that's fine, just do it without adding that kref function and I have
> no objection :)

Why? The code is perfectly fine. I originally objected to making an API
like this for simple reference counting - seems I was right. Please
actually look at the code and use. Alexey asked whether this was a toy
API or a real one, I'd like to know that as well. If this is meant just
for very basic get/put references, fine, then document that. But then
what's the point of having this API in the first place?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-05 18:46                                           ` Jens Axboe
@ 2011-01-05 20:08                                             ` Greg KH
  2011-01-05 21:38                                               ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Greg KH @ 2011-01-05 20:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jerome Marchand, Vivek Goyal, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On Wed, Jan 05, 2011 at 07:46:32PM +0100, Jens Axboe wrote:
> On 2011-01-05 16:58, Greg KH wrote:
> > On Wed, Jan 05, 2011 at 02:55:51PM +0100, Jens Axboe wrote:
> >> On 2011-01-04 22:00, Greg KH wrote:
> >>> On Tue, Jan 04, 2011 at 04:55:13PM +0100, Jerome Marchand wrote:
> >>>> Also add a refcount to struct hd_struct to keep the partition in
> >>>> memory as long as users exist. We use kref_test_and_get() to ensure
> >>>> we don't add a reference to a partition which is going away.
> >>>
> >>> No, don't do this, use a kref correctly and no such function should be
> >>> needed.
> >>>
> >>>> +	} else {
> >>>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> >>>
> >>> That is the function that should properly increment the reference count
> >>> on the object.  If the object is "being removed", then it will return
> >>> NULL and you need to check that.  Do that and you do not need to add:
> >>
> >> It doesn't matter if you do it in there of after the fact, since the
> >> "lock" (RCU) is being held across the call. See my original suggestion
> >> here:
> >>
> >> https://lkml.org/lkml/2010/12/17/275
> > 
> > Ok, that's fine, just do it without adding that kref function and I have
> > no objection :)
> 
> Why? The code is perfectly fine. I originally objected to making an API
> like this for simple reference counting - seems I was right. Please
> actually look at the code and use. Alexey asked whether this was a toy
> API or a real one, I'd like to know that as well. If this is meant just
> for very basic get/put references, fine, then document that. But then
> what's the point of having this API in the first place?

The point is that you shouldn't have to roll your own reference count
code all over the place, 99% of the time, you should just use the
debugged, and documented, interface that the kernel provides with the
kref interface.

As for it being a "toy", it properly handles a very large majority of
the kernel reference counting logic today, in a race-free manner, so I
would not call that a "toy" at all.

Just use it properly.  As this patch series points out, adding this type
of function to the api is not a good idea, as it will be incorrect when
used.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-05 20:08                                             ` Greg KH
@ 2011-01-05 21:38                                               ` Jens Axboe
  2011-01-05 22:16                                                 ` Greg KH
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2011-01-05 21:38 UTC (permalink / raw)
  To: Greg KH
  Cc: Jerome Marchand, Vivek Goyal, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On 2011-01-05 21:08, Greg KH wrote:
> On Wed, Jan 05, 2011 at 07:46:32PM +0100, Jens Axboe wrote:
>> On 2011-01-05 16:58, Greg KH wrote:
>>> On Wed, Jan 05, 2011 at 02:55:51PM +0100, Jens Axboe wrote:
>>>> On 2011-01-04 22:00, Greg KH wrote:
>>>>> On Tue, Jan 04, 2011 at 04:55:13PM +0100, Jerome Marchand wrote:
>>>>>> Also add a refcount to struct hd_struct to keep the partition in
>>>>>> memory as long as users exist. We use kref_test_and_get() to ensure
>>>>>> we don't add a reference to a partition which is going away.
>>>>>
>>>>> No, don't do this, use a kref correctly and no such function should be
>>>>> needed.
>>>>>
>>>>>> +	} else {
>>>>>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>>>>>
>>>>> That is the function that should properly increment the reference count
>>>>> on the object.  If the object is "being removed", then it will return
>>>>> NULL and you need to check that.  Do that and you do not need to add:
>>>>
>>>> It doesn't matter if you do it in there of after the fact, since the
>>>> "lock" (RCU) is being held across the call. See my original suggestion
>>>> here:
>>>>
>>>> https://lkml.org/lkml/2010/12/17/275
>>>
>>> Ok, that's fine, just do it without adding that kref function and I have
>>> no objection :)
>>
>> Why? The code is perfectly fine. I originally objected to making an API
>> like this for simple reference counting - seems I was right. Please
>> actually look at the code and use. Alexey asked whether this was a toy
>> API or a real one, I'd like to know that as well. If this is meant just
>> for very basic get/put references, fine, then document that. But then
>> what's the point of having this API in the first place?
> 
> The point is that you shouldn't have to roll your own reference count
> code all over the place, 99% of the time, you should just use the
> debugged, and documented, interface that the kernel provides with the
> kref interface.
> 
> As for it being a "toy", it properly handles a very large majority of
> the kernel reference counting logic today, in a race-free manner, so I
> would not call that a "toy" at all.

Dunno, then perhaps pointless. It's not like the API is saving lots of
typing or easier to use than just atomics, imho.

> Just use it properly.  As this patch series points out, adding this type

By adding pointless locks? Your suggestion of doing the referencing
inside the function being called is moot, since RCU is held off over the
call. The point of the addition to the API is to _not_ grab a reference
if someone has done the final put. We know that the RCU grace period has
not ended, so the kref is valid. But if it is going away _in the future_
after we drop the part lock, then we don't want a reference to it.

So to "use it properly", I would have to slow down a fast path. No
thanks.

> of function to the api is not a good idea, as it will be incorrect when
> used.

The code is fine, the use is fine. I think the only thing we have
established here is that Jerome made the mistake of using the kref API
for this. I'll rewrite that part to handle its own references.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-05 21:38                                               ` Jens Axboe
@ 2011-01-05 22:16                                                 ` Greg KH
  2011-01-06  9:46                                                   ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Greg KH @ 2011-01-05 22:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jerome Marchand, Vivek Goyal, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On Wed, Jan 05, 2011 at 10:38:15PM +0100, Jens Axboe wrote:
> On 2011-01-05 21:08, Greg KH wrote:
> > On Wed, Jan 05, 2011 at 07:46:32PM +0100, Jens Axboe wrote:
> >> On 2011-01-05 16:58, Greg KH wrote:
> >>> On Wed, Jan 05, 2011 at 02:55:51PM +0100, Jens Axboe wrote:
> >>>> On 2011-01-04 22:00, Greg KH wrote:
> >>>>> On Tue, Jan 04, 2011 at 04:55:13PM +0100, Jerome Marchand wrote:
> >>>>>> Also add a refcount to struct hd_struct to keep the partition in
> >>>>>> memory as long as users exist. We use kref_test_and_get() to ensure
> >>>>>> we don't add a reference to a partition which is going away.
> >>>>>
> >>>>> No, don't do this, use a kref correctly and no such function should be
> >>>>> needed.
> >>>>>
> >>>>>> +	} else {
> >>>>>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> >>>>>
> >>>>> That is the function that should properly increment the reference count
> >>>>> on the object.  If the object is "being removed", then it will return
> >>>>> NULL and you need to check that.  Do that and you do not need to add:
> >>>>
> >>>> It doesn't matter if you do it in there of after the fact, since the
> >>>> "lock" (RCU) is being held across the call. See my original suggestion
> >>>> here:
> >>>>
> >>>> https://lkml.org/lkml/2010/12/17/275
> >>>
> >>> Ok, that's fine, just do it without adding that kref function and I have
> >>> no objection :)
> >>
> >> Why? The code is perfectly fine. I originally objected to making an API
> >> like this for simple reference counting - seems I was right. Please
> >> actually look at the code and use. Alexey asked whether this was a toy
> >> API or a real one, I'd like to know that as well. If this is meant just
> >> for very basic get/put references, fine, then document that. But then
> >> what's the point of having this API in the first place?
> > 
> > The point is that you shouldn't have to roll your own reference count
> > code all over the place, 99% of the time, you should just use the
> > debugged, and documented, interface that the kernel provides with the
> > kref interface.
> > 
> > As for it being a "toy", it properly handles a very large majority of
> > the kernel reference counting logic today, in a race-free manner, so I
> > would not call that a "toy" at all.
> 
> Dunno, then perhaps pointless. It's not like the API is saving lots of
> typing or easier to use than just atomics, imho.

Remember what Andrew said when you complained about this last time?
I'll paraphrase:
	When we see someone use an atomic value for a reference count,
	we then need to go audit the code to make sure they got the
	flushing right, and all the other little stuff you need to do in
	order to ensure the code works properly.

	If you use a kref, then we know that all is correct, and we can
	then focus on how the kref is used (very easy to audit) and
	worry about the rest of the code.  It saves us time as
	reviewers and maintainers also.

> > Just use it properly.  As this patch series points out, adding this type
> 
> By adding pointless locks?

No.

> Your suggestion of doing the referencing
> inside the function being called is moot, since RCU is held off over the
> call. The point of the addition to the API is to _not_ grab a reference
> if someone has done the final put. We know that the RCU grace period has
> not ended, so the kref is valid. But if it is going away _in the future_
> after we drop the part lock, then we don't want a reference to it.
> 
> So to "use it properly", I would have to slow down a fast path. No
> thanks.

Then don't use it.  a kref is NOT for a fast path, use RCU for that.
Heck, an atomic value is also not good for a fast path also, as it
causes major stalls, so you might want to reconsider not even using that
if you are thinking you should roll your own.

> > of function to the api is not a good idea, as it will be incorrect when
> > used.
> 
> The code is fine, the use is fine. I think the only thing we have
> established here is that Jerome made the mistake of using the kref API
> for this. I'll rewrite that part to handle its own references.

Yes, that is true.  If you are already using RCU, then by all means,
don't use a kref.  There are lots of reference counts in the kernel that
don't use a kref, nor should they.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2] block: fix accounting bug on cross partition merges
  2011-01-05 22:16                                                 ` Greg KH
@ 2011-01-06  9:46                                                   ` Jens Axboe
  0 siblings, 0 replies; 47+ messages in thread
From: Jens Axboe @ 2011-01-06  9:46 UTC (permalink / raw)
  To: Greg KH
  Cc: Jerome Marchand, Vivek Goyal, Satoru Takeuchi, Linus Torvalds,
	Yasuaki Ishimatsu, linux-kernel

On 2011-01-05 23:16, Greg KH wrote:
> On Wed, Jan 05, 2011 at 10:38:15PM +0100, Jens Axboe wrote:
>> On 2011-01-05 21:08, Greg KH wrote:
>>> On Wed, Jan 05, 2011 at 07:46:32PM +0100, Jens Axboe wrote:
>>>> On 2011-01-05 16:58, Greg KH wrote:
>>>>> On Wed, Jan 05, 2011 at 02:55:51PM +0100, Jens Axboe wrote:
>>>>>> On 2011-01-04 22:00, Greg KH wrote:
>>>>>>> On Tue, Jan 04, 2011 at 04:55:13PM +0100, Jerome Marchand wrote:
>>>>>>>> Also add a refcount to struct hd_struct to keep the partition in
>>>>>>>> memory as long as users exist. We use kref_test_and_get() to ensure
>>>>>>>> we don't add a reference to a partition which is going away.
>>>>>>>
>>>>>>> No, don't do this, use a kref correctly and no such function should be
>>>>>>> needed.
>>>>>>>
>>>>>>>> +	} else {
>>>>>>>> +		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>>>>>>>
>>>>>>> That is the function that should properly increment the reference count
>>>>>>> on the object.  If the object is "being removed", then it will return
>>>>>>> NULL and you need to check that.  Do that and you do not need to add:
>>>>>>
>>>>>> It doesn't matter if you do it in there of after the fact, since the
>>>>>> "lock" (RCU) is being held across the call. See my original suggestion
>>>>>> here:
>>>>>>
>>>>>> https://lkml.org/lkml/2010/12/17/275
>>>>>
>>>>> Ok, that's fine, just do it without adding that kref function and I have
>>>>> no objection :)
>>>>
>>>> Why? The code is perfectly fine. I originally objected to making an API
>>>> like this for simple reference counting - seems I was right. Please
>>>> actually look at the code and use. Alexey asked whether this was a toy
>>>> API or a real one, I'd like to know that as well. If this is meant just
>>>> for very basic get/put references, fine, then document that. But then
>>>> what's the point of having this API in the first place?
>>>
>>> The point is that you shouldn't have to roll your own reference count
>>> code all over the place, 99% of the time, you should just use the
>>> debugged, and documented, interface that the kernel provides with the
>>> kref interface.
>>>
>>> As for it being a "toy", it properly handles a very large majority of
>>> the kernel reference counting logic today, in a race-free manner, so I
>>> would not call that a "toy" at all.
>>
>> Dunno, then perhaps pointless. It's not like the API is saving lots of
>> typing or easier to use than just atomics, imho.
> 
> Remember what Andrew said when you complained about this last time?
> I'll paraphrase:
> 	When we see someone use an atomic value for a reference count,
> 	we then need to go audit the code to make sure they got the
> 	flushing right, and all the other little stuff you need to do in
> 	order to ensure the code works properly.
> 
> 	If you use a kref, then we know that all is correct, and we can
> 	then focus on how the kref is used (very easy to audit) and
> 	worry about the rest of the code.  It saves us time as
> 	reviewers and maintainers also.

And I agree with that. Which is even more reason to add advanced
functionality to it, instead of just bog basic stuff. It's not like
basic reference counts are black magic by any stretch of the
imagination. If you just document with kerneldoc, there's a chance that
people will actually read the documentation.

>>> Just use it properly.  As this patch series points out, adding this type
>>
>> By adding pointless locks?
> 
> No.

Then how would you suggest to use it properly, giving the context of
this thread? Did you read more of the code than just the single kref
function that was added?

>> Your suggestion of doing the referencing
>> inside the function being called is moot, since RCU is held off over the
>> call. The point of the addition to the API is to _not_ grab a reference
>> if someone has done the final put. We know that the RCU grace period has
>> not ended, so the kref is valid. But if it is going away _in the future_
>> after we drop the part lock, then we don't want a reference to it.
>>
>> So to "use it properly", I would have to slow down a fast path. No
>> thanks.
> 
> Then don't use it.  a kref is NOT for a fast path, use RCU for that.
> Heck, an atomic value is also not good for a fast path also, as it
> causes major stalls, so you might want to reconsider not even using that
> if you are thinking you should roll your own.

There's a major difference between needing a reference count in a fast
path, or needing a reference count and screwing locking just to be able
to do that.

>>> of function to the api is not a good idea, as it will be incorrect when
>>> used.
>>
>> The code is fine, the use is fine. I think the only thing we have
>> established here is that Jerome made the mistake of using the kref API
>> for this. I'll rewrite that part to handle its own references.
> 
> Yes, that is true.  If you are already using RCU, then by all means,
> don't use a kref.  There are lots of reference counts in the kernel that
> don't use a kref, nor should they.

I'll do that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2011-01-06  9:44 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-06  9:44 [PATCH 1/2] Don't merge different partition's IOs Yasuaki Ishimatsu
2010-12-06 16:08 ` Linus Torvalds
2010-12-07  7:18   ` Satoru Takeuchi
2010-12-07 18:39     ` Vivek Goyal
2010-12-08  7:33     ` Jens Axboe
2010-12-08  7:59       ` Satoru Takeuchi
2010-12-08  8:06         ` Jens Axboe
2010-12-08  8:11           ` Satoru Takeuchi
2010-12-08 14:46             ` Jens Axboe
2010-12-08 15:51               ` Vivek Goyal
2010-12-08 15:58                 ` Vivek Goyal
2010-12-10 11:22                   ` Jerome Marchand
2010-12-10 16:12               ` Jerome Marchand
2010-12-10 16:55                 ` Vivek Goyal
2010-12-14 20:25                   ` Jens Axboe
2010-12-17 13:42                     ` [PATCH] block: fix accounting bug on cross partition merges Jerome Marchand
2010-12-17 19:06                       ` Jens Axboe
2010-12-17 22:32                         ` Vivek Goyal
2010-12-23 15:10                         ` Jerome Marchand
2010-12-23 15:39                           ` Vivek Goyal
2010-12-23 17:04                             ` Jerome Marchand
2010-12-24 19:29                               ` Vivek Goyal
2011-01-04 15:52                                 ` [PATCH 1/2] kref: add kref_test_and_get Jerome Marchand
2011-01-04 15:55                                   ` [PATCH 2/2] block: fix accounting bug on cross partition merges Jerome Marchand
2011-01-04 21:00                                     ` Greg KH
2011-01-05 13:51                                       ` Jerome Marchand
2011-01-05 16:00                                         ` Greg KH
2011-01-05 16:19                                           ` Jerome Marchand
2011-01-05 16:27                                             ` Greg KH
2011-01-05 13:55                                       ` Jens Axboe
2011-01-05 15:58                                         ` Greg KH
2011-01-05 18:46                                           ` Jens Axboe
2011-01-05 20:08                                             ` Greg KH
2011-01-05 21:38                                               ` Jens Axboe
2011-01-05 22:16                                                 ` Greg KH
2011-01-06  9:46                                                   ` Jens Axboe
2011-01-05 14:00                                     ` Jens Axboe
2011-01-05 14:09                                       ` Jerome Marchand
2011-01-05 14:17                                         ` Jens Axboe
2011-01-04 16:05                                   ` [PATCH 1/2] kref: add kref_test_and_get Eric Dumazet
2011-01-05 15:02                                     ` [PATCH 1/2 v2] " Jerome Marchand
2011-01-05 15:43                                       ` Alexey Dobriyan
2011-01-05 15:57                                         ` Greg KH
2011-01-05 15:56                                       ` Greg KH
2011-01-04 20:57                                   ` [PATCH 1/2] " Greg KH
2011-01-05 13:35                                     ` Jerome Marchand
2011-01-05 15:55                                       ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.