linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments
@ 2008-10-02 14:29 Nikanth Karthikesan
  2008-10-02 15:03 ` James Bottomley
  2008-10-02 15:05 ` Nikanth Karthikesan
  0 siblings, 2 replies; 12+ messages in thread
From: Nikanth Karthikesan @ 2008-10-02 14:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, FUJITA Tomonori

This is a follow-up to my earlier mail http://lkml.org/lkml/2008/9/23/294 
([PATCH] BUG: ll_merge_requests_fn() updates req->nr_phys_segments wrongly)

It is possible for the merging code to create lesser no of phys segments than 
hw segments, but every hw segment needs atleast one new phys segment. This 
triggers the BUG() on scsi_init_sgtable() as blk_rq_map_sg() returns more no 
of segments than rq->nr_phys_segments

The following blktrace shows a sequence of bio's to trigger such condition on 
my machine with max_sectors_kb=512 & max_hw_sectors_kb=32767.

  8,0    0       12     1.943680621  2261  Q   R 6184075 + 55 [bash]
  8,0    0       13     1.943684671  2261  G   R 6184075 + 55 [bash]
  8,0    0       14     1.943690189  2261  P   N [bash]
  8,0    0       15     1.943692075  2261  I   R 6184075 + 55 [bash]
  8,0    0       16     1.943712119  2261  D   R 6184075 + 55 [bash]
  8,0    0       17     1.943718684  2261  Q   R 6184130 + 55 [bash]
  8,0    0       18     1.943721897  2261  G   R 6184130 + 55 [bash]
  8,0    0       19     1.943726576  2261  P   N [bash]
  8,0    0       20     1.943727763  2261  I   R 6184130 + 55 [bash]
  8,0    0       21     1.943731675  2261  Q   R 6184241 + 56 [bash]
  8,0    0       22     1.943735167  2261  G   R 6184241 + 56 [bash]
  8,0    0       23     1.943739497  2261  I   R 6184241 + 56 [bash]
  8,0    0       24     1.943742081  2261  Q   R 6184185 + 56 [bash]
  8,0    0       25     1.943744875  2261  M   R 6184185 + 56 [bash]
  8,0    0       26     1.943753535  2261  Q   R 6184352 + 55 [bash]
  8,0    0       27     1.943756538  2261  G   R 6184352 + 55 [bash]
  8,0    0       28     1.943760868  2261  I   R 6184352 + 55 [bash]
  8,0    0       29     1.943763522  2261  Q   R 6184407 + 55 [bash]
  8,0    0       30     1.943766316  2261  M   R 6184407 + 55 [bash]
  8,0    0       31     1.943770506  2261  Q   R 6184297 + 55 [bash]
  8,0    0       32     1.943772951  2261  F   R 6184297 + 55 [bash]
  8,0    0       33     1.943780843  2261  Q   R 6184462 + 55 [bash]
  8,0    0       34     1.943783776  2261  M   R 6184462 + 55 [bash]
  8,0    0       35     1.944444684     0 UT   N [swapper] 2
  8,0    0       36     1.944453624    10  U   N [kblockd/0] 2
  8,0    0       37     1.944470595    10  D   R 6184130 + 387 [kblockd/0]
  8,0    0       38     1.970340853     0  C   R 6184075 + 55 [0]
  8,0    0       39     1.973576739     0  C   R 6184130 + 387 [0]

Patches against Jens git to fix this.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..44cc1d7 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -395,9 +395,6 @@ static int ll_merge_requests_fn(struct request_queue *q, 
struct request *req,
 	if (blk_phys_contig_segment(q, req->biotail, next->bio))
 		total_phys_segments--;
 
-	if (total_phys_segments > q->max_phys_segments)
-		return 0;
-
 	total_hw_segments = req->nr_hw_segments + next->nr_hw_segments;
 	if (blk_hw_contig_segment(q, req->biotail, next->bio)) {
 		int len = req->biotail->bi_hw_back_size +
@@ -415,6 +412,15 @@ static int ll_merge_requests_fn(struct request_queue *q, 
struct request *req,
 	if (total_hw_segments > q->max_hw_segments)
 		return 0;
 
+	/*
+	 * FIXME: if 2 phys segments is used for a single hw segment?
+	 */
+	if (total_phys_segments < total_hw_segments)
+		total_phys_segments = total_hw_segments;
+
+	if (total_phys_segments > q->max_phys_segments)
+		return 0;
+
 	/* Merge is OK... */
 	req->nr_phys_segments = total_phys_segments;
 	req->nr_hw_segments = total_hw_segments;

But this may not be the complete fix? I am not sure whether the condition in 
FIXME comment can be triggered. The following patch may be a better fix.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..b2c37f4 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -467,6 +467,8 @@ static int attempt_merge(struct request_queue *q, struct 
request *req,
 
 	req->nr_sectors = req->hard_nr_sectors += next->hard_nr_sectors;
 
+	blk_recalc_rq_segments(req);
+
 	elv_merge_requests(q, req, next);
 
 	if (req->rq_disk) {

But still wouldn't it be incomplete fix as blk_recalc_rq_segments() can 
produce nr_phys_segments > q->max_phys_segments? Do we need something like 
this.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..e4431db 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -427,6 +427,8 @@ static int ll_merge_requests_fn(struct request_queue *q, 
struct request *req,
 static int attempt_merge(struct request_queue *q, struct request *req,
 			  struct request *next)
 {
+	struct bio *req_biotail, *req_biotail_bi_next;
+
 	if (!rq_mergeable(req) || !rq_mergeable(next))
 		return 0;
 
@@ -462,11 +464,21 @@ static int attempt_merge(struct request_queue *q, struct 
request *req,
 	if (time_after(req->start_time, next->start_time))
 		req->start_time = next->start_time;
 
+	req_biotail = req->biotail;
+	req_biotail_bi_next = req->biotail->bi_next;
 	req->biotail->bi_next = next->bio;
 	req->biotail = next->biotail;
 
 	req->nr_sectors = req->hard_nr_sectors += next->hard_nr_sectors;
 
+	blk_recalc_rq_segments(req);
+	if (req->nr_phys_segments > q->max_phys_segments) {
+		req->biotail = req_biotail;
+		req->biotail->bi_next = req_biotail_bi_next;
+		blk_recalc_rq_segments(req);
+		return 0;
+	}
+
 	elv_merge_requests(q, req, next);
 
 	if (req->rq_disk) {

But blk_recalc_rq_segments() cannot increase nr_phys_segments by more than 
one. So checking for one lesser limit earlier should also work? But we would 
be mostly merging 1 lesser segment.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..d9b5289 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -256,7 +256,7 @@ static inline int ll_new_mergeable(struct request_queue 
*q,
 {
 	int nr_phys_segs = bio_phys_segments(q, bio);
 
-	if (req->nr_phys_segments + nr_phys_segs > q->max_phys_segments) {
+	if (req->nr_phys_segments + nr_phys_segs > q->max_phys_segments - 1) {
 		req->cmd_flags |= REQ_NOMERGE;
 		if (req == q->last_merge)
 			q->last_merge = NULL;
@@ -395,7 +395,7 @@ static int ll_merge_requests_fn(struct request_queue *q, 
struct request *req,
 	if (blk_phys_contig_segment(q, req->biotail, next->bio))
 		total_phys_segments--;
 
-	if (total_phys_segments > q->max_phys_segments)
+	if (total_phys_segments > q->max_phys_segments - 1)
 		return 0;
 
 	total_hw_segments = req->nr_hw_segments + next->nr_hw_segments;
@@ -467,6 +467,8 @@ static int attempt_merge(struct request_queue *q, struct 
request *req,
 
 	req->nr_sectors = req->hard_nr_sectors += next->hard_nr_sectors;
 
+	blk_recalc_rq_segments(req);
+
 	elv_merge_requests(q, req, next);
 
 	if (req->rq_disk) {


Thanks
Nikanth Karthikesan



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments
  2008-10-02 14:29 [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments Nikanth Karthikesan
@ 2008-10-02 15:03 ` James Bottomley
  2008-10-02 16:58   ` Jens Axboe
  2008-10-02 15:05 ` Nikanth Karthikesan
  1 sibling, 1 reply; 12+ messages in thread
From: James Bottomley @ 2008-10-02 15:03 UTC (permalink / raw)
  To: Nikanth Karthikesan; +Cc: Jens Axboe, linux-kernel, linux-scsi, FUJITA Tomonori

On Thu, 2008-10-02 at 19:59 +0530, Nikanth Karthikesan wrote:
> This is a follow-up to my earlier mail http://lkml.org/lkml/2008/9/23/294 
> ([PATCH] BUG: ll_merge_requests_fn() updates req->nr_phys_segments wrongly)
> 
> It is possible for the merging code to create lesser no of phys segments than 
> hw segments, but every hw segment needs atleast one new phys segment. This 
> triggers the BUG() on scsi_init_sgtable() as blk_rq_map_sg() returns more no 
> of segments than rq->nr_phys_segments
> 
> The following blktrace shows a sequence of bio's to trigger such condition on 
> my machine with max_sectors_kb=512 & max_hw_sectors_kb=32767.

Um, don't you mean this the other way around?  I can see this problem
occurring if the block layer gets tricked into doing a physical merge
where sector limits forbid a virtual merge.

The bug would appear to be that we sometimes only look at q->max_sectors
when deciding on mergability.  Either we have to insist on max_sectors
<= hw_max_sectors, or we have to start using min(q->max_sectors,
q->max_hw_sectors) for this.

James



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments
  2008-10-02 14:29 [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments Nikanth Karthikesan
  2008-10-02 15:03 ` James Bottomley
@ 2008-10-02 15:05 ` Nikanth Karthikesan
  1 sibling, 0 replies; 12+ messages in thread
From: Nikanth Karthikesan @ 2008-10-02 15:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi, FUJITA Tomonori

On Thursday 02 October 2008 19:59:33 Nikanth Karthikesan wrote:
> This is a follow-up to my earlier mail http://lkml.org/lkml/2008/9/23/294
> ([PATCH] BUG: ll_merge_requests_fn() updates req->nr_phys_segments wrongly)
>
> It is possible for the merging code to create lesser no of phys segments
> than hw segments, but every hw segment needs atleast one new phys segment.
> This triggers the BUG() on scsi_init_sgtable() as blk_rq_map_sg() returns
> more no of segments than rq->nr_phys_segments
>
> The following blktrace shows a sequence of bio's to trigger such condition
> on my machine with max_sectors_kb=512 & max_hw_sectors_kb=32767.
>
>   8,0    0       12     1.943680621  2261  Q   R 6184075 + 55 [bash]
>   8,0    0       13     1.943684671  2261  G   R 6184075 + 55 [bash]
>   8,0    0       14     1.943690189  2261  P   N [bash]
>   8,0    0       15     1.943692075  2261  I   R 6184075 + 55 [bash]
>   8,0    0       16     1.943712119  2261  D   R 6184075 + 55 [bash]
>   8,0    0       17     1.943718684  2261  Q   R 6184130 + 55 [bash]
>   8,0    0       18     1.943721897  2261  G   R 6184130 + 55 [bash]
>   8,0    0       19     1.943726576  2261  P   N [bash]
>   8,0    0       20     1.943727763  2261  I   R 6184130 + 55 [bash]
>   8,0    0       21     1.943731675  2261  Q   R 6184241 + 56 [bash]
>   8,0    0       22     1.943735167  2261  G   R 6184241 + 56 [bash]
>   8,0    0       23     1.943739497  2261  I   R 6184241 + 56 [bash]
>   8,0    0       24     1.943742081  2261  Q   R 6184185 + 56 [bash]
>   8,0    0       25     1.943744875  2261  M   R 6184185 + 56 [bash]
>   8,0    0       26     1.943753535  2261  Q   R 6184352 + 55 [bash]
>   8,0    0       27     1.943756538  2261  G   R 6184352 + 55 [bash]
>   8,0    0       28     1.943760868  2261  I   R 6184352 + 55 [bash]
>   8,0    0       29     1.943763522  2261  Q   R 6184407 + 55 [bash]
>   8,0    0       30     1.943766316  2261  M   R 6184407 + 55 [bash]
>   8,0    0       31     1.943770506  2261  Q   R 6184297 + 55 [bash]
>   8,0    0       32     1.943772951  2261  F   R 6184297 + 55 [bash]
>   8,0    0       33     1.943780843  2261  Q   R 6184462 + 55 [bash]
>   8,0    0       34     1.943783776  2261  M   R 6184462 + 55 [bash]
>   8,0    0       35     1.944444684     0 UT   N [swapper] 2
>   8,0    0       36     1.944453624    10  U   N [kblockd/0] 2
>   8,0    0       37     1.944470595    10  D   R 6184130 + 387 [kblockd/0]
>   8,0    0       38     1.970340853     0  C   R 6184075 + 55 [0]
>   8,0    0       39     1.973576739     0  C   R 6184130 + 387 [0]
>
> 
Oops, that was incomplete information to reproduce the issue.

typedef struct {
	int	IoSize;
	char	*Mem;
} DB_MEM_LIST, *pDB_MEM_LIST;

DB_MEM_LIST	dbMemList[NUM_BIO] = {
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
	{ 0x7000,	NULL },
	{ 0x7000,	NULL },
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
};

		
// Allocate Memory.
char *Mem0 = kmalloc(dbMemList[0].IoSize, GFP_KERNEL);
char *Mem1 = kmalloc(dbMemList[1].IoSize, GFP_KERNEL);
char *Mem2 = kmalloc((dbMemList[2].IoSize + dbMemList[3].IoSize + 
dbMemList[4].IoSize), GFP_KERNEL);
char *Mem5 = kmalloc(dbMemList[5].IoSize, GFP_KERNEL);
char *Mem6 = kmalloc(dbMemList[6].IoSize, GFP_KERNEL);
char *Mem7 = kmalloc(dbMemList[7].IoSize, GFP_KERNEL);

dbMemList[0].Mem = Mem0;
dbMemList[1].Mem = Mem1;
dbMemList[2].Mem = Mem2;
dbMemList[3].Mem = dbMemList[2].Mem + dbMemList[2].IoSize;
dbMemList[4].Mem = dbMemList[3].Mem + dbMemList[3].IoSize;
dbMemList[5].Mem = Mem5;
dbMemList[6].Mem = Mem6;
dbMemList[7].Mem = Mem7;


int			Loop;
struct bio		*BioList[NUM_BIO]	= { NULL };
struct bio		*Bio			= NULL;
long			Sector;


/*
** Allocate BIOs.
*/
for(Loop = 0; Loop < NUM_BIO; Loop++)
	BioList[Loop] = bio_alloc(GFP_KERNEL, NUMBER_OF_VECS);

for(Loop = 0; Loop < NUM_BIO; Loop++) {
	Bio = BioList[Loop];
	Bio->bi_sector			= Sector;
	Bio->bi_bdev			= Bdev;
	Bio->bi_end_io			= db_end_io;
	if (Loop == 0) {
		if (!bio_add_page(BioList[Loop],
				virt_to_page(dbMemList[Loop].Mem),
				0x3000,
				offset_in_page(dbMemList[Loop].Mem))) {
			printk("NIK: bio add page failed- %dA\n", Loop);
		}
		if (!bio_add_page(BioList[Loop],
				virt_to_page(dbMemList[7].Mem),
				0x3e00,
				offset_in_page(dbMemList[7].Mem))) {
			printk("NIK: bio add page failed- %dB\n", Loop);
		}
	} else if (Loop == 7) {
		if (!bio_add_page(BioList[Loop],
				virt_to_page(dbMemList[Loop].Mem),
				0x2000,
				offset_in_page(dbMemList[Loop].Mem))) {
			printk("NIK: bio add page failed- %dA\n", Loop);
		}
		if (!bio_add_page(BioList[Loop],
				virt_to_page(dbMemList[7].Mem),
				0x4e00,
				offset_in_page(dbMemList[7].Mem))) {
			printk("NIK: bio add page failed- %dB\n", Loop);
		}
	} else if (!bio_add_page(BioList[Loop],
				virt_to_page(dbMemList[Loop].Mem),
				dbMemList[Loop].IoSize,
				offset_in_page(dbMemList[Loop].Mem))) {
		printk("NIK: bio add page failed %d\n", Loop);
	}
	Sector += (Bio->bi_size / 512);
}

submit_bio(READ, BioList[0]);	// 1
submit_bio(READ, BioList[1]);	// 2
submit_bio(READ, BioList[3]);	// 4
submit_bio(READ, BioList[2]);	// 3
submit_bio(READ, BioList[5]);	// 6
submit_bio(READ, BioList[6]);	// 7
submit_bio(READ, BioList[4]);	// 5
submit_bio(READ, BioList[7]);	// 8

Thanks
Nikanth Karthikesan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments
  2008-10-02 15:03 ` James Bottomley
@ 2008-10-02 16:58   ` Jens Axboe
  2008-10-02 17:12     ` James Bottomley
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2008-10-02 16:58 UTC (permalink / raw)
  To: James Bottomley
  Cc: Nikanth Karthikesan, linux-kernel, linux-scsi, FUJITA Tomonori

On Thu, Oct 02 2008, James Bottomley wrote:
> The bug would appear to be that we sometimes only look at q->max_sectors
> when deciding on mergability.  Either we have to insist on max_sectors
> <= hw_max_sectors, or we have to start using min(q->max_sectors,
> q->max_hw_sectors) for this.

q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
be sending down requests that are too large for the device to handle. So
that condition would be a big bug. The sysfs interface checks for this,
and blk_queue_max_sectors() makes sure that is true as well.

The fixes proposed still look weird. There is no phys vs hw segment
constraints, the request must adhere to the limits set by both. It's
mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
anyway.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments
  2008-10-02 16:58   ` Jens Axboe
@ 2008-10-02 17:12     ` James Bottomley
  2008-10-02 17:13       ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: James Bottomley @ 2008-10-02 17:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Nikanth Karthikesan, linux-kernel, linux-scsi, FUJITA Tomonori

On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> On Thu, Oct 02 2008, James Bottomley wrote:
> > The bug would appear to be that we sometimes only look at q->max_sectors
> > when deciding on mergability.  Either we have to insist on max_sectors
> > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > q->max_hw_sectors) for this.
> 
> q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> be sending down requests that are too large for the device to handle. So
> that condition would be a big bug. The sysfs interface checks for this,
> and blk_queue_max_sectors() makes sure that is true as well.

Yes, that seems always to be enforced.  Perhaps there are other ways of
tripping this problem ... I'm still sure if it occurs it's because we do
a physical merge where a virtual merge is forbidden.

> The fixes proposed still look weird. There is no phys vs hw segment
> constraints, the request must adhere to the limits set by both. It's
> mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> anyway.

Agree all three proposed fixes look wrong.  However, if it's what I
think, just getting rid of hw accounting won't fix the problem because
we'll still have the case where a physical merge is forbidden by iommu
constraints ... this still needs to be accounted for.

What we really need is a demonstration of what actually is going
wrong ...

James



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments
  2008-10-02 17:12     ` James Bottomley
@ 2008-10-02 17:13       ` Jens Axboe
  2008-10-03  5:28         ` Nikanth Karthikesan
  2008-10-06 17:24         ` FUJITA Tomonori
  0 siblings, 2 replies; 12+ messages in thread
From: Jens Axboe @ 2008-10-02 17:13 UTC (permalink / raw)
  To: James Bottomley
  Cc: Nikanth Karthikesan, linux-kernel, linux-scsi, FUJITA Tomonori

On Thu, Oct 02 2008, James Bottomley wrote:
> On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > On Thu, Oct 02 2008, James Bottomley wrote:
> > > The bug would appear to be that we sometimes only look at q->max_sectors
> > > when deciding on mergability.  Either we have to insist on max_sectors
> > > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > > q->max_hw_sectors) for this.
> > 
> > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > be sending down requests that are too large for the device to handle. So
> > that condition would be a big bug. The sysfs interface checks for this,
> > and blk_queue_max_sectors() makes sure that is true as well.
> 
> Yes, that seems always to be enforced.  Perhaps there are other ways of
> tripping this problem ... I'm still sure if it occurs it's because we do
> a physical merge where a virtual merge is forbidden.
> 
> > The fixes proposed still look weird. There is no phys vs hw segment
> > constraints, the request must adhere to the limits set by both. It's
> > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > anyway.
> 
> Agree all three proposed fixes look wrong.  However, if it's what I
> think, just getting rid of hw accounting won't fix the problem because
> we'll still have the case where a physical merge is forbidden by iommu
> constraints ... this still needs to be accounted for.
> 
> What we really need is a demonstration of what actually is going
> wrong ...

Yep, IIRC we both asked for that the last time as well... Nikanth?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments
  2008-10-02 17:13       ` Jens Axboe
@ 2008-10-03  5:28         ` Nikanth Karthikesan
  2008-10-06 17:24         ` FUJITA Tomonori
  1 sibling, 0 replies; 12+ messages in thread
From: Nikanth Karthikesan @ 2008-10-03  5:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: James Bottomley, linux-kernel, linux-scsi, FUJITA Tomonori

On Thursday 02 October 2008 22:43:57 Jens Axboe wrote:
> On Thu, Oct 02 2008, James Bottomley wrote:
> > On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > The bug would appear to be that we sometimes only look at
> > > > q->max_sectors when deciding on mergability.  Either we have to
> > > > insist on max_sectors <= hw_max_sectors, or we have to start using
> > > > min(q->max_sectors, q->max_hw_sectors) for this.
> > >
> > > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > > be sending down requests that are too large for the device to handle.
> > > So that condition would be a big bug. The sysfs interface checks for
> > > this, and blk_queue_max_sectors() makes sure that is true as well.
> >
> > Yes, that seems always to be enforced.  Perhaps there are other ways of
> > tripping this problem ... I'm still sure if it occurs it's because we do
> > a physical merge where a virtual merge is forbidden.
> >
> > > The fixes proposed still look weird. There is no phys vs hw segment
> > > constraints, the request must adhere to the limits set by both. It's
> > > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > > anyway.
> >
> > Agree all three proposed fixes look wrong.  However, if it's what I
> > think, just getting rid of hw accounting won't fix the problem because
> > we'll still have the case where a physical merge is forbidden by iommu
> > constraints ... this still needs to be accounted for.
> >

Yes. This patch fixes it.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..9c2fe49 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -392,7 +392,11 @@ static int ll_merge_requests_fn(struct request_queue *q, 
struct request *req,
 		return 0;
 
 	total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
-	if (blk_phys_contig_segment(q, req->biotail, next->bio))
+	if (blk_phys_contig_segment(q, req->biotail, next->bio) &&
+	    BIOVEC_VIRT_MERGEABLE(__BVEC_END(req->biotail),
+					__BVEC_START(next->bio)) &&
+	    !BIOVEC_VIRT_OVERSIZE(req->biotail->bi_hw_back_size
+					+ next->bio->bi_hw_front_size))
 		total_phys_segments--;
 
 	if (total_phys_segments > q->max_phys_segments)


> > What we really need is a demonstration of what actually is going
> > wrong ...
>
> Yep, IIRC we both asked for that the last time as well... Nikanth?

Sorry, I had already sent the blktrace and some code snippet that would 
reproduce the condition. Here is the module and user-space program that can 
trigger this condition.

diskbiomod.c
---

#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/blkdev.h>
#include <linux/miscdevice.h>
/*
** Shared between the driver module and application.
*/
#define MOD_NODE        "/dev/DiskBio"
#define MOD_NAME        "DiskBio"

/*
** IOCTLs
*/
#define DB_IOCTL_DISK           0x00000001
#define DB_IOCTL_GENDISK        0x00000002

/*
** MODULE Defines.
*/
MODULE_DESCRIPTION("DiskBio - Initiate Disk I/O Traffic");
MODULE_LICENSE("GPL");


/*
** Module Defines.
*/
#define MOD_VERSION	"09-11-2008-1"
#define NUM_BIO		8
#define NUMBER_OF_VECS	16
#define START_LBA	0x5e5c8b


/*
** Module Globals.
*/
static struct gendisk	*(*get_GenDisk)(dev_t, int *)	= NULL;
static int		ErrorsOccurred			= 0;


typedef struct {
	int	IoSize;
	char	*Mem;
} DB_MEM_LIST, *pDB_MEM_LIST;

DB_MEM_LIST	dbMemList[NUM_BIO] = {
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
	{ 0x7000,	NULL },
	{ 0x7000,	NULL },
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
};


/*
** Wait Queue.
*/
DECLARE_WAIT_QUEUE_HEAD(db_wait_queue);


/*
** db_end_io()
*/
static void/*int*/
db_end_io(struct bio *Bio,/* unsigned int Bytes_Done,*/ int Error)
{
	static int Completions = 0;

	if(Error) {
		++ErrorsOccurred;
		printk("%s: End_Io Error=0x%x\n", MOD_NAME, Error);
	}
	if(Bio->bi_size) {
		++ErrorsOccurred;
		printk("%s: End_Io bi_size=0x%x\n", MOD_NAME, Bio->bi_size);
	}

	if(++Completions < NUM_BIO) {
		return;//(0);
	}
	Completions = 0;

	wake_up(&db_wait_queue);
	return;//(0);
} /* end - db_end_io() */


/*
** db_do_io()
*/
static int
db_do_io(struct gendisk *GD)
{
	int			Loop;
	int			RC			= 0;
	struct bio		*BioList[NUM_BIO]	= { NULL };
	struct bio		*Bio			= NULL;
	struct block_device	*Bdev;
	long			Sector;

	/*
	** Get block device structure.
	*/
	Bdev = bdget_disk(GD, 0);
	if(!Bdev) {
		RC = -ENOMEM;
		goto OutOfHere;
	}

	/*
	** Allocate BIOs.
	*/
	for(Loop = 0; Loop < NUM_BIO; Loop++) {
		BioList[Loop] = bio_alloc(GFP_KERNEL, NUMBER_OF_VECS);
		if(!BioList[Loop]) {
			RC = -ENOMEM;
			goto OutOfHere;
		}
	}

	/*
	** Initialize the I/O.
	*/
	Sector = START_LBA;
	for(Loop = 0; Loop < NUM_BIO; Loop++) {
		Bio = BioList[Loop];
		Bio->bi_sector			= Sector;
		Bio->bi_bdev			= Bdev;
		Bio->bi_end_io			= db_end_io;
		if (Loop == 0) {
			if (!bio_add_page(BioList[Loop],
					virt_to_page(dbMemList[Loop].Mem),
					0x3000,
					offset_in_page(dbMemList[Loop].Mem))) {
				printk("bio add page failed- %dA\n", Loop);
			}
			if (!bio_add_page(BioList[Loop],
					virt_to_page(dbMemList[7].Mem),
					0x3e00,
					offset_in_page(dbMemList[7].Mem))) {
				printk("bio add page failed- %dB\n", Loop);
			}
		} else if (Loop == 7) {
			if (!bio_add_page(BioList[Loop],
					virt_to_page(dbMemList[Loop].Mem),
					0x2000,
					offset_in_page(dbMemList[Loop].Mem))) {
				printk("bio add page failed- %dA\n", Loop);
			}
			if (!bio_add_page(BioList[Loop],
					virt_to_page(dbMemList[7].Mem),
					0x4e00,
					offset_in_page(dbMemList[7].Mem))) {
				printk("bio add page failed- %dB\n", Loop);
			}
		} else if (!bio_add_page(BioList[Loop],
					virt_to_page(dbMemList[Loop].Mem),
					dbMemList[Loop].IoSize,
					offset_in_page(dbMemList[Loop].Mem))) {
			printk("bio add page failed %d\n", Loop);
		}
		Sector += (Bio->bi_size / 512);
	}
	ErrorsOccurred = 0;

	/*
	** Issue the I/Os.
	*/
	submit_bio(READ, BioList[0]);	// 1
	submit_bio(READ, BioList[1]);	// 2
	submit_bio(READ, BioList[3]);	// 4
	submit_bio(READ, BioList[2]);	// 3
	submit_bio(READ, BioList[5]);	// 6
	submit_bio(READ, BioList[6]);	// 7
	submit_bio(READ, BioList[4]);	// 5
	submit_bio(READ, BioList[7]);	// 8

	/*
	** Wait for the completion.
	*/
	sleep_on(&db_wait_queue);

	/*
	** Release the resources.
	*/
OutOfHere:
	for(Loop = 0; Loop < NUM_BIO; Loop++) {
		if(BioList[Loop]) bio_put(BioList[Loop]);
	}
	if(ErrorsOccurred)
		RC = -EIO;
	
	return(RC);
} /* end - db_do_io() */


/*
** db_ioctl
*/
static int
db_ioctl(struct inode *inode, struct file *file, uint Command, ulong Arg)
{
	int		Status	= 0;
	struct gendisk	*GenDisk;
	int		Part;
	char		*Mem0	= NULL;
	char		*Mem1	= NULL;
	char		*Mem2	= NULL;
	char		*Mem5	= NULL;
	char		*Mem6	= NULL;
	char		*Mem7	= NULL;

	switch(Command) {
	case DB_IOCTL_GENDISK:
		get_GenDisk = (void *)Arg;
		break;

	case DB_IOCTL_DISK:
		Arg = new_decode_dev(Arg);
		if(!get_GenDisk) {
			Status = -EADDRNOTAVAIL;
			break;
		}
		GenDisk = get_GenDisk(Arg, &Part);
		if(!GenDisk) {
			Status = -ENODEV;
			break;
		}

		/*
		** Allocate Memory.
		*/
		Mem0 = kmalloc(dbMemList[0].IoSize, GFP_KERNEL);
		Mem1 = kmalloc(dbMemList[1].IoSize, GFP_KERNEL);
		Mem2 = kmalloc((dbMemList[2].IoSize + dbMemList[3].IoSize + 
dbMemList[4].IoSize), GFP_KERNEL);
		Mem5 = kmalloc(dbMemList[5].IoSize, GFP_KERNEL);
		Mem6 = kmalloc(dbMemList[6].IoSize, GFP_KERNEL);
		Mem7 = kmalloc(dbMemList[7].IoSize, GFP_KERNEL);

		if(!Mem0 || !Mem1 || !Mem2 || !Mem5 || !Mem6 || !Mem7) {
			Status = -ENOMEM;
			goto OutOfHere2;
		}

		dbMemList[0].Mem = Mem0;
		dbMemList[1].Mem = Mem1;
		dbMemList[2].Mem = Mem2;
		dbMemList[3].Mem = dbMemList[2].Mem + dbMemList[2].IoSize;
		dbMemList[4].Mem = dbMemList[3].Mem + dbMemList[3].IoSize;
		dbMemList[5].Mem = Mem5;
		dbMemList[6].Mem = Mem6;
		dbMemList[7].Mem = Mem7;

		Status = db_do_io(GenDisk);

OutOfHere2:
		if(Mem0) kfree(Mem0);
		if(Mem1) kfree(Mem1);
		if(Mem2) kfree(Mem2);
		if(Mem5) kfree(Mem5);
		if(Mem6) kfree(Mem6);
		if(Mem7) kfree(Mem7);
		break;

	default:
		printk("%s: Unknown Ioctl Command (0x%x)\n.", MOD_NAME, Command);
		Status = -EINVAL;
		break;
	}

	return(Status);
} /* end db_ioctl() */


/*
** File operations structure.
*/
static struct file_operations	db_fops = {
	.owner		= THIS_MODULE,
	.ioctl		= db_ioctl,
};


/*
** Misc Device structure.
*/
static struct miscdevice db_miscdev = {
	MISC_DYNAMIC_MINOR,
	MOD_NAME,
	&db_fops,
};


/*
** db_init()
*/
int __init
db_init(void)
{
	int	Error = 0;

	printk("%s: Init: ENTER.\n", MOD_NAME);
	printk("%s: Version %s.\n", MOD_NAME, MOD_VERSION);

	Error = misc_register(&db_miscdev);
	if(Error)
		printk("%s: Misc Register Failure!\n", MOD_NAME);
	
	printk("%s: Init: DONE.\n", MOD_NAME);
	return(Error);
} /* end - db_init() */


/*
** db_exit()
*/
void __exit
db_exit(void)
{
	printk("%s: Exit: ENTER.\n", MOD_NAME);

	misc_deregister(&db_miscdev);
		
	printk("%s: Exit: DONE.\n", MOD_NAME);
} /* end - db_exit() */


/*
** Module Entry Points.
*/
module_init(db_init);
module_exit(db_exit);

---
The Userspace utility to send ioctl to the module.
diskbio.c
---

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <libgen.h>
/*
** Shared between the driver module and application.
*/
#define MOD_NODE        "/dev/DiskBio"
#define MOD_NAME        "DiskBio"

/*
** IOCTLs
*/
#define DB_IOCTL_DISK           0x00000001
#define DB_IOCTL_GENDISK        0x00000002

#define	DEVICE_BUF_SIZE	100
#define GET_GENDISK	"get_gendisk"

int
OpenNode(void)
{
	int		IoPtr;
	FILE		*Fd;
	char		Buf[DEVICE_BUF_SIZE];
	unsigned long	Symbol_get_gendisk	= 0;

	/*
	** Get the symbols we need so they can be sent to the driver.
	*/
	if((Fd = fopen("/proc/kallsyms", "r")) == NULL) {
		perror("open");
		fprintf(stderr, "Can Not Get Symbol Values!\n");
		return(-1);
	}

	/*
	** Scan the /proc/kallsyms file for the symbols and addresses.
	*/
	while(fgets(Buf, DEVICE_BUF_SIZE, Fd)) {
		unsigned long	SymbolAddress;
		char		Tag[DEVICE_BUF_SIZE];
		char		Name[DEVICE_BUF_SIZE];

		if(sscanf(Buf, "%lx %s %s", &SymbolAddress, Tag, Name) != 3)
			continue;

		/*
		** Check for a symbol match.
		*/
		if(!strcmp(Name, GET_GENDISK)) {
			Symbol_get_gendisk = SymbolAddress;
			break;
		}
	}

	/*
	** Done with this file.
	*/
	fclose(Fd);

	/*
	** Make sure we found all of the symbols.
	*/
	if(!Symbol_get_gendisk) {
		fprintf(stderr, "%s Symbol NOT Found!\n", GET_GENDISK);
		return(-1);
	}

	/*
	** Device node should be present. Try and open.
	*/
	if((IoPtr = open(MOD_NODE, O_RDWR)) == -1) {
		perror("open");
		fprintf(stderr, "Can Not Open Module Device '%s'!\n", MOD_NODE);
		fprintf(stderr, "Unload and Reload the Module.\n");
		return(-1);
	}

	/*
	** Send the symbol addresses to the module.
	*/
	if(ioctl(IoPtr, DB_IOCTL_GENDISK, Symbol_get_gendisk)) {
		perror("ioctl");
		fprintf(stderr, "Unable to Send GENDISK Symbol to Module!\n");
		close(IoPtr);
		return(-1);
	}

	return(IoPtr);
}

int
main(int argc, char *argv[])
{
	int		RC		= 0;
	int		IoPtr		= 0;
	int		DevPtr		= 0;
	unsigned long	DeviceNumber;
	struct stat	StatBuf;

	/*
	** Check Arguments.
	*/
	if(argc < 2) {
		fprintf(stderr, "Usage: %s /dev/sdX\n", basename(argv[0]));
		exit(1);
	}

	/*
	** Open the device node to the driver interface.
	*/
	if((IoPtr = OpenNode()) == -1) {
		exit(2);
	}

	/*
	** Open the device.
	*/
	DevPtr = open(argv[1], O_RDWR);
	if(DevPtr == -1) {
		perror("open");
		fprintf(stderr, "Open Failed for Device '%s'\n", argv[1]);
		RC = 3;
		goto out;
	}

	/*
	** Make sure the device is a block device.
	*/
	if(fstat(DevPtr, &StatBuf)) {
		perror("fstat");
		fprintf(stderr, "Unable to stat device '%s'.\n", argv[1]);
		RC=4;
		goto out;
	}
	if(!S_ISBLK(StatBuf.st_mode)) {
		fprintf(stderr, "Device '%s' not a block device.\n", argv[1]);
		RC = 5;
		goto out;
	}

	/*
	** Get Device Number.
	*/
	DeviceNumber = StatBuf.st_rdev;

	/*
	** Send the device number to the driver
	** so it can do an I/O for us.
	*/
	RC = ioctl(IoPtr, DB_IOCTL_DISK, DeviceNumber);
	if(RC) {
		perror("ioctl");
		fprintf(stderr, "Ioctl Failed (%d).\n", RC);
	}

out:
	if(DevPtr > -1)	close(DevPtr);
	if(IoPtr > -1)	close(IoPtr);
	exit(RC);
}


Thanks
Nikanth Karthikesan







^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments
  2008-10-02 17:13       ` Jens Axboe
  2008-10-03  5:28         ` Nikanth Karthikesan
@ 2008-10-06 17:24         ` FUJITA Tomonori
  2008-10-10 12:03           ` Jens Axboe
  1 sibling, 1 reply; 12+ messages in thread
From: FUJITA Tomonori @ 2008-10-06 17:24 UTC (permalink / raw)
  To: jens.axboe
  Cc: James.Bottomley, knikanth, linux-kernel, linux-scsi, fujita.tomonori

On Thu, 2 Oct 2008 19:13:57 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> On Thu, Oct 02 2008, James Bottomley wrote:
> > On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > The bug would appear to be that we sometimes only look at q->max_sectors
> > > > when deciding on mergability.  Either we have to insist on max_sectors
> > > > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > > > q->max_hw_sectors) for this.
> > > 
> > > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > > be sending down requests that are too large for the device to handle. So
> > > that condition would be a big bug. The sysfs interface checks for this,
> > > and blk_queue_max_sectors() makes sure that is true as well.
> > 
> > Yes, that seems always to be enforced.  Perhaps there are other ways of
> > tripping this problem ... I'm still sure if it occurs it's because we do
> > a physical merge where a virtual merge is forbidden.
> > 
> > > The fixes proposed still look weird. There is no phys vs hw segment
> > > constraints, the request must adhere to the limits set by both. It's
> > > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > > anyway.
> > 
> > Agree all three proposed fixes look wrong.  However, if it's what I
> > think, just getting rid of hw accounting won't fix the problem because
> > we'll still have the case where a physical merge is forbidden by iommu
> > constraints ... this still needs to be accounted for.
> > 
> > What we really need is a demonstration of what actually is going
> > wrong ...
> 
> Yep, IIRC we both asked for that the last time as well... Nikanth?

Possibly, blk_phys_contig_segment might miscalculate
q->max_segment_size?

blk_phys_contig_segment does:

req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size;

But it's possible that req->biotail and the previous bio are supposed
be merged into one segment? Then we could create too large segment
here.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments
  2008-10-06 17:24         ` FUJITA Tomonori
@ 2008-10-10 12:03           ` Jens Axboe
  2008-10-10 12:32             ` FUJITA Tomonori
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2008-10-10 12:03 UTC (permalink / raw)
  To: FUJITA Tomonori; +Cc: James.Bottomley, knikanth, linux-kernel, linux-scsi

On Tue, Oct 07 2008, FUJITA Tomonori wrote:
> On Thu, 2 Oct 2008 19:13:57 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Thu, Oct 02 2008, James Bottomley wrote:
> > > On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > > The bug would appear to be that we sometimes only look at q->max_sectors
> > > > > when deciding on mergability.  Either we have to insist on max_sectors
> > > > > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > > > > q->max_hw_sectors) for this.
> > > > 
> > > > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > > > be sending down requests that are too large for the device to handle. So
> > > > that condition would be a big bug. The sysfs interface checks for this,
> > > > and blk_queue_max_sectors() makes sure that is true as well.
> > > 
> > > Yes, that seems always to be enforced.  Perhaps there are other ways of
> > > tripping this problem ... I'm still sure if it occurs it's because we do
> > > a physical merge where a virtual merge is forbidden.
> > > 
> > > > The fixes proposed still look weird. There is no phys vs hw segment
> > > > constraints, the request must adhere to the limits set by both. It's
> > > > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > > > anyway.
> > > 
> > > Agree all three proposed fixes look wrong.  However, if it's what I
> > > think, just getting rid of hw accounting won't fix the problem because
> > > we'll still have the case where a physical merge is forbidden by iommu
> > > constraints ... this still needs to be accounted for.
> > > 
> > > What we really need is a demonstration of what actually is going
> > > wrong ...
> > 
> > Yep, IIRC we both asked for that the last time as well... Nikanth?
> 
> Possibly, blk_phys_contig_segment might miscalculate
> q->max_segment_size?
> 
> blk_phys_contig_segment does:
> 
> req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size;
> 
> But it's possible that req->biotail and the previous bio are supposed
> be merged into one segment? Then we could create too large segment
> here.

Hmm yes, that looks like it could indeed be a problem! We could fix this
with similar logic to what we used to do for the hw merging, keep track
of the current segment size that this bio belongs to, so it would end up
ala

        if (blk_phys_contig_segment(q, req->biotail, next->bio) &&
            rq->biotail->bi_seg_size + next->bio->bi_size <= q->max_segment_size) {
                total_phys_segments--;
                next->bio->bi_seg_size = rq->biotail->bi_seg_size + next->bio->bi_size;
        }

for the merge part.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments
  2008-10-10 12:03           ` Jens Axboe
@ 2008-10-10 12:32             ` FUJITA Tomonori
  2008-10-10 12:37               ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: FUJITA Tomonori @ 2008-10-10 12:32 UTC (permalink / raw)
  To: jens.axboe
  Cc: fujita.tomonori, James.Bottomley, knikanth, linux-kernel, linux-scsi

On Fri, 10 Oct 2008 14:03:44 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> On Tue, Oct 07 2008, FUJITA Tomonori wrote:
> > On Thu, 2 Oct 2008 19:13:57 +0200
> > Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > > > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > > > The bug would appear to be that we sometimes only look at q->max_sectors
> > > > > > when deciding on mergability.  Either we have to insist on max_sectors
> > > > > > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > > > > > q->max_hw_sectors) for this.
> > > > > 
> > > > > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > > > > be sending down requests that are too large for the device to handle. So
> > > > > that condition would be a big bug. The sysfs interface checks for this,
> > > > > and blk_queue_max_sectors() makes sure that is true as well.
> > > > 
> > > > Yes, that seems always to be enforced.  Perhaps there are other ways of
> > > > tripping this problem ... I'm still sure if it occurs it's because we do
> > > > a physical merge where a virtual merge is forbidden.
> > > > 
> > > > > The fixes proposed still look weird. There is no phys vs hw segment
> > > > > constraints, the request must adhere to the limits set by both. It's
> > > > > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > > > > anyway.
> > > > 
> > > > Agree all three proposed fixes look wrong.  However, if it's what I
> > > > think, just getting rid of hw accounting won't fix the problem because
> > > > we'll still have the case where a physical merge is forbidden by iommu
> > > > constraints ... this still needs to be accounted for.
> > > > 
> > > > What we really need is a demonstration of what actually is going
> > > > wrong ...
> > > 
> > > Yep, IIRC we both asked for that the last time as well... Nikanth?
> > 
> > Possibly, blk_phys_contig_segment might miscalculate
> > q->max_segment_size?
> > 
> > blk_phys_contig_segment does:
> > 
> > req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size;
> > 
> > But it's possible that req->biotail and the previous bio are supposed
> > be merged into one segment? Then we could create too large segment
> > here.
> 
> Hmm yes, that looks like it could indeed be a problem!

I think so.


> We could fix this
> with similar logic to what we used to do for the hw merging, keep track
> of the current segment size that this bio belongs to, so it would end up
> ala

Yeah, exactly.

You want a fix for this 2.6.28? Or disable this feature for 2.6.28?


>         if (blk_phys_contig_segment(q, req->biotail, next->bio) &&
>             rq->biotail->bi_seg_size + next->bio->bi_size <= q->max_segment_size) {
>                 total_phys_segments--;
>                 next->bio->bi_seg_size = rq->biotail->bi_seg_size + next->bio->bi_size;
>         }
> 
> for the merge part.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments
  2008-10-10 12:32             ` FUJITA Tomonori
@ 2008-10-10 12:37               ` Jens Axboe
  2008-10-10 12:49                 ` FUJITA Tomonori
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2008-10-10 12:37 UTC (permalink / raw)
  To: FUJITA Tomonori; +Cc: James.Bottomley, knikanth, linux-kernel, linux-scsi

On Fri, Oct 10 2008, FUJITA Tomonori wrote:
> On Fri, 10 Oct 2008 14:03:44 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Tue, Oct 07 2008, FUJITA Tomonori wrote:
> > > On Thu, 2 Oct 2008 19:13:57 +0200
> > > Jens Axboe <jens.axboe@oracle.com> wrote:
> > > 
> > > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > > On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > > > > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > > > > The bug would appear to be that we sometimes only look at q->max_sectors
> > > > > > > when deciding on mergability.  Either we have to insist on max_sectors
> > > > > > > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > > > > > > q->max_hw_sectors) for this.
> > > > > > 
> > > > > > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > > > > > be sending down requests that are too large for the device to handle. So
> > > > > > that condition would be a big bug. The sysfs interface checks for this,
> > > > > > and blk_queue_max_sectors() makes sure that is true as well.
> > > > > 
> > > > > Yes, that seems always to be enforced.  Perhaps there are other ways of
> > > > > tripping this problem ... I'm still sure if it occurs it's because we do
> > > > > a physical merge where a virtual merge is forbidden.
> > > > > 
> > > > > > The fixes proposed still look weird. There is no phys vs hw segment
> > > > > > constraints, the request must adhere to the limits set by both. It's
> > > > > > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > > > > > anyway.
> > > > > 
> > > > > Agree all three proposed fixes look wrong.  However, if it's what I
> > > > > think, just getting rid of hw accounting won't fix the problem because
> > > > > we'll still have the case where a physical merge is forbidden by iommu
> > > > > constraints ... this still needs to be accounted for.
> > > > > 
> > > > > What we really need is a demonstration of what actually is going
> > > > > wrong ...
> > > > 
> > > > Yep, IIRC we both asked for that the last time as well... Nikanth?
> > > 
> > > Possibly, blk_phys_contig_segment might miscalculate
> > > q->max_segment_size?
> > > 
> > > blk_phys_contig_segment does:
> > > 
> > > req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size;
> > > 
> > > But it's possible that req->biotail and the previous bio are supposed
> > > be merged into one segment? Then we could create too large segment
> > > here.
> > 
> > Hmm yes, that looks like it could indeed be a problem!
> 
> I think so.
> 
> 
> > We could fix this
> > with similar logic to what we used to do for the hw merging, keep track
> > of the current segment size that this bio belongs to, so it would end up
> > ala
> 
> Yeah, exactly.
> 
> You want a fix for this 2.6.28? Or disable this feature for 2.6.28?

Lets fix it. It wont be part of the initial merge, since it'll need some
dedicated testing, but we can get it there for 2.6.28. Shall I interpret
your message as willingness to write up the fix? :)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments
  2008-10-10 12:37               ` Jens Axboe
@ 2008-10-10 12:49                 ` FUJITA Tomonori
  0 siblings, 0 replies; 12+ messages in thread
From: FUJITA Tomonori @ 2008-10-10 12:49 UTC (permalink / raw)
  To: jens.axboe
  Cc: fujita.tomonori, James.Bottomley, knikanth, linux-kernel, linux-scsi

On Fri, 10 Oct 2008 14:37:19 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> On Fri, Oct 10 2008, FUJITA Tomonori wrote:
> > On Fri, 10 Oct 2008 14:03:44 +0200
> > Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > On Tue, Oct 07 2008, FUJITA Tomonori wrote:
> > > > On Thu, 2 Oct 2008 19:13:57 +0200
> > > > Jens Axboe <jens.axboe@oracle.com> wrote:
> > > > 
> > > > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > > > On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > > > > > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > > > > > The bug would appear to be that we sometimes only look at q->max_sectors
> > > > > > > > when deciding on mergability.  Either we have to insist on max_sectors
> > > > > > > > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > > > > > > > q->max_hw_sectors) for this.
> > > > > > > 
> > > > > > > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > > > > > > be sending down requests that are too large for the device to handle. So
> > > > > > > that condition would be a big bug. The sysfs interface checks for this,
> > > > > > > and blk_queue_max_sectors() makes sure that is true as well.
> > > > > > 
> > > > > > Yes, that seems always to be enforced.  Perhaps there are other ways of
> > > > > > tripping this problem ... I'm still sure if it occurs it's because we do
> > > > > > a physical merge where a virtual merge is forbidden.
> > > > > > 
> > > > > > > The fixes proposed still look weird. There is no phys vs hw segment
> > > > > > > constraints, the request must adhere to the limits set by both. It's
> > > > > > > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > > > > > > anyway.
> > > > > > 
> > > > > > Agree all three proposed fixes look wrong.  However, if it's what I
> > > > > > think, just getting rid of hw accounting won't fix the problem because
> > > > > > we'll still have the case where a physical merge is forbidden by iommu
> > > > > > constraints ... this still needs to be accounted for.
> > > > > > 
> > > > > > What we really need is a demonstration of what actually is going
> > > > > > wrong ...
> > > > > 
> > > > > Yep, IIRC we both asked for that the last time as well... Nikanth?
> > > > 
> > > > Possibly, blk_phys_contig_segment might miscalculate
> > > > q->max_segment_size?
> > > > 
> > > > blk_phys_contig_segment does:
> > > > 
> > > > req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size;
> > > > 
> > > > But it's possible that req->biotail and the previous bio are supposed
> > > > be merged into one segment? Then we could create too large segment
> > > > here.
> > > 
> > > Hmm yes, that looks like it could indeed be a problem!
> > 
> > I think so.
> > 
> > 
> > > We could fix this
> > > with similar logic to what we used to do for the hw merging, keep track
> > > of the current segment size that this bio belongs to, so it would end up
> > > ala
> > 
> > Yeah, exactly.
> > 
> > You want a fix for this 2.6.28? Or disable this feature for 2.6.28?
> 
> Lets fix it. It wont be part of the initial merge, since it'll need some
> dedicated testing, but we can get it there for 2.6.28. Shall I interpret
> your message as willingness to write up the fix? :)

Yeah, it's on this weekend todo list. :) I want to look at the code
again and make sure I correctly understand the problem.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-10-10 12:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-02 14:29 [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments Nikanth Karthikesan
2008-10-02 15:03 ` James Bottomley
2008-10-02 16:58   ` Jens Axboe
2008-10-02 17:12     ` James Bottomley
2008-10-02 17:13       ` Jens Axboe
2008-10-03  5:28         ` Nikanth Karthikesan
2008-10-06 17:24         ` FUJITA Tomonori
2008-10-10 12:03           ` Jens Axboe
2008-10-10 12:32             ` FUJITA Tomonori
2008-10-10 12:37               ` Jens Axboe
2008-10-10 12:49                 ` FUJITA Tomonori
2008-10-02 15:05 ` Nikanth Karthikesan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).