linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* PATCH for Corrupted IO on all block devices
@ 2001-07-18  1:37 David J. Picard
  2001-07-18  1:43 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: David J. Picard @ 2001-07-18  1:37 UTC (permalink / raw)
  To: axboe; +Cc: torvalds, linux-kernel

Basically, what is happening is the read requests are being pushed to
the front of the IO queue - before the preceding write for the same
sector. This is VERY BAD in that applications that perform repeated IO
to the same sector sequentially - like databases - are not guaranteed
the IO will be performed in the order the operations were requested.

The patch follows, then a test utility for the presence of the defect,
and last some background, detail regarding suspected impact of the patch
and next steps.

/* ---- START OVERLAPPING_IO_CORRUPTION PATCH --- */
--- linux/drivers/block/elevator.c	Tue Jul 17 20:56:09 2001
+++ linux-2.4.6-unchanged/drivers/block/elevator.c	Tue Jul 17 20:55:41
2001
@@ -18,6 +18,11 @@
  * Removed tests for max-bomb-segments, which was breaking elvtune
  *  when run without -bN
  *
+ * 17072001 Dave Picard <dave@psind.com> :
+ * Modified the default merge function to stop pushing an IO request
towards
+ *  the front of the queue when it bumps up against an overlapping IO
+ *  request. This was corrupting apps with high IO to the same file,
like
+ *  high traffic databases.
  */
 
 #include <linux/fs.h>
@@ -74,6 +79,39 @@
 	return 0;
 }
 
+/**
+ * bh and rq may be for overlapping regions on the disk. If this is the
case,
+ * it's imperative the order of the io operations be maintained in
order to
+ * guarantee the results be consistent with the order of the requests.
+ */
+inline int bh_rq_in_block(struct buffer_head *bh, unsigned int count,
+			  struct request *rq, struct list_head *head)
+{
+	struct list_head *next;
+	struct request *next_rq;
+
+	/*
+	 * if the io overlaps with this request, consider it overlapping
+	 */
+	if ( ((rq->sector + rq->nr_sectors) > bh->b_rsector) &&
+	     (rq->sector < (bh->b_rsector + count)) )
+	        return 1;
+
+	next = rq->queue.next;
+	if (next == head)
+		return 0;
+
+	/*
+	 * if the io overlaps with the next request, consider it overlapping
+	 */
+	next_rq = blkdev_entry_to_request(next);
+	if ( ((next_rq->sector + next_rq->nr_sectors) > bh->b_rsector) &&
+	     (next_rq->sector < (bh->b_rsector + count)) )
+	        return 1;
+
+	return 0;
+}
+
 
 int elevator_linus_merge(request_queue_t *q, struct request **req,
 			 struct list_head * head,
@@ -98,6 +136,10 @@
 			continue;
 		if (!*req && bh_rq_in_between(bh, __rq, &q->queue_head))
 			*req = __rq;
+		if (bh_rq_in_block(bh, count, __rq, &q->queue_head)) {
+			*req = __rq;
+			break;
+		}
 		if (__rq->cmd != rw)
 			continue;
 		if (__rq->nr_sectors + count > max_sectors)
/* ---- END OVERLAPPING_IO_CORRUPTION PATCH --- */

The following code for a small test application demonstrates the data
corruption with the IO for block devices in the 2.4.6 kernel. Test code
follows, then more detail on next steps:

/* START TEST APPLICATION to validate defect is present or repaired */

#include <stdio.h>
#include <assert.h>

#define TEST_SZ 25000000
#define RD_BUFF_SZ 5000
int main(int argc, const char **argv, const char **env)
{
    FILE* fp;

    if(argc > 1) fp = fopen(argv[1], "r+");
    else fp =tmpfile();
    if(NULL != fp) {
        int j = -1;
	off_t o;
        while(1) {
            if(++j != TEST_SZ) {
	        if (j == (TEST_SZ - RD_BUFF_SZ) ) o = ftello(fp);
                fwrite(&j, sizeof(int), 1, fp);
            } else {
                int i, buffer[RD_BUFF_SZ];
		fflush(fp);
                fseek(fp, o, SEEK_SET);
                fread(buffer, sizeof(int), sizeof(buffer), fp);
                printf("Validating end of file writes\n");
                for(i = (RD_BUFF_SZ - 1); i >= 0; i--) {
                    assert(buffer[i] == --j) ;
                }
                rewind(fp);
		j = -1;
            }
        }
        return 1;
    }
    return 0;
}
/* END TEST APPLICATION */


This defect appeared to come about in an attempt to optimize Linux as a
web / file server by pushing all the read requests to the front of the
pending IO queue. I suspect this patch will have a slightly adverse
impact on that performance, but I believe the impact will be limited to
only those occurences where the IO is overlapping. This will force all
IO operations for the same overlapping sector to occur concurrently. 

There is now an opportunity for further optimization by checking for the
next request before a write operation is actually executed from the
queue and performing the next operation against the pending write buffer
if it is for one or many overlapping segment(s). This of course may
require the extension of the pending write buffer. This will minimize
the thrashing of the disk to update concurrent sections, but I am unsure
of the latencies involved with the execution of the queue.

Another question is whether or not sound devices (or any other devices
not related to storage) are considered to be block devices, in which
case this optimization would likely lose the intervening writes
resulting in some very poor sound quality. 

This optimization should not be performed at the cost of stability.

-- 
David J. Picard
  dave@psind.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PATCH for Corrupted IO on all block devices
  2001-07-18  1:37 PATCH for Corrupted IO on all block devices David J. Picard
@ 2001-07-18  1:43 ` Linus Torvalds
  2001-07-18  2:37   ` Aaron Sethman
  2001-07-18  2:05 ` David S. Miller
  2001-07-18  2:16 ` Alexander Viro
  2 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2001-07-18  1:43 UTC (permalink / raw)
  To: dpicard; +Cc: axboe, linux-kernel


On Tue, 17 Jul 2001, David J. Picard wrote:
>
> Basically, what is happening is the read requests are being pushed to
> the front of the IO queue - before the preceding write for the same
> sector.

This is a bug in the USER, not in the code.

The locking is NOT supposed to be done at the elevator level (or, indeed
at ANY _io_ level), but must be done by upper layers.

If upper layers do not do this locking, then THAT is the bug.

What filesystem do you see the bug with?

		Linus


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PATCH for Corrupted IO on all block devices
  2001-07-18  1:37 PATCH for Corrupted IO on all block devices David J. Picard
  2001-07-18  1:43 ` Linus Torvalds
@ 2001-07-18  2:05 ` David S. Miller
  2001-07-18  2:08   ` Linus Torvalds
  2001-07-18  2:35   ` Andrea Arcangeli
  2001-07-18  2:16 ` Alexander Viro
  2 siblings, 2 replies; 11+ messages in thread
From: David S. Miller @ 2001-07-18  2:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dpicard, axboe, linux-kernel


Linus Torvalds writes:
 > What filesystem do you see the bug with?

His report did specifically mention "databases".  But my initial
impression was the same as yours, that this is a bug in the user.

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PATCH for Corrupted IO on all block devices
  2001-07-18  2:05 ` David S. Miller
@ 2001-07-18  2:08   ` Linus Torvalds
  2001-07-18  2:35   ` Andrea Arcangeli
  1 sibling, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2001-07-18  2:08 UTC (permalink / raw)
  To: David S. Miller; +Cc: dpicard, axboe, linux-kernel


On Tue, 17 Jul 2001, David S. Miller wrote:
>
> Linus Torvalds writes:
>  > What filesystem do you see the bug with?
>
> His report did specifically mention "databases".  But my initial
> impression was the same as yours, that this is a bug in the user.

More detailed report indicates that it is actually on ext2. Which would be
really really bad. It doesn't make the patch correct, but the patch might
be a starting point for some debugging session (ie instead of refusing to
merge them, print out the state of the overlapping buffers to see if there
is some pattern to it..)

		Linus


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PATCH for Corrupted IO on all block devices
  2001-07-18  2:37   ` Aaron Sethman
@ 2001-07-18  2:14     ` David J. Picard
  2001-07-18  2:51       ` Alexander Viro
  2001-07-18  3:03       ` David S. Miller
  2001-07-18  2:41     ` Aaron Sethman
  1 sibling, 2 replies; 11+ messages in thread
From: David J. Picard @ 2001-07-18  2:14 UTC (permalink / raw)
  To: Aaron Sethman; +Cc: linux-kernel

The issue was in a stack overflow in the test program, as Alexander
pointed out. Is the stack order different on Solaris et.al v. Linux?
Could this be why it worked so well on the other OS's? The correct
version of the program is as follows:

#include <stdio.h>
#include <assert.h>

#define TEST_SZ 25000000
#define RD_BUFF_SZ 5000
int main(int argc, const char **argv, const char **env)
{
    FILE* fp;

    if(argc > 1) fp = fopen(argv[1], "r+");
    else fp =tmpfile();
    if(NULL != fp) {
        int j = -1;
	off_t o;
        while(1) {
            if(++j != TEST_SZ) {
	        if (j == (TEST_SZ - RD_BUFF_SZ) ) o = ftello(fp);
                fwrite(&j, sizeof(int), 1, fp);
            } else {
                int i, buffer[RD_BUFF_SZ];
		fflush(fp);
                fseek(fp, o, SEEK_SET);
                fread(buffer, sizeof(int), RD_BUFF_SZ, fp);
                printf("Validating end of file writes (%d %d)\n",
sizeof(int),
		       RD_BUFF_SZ);
                for(i = (RD_BUFF_SZ - 1); i >= 0; i--) {
                    assert(buffer[i] == --j) ;
                }
                rewind(fp);
		j = -1;
            }
        }
        return 1;
    }
    return 0;
}


Aaron Sethman wrote:
> 
> I'd just like to add that the test program bombs on a reiserfs filesystem
> as well.  So if their is some sort of issue, its not just related to ext2.
> 
> Regards,
> 
> Aaron
> 
> On Tue, 17 Jul 2001, Linus Torvalds wrote:
> 
> >
> > On Tue, 17 Jul 2001, David J. Picard wrote:
> > >
> > > Basically, what is happening is the read requests are being pushed to
> > > the front of the IO queue - before the preceding write for the same
> > > sector.
> >
> > This is a bug in the USER, not in the code.
> >
> > The locking is NOT supposed to be done at the elevator level (or, indeed
> > at ANY _io_ level), but must be done by upper layers.
> >
> > If upper layers do not do this locking, then THAT is the bug.
> >
> > What filesystem do you see the bug with?
> >
> >               Linus
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> >

-- 
David J. Picard
  dave@psind.com

If you can keep your head when all about you are losing theirs,
  then you clearly don't understand the situation.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PATCH for Corrupted IO on all block devices
  2001-07-18  1:37 PATCH for Corrupted IO on all block devices David J. Picard
  2001-07-18  1:43 ` Linus Torvalds
  2001-07-18  2:05 ` David S. Miller
@ 2001-07-18  2:16 ` Alexander Viro
  2 siblings, 0 replies; 11+ messages in thread
From: Alexander Viro @ 2001-07-18  2:16 UTC (permalink / raw)
  To: dpicard; +Cc: axboe, torvalds, linux-kernel



On Tue, 17 Jul 2001, David J. Picard wrote:

>                 int i, buffer[RD_BUFF_SZ];
> 		fflush(fp);
>                 fseek(fp, o, SEEK_SET);
>                 fread(buffer, sizeof(int), sizeof(buffer), fp);
				^^^^^^^^^^^^^^^^^^^^^^^^^^^

You've just smashed the stack. Big way. Basically, all your local variables
are junk after that point.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PATCH for Corrupted IO on all block devices
  2001-07-18  2:05 ` David S. Miller
  2001-07-18  2:08   ` Linus Torvalds
@ 2001-07-18  2:35   ` Andrea Arcangeli
  1 sibling, 0 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2001-07-18  2:35 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linus Torvalds, dpicard, axboe, linux-kernel

On Tue, Jul 17, 2001 at 07:05:32PM -0700, David S. Miller wrote:
> 
> Linus Torvalds writes:
>  > What filesystem do you see the bug with?
> 
> His report did specifically mention "databases".  But my initial
> impression was the same as yours, that this is a bug in the user.

it is:

--- corruption.c.orig	Wed Jul 18 04:14:15 2001
+++ corruption.c	Wed Jul 18 04:31:29 2001
@@ -22,7 +22,7 @@
                 int i, buffer[RD_BUFF_SZ];
 		fflush(fp);
                 fseek(fp, o, SEEK_SET);
-                fread(buffer, sizeof(int), sizeof(buffer), fp);
+                fread(buffer, 1, sizeof(buffer), fp);
                 printf("Validating end of file writes\n");
                 for(i = (RD_BUFF_SZ - 1); i >= 0; i--) {
                     assert(buffer[i] == --j) ;

Andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PATCH for Corrupted IO on all block devices
  2001-07-18  1:43 ` Linus Torvalds
@ 2001-07-18  2:37   ` Aaron Sethman
  2001-07-18  2:14     ` David J. Picard
  2001-07-18  2:41     ` Aaron Sethman
  0 siblings, 2 replies; 11+ messages in thread
From: Aaron Sethman @ 2001-07-18  2:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dpicard, axboe, linux-kernel

I'd just like to add that the test program bombs on a reiserfs filesystem
as well.  So if their is some sort of issue, its not just related to ext2.

Regards,

Aaron

On Tue, 17 Jul 2001, Linus Torvalds wrote:

>
> On Tue, 17 Jul 2001, David J. Picard wrote:
> >
> > Basically, what is happening is the read requests are being pushed to
> > the front of the IO queue - before the preceding write for the same
> > sector.
>
> This is a bug in the USER, not in the code.
>
> The locking is NOT supposed to be done at the elevator level (or, indeed
> at ANY _io_ level), but must be done by upper layers.
>
> If upper layers do not do this locking, then THAT is the bug.
>
> What filesystem do you see the bug with?
>
> 		Linus
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PATCH for Corrupted IO on all block devices
  2001-07-18  2:37   ` Aaron Sethman
  2001-07-18  2:14     ` David J. Picard
@ 2001-07-18  2:41     ` Aaron Sethman
  1 sibling, 0 replies; 11+ messages in thread
From: Aaron Sethman @ 2001-07-18  2:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dpicard, axboe, linux-kernel

Ignore this one. It seems with Andrea's fix to the test program it runs
fine on reiserfs but bombs on an ext2fs.

Aaron

On Tue, 17 Jul 2001, Aaron Sethman wrote:

> I'd just like to add that the test program bombs on a reiserfs filesystem
> as well.  So if their is some sort of issue, its not just related to ext2.
>
> Regards,
>
> Aaron
>
> On Tue, 17 Jul 2001, Linus Torvalds wrote:
>
> >
> > On Tue, 17 Jul 2001, David J. Picard wrote:
> > >
> > > Basically, what is happening is the read requests are being pushed to
> > > the front of the IO queue - before the preceding write for the same
> > > sector.
> >
> > This is a bug in the USER, not in the code.
> >
> > The locking is NOT supposed to be done at the elevator level (or, indeed
> > at ANY _io_ level), but must be done by upper layers.
> >
> > If upper layers do not do this locking, then THAT is the bug.
> >
> > What filesystem do you see the bug with?
> >
> > 		Linus
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> >
>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PATCH for Corrupted IO on all block devices
  2001-07-18  2:14     ` David J. Picard
@ 2001-07-18  2:51       ` Alexander Viro
  2001-07-18  3:03       ` David S. Miller
  1 sibling, 0 replies; 11+ messages in thread
From: Alexander Viro @ 2001-07-18  2:51 UTC (permalink / raw)
  To: dpicard; +Cc: Aaron Sethman, linux-kernel



On Tue, 17 Jul 2001, David J. Picard wrote:

> The issue was in a stack overflow in the test program, as Alexander
> pointed out. Is the stack order different on Solaris et.al v. Linux?
> Could this be why it worked so well on the other OS's?

Stack on Sparc grows up. On x86 - down. Besides, on a different CPU/platform/
compiler you might get different register allocation and thus have a local
variable overwritten in one case happily survive in another (where it just
happened to live in a register).


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PATCH for Corrupted IO on all block devices
  2001-07-18  2:14     ` David J. Picard
  2001-07-18  2:51       ` Alexander Viro
@ 2001-07-18  3:03       ` David S. Miller
  1 sibling, 0 replies; 11+ messages in thread
From: David S. Miller @ 2001-07-18  3:03 UTC (permalink / raw)
  To: Alexander Viro; +Cc: dpicard, Aaron Sethman, linux-kernel


Alexander Viro writes:
 > Stack on Sparc grows up.

nope, grows down on both platforms

The HPPA/Linux folks wouldn't have had so many problems if Sparc's
stack grew up rather than down :-)

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2001-07-18  3:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-07-18  1:37 PATCH for Corrupted IO on all block devices David J. Picard
2001-07-18  1:43 ` Linus Torvalds
2001-07-18  2:37   ` Aaron Sethman
2001-07-18  2:14     ` David J. Picard
2001-07-18  2:51       ` Alexander Viro
2001-07-18  3:03       ` David S. Miller
2001-07-18  2:41     ` Aaron Sethman
2001-07-18  2:05 ` David S. Miller
2001-07-18  2:08   ` Linus Torvalds
2001-07-18  2:35   ` Andrea Arcangeli
2001-07-18  2:16 ` Alexander Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).