All of lore.kernel.org
 help / color / mirror / Atom feed
* writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
@ 2009-11-30 20:55 James Y Knight
  2009-12-01  0:48 ` James Y Knight
  0 siblings, 1 reply; 16+ messages in thread
From: James Y Knight @ 2009-11-30 20:55 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2053 bytes --]

This test case fails in 2.6.23-2.6.25, because of the bug fixed in 864f24395c72b6a6c48d13f409f986dc71a5cf4a, and now again in at least 2.6.31 and 2.6.32pre8 because of a *different* bug. This test *does not* fail 2.6.26. I have not tested anything between 2.6.26 and 2.6.31.

The bug in 2.6.31 is definitely not the same bug as 2.6.23's. This time, the zero'd area of the file doesn't show up immediately upon writing the file. Instead, the kernel waits to mangle the file until it has to flush the buffer to disk. *THEN* it zeros out parts of the file.

So, after writing out the new file with writev, and checking the md5sum (which is correct), this test case asks the kernel to flush the cache for that file, and then checks the md5sum again. ONLY THEN is the file corrupted. That is, I won't hesitate to say *incredibly evil* behavior: it took me quite some time to figure out WTH was going wrong with my program before determining it was a kernel bug.

This test case is distilled from an actual application which doesn't even intentionally use writev: it just uses C++'s ofstream class to write data to a file. Unfortunately, that class smart and uses writev under the covers. Unfortunately, I guess nobody ever tests linux writev behavior, since it's broken _so_much_of_the_time_. I really am quite astounded to see such a bad track record for such a fundamental core system call....

My /tmp is an ext3 filesystem, in case that matters.

Here is the output I get from running the program on a broken kernel:
Compiling test program
Making original file /tmp/writevtest.yzafRmFCOR/test.in
..checking original file's md5sum.
Running test to copy to /tmp/writevtest.yzafRmFCOR/test.out
..checking new file's md5sum.
Attempting to drop the page cache for this file...
..checking new file's md5sum again.
MD5SUM MISMATCH(/tmp/writevtest.yzafRmFCOR/test.out):
  wanted 2fdd6851b32ae931637d4845c037b550
  got    67e5e2d6d4435e8095335d86a3d3e993


(please CC responses to me, I'm not subscribed to this list).

Thanks,
James


[-- Attachment #2: run-writev-test.sh --]
[-- Type: application/octet-stream, Size: 974 bytes --]

#!/bin/sh

set -e

MYDIR="$(dirname $0)"

cd "$MYDIR"
echo "Compiling test program"
gcc -o writev-test writev-test.c

test_md5 () {
    local MD5
    MD5=$(md5sum $1|cut -d" " -f1)
    if [[ "$MD5" != $2 ]]; then
	printf "MD5SUM MISMATCH($1):\n  wanted $2\n  got    $MD5\n"
	exit 1
    fi
}

EXPECTED_MD5=2fdd6851b32ae931637d4845c037b550
DIR=$(mktemp -d -t writevtest.XXXXXXXXXX)
echo "Making original file $DIR/test.in"
dd if=/dev/zero bs=1k count=1k  2>/dev/null | tr '\000' '\377' > $DIR/test.in
echo "..checking original file's md5sum."
test_md5 $DIR/test.in $EXPECTED_MD5
echo "Running test to copy to $DIR/test.out"
./writev-test copy $DIR/test.in $DIR/test.out
echo "..checking new file's md5sum."
test_md5 $DIR/test.out $EXPECTED_MD5

echo "Attempting to drop the page cache for this file..."
./writev-test drop $DIR/test.out
#sync; sudo /bin/sh -c "echo 3 > /proc/sys/vm/drop_caches"

echo "..checking new file's md5sum again."
test_md5 $DIR/test.out $EXPECTED_MD5

[-- Attachment #3: writev-test.c --]
[-- Type: application/octet-stream, Size: 1853 bytes --]

#include <sys/uio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <string.h>

void exit_error(char *str) {
    fputs(str, stderr);
    fputc('\n', stderr);
    exit(1);
}

void exit_perror(char *str) {
    perror(str);
    exit(1);
}

int main(int argc, char **argv) {
    int in_fd, out_fd;
    struct stat info;
    void *base_addr;
    char buf[600];
    struct iovec iov[2];
    int mode = 0;

    if (argc < 3)
        exit_error("Usage: writev-test copy infile outfile\n"
                   "       writev-test drop infile\n");

    if (!strcmp(argv[1], "copy")) {
        if (argc < 4)
            exit_error("Missing outfile argument\n");
        mode = 1;
    }
    else if (!strcmp(argv[1], "drop"))
        mode = 0;
    else
        exit_error("Unknown mode\n");

    in_fd = open(argv[2], O_RDONLY);
    if (in_fd < 0)
        exit_perror("open input");

    if (fstat(in_fd, &info) < 0)
        exit_perror("fstat");

    if (mode == 1) {

        base_addr = mmap(0, info.st_size, PROT_READ, MAP_SHARED, in_fd, 0);
        if (base_addr == MAP_FAILED)
            perror("mmap");

        if (read(in_fd, buf, 600) < 600)
            exit_perror("read");

        out_fd = open(argv[3], O_WRONLY|O_TRUNC|O_CREAT, 0666);
        if (out_fd < 0)
            exit_perror("open output");

        iov[0].iov_base = buf;
        iov[0].iov_len = 600;
        iov[1].iov_base = base_addr + 600;
        iov[1].iov_len = info.st_size - 600;
        if (writev(out_fd, iov, 2) < info.st_size)
            exit_perror("writev");

    } else {
        if (fsync(in_fd) < 0)
            perror("fsync");
        if (posix_fadvise(in_fd, 0, info.st_size, POSIX_FADV_DONTNEED) != 0)
            perror("posix_fadvise");
    }
    return 0;
}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
  2009-11-30 20:55 writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64 James Y Knight
@ 2009-12-01  0:48 ` James Y Knight
       [not found]   ` <1259667765.9614.19.camel@marge.simson.net>
  0 siblings, 1 reply; 16+ messages in thread
From: James Y Knight @ 2009-12-01  0:48 UTC (permalink / raw)
  To: linux-kernel


On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:

> This test case fails in 2.6.23-2.6.25, because of the bug fixed in 864f24395c72b6a6c48d13f409f986dc71a5cf4a, and now again in at least 2.6.31 and 2.6.32pre8 because of a *different* bug. This test *does not* fail 2.6.26. I have not tested anything between 2.6.26 and 2.6.31.
> 
> The bug in 2.6.31 is definitely not the same bug as 2.6.23's. This time, the zero'd area of the file doesn't show up immediately upon writing the file. Instead, the kernel waits to mangle the file until it has to flush the buffer to disk. *THEN* it zeros out parts of the file.
> 
> So, after writing out the new file with writev, and checking the md5sum (which is correct), this test case asks the kernel to flush the cache for that file, and then checks the md5sum again. ONLY THEN is the file corrupted. That is, I won't hesitate to say *incredibly evil* behavior: it took me quite some time to figure out WTH was going wrong with my program before determining it was a kernel bug.
> 
> This test case is distilled from an actual application which doesn't even intentionally use writev: it just uses C++'s ofstream class to write data to a file. Unfortunately, that class smart and uses writev under the covers. Unfortunately, I guess nobody ever tests linux writev behavior, since it's broken _so_much_of_the_time_. I really am quite astounded to see such a bad track record for such a fundamental core system call....
> 
> My /tmp is an ext3 filesystem, in case that matters.

Further testing shows that the filesystem type *does* matter. The bug does not exhibit when the test is run on ext2, ext4, vfat, btrfs, jfs, or xfs (and probably all the others too). Only, so far as I can determine, on ext3.

Thanks,
James

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
       [not found]   ` <1259667765.9614.19.camel@marge.simson.net>
@ 2009-12-01 12:59     ` Jan Kara
  2009-12-01 14:35     ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Kara @ 2009-12-01 12:59 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: James Y Knight, LKML

On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
> > On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
> > 
> > > This test case fails in 2.6.23-2.6.25, because of the bug fixed in 864f24395c72b6a6c48d13f409f986dc71a5cf4a, and now again in at least 2.6.31 and 2.6.32pre8 because of a *different* bug. This test *does not* fail 2.6.26. I have not tested anything between 2.6.26 and 2.6.31.
> > > 
> > > The bug in 2.6.31 is definitely not the same bug as 2.6.23's. This time, the zero'd area of the file doesn't show up immediately upon writing the file. Instead, the kernel waits to mangle the file until it has to flush the buffer to disk. *THEN* it zeros out parts of the file.
> > > 
> > > So, after writing out the new file with writev, and checking the md5sum (which is correct), this test case asks the kernel to flush the cache for that file, and then checks the md5sum again. ONLY THEN is the file corrupted. That is, I won't hesitate to say *incredibly evil* behavior: it took me quite some time to figure out WTH was going wrong with my program before determining it was a kernel bug.
> > > 
> > > This test case is distilled from an actual application which doesn't even intentionally use writev: it just uses C++'s ofstream class to write data to a file. Unfortunately, that class smart and uses writev under the covers. Unfortunately, I guess nobody ever tests linux writev behavior, since it's broken _so_much_of_the_time_. I really am quite astounded to see such a bad track record for such a fundamental core system call....
> > > 
> > > My /tmp is an ext3 filesystem, in case that matters.
> > 
> > Further testing shows that the filesystem type *does* matter. The bug does not exhibit when the test is run on ext2, ext4, vfat, btrfs, jfs, or xfs (and probably all the others too). Only, so far as I can determine, on ext3.
> 
> I bisected it this morning.  Bisected cleanly to...
> 
> 9eaaa2d5759837402ec5eee13b2a97921808c3eb is the first bad commit
> commit 9eaaa2d5759837402ec5eee13b2a97921808c3eb
> Author: Jan Kara <jack@suse.cz>
> Date:   Mon Jul 13 20:26:52 2009 +0200
> 
>     ext3: Fix truncation of symlinks after failed write
> 
>     Contents of long symlinks is written via standard write methods. So when the
>     write fails, we add inode to orphan list. But symlinks don't have .truncate
>     method defined so nobody properly removes them from the orphan list (both on
>     disk and in memory).
> 
>     Fix this by calling ext3_truncate() directly instead of calling vmtruncate()
>     (which is saner anyway since we don't need anything vmtruncate() does except
>     from calling .truncate in these paths).  We also add inode to orphan list only
>     if ext3_can_truncate() is true (currently, it can be false for symlinks when
>     there are no blocks allocated) - otherwise orphan list processing will complain
>     and ext3_truncate() will not remove inode from on-disk orphan list.
> 
>     Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Reverting that in 31.6 (two revert/apply cycles) cured it (which doesn't
> look right at a glance at changelog, but.. shrug).  Doing the same
> in .git does not cure it, so either there's a part two, or something
> went wonky.  I'll probably try to bisect part two, but would appreciate
> a verification before maybe wasting more time.
  Huh, I don't see how that's connected either but OTOH it's touching write
path so it's probably some strange interaction. Anyway, I see it on my
machine as well so I'm investigating. Thanks for CCing me.

									Honza

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
       [not found]   ` <1259667765.9614.19.camel@marge.simson.net>
  2009-12-01 12:59     ` Jan Kara
@ 2009-12-01 14:35     ` Jan Kara
  2009-12-01 16:03       ` Jan Kara
                         ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Jan Kara @ 2009-12-01 14:35 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: James Y Knight, Jan Kara, LKML, linux-ext4, npiggin

On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
> > On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
> > 
> > > This test case fails in 2.6.23-2.6.25, because of the bug fixed in 864f24395c72b6a6c48d13f409f986dc71a5cf4a, and now again in at least 2.6.31 and 2.6.32pre8 because of a *different* bug. This test *does not* fail 2.6.26. I have not tested anything between 2.6.26 and 2.6.31.
> > > 
> > > The bug in 2.6.31 is definitely not the same bug as 2.6.23's. This time, the zero'd area of the file doesn't show up immediately upon writing the file. Instead, the kernel waits to mangle the file until it has to flush the buffer to disk. *THEN* it zeros out parts of the file.
> > > 
> > > So, after writing out the new file with writev, and checking the md5sum (which is correct), this test case asks the kernel to flush the cache for that file, and then checks the md5sum again. ONLY THEN is the file corrupted. That is, I won't hesitate to say *incredibly evil* behavior: it took me quite some time to figure out WTH was going wrong with my program before determining it was a kernel bug.
> > > 
> > > This test case is distilled from an actual application which doesn't even intentionally use writev: it just uses C++'s ofstream class to write data to a file. Unfortunately, that class smart and uses writev under the covers. Unfortunately, I guess nobody ever tests linux writev behavior, since it's broken _so_much_of_the_time_. I really am quite astounded to see such a bad track record for such a fundamental core system call....
> > > 
> > > My /tmp is an ext3 filesystem, in case that matters.
> > 
> > Further testing shows that the filesystem type *does* matter. The bug does not exhibit when the test is run on ext2, ext4, vfat, btrfs, jfs, or xfs (and probably all the others too). Only, so far as I can determine, on ext3.
> 
> I bisected it this morning.  Bisected cleanly to...
> 
> 9eaaa2d5759837402ec5eee13b2a97921808c3eb is the first bad commit
  OK, I've debugged it. This commit is really at fault. The problem is
following:
  When using writev, the page we copy from is not paged in (while when we
use ordinary write, it is paged in). This difference might be worth
investigation on its own (as it is likely to heavily impact performance of
writev) but is irrelevant for us now - we should handle this without data
corruption anyway. Because the source page is not available, we pass 0 as
the number of copied bytes to write_end and thus ext3_write_end decides to
truncate the file to original size. This is perfectly fine. The problem is
that we do this by ext3_truncate() which just frees corresponding block but
does not unmap buffers. So we leave mapped buffers beyond i_size (they
actually never were inside i_size) but the blocks they are mapped to are
already free. The write is then retried (after mapping the page),
block_write_begin() sees the buffer is mapped (although it is beyond
i_size) and thus it does not call ext3_get_block() anymore. So as a result,
data is written to a block that is no longer allocated to the file. Bummer
- welcome filesystem corruption.
  Ext4 also has this problem but delayed allocation mitigates the effect to
an error in accounting of blocks reserved for delayed allocation and thus
under normal circumstances nothing bad happens.
  The question is how to solve this in the cleanest way. We can call
vmtruncate() instead of ext3_truncate() as we used to do but Nick wants to
get rid of that (that's why I originally changed the code to what it is
now). So probably we could just manually call truncate_pagecache() instead.
Nick, I think your truncate calling sequence patch set needs similar fix
for all filesystems as well.

								Honza

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
  2009-12-01 14:35     ` Jan Kara
@ 2009-12-01 16:03       ` Jan Kara
  2009-12-01 16:47         ` Mike Galbraith
                           ` (2 more replies)
  2009-12-02 19:04       ` Jan Kara
  2009-12-03  5:22       ` Nick Piggin
  2 siblings, 3 replies; 16+ messages in thread
From: Jan Kara @ 2009-12-01 16:03 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: James Y Knight, Jan Kara, LKML, linux-ext4, npiggin

On Tue 01-12-09 15:35:59, Jan Kara wrote:
> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
> > On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
> > > On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
> > > 
> > > > This test case fails in 2.6.23-2.6.25, because of the bug fixed in 864f24395c72b6a6c48d13f409f986dc71a5cf4a, and now again in at least 2.6.31 and 2.6.32pre8 because of a *different* bug. This test *does not* fail 2.6.26. I have not tested anything between 2.6.26 and 2.6.31.
> > > > 
> > > > The bug in 2.6.31 is definitely not the same bug as 2.6.23's. This time, the zero'd area of the file doesn't show up immediately upon writing the file. Instead, the kernel waits to mangle the file until it has to flush the buffer to disk. *THEN* it zeros out parts of the file.
> > > > 
> > > > So, after writing out the new file with writev, and checking the md5sum (which is correct), this test case asks the kernel to flush the cache for that file, and then checks the md5sum again. ONLY THEN is the file corrupted. That is, I won't hesitate to say *incredibly evil* behavior: it took me quite some time to figure out WTH was going wrong with my program before determining it was a kernel bug.
> > > > 
> > > > This test case is distilled from an actual application which doesn't even intentionally use writev: it just uses C++'s ofstream class to write data to a file. Unfortunately, that class smart and uses writev under the covers. Unfortunately, I guess nobody ever tests linux writev behavior, since it's broken _so_much_of_the_time_. I really am quite astounded to see such a bad track record for such a fundamental core system call....
> > > > 
> > > > My /tmp is an ext3 filesystem, in case that matters.
> > > 
> > > Further testing shows that the filesystem type *does* matter. The bug does not exhibit when the test is run on ext2, ext4, vfat, btrfs, jfs, or xfs (and probably all the others too). Only, so far as I can determine, on ext3.
> > 
> > I bisected it this morning.  Bisected cleanly to...
> > 
> > 9eaaa2d5759837402ec5eee13b2a97921808c3eb is the first bad commit
>   OK, I've debugged it. This commit is really at fault. The problem is
> following:
>   When using writev, the page we copy from is not paged in (while when we
> use ordinary write, it is paged in). This difference might be worth
> investigation on its own (as it is likely to heavily impact performance of
> writev) but is irrelevant for us now - we should handle this without data
> corruption anyway. Because the source page is not available, we pass 0 as
> the number of copied bytes to write_end and thus ext3_write_end decides to
> truncate the file to original size. This is perfectly fine. The problem is
> that we do this by ext3_truncate() which just frees corresponding block but
> does not unmap buffers. So we leave mapped buffers beyond i_size (they
> actually never were inside i_size) but the blocks they are mapped to are
> already free. The write is then retried (after mapping the page),
> block_write_begin() sees the buffer is mapped (although it is beyond
> i_size) and thus it does not call ext3_get_block() anymore. So as a result,
> data is written to a block that is no longer allocated to the file. Bummer
> - welcome filesystem corruption.
>   Ext4 also has this problem but delayed allocation mitigates the effect to
> an error in accounting of blocks reserved for delayed allocation and thus
> under normal circumstances nothing bad happens.
>   The question is how to solve this in the cleanest way. We can call
> vmtruncate() instead of ext3_truncate() as we used to do but Nick wants to
> get rid of that (that's why I originally changed the code to what it is
> now). So probably we could just manually call truncate_pagecache() instead.
> Nick, I think your truncate calling sequence patch set needs similar fix
> for all filesystems as well.
  The patch below fixes the issue for me...

									Honza

>From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 1 Dec 2009 16:53:06 +0100
Subject: [PATCH] ext3: Fix data / filesystem corruption when write fails to copy data

When ext3_write_begin fails after allocating some blocks or
generic_perform_write fails to copy data to write, we truncate blocks already
instantiated beyond i_size. Although these blocks were never inside i_size, we
have to truncate pagecache of these blocks so that corresponding buffers get
unmapped. Otherwise subsequent __block_prepare_write (called because we are
retrying the write) will find the buffers mapped, not call ->get_block, and
thus the page will be backed by already freed blocks leading to filesystem and
data corruption.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c |   18 ++++++++++++++----
 1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 354ed3b..f9d6937 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1151,6 +1151,16 @@ static int do_journal_get_write_access(handle_t *handle,
 	return ext3_journal_get_write_access(handle, bh);
 }
 
+/*
+ * Truncate blocks that were not used by write. We have to truncate the
+ * pagecache as well so that corresponding buffers get properly unmapped.
+ */
+static void ext3_truncate_failed_write(struct inode *inode)
+{
+	truncate_inode_pages(inode->i_mapping, inode->i_size);
+	ext3_truncate(inode);
+}
+
 static int ext3_write_begin(struct file *file, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
 				struct page **pagep, void **fsdata)
@@ -1209,7 +1219,7 @@ write_begin_failed:
 		unlock_page(page);
 		page_cache_release(page);
 		if (pos + len > inode->i_size)
-			ext3_truncate(inode);
+			ext3_truncate_failed_write(inode);
 	}
 	if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
@@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file *file,
 	page_cache_release(page);
 
 	if (pos + len > inode->i_size)
-		ext3_truncate(inode);
+		ext3_truncate_failed_write(inode);
 	return ret ? ret : copied;
 }
 
@@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct file *file,
 	page_cache_release(page);
 
 	if (pos + len > inode->i_size)
-		ext3_truncate(inode);
+		ext3_truncate_failed_write(inode);
 	return ret ? ret : copied;
 }
 
@@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct file *file,
 	page_cache_release(page);
 
 	if (pos + len > inode->i_size)
-		ext3_truncate(inode);
+		ext3_truncate_failed_write(inode);
 	return ret ? ret : copied;
 }
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
  2009-12-01 16:03       ` Jan Kara
@ 2009-12-01 16:47         ` Mike Galbraith
  2009-12-01 22:02         ` Andreas Dilger
  2009-12-02 21:24         ` James Y Knight
  2 siblings, 0 replies; 16+ messages in thread
From: Mike Galbraith @ 2009-12-01 16:47 UTC (permalink / raw)
  To: Jan Kara; +Cc: James Y Knight, LKML, linux-ext4, npiggin

On Tue, 2009-12-01 at 17:03 +0100, Jan Kara wrote:

>   The patch below fixes the issue for me...

Ditto.

	-Mike


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
  2009-12-01 16:03       ` Jan Kara
  2009-12-01 16:47         ` Mike Galbraith
@ 2009-12-01 22:02         ` Andreas Dilger
  2009-12-02 18:53             ` [LTP] " Randy Dunlap
  2009-12-02 21:24         ` James Y Knight
  2 siblings, 1 reply; 16+ messages in thread
From: Andreas Dilger @ 2009-12-01 22:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: Mike Galbraith, James Y Knight, ext4 development, npiggin

On 2009-12-01, at 09:03, Jan Kara wrote:
> On Tue 01-12-09 15:35:59, Jan Kara wrote:
>> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
>>> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
>>>> On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
>>>>> This test case is distilled from an actual application which  
>>>>> doesn't even intentionally use writev: it just uses C++'s  
>>>>> ofstream class to write data to a file. Unfortunately, that  
>>>>> class smart and uses writev under the covers. Unfortunately, I  
>>>>> guess nobody ever tests linux writev behavior, since it's broken  
>>>>> _so_much_of_the_time_. I really am quite astounded to see such a  
>>>>> bad track record for such a fundamental core system call....

I suspect an excellent way of exposing problems with the writev()  
interface would be to wire it into fsx, which is commonly run as a  
stress test for Linux.  I don't know if it would have caught this  
case, but it definitely couldn't hurt to get more testing cycles for it.

>>  Ext4 also has this problem but delayed allocation mitigates the  
>> effect to an error in accounting of blocks reserved for delayed  
>> allocation and thus under normal circumstances nothing bad happens.

It looks like ext4 might still hit this problem, if delalloc is  
disabled.  Could you please submit a similar patch for ext4 also.

>  The patch below fixes the issue for me...
>
> 									Honza
>
> From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Tue, 1 Dec 2009 16:53:06 +0100
> Subject: [PATCH] ext3: Fix data / filesystem corruption when write  
> fails to copy data
>
> When ext3_write_begin fails after allocating some blocks or  
> generic_perform_write fails to copy data to write, we truncate  
> blocks already instantiated beyond i_size. Although these blocks  
> were never inside i_size, we have to truncate pagecache of these  
> blocks so that corresponding buffers get unmapped. Otherwise  
> subsequent __block_prepare_write (called because we are retrying the  
> write) will find the buffers mapped, not call ->get_block, and thus  
> the page will be backed by already freed blocks leading to  
> filesystem and data corruption.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> fs/ext3/inode.c |   18 ++++++++++++++----
> 1 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 354ed3b..f9d6937 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -1151,6 +1151,16 @@ static int  
> do_journal_get_write_access(handle_t *handle,
> 	return ext3_journal_get_write_access(handle, bh);
> }
>
> +/*
> + * Truncate blocks that were not used by write. We have to truncate  
> the
> + * pagecache as well so that corresponding buffers get properly  
> unmapped.
> + */
> +static void ext3_truncate_failed_write(struct inode *inode)
> +{
> +	truncate_inode_pages(inode->i_mapping, inode->i_size);
> +	ext3_truncate(inode);
> +}
> +
> static int ext3_write_begin(struct file *file, struct address_space  
> *mapping,
> 				loff_t pos, unsigned len, unsigned flags,
> 				struct page **pagep, void **fsdata)
> @@ -1209,7 +1219,7 @@ write_begin_failed:
> 		unlock_page(page);
> 		page_cache_release(page);
> 		if (pos + len > inode->i_size)
> -			ext3_truncate(inode);
> +			ext3_truncate_failed_write(inode);
> 	}
> 	if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> 		goto retry;
> @@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file  
> *file,
> 	page_cache_release(page);
>
> 	if (pos + len > inode->i_size)
> -		ext3_truncate(inode);
> +		ext3_truncate_failed_write(inode);
> 	return ret ? ret : copied;
> }
>
> @@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct  
> file *file,
> 	page_cache_release(page);
>
> 	if (pos + len > inode->i_size)
> -		ext3_truncate(inode);
> +		ext3_truncate_failed_write(inode);
> 	return ret ? ret : copied;
> }
>
> @@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct  
> file *file,
> 	page_cache_release(page);
>
> 	if (pos + len > inode->i_size)
> -		ext3_truncate(inode);
> +		ext3_truncate_failed_write(inode);
> 	return ret ? ret : copied;
> }
>
> -- 
> 1.6.4.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux- 
> ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
  2009-12-01 22:02         ` Andreas Dilger
@ 2009-12-02 18:53             ` Randy Dunlap
  0 siblings, 0 replies; 16+ messages in thread
From: Randy Dunlap @ 2009-12-02 18:53 UTC (permalink / raw)
  To: Andreas Dilger, ltp-list
  Cc: Jan Kara, Mike Galbraith, James Y Knight, ext4 development, npiggin

On Tue, 01 Dec 2009 15:02:04 -0700 Andreas Dilger wrote:

> On 2009-12-01, at 09:03, Jan Kara wrote:
> > On Tue 01-12-09 15:35:59, Jan Kara wrote:
> >> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
> >>> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
> >>>> On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
> >>>>> This test case is distilled from an actual application which  
> >>>>> doesn't even intentionally use writev: it just uses C++'s  
> >>>>> ofstream class to write data to a file. Unfortunately, that  
> >>>>> class smart and uses writev under the covers. Unfortunately, I  
> >>>>> guess nobody ever tests linux writev behavior, since it's broken  
> >>>>> _so_much_of_the_time_. I really am quite astounded to see such a  
> >>>>> bad track record for such a fundamental core system call....
> 
> I suspect an excellent way of exposing problems with the writev()  
> interface would be to wire it into fsx, which is commonly run as a  
> stress test for Linux.  I don't know if it would have caught this  
> case, but it definitely couldn't hurt to get more testing cycles for it.

Maybe someone from LTP would be interested in adding this test functionality
to fsx-linux ?

Source/test program is available at
http://marc.info/?l=linux-kernel&m=125961612418323&w=2


> >>  Ext4 also has this problem but delayed allocation mitigates the  
> >> effect to an error in accounting of blocks reserved for delayed  
> >> allocation and thus under normal circumstances nothing bad happens.
> 
> It looks like ext4 might still hit this problem, if delalloc is  
> disabled.  Could you please submit a similar patch for ext4 also.
> 
> >  The patch below fixes the issue for me...
> >
> > 									Honza
> >
> > From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001
> > From: Jan Kara <jack@suse.cz>
> > Date: Tue, 1 Dec 2009 16:53:06 +0100
> > Subject: [PATCH] ext3: Fix data / filesystem corruption when write  
> > fails to copy data
> >
> > When ext3_write_begin fails after allocating some blocks or  
> > generic_perform_write fails to copy data to write, we truncate  
> > blocks already instantiated beyond i_size. Although these blocks  
> > were never inside i_size, we have to truncate pagecache of these  
> > blocks so that corresponding buffers get unmapped. Otherwise  
> > subsequent __block_prepare_write (called because we are retrying the  
> > write) will find the buffers mapped, not call ->get_block, and thus  
> > the page will be backed by already freed blocks leading to  
> > filesystem and data corruption.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> > fs/ext3/inode.c |   18 ++++++++++++++----
> > 1 files changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > index 354ed3b..f9d6937 100644
> > --- a/fs/ext3/inode.c
> > +++ b/fs/ext3/inode.c
> > @@ -1151,6 +1151,16 @@ static int  
> > do_journal_get_write_access(handle_t *handle,
> > 	return ext3_journal_get_write_access(handle, bh);
> > }
> >
> > +/*
> > + * Truncate blocks that were not used by write. We have to truncate  
> > the
> > + * pagecache as well so that corresponding buffers get properly  
> > unmapped.
> > + */
> > +static void ext3_truncate_failed_write(struct inode *inode)
> > +{
> > +	truncate_inode_pages(inode->i_mapping, inode->i_size);
> > +	ext3_truncate(inode);
> > +}
> > +
> > static int ext3_write_begin(struct file *file, struct address_space  
> > *mapping,
> > 				loff_t pos, unsigned len, unsigned flags,
> > 				struct page **pagep, void **fsdata)
> > @@ -1209,7 +1219,7 @@ write_begin_failed:
> > 		unlock_page(page);
> > 		page_cache_release(page);
> > 		if (pos + len > inode->i_size)
> > -			ext3_truncate(inode);
> > +			ext3_truncate_failed_write(inode);
> > 	}
> > 	if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> > 		goto retry;
> > @@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file  
> > *file,
> > 	page_cache_release(page);
> >
> > 	if (pos + len > inode->i_size)
> > -		ext3_truncate(inode);
> > +		ext3_truncate_failed_write(inode);
> > 	return ret ? ret : copied;
> > }
> >
> > @@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct  
> > file *file,
> > 	page_cache_release(page);
> >
> > 	if (pos + len > inode->i_size)
> > -		ext3_truncate(inode);
> > +		ext3_truncate_failed_write(inode);
> > 	return ret ? ret : copied;
> > }
> >
> > @@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct  
> > file *file,
> > 	page_cache_release(page);
> >
> > 	if (pos + len > inode->i_size)
> > -		ext3_truncate(inode);
> > +		ext3_truncate_failed_write(inode);
> > 	return ret ? ret : copied;
> > }
> >
> > -- 
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.


---
~Randy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LTP] writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
@ 2009-12-02 18:53             ` Randy Dunlap
  0 siblings, 0 replies; 16+ messages in thread
From: Randy Dunlap @ 2009-12-02 18:53 UTC (permalink / raw)
  To: Andreas Dilger, ltp-list
  Cc: Mike Galbraith, ext4 development, James Y Knight, Jan Kara, npiggin

On Tue, 01 Dec 2009 15:02:04 -0700 Andreas Dilger wrote:

> On 2009-12-01, at 09:03, Jan Kara wrote:
> > On Tue 01-12-09 15:35:59, Jan Kara wrote:
> >> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
> >>> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
> >>>> On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
> >>>>> This test case is distilled from an actual application which  
> >>>>> doesn't even intentionally use writev: it just uses C++'s  
> >>>>> ofstream class to write data to a file. Unfortunately, that  
> >>>>> class smart and uses writev under the covers. Unfortunately, I  
> >>>>> guess nobody ever tests linux writev behavior, since it's broken  
> >>>>> _so_much_of_the_time_. I really am quite astounded to see such a  
> >>>>> bad track record for such a fundamental core system call....
> 
> I suspect an excellent way of exposing problems with the writev()  
> interface would be to wire it into fsx, which is commonly run as a  
> stress test for Linux.  I don't know if it would have caught this  
> case, but it definitely couldn't hurt to get more testing cycles for it.

Maybe someone from LTP would be interested in adding this test functionality
to fsx-linux ?

Source/test program is available at
http://marc.info/?l=linux-kernel&m=125961612418323&w=2


> >>  Ext4 also has this problem but delayed allocation mitigates the  
> >> effect to an error in accounting of blocks reserved for delayed  
> >> allocation and thus under normal circumstances nothing bad happens.
> 
> It looks like ext4 might still hit this problem, if delalloc is  
> disabled.  Could you please submit a similar patch for ext4 also.
> 
> >  The patch below fixes the issue for me...
> >
> > 									Honza
> >
> > From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001
> > From: Jan Kara <jack@suse.cz>
> > Date: Tue, 1 Dec 2009 16:53:06 +0100
> > Subject: [PATCH] ext3: Fix data / filesystem corruption when write  
> > fails to copy data
> >
> > When ext3_write_begin fails after allocating some blocks or  
> > generic_perform_write fails to copy data to write, we truncate  
> > blocks already instantiated beyond i_size. Although these blocks  
> > were never inside i_size, we have to truncate pagecache of these  
> > blocks so that corresponding buffers get unmapped. Otherwise  
> > subsequent __block_prepare_write (called because we are retrying the  
> > write) will find the buffers mapped, not call ->get_block, and thus  
> > the page will be backed by already freed blocks leading to  
> > filesystem and data corruption.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> > fs/ext3/inode.c |   18 ++++++++++++++----
> > 1 files changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > index 354ed3b..f9d6937 100644
> > --- a/fs/ext3/inode.c
> > +++ b/fs/ext3/inode.c
> > @@ -1151,6 +1151,16 @@ static int  
> > do_journal_get_write_access(handle_t *handle,
> > 	return ext3_journal_get_write_access(handle, bh);
> > }
> >
> > +/*
> > + * Truncate blocks that were not used by write. We have to truncate  
> > the
> > + * pagecache as well so that corresponding buffers get properly  
> > unmapped.
> > + */
> > +static void ext3_truncate_failed_write(struct inode *inode)
> > +{
> > +	truncate_inode_pages(inode->i_mapping, inode->i_size);
> > +	ext3_truncate(inode);
> > +}
> > +
> > static int ext3_write_begin(struct file *file, struct address_space  
> > *mapping,
> > 				loff_t pos, unsigned len, unsigned flags,
> > 				struct page **pagep, void **fsdata)
> > @@ -1209,7 +1219,7 @@ write_begin_failed:
> > 		unlock_page(page);
> > 		page_cache_release(page);
> > 		if (pos + len > inode->i_size)
> > -			ext3_truncate(inode);
> > +			ext3_truncate_failed_write(inode);
> > 	}
> > 	if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> > 		goto retry;
> > @@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file  
> > *file,
> > 	page_cache_release(page);
> >
> > 	if (pos + len > inode->i_size)
> > -		ext3_truncate(inode);
> > +		ext3_truncate_failed_write(inode);
> > 	return ret ? ret : copied;
> > }
> >
> > @@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct  
> > file *file,
> > 	page_cache_release(page);
> >
> > 	if (pos + len > inode->i_size)
> > -		ext3_truncate(inode);
> > +		ext3_truncate_failed_write(inode);
> > 	return ret ? ret : copied;
> > }
> >
> > @@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct  
> > file *file,
> > 	page_cache_release(page);
> >
> > 	if (pos + len > inode->i_size)
> > -		ext3_truncate(inode);
> > +		ext3_truncate_failed_write(inode);
> > 	return ret ? ret : copied;
> > }
> >
> > -- 
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.


---
~Randy

------------------------------------------------------------------------------
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
_______________________________________________
Ltp-list mailing list
Ltp-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ltp-list

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
  2009-12-01 14:35     ` Jan Kara
  2009-12-01 16:03       ` Jan Kara
@ 2009-12-02 19:04       ` Jan Kara
  2009-12-03  5:28         ` Nick Piggin
  2009-12-03  5:22       ` Nick Piggin
  2 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2009-12-02 19:04 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: James Y Knight, Jan Kara, LKML, linux-ext4, npiggin

On Tue 01-12-09 15:35:59, Jan Kara wrote:
> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
> > On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
> > > On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
> > > 
> > > > This test case fails in 2.6.23-2.6.25, because of the bug fixed in 864f24395c72b6a6c48d13f409f986dc71a5cf4a, and now again in at least 2.6.31 and 2.6.32pre8 because of a *different* bug. This test *does not* fail 2.6.26. I have not tested anything between 2.6.26 and 2.6.31.
> > > > 
> > > > The bug in 2.6.31 is definitely not the same bug as 2.6.23's. This time, the zero'd area of the file doesn't show up immediately upon writing the file. Instead, the kernel waits to mangle the file until it has to flush the buffer to disk. *THEN* it zeros out parts of the file.
> > > > 
> > > > So, after writing out the new file with writev, and checking the md5sum (which is correct), this test case asks the kernel to flush the cache for that file, and then checks the md5sum again. ONLY THEN is the file corrupted. That is, I won't hesitate to say *incredibly evil* behavior: it took me quite some time to figure out WTH was going wrong with my program before determining it was a kernel bug.
> > > > 
> > > > This test case is distilled from an actual application which doesn't even intentionally use writev: it just uses C++'s ofstream class to write data to a file. Unfortunately, that class smart and uses writev under the covers. Unfortunately, I guess nobody ever tests linux writev behavior, since it's broken _so_much_of_the_time_. I really am quite astounded to see such a bad track record for such a fundamental core system call....
> > > > 
> > > > My /tmp is an ext3 filesystem, in case that matters.
> > > 
> > > Further testing shows that the filesystem type *does* matter. The bug does not exhibit when the test is run on ext2, ext4, vfat, btrfs, jfs, or xfs (and probably all the others too). Only, so far as I can determine, on ext3.
> > 
> > I bisected it this morning.  Bisected cleanly to...
> > 
> > 9eaaa2d5759837402ec5eee13b2a97921808c3eb is the first bad commit
>   OK, I've debugged it. This commit is really at fault. The problem is
> following:
>   When using writev, the page we copy from is not paged in (while when we
> use ordinary write, it is paged in). This difference might be worth
> investigation on its own (as it is likely to heavily impact performance of
> writev) but is irrelevant for us now - we should handle this without data
> corruption anyway.
  I've looked into why writev fails reliably the writes. The reason is that
iov_iter_fault_in_readable() faults in only the first IO buffer. Because
this is just 600 bytes big, following iov_iter_copy_from_user_atomic copies
only 600 bytes and block_write_end sets number of copied bytes to 0. Thus
we restart the write and do it one iov per iteration which succeeds. So
everything works as designed only it gets inefficient in this particular
case.

								Honza

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
  2009-12-01 16:03       ` Jan Kara
  2009-12-01 16:47         ` Mike Galbraith
  2009-12-01 22:02         ` Andreas Dilger
@ 2009-12-02 21:24         ` James Y Knight
  2 siblings, 0 replies; 16+ messages in thread
From: James Y Knight @ 2009-12-02 21:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: Mike Galbraith, LKML, linux-ext4, npiggin

On Dec 1, 2009, at 11:03 AM, Jan Kara wrote:
> On Tue 01-12-09 15:35:59, Jan Kara wrote:
>> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
>>> I bisected it this morning.  Bisected cleanly to...
>>> 
>>> 9eaaa2d5759837402ec5eee13b2a97921808c3eb is the first bad commit
>>  OK, I've debugged it. This commit is really at fault. The problem is
>> following:
>>  When using writev, the page we copy from is not paged in (while when we
>> use ordinary write, it is paged in). This difference might be worth
>> investigation on its own (as it is likely to heavily impact performance of
>> writev) but is irrelevant for us now - we should handle this without data
>> corruption anyway. Because the source page is not available, we pass 0 as
>> the number of copied bytes to write_end and thus ext3_write_end decides to
>> truncate the file to original size. This is perfectly fine. The problem is
>> that we do this by ext3_truncate() which just frees corresponding block but
>> does not unmap buffers. So we leave mapped buffers beyond i_size (they
>> actually never were inside i_size) but the blocks they are mapped to are
>> already free. The write is then retried (after mapping the page),
>> block_write_begin() sees the buffer is mapped (although it is beyond
>> i_size) and thus it does not call ext3_get_block() anymore. So as a result,
>> data is written to a block that is no longer allocated to the file. Bummer
>> - welcome filesystem corruption.
>>  Ext4 also has this problem but delayed allocation mitigates the effect to
>> an error in accounting of blocks reserved for delayed allocation and thus
>> under normal circumstances nothing bad happens.
>>  The question is how to solve this in the cleanest way. We can call
>> vmtruncate() instead of ext3_truncate() as we used to do but Nick wants to
>> get rid of that (that's why I originally changed the code to what it is
>> now). So probably we could just manually call truncate_pagecache() instead.
>> Nick, I think your truncate calling sequence patch set needs similar fix
>> for all filesystems as well.
>  The patch below fixes the issue for me...

Thank you! I can confirm that the patch fixes the issue in my real application as well.

James

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
  2009-12-01 14:35     ` Jan Kara
  2009-12-01 16:03       ` Jan Kara
  2009-12-02 19:04       ` Jan Kara
@ 2009-12-03  5:22       ` Nick Piggin
  2 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2009-12-03  5:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Mike Galbraith, James Y Knight, LKML, linux-ext4

On Tue, Dec 01, 2009 at 03:35:59PM +0100, Jan Kara wrote:
> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
> > On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
> > > On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
> > > 
> > > > This test case fails in 2.6.23-2.6.25, because of the bug fixed in 864f24395c72b6a6c48d13f409f986dc71a5cf4a, and now again in at least 2.6.31 and 2.6.32pre8 because of a *different* bug. This test *does not* fail 2.6.26. I have not tested anything between 2.6.26 and 2.6.31.
> > > > 
> > > > The bug in 2.6.31 is definitely not the same bug as 2.6.23's. This time, the zero'd area of the file doesn't show up immediately upon writing the file. Instead, the kernel waits to mangle the file until it has to flush the buffer to disk. *THEN* it zeros out parts of the file.
> > > > 
> > > > So, after writing out the new file with writev, and checking the md5sum (which is correct), this test case asks the kernel to flush the cache for that file, and then checks the md5sum again. ONLY THEN is the file corrupted. That is, I won't hesitate to say *incredibly evil* behavior: it took me quite some time to figure out WTH was going wrong with my program before determining it was a kernel bug.
> > > > 
> > > > This test case is distilled from an actual application which doesn't even intentionally use writev: it just uses C++'s ofstream class to write data to a file. Unfortunately, that class smart and uses writev under the covers. Unfortunately, I guess nobody ever tests linux writev behavior, since it's broken _so_much_of_the_time_. I really am quite astounded to see such a bad track record for such a fundamental core system call....
> > > > 
> > > > My /tmp is an ext3 filesystem, in case that matters.
> > > 
> > > Further testing shows that the filesystem type *does* matter. The bug does not exhibit when the test is run on ext2, ext4, vfat, btrfs, jfs, or xfs (and probably all the others too). Only, so far as I can determine, on ext3.
> > 
> > I bisected it this morning.  Bisected cleanly to...
> > 
> > 9eaaa2d5759837402ec5eee13b2a97921808c3eb is the first bad commit
>   OK, I've debugged it. This commit is really at fault. The problem is
> following:
>   When using writev, the page we copy from is not paged in (while when we
> use ordinary write, it is paged in). This difference might be worth
> investigation on its own (as it is likely to heavily impact performance of
> writev) but is irrelevant for us now - we should handle this without data
> corruption anyway. Because the source page is not available, we pass 0 as
> the number of copied bytes to write_end and thus ext3_write_end decides to
> truncate the file to original size. This is perfectly fine. The problem is
> that we do this by ext3_truncate() which just frees corresponding block but
> does not unmap buffers. So we leave mapped buffers beyond i_size (they
> actually never were inside i_size) but the blocks they are mapped to are
> already free. The write is then retried (after mapping the page),
> block_write_begin() sees the buffer is mapped (although it is beyond
> i_size) and thus it does not call ext3_get_block() anymore. So as a result,
> data is written to a block that is no longer allocated to the file. Bummer
> - welcome filesystem corruption.
>   Ext4 also has this problem but delayed allocation mitigates the effect to
> an error in accounting of blocks reserved for delayed allocation and thus
> under normal circumstances nothing bad happens.
>   The question is how to solve this in the cleanest way. We can call
> vmtruncate() instead of ext3_truncate() as we used to do but Nick wants to
> get rid of that (that's why I originally changed the code to what it is
> now). So probably we could just manually call truncate_pagecache() instead.
> Nick, I think your truncate calling sequence patch set needs similar fix
> for all filesystems as well.

Thanks Jan, good analysis and yes I believe I will need to do a similar
fix there.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
  2009-12-02 19:04       ` Jan Kara
@ 2009-12-03  5:28         ` Nick Piggin
  2009-12-03 10:32           ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2009-12-03  5:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: Mike Galbraith, James Y Knight, LKML, linux-ext4

On Wed, Dec 02, 2009 at 08:04:26PM +0100, Jan Kara wrote:
> >   When using writev, the page we copy from is not paged in (while when we
> > use ordinary write, it is paged in). This difference might be worth
> > investigation on its own (as it is likely to heavily impact performance of
> > writev) but is irrelevant for us now - we should handle this without data
> > corruption anyway.
>   I've looked into why writev fails reliably the writes. The reason is that
> iov_iter_fault_in_readable() faults in only the first IO buffer. Because
> this is just 600 bytes big, following iov_iter_copy_from_user_atomic copies
> only 600 bytes and block_write_end sets number of copied bytes to 0. Thus
> we restart the write and do it one iov per iteration which succeeds. So
> everything works as designed only it gets inefficient in this particular
> case.

Yep, this would be right. We could actually do more prefaulting; I
think I was being a little over conservative and worried about earlier
pages being unmapped before we were able to consume them... but I
think being too worried about that case is optimizing an unusual case
that is probably performing badly anyway at the expense of more common
patterns.

Anyway, what I was doing to test this code when I wrote it was to
inject random failures into user copy functions. I guess this could
be useful to merge in the error injection framework?

Thanks,
Nick


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
  2009-12-03  5:28         ` Nick Piggin
@ 2009-12-03 10:32           ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2009-12-03 10:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jan Kara, Mike Galbraith, James Y Knight, LKML, linux-ext4

On Thu 03-12-09 06:28:25, Nick Piggin wrote:
> On Wed, Dec 02, 2009 at 08:04:26PM +0100, Jan Kara wrote:
> > >   When using writev, the page we copy from is not paged in (while when we
> > > use ordinary write, it is paged in). This difference might be worth
> > > investigation on its own (as it is likely to heavily impact performance of
> > > writev) but is irrelevant for us now - we should handle this without data
> > > corruption anyway.
> >   I've looked into why writev fails reliably the writes. The reason is that
> > iov_iter_fault_in_readable() faults in only the first IO buffer. Because
> > this is just 600 bytes big, following iov_iter_copy_from_user_atomic copies
> > only 600 bytes and block_write_end sets number of copied bytes to 0. Thus
> > we restart the write and do it one iov per iteration which succeeds. So
> > everything works as designed only it gets inefficient in this particular
> > case.
> Yep, this would be right. We could actually do more prefaulting; I
> think I was being a little over conservative and worried about earlier
> pages being unmapped before we were able to consume them... but I
> think being too worried about that case is optimizing an unusual case
> that is probably performing badly anyway at the expense of more common
> patterns.
  Yeah, IMHO optimal would be to fault in enough buffers so that we can
fill one page (although we may pose some upper bound on the number of pages
we are willing to fault in - like 1 MB of data or so).

> Anyway, what I was doing to test this code when I wrote it was to
> inject random failures into user copy functions. I guess this could
> be useful to merge in the error injection framework?
  Yes, that would be definitely useful. This was exceptionally easy to
track down because it was easily reproducible. But otherwise this path is
almost never taken and bugs in there are hard to debug so it would get more
testing coverage. I've spent like a month debugging a bug in reiserfs
causing data corruption in this path - mainly because it took a few days to
reproduce it and I didn't know what could be possibly triggering it...

								Honza

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LTP] writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
  2009-12-02 18:53             ` [LTP] " Randy Dunlap
@ 2009-12-03 15:11               ` Subrata Modak
  -1 siblings, 0 replies; 16+ messages in thread
From: Subrata Modak @ 2009-12-03 15:11 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Andreas Dilger, ltp-list, Mike Galbraith, ext4 development,
	James Y Knight, Jan Kara, npiggin

On Wed, 2009-12-02 at 10:53 -0800, Randy Dunlap wrote: 
> On Tue, 01 Dec 2009 15:02:04 -0700 Andreas Dilger wrote:
> 
> > On 2009-12-01, at 09:03, Jan Kara wrote:
> > > On Tue 01-12-09 15:35:59, Jan Kara wrote:
> > >> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
> > >>> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
> > >>>> On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
> > >>>>> This test case is distilled from an actual application which  
> > >>>>> doesn't even intentionally use writev: it just uses C++'s  
> > >>>>> ofstream class to write data to a file. Unfortunately, that  
> > >>>>> class smart and uses writev under the covers. Unfortunately, I  
> > >>>>> guess nobody ever tests linux writev behavior, since it's broken  
> > >>>>> _so_much_of_the_time_. I really am quite astounded to see such a  
> > >>>>> bad track record for such a fundamental core system call....
> > 
> > I suspect an excellent way of exposing problems with the writev()  
> > interface would be to wire it into fsx, which is commonly run as a  
> > stress test for Linux.  I don't know if it would have caught this  
> > case, but it definitely couldn't hurt to get more testing cycles for it.
> 
> Maybe someone from LTP would be interested in adding this test functionality
> to fsx-linux ?
> 

Sure Randy,

We will do it.

Regards--
Subrata

> Source/test program is available at
> http://marc.info/?l=linux-kernel&m=125961612418323&w=2
> 
> 
> > >>  Ext4 also has this problem but delayed allocation mitigates the  
> > >> effect to an error in accounting of blocks reserved for delayed  
> > >> allocation and thus under normal circumstances nothing bad happens.
> > 
> > It looks like ext4 might still hit this problem, if delalloc is  
> > disabled.  Could you please submit a similar patch for ext4 also.
> > 
> > >  The patch below fixes the issue for me...
> > >
> > > 									Honza
> > >
> > > From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001
> > > From: Jan Kara <jack@suse.cz>
> > > Date: Tue, 1 Dec 2009 16:53:06 +0100
> > > Subject: [PATCH] ext3: Fix data / filesystem corruption when write  
> > > fails to copy data
> > >
> > > When ext3_write_begin fails after allocating some blocks or  
> > > generic_perform_write fails to copy data to write, we truncate  
> > > blocks already instantiated beyond i_size. Although these blocks  
> > > were never inside i_size, we have to truncate pagecache of these  
> > > blocks so that corresponding buffers get unmapped. Otherwise  
> > > subsequent __block_prepare_write (called because we are retrying the  
> > > write) will find the buffers mapped, not call ->get_block, and thus  
> > > the page will be backed by already freed blocks leading to  
> > > filesystem and data corruption.
> > >
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > > fs/ext3/inode.c |   18 ++++++++++++++----
> > > 1 files changed, 14 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > > index 354ed3b..f9d6937 100644
> > > --- a/fs/ext3/inode.c
> > > +++ b/fs/ext3/inode.c
> > > @@ -1151,6 +1151,16 @@ static int  
> > > do_journal_get_write_access(handle_t *handle,
> > > 	return ext3_journal_get_write_access(handle, bh);
> > > }
> > >
> > > +/*
> > > + * Truncate blocks that were not used by write. We have to truncate  
> > > the
> > > + * pagecache as well so that corresponding buffers get properly  
> > > unmapped.
> > > + */
> > > +static void ext3_truncate_failed_write(struct inode *inode)
> > > +{
> > > +	truncate_inode_pages(inode->i_mapping, inode->i_size);
> > > +	ext3_truncate(inode);
> > > +}
> > > +
> > > static int ext3_write_begin(struct file *file, struct address_space  
> > > *mapping,
> > > 				loff_t pos, unsigned len, unsigned flags,
> > > 				struct page **pagep, void **fsdata)
> > > @@ -1209,7 +1219,7 @@ write_begin_failed:
> > > 		unlock_page(page);
> > > 		page_cache_release(page);
> > > 		if (pos + len > inode->i_size)
> > > -			ext3_truncate(inode);
> > > +			ext3_truncate_failed_write(inode);
> > > 	}
> > > 	if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> > > 		goto retry;
> > > @@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file  
> > > *file,
> > > 	page_cache_release(page);
> > >
> > > 	if (pos + len > inode->i_size)
> > > -		ext3_truncate(inode);
> > > +		ext3_truncate_failed_write(inode);
> > > 	return ret ? ret : copied;
> > > }
> > >
> > > @@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct  
> > > file *file,
> > > 	page_cache_release(page);
> > >
> > > 	if (pos + len > inode->i_size)
> > > -		ext3_truncate(inode);
> > > +		ext3_truncate_failed_write(inode);
> > > 	return ret ? ret : copied;
> > > }
> > >
> > > @@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct  
> > > file *file,
> > > 	page_cache_release(page);
> > >
> > > 	if (pos + len > inode->i_size)
> > > -		ext3_truncate(inode);
> > > +		ext3_truncate_failed_write(inode);
> > > 	return ret ? ret : copied;
> > > }
> > >
> > > -- 
> > 
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> 
> 
> ---
> ~Randy
> 
> ------------------------------------------------------------------------------
> Join us December 9, 2009 for the Red Hat Virtual Experience,
> a free event focused on virtualization and cloud computing. 
> Attend in-depth sessions from your desk. Your couch. Anywhere.
> http://p.sf.net/sfu/redhat-sfdev2dev
> _______________________________________________
> Ltp-list mailing list
> Ltp-list@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ltp-list


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LTP] writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64
@ 2009-12-03 15:11               ` Subrata Modak
  0 siblings, 0 replies; 16+ messages in thread
From: Subrata Modak @ 2009-12-03 15:11 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: npiggin, ltp-list, Andreas Dilger, Jan Kara, Mike Galbraith,
	James Y Knight, ext4 development

On Wed, 2009-12-02 at 10:53 -0800, Randy Dunlap wrote: 
> On Tue, 01 Dec 2009 15:02:04 -0700 Andreas Dilger wrote:
> 
> > On 2009-12-01, at 09:03, Jan Kara wrote:
> > > On Tue 01-12-09 15:35:59, Jan Kara wrote:
> > >> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
> > >>> On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
> > >>>> On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
> > >>>>> This test case is distilled from an actual application which  
> > >>>>> doesn't even intentionally use writev: it just uses C++'s  
> > >>>>> ofstream class to write data to a file. Unfortunately, that  
> > >>>>> class smart and uses writev under the covers. Unfortunately, I  
> > >>>>> guess nobody ever tests linux writev behavior, since it's broken  
> > >>>>> _so_much_of_the_time_. I really am quite astounded to see such a  
> > >>>>> bad track record for such a fundamental core system call....
> > 
> > I suspect an excellent way of exposing problems with the writev()  
> > interface would be to wire it into fsx, which is commonly run as a  
> > stress test for Linux.  I don't know if it would have caught this  
> > case, but it definitely couldn't hurt to get more testing cycles for it.
> 
> Maybe someone from LTP would be interested in adding this test functionality
> to fsx-linux ?
> 

Sure Randy,

We will do it.

Regards--
Subrata

> Source/test program is available at
> http://marc.info/?l=linux-kernel&m=125961612418323&w=2
> 
> 
> > >>  Ext4 also has this problem but delayed allocation mitigates the  
> > >> effect to an error in accounting of blocks reserved for delayed  
> > >> allocation and thus under normal circumstances nothing bad happens.
> > 
> > It looks like ext4 might still hit this problem, if delalloc is  
> > disabled.  Could you please submit a similar patch for ext4 also.
> > 
> > >  The patch below fixes the issue for me...
> > >
> > > 									Honza
> > >
> > > From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001
> > > From: Jan Kara <jack@suse.cz>
> > > Date: Tue, 1 Dec 2009 16:53:06 +0100
> > > Subject: [PATCH] ext3: Fix data / filesystem corruption when write  
> > > fails to copy data
> > >
> > > When ext3_write_begin fails after allocating some blocks or  
> > > generic_perform_write fails to copy data to write, we truncate  
> > > blocks already instantiated beyond i_size. Although these blocks  
> > > were never inside i_size, we have to truncate pagecache of these  
> > > blocks so that corresponding buffers get unmapped. Otherwise  
> > > subsequent __block_prepare_write (called because we are retrying the  
> > > write) will find the buffers mapped, not call ->get_block, and thus  
> > > the page will be backed by already freed blocks leading to  
> > > filesystem and data corruption.
> > >
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > > fs/ext3/inode.c |   18 ++++++++++++++----
> > > 1 files changed, 14 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > > index 354ed3b..f9d6937 100644
> > > --- a/fs/ext3/inode.c
> > > +++ b/fs/ext3/inode.c
> > > @@ -1151,6 +1151,16 @@ static int  
> > > do_journal_get_write_access(handle_t *handle,
> > > 	return ext3_journal_get_write_access(handle, bh);
> > > }
> > >
> > > +/*
> > > + * Truncate blocks that were not used by write. We have to truncate  
> > > the
> > > + * pagecache as well so that corresponding buffers get properly  
> > > unmapped.
> > > + */
> > > +static void ext3_truncate_failed_write(struct inode *inode)
> > > +{
> > > +	truncate_inode_pages(inode->i_mapping, inode->i_size);
> > > +	ext3_truncate(inode);
> > > +}
> > > +
> > > static int ext3_write_begin(struct file *file, struct address_space  
> > > *mapping,
> > > 				loff_t pos, unsigned len, unsigned flags,
> > > 				struct page **pagep, void **fsdata)
> > > @@ -1209,7 +1219,7 @@ write_begin_failed:
> > > 		unlock_page(page);
> > > 		page_cache_release(page);
> > > 		if (pos + len > inode->i_size)
> > > -			ext3_truncate(inode);
> > > +			ext3_truncate_failed_write(inode);
> > > 	}
> > > 	if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> > > 		goto retry;
> > > @@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file  
> > > *file,
> > > 	page_cache_release(page);
> > >
> > > 	if (pos + len > inode->i_size)
> > > -		ext3_truncate(inode);
> > > +		ext3_truncate_failed_write(inode);
> > > 	return ret ? ret : copied;
> > > }
> > >
> > > @@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct  
> > > file *file,
> > > 	page_cache_release(page);
> > >
> > > 	if (pos + len > inode->i_size)
> > > -		ext3_truncate(inode);
> > > +		ext3_truncate_failed_write(inode);
> > > 	return ret ? ret : copied;
> > > }
> > >
> > > @@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct  
> > > file *file,
> > > 	page_cache_release(page);
> > >
> > > 	if (pos + len > inode->i_size)
> > > -		ext3_truncate(inode);
> > > +		ext3_truncate_failed_write(inode);
> > > 	return ret ? ret : copied;
> > > }
> > >
> > > -- 
> > 
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> 
> 
> ---
> ~Randy
> 
> ------------------------------------------------------------------------------
> Join us December 9, 2009 for the Red Hat Virtual Experience,
> a free event focused on virtualization and cloud computing. 
> Attend in-depth sessions from your desk. Your couch. Anywhere.
> http://p.sf.net/sfu/redhat-sfdev2dev
> _______________________________________________
> Ltp-list mailing list
> Ltp-list@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ltp-list


------------------------------------------------------------------------------
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
_______________________________________________
Ltp-list mailing list
Ltp-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ltp-list

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-12-03 15:30 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-30 20:55 writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64 James Y Knight
2009-12-01  0:48 ` James Y Knight
     [not found]   ` <1259667765.9614.19.camel@marge.simson.net>
2009-12-01 12:59     ` Jan Kara
2009-12-01 14:35     ` Jan Kara
2009-12-01 16:03       ` Jan Kara
2009-12-01 16:47         ` Mike Galbraith
2009-12-01 22:02         ` Andreas Dilger
2009-12-02 18:53           ` Randy Dunlap
2009-12-02 18:53             ` [LTP] " Randy Dunlap
2009-12-03 15:11             ` Subrata Modak
2009-12-03 15:11               ` Subrata Modak
2009-12-02 21:24         ` James Y Knight
2009-12-02 19:04       ` Jan Kara
2009-12-03  5:28         ` Nick Piggin
2009-12-03 10:32           ` Jan Kara
2009-12-03  5:22       ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.