All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] fix corruption case for block size < page size
@ 2008-12-13  7:07 Eric Sandeen
  2008-12-13 17:48 ` Eric Sandeen
  2008-12-13 18:20 ` Eric Sandeen
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Sandeen @ 2008-12-13  7:07 UTC (permalink / raw)
  To: xfs-oss

On a 4k page system and 512-byte blocksize, this:

xfs_io \
-c "pwrite -S 0x11 -b 4096 0 4096" \
-c "mmap -r 0 512" -c "mread 0 512" -c "munmap" \
-c "truncate 256" \
-c "truncate 513" \
-c "pwrite -S 0x22 -b 512 2048 512" \
-t -d -f testfile

leads to this in the resulting file:

# 00000000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
# *
# 00000100  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
# *
# 00000400  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................| <- BAD
# *
# 00000800  22 22 22 22 22 22 22 22  22 22 22 22 22 22 22 22  |""""""""""""""""|
# *
# 00000a00

laid out like this:

 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
   0: [0..4]:          63..67            0 (63..67)             5

which is wrong.  :)  The 2nd series of 1's should be a hole;
it is stale data left over from the original write which gets re-
mapped in xfs_page_state convert; this is largely because on
the previous truncate down, discard_buffer() leaves the discarded
bh uptodate - due to page vs. bh uptodate rules, I think.

This all got turned up by xfsqa 091 on ppc with 64k pages & 4k blocks;
see also http://oss.sgi.com/bugzilla/show_bug.cgi?id=801.
This would hit 64k page ia64 as well.

This could probably use a bit more investigation; why for example
is the mmap read needed above; but the below fixes the problem for
me, by recognizing an uptodate but not dirty or mapped buffer at
this stage as a hole.  At any rate it certainly doesn't need to
be written (not dirty)...

I previously submitted an xfsqa testcase for this too.

I've run it through qa but more soak time would probably be good.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
---

Index: xfs/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_aops.c
+++ xfs/fs/xfs/linux-2.6/xfs_aops.c
@@ -1042,6 +1042,13 @@ xfs_page_state_convert(
 			continue;
 		}
 
+		/* This means its a hole (discard_buffer leaves uptodate set) */
+		if (!buffer_dirty(bh) && !buffer_mapped(bh) &&
+		    buffer_uptodate(bh)) {
+			iomap_valid = 0;
+			continue;
+		}
+
 		if (iomap_valid)
 			iomap_valid = xfs_iomap_valid(&iomap, offset);
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2008-12-13  7:07 [PATCH] fix corruption case for block size < page size Eric Sandeen
@ 2008-12-13 17:48 ` Eric Sandeen
  2008-12-13 18:20 ` Eric Sandeen
  1 sibling, 0 replies; 15+ messages in thread
From: Eric Sandeen @ 2008-12-13 17:48 UTC (permalink / raw)
  To: xfs-oss

Eric Sandeen wrote:

> Index: xfs/fs/xfs/linux-2.6/xfs_aops.c
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/xfs_aops.c
> +++ xfs/fs/xfs/linux-2.6/xfs_aops.c
> @@ -1042,6 +1042,13 @@ xfs_page_state_convert(
>  			continue;
>  		}
>  
> +		/* This means its a hole (discard_buffer leaves uptodate set) */
>   
I've offended cw's sense of grammar, and I did cringe when I removed
the apostrophe in the name of an 80char line, so how about this :)

Index: xfs/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_aops.c
+++ xfs/fs/xfs/linux-2.6/xfs_aops.c
@@ -1042,6 +1042,13 @@ xfs_page_state_convert(
 			continue;
 		}
 
+		/* It's a hole. (discard_buffer leaves uptodate set) */
+		if (!buffer_dirty(bh) && !buffer_mapped(bh) &&
+		    buffer_uptodate(bh)) {
+			iomap_valid = 0;
+			continue;
+		}
+
 		if (iomap_valid)
 			iomap_valid = xfs_iomap_valid(&iomap, offset);
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2008-12-13  7:07 [PATCH] fix corruption case for block size < page size Eric Sandeen
  2008-12-13 17:48 ` Eric Sandeen
@ 2008-12-13 18:20 ` Eric Sandeen
  2008-12-16  5:00   ` Lachlan McIlroy
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Sandeen @ 2008-12-13 18:20 UTC (permalink / raw)
  To: xfs-oss

Eric Sandeen wrote:
> On a 4k page system and 512-byte blocksize, this:
> 
> xfs_io \
> -c "pwrite -S 0x11 -b 4096 0 4096" \
> -c "mmap -r 0 512" -c "mread 0 512" -c "munmap" \
> -c "truncate 256" \
> -c "truncate 513" \
> -c "pwrite -S 0x22 -b 512 2048 512" \
> -t -d -f testfile

Not to keep belaboring the point, but if anyone reviews this here's a
bit more info.

If I blktrace the testcase it looks like this:

8,16  0   1   0.000000000  4222  C   W 166979666 + 8 [0] 4k wr
8,16  0   2   0.000367043  4222  C   R 166979666 + 8 [0] 4k map rd
8,16  0   3   0.002923548  4222  C   N (35 00 ..) [0]
8,16  0   4   0.003108924  4222  C   W 200708307 + 9 [0] Log?(trunc)
8,16  0   5   0.020357902  4222  C   N (35 00 ..) [0]
8,16  0   6   0.020361434  4222  C   W 200708307 + 9 [0] Log?(trunc)
8,16  0   7   0.020745509  4222  C   W 166979666 + 1 [0] 512 wr @0
8,16  0   8   0.020940005  4222  C   W 166979667 + 1 [0] 512 wr @1
8,16  0   9   0.021172749  4222  C   W 166979670 + 1 [0] 512 wr @4

and a detailed look at the data on disk is this:

00000000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  Block 0(OK)
*
00000100  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  Block 0...
*
00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  Block 1(OK)
*
00000400  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  Block 2(BAD)
*
00000600  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  Block 3(BAD)
*
00000800  22 22 22 22 22 22 22 22  22 22 22 22 22 22 22 22  Block 4(OK)
*
00000a00

And the bmap information is this:

 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
   0: [0..4]:          56..60            0 (56..60)             5

So the bad data in blocks 2 and 3 were never rewritten; the buffer heads
probably were fine (containing 0's, but I should check) and we simply
re-mapped blocks 2 and 3 back into existence, along with their stale
data, it seems.

So I think this was just a bad mapping decision, and not a buffer head
state/zeroing problem...?

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2008-12-13 18:20 ` Eric Sandeen
@ 2008-12-16  5:00   ` Lachlan McIlroy
  2008-12-16  5:40     ` Eric Sandeen
  0 siblings, 1 reply; 15+ messages in thread
From: Lachlan McIlroy @ 2008-12-16  5:00 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs-oss

Eric Sandeen wrote:
> Eric Sandeen wrote:
>> On a 4k page system and 512-byte blocksize, this:
>>
>> xfs_io \
>> -c "pwrite -S 0x11 -b 4096 0 4096" \
>> -c "mmap -r 0 512" -c "mread 0 512" -c "munmap" \
>> -c "truncate 256" \
>> -c "truncate 513" \
>> -c "pwrite -S 0x22 -b 512 2048 512" \
>> -t -d -f testfile
> 
> Not to keep belaboring the point, but if anyone reviews this here's a
> bit more info.
> 
> If I blktrace the testcase it looks like this:
> 
> 8,16  0   1   0.000000000  4222  C   W 166979666 + 8 [0] 4k wr
> 8,16  0   2   0.000367043  4222  C   R 166979666 + 8 [0] 4k map rd
> 8,16  0   3   0.002923548  4222  C   N (35 00 ..) [0]
> 8,16  0   4   0.003108924  4222  C   W 200708307 + 9 [0] Log?(trunc)
> 8,16  0   5   0.020357902  4222  C   N (35 00 ..) [0]
> 8,16  0   6   0.020361434  4222  C   W 200708307 + 9 [0] Log?(trunc)
> 8,16  0   7   0.020745509  4222  C   W 166979666 + 1 [0] 512 wr @0
> 8,16  0   8   0.020940005  4222  C   W 166979667 + 1 [0] 512 wr @1
> 8,16  0   9   0.021172749  4222  C   W 166979670 + 1 [0] 512 wr @4
> 
> and a detailed look at the data on disk is this:
> 
> 00000000  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  Block 0(OK)
> *
> 00000100  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  Block 0...
> *
> 00000200  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  Block 1(OK)
> *
> 00000400  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  Block 2(BAD)
> *
> 00000600  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  Block 3(BAD)
> *
> 00000800  22 22 22 22 22 22 22 22  22 22 22 22 22 22 22 22  Block 4(OK)
> *
> 00000a00
> 
> And the bmap information is this:
> 
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>    0: [0..4]:          56..60            0 (56..60)             5
> 
> So the bad data in blocks 2 and 3 were never rewritten; the buffer heads
> probably were fine (containing 0's, but I should check) and we simply
> re-mapped blocks 2 and 3 back into existence, along with their stale
> data, it seems.
> 
> So I think this was just a bad mapping decision, and not a buffer head
> state/zeroing problem...?
I'm still working through this Eric so I don't fully understand what's 
going on.

It looks to me like the region was never zeroed at all.  In
xfs_zero_last_block() we only zero up to the end of the last block
(hence the name) but if the last page extends beyond that last
block we wont zero that extra space in the page.  If that remaining
space in the page sits over a hole then xfs_zero_eof() wont zero it
either.

In your example above the last write extends the file size from 513
bytes to 2048 bytes.  In xfs_zero_last_block() we'll only zero from
513 up to 1024 bytes (ie up to the end of the last block) but leave
the rest of the page untouched.  Because of the truncate to 256 bytes
only the first block is allocated and everything beyond 512 bytes is
a hole.  More specifically there is a hole under the remainder of the
page so xfs_zero_eof() will skip that region and not zero anything.

> 
> -Eric
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2008-12-16  5:00   ` Lachlan McIlroy
@ 2008-12-16  5:40     ` Eric Sandeen
  2008-12-16  6:05       ` Lachlan McIlroy
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Sandeen @ 2008-12-16  5:40 UTC (permalink / raw)
  To: lachlan; +Cc: xfs-oss

Lachlan McIlroy wrote:
> I'm still working through this Eric so I don't fully understand what's 
> going on.
> 
> It looks to me like the region was never zeroed at all.  In
> xfs_zero_last_block() we only zero up to the end of the last block
> (hence the name) but if the last page extends beyond that last
> block we wont zero that extra space in the page.  If that remaining
> space in the page sits over a hole then xfs_zero_eof() wont zero it
> either.

So the testcase steps are (8 blocks per page here).

1: |1111|1111|1111|1111|1111|1111|1111|1111|     write 1's

2: |RRRR|1111|1111|1111|1111|1111|1111|1111|     mapread first block

     |<--------trunc-----------------------
3: |1100|                                        trunc down to 1/2 block

 trunc-->|
4: |1100|                                        trunc up to block+1byte

5: |    |    |    |    |2222|                    write 2's (extending)

And we get:

# |1100|0000|1111|1111|2222|----|----|----|

But we should get:

# |1100|HHHH|HHHH|HHHH|2222|----|----|----|

(where "H" means hole)

So no, the data for blocks 1,2,3 is never zeroed, nor should it be, I
think - we never partially write those blocks.  And the stale 1's coming
through in blocks 2 & 3 are not the result of non-zero buffer heads
getting written to disk in the last stage; they are simply stale disk
blocks from the first step mapped back into the file.  (They were
originally written as 1's in the first step.)

> In your example above the last write extends the file size from 513
> bytes to 2048 bytes.  In xfs_zero_last_block() we'll only zero from
> 513 up to 1024 bytes (ie up to the end of the last block) but leave
> the rest of the page untouched.  

Actually; after the truncate down step (3) we should have:

     |<--------trunc-----------------------
3: |11??|                                       trunc down to 1/2 block
     ^
     |
    EOF

Hm, but does the end of this block get zeroed now or only when we
subsequently extend the size?  The latter I think...?

So I think in the next step:

 trunc-->|
4: |1100|                                        trunc up to block+1byte
      ^^
  now || this part of the block gets zeroed, right, by xfs_zero_eof?

> Because of the truncate to 256 bytes
> only the first block is allocated and everything beyond 512 bytes is
> a hole.  

Yep, up until the last write anyway.

> More specifically there is a hole under the remainder of the
> page so xfs_zero_eof() will skip that region and not zero anything.

Well, the last write (step 5) is still completely within the page...

Right, that's what it *should* be doing; but in page_state_convert (and
I'll admit to not having this 100% nailed down) we write block 1 and map
blocks 2 & 3 back into the file, and get:

# |1100|0000|1111|1111|2222|----|----|----|
             ^^^^ ^^^^
where these  |||| |||| blocks are stale data, and block 1 is written
(but at least zeroed).  How block 1 got zeroed I guess I'm not quite
certain yet.  But it does not appear that blocks 2 and 3 get *written*
any time other than step 1; blktrace seems to confirm this.  block 1
does get written, and 0s are written.  (But I don't think this block
ever should get written either; EOF landed there but only via truncate,
not a write).

Crap, now you've got me slightly confused again, and I'll need to look a
bit more to be sure I'm 100% clear on what's getting zeroed and when vs.
what's getting mapped and why.  :)

-Eric


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2008-12-16  5:40     ` Eric Sandeen
@ 2008-12-16  6:05       ` Lachlan McIlroy
  2008-12-16  6:10         ` Eric Sandeen
  0 siblings, 1 reply; 15+ messages in thread
From: Lachlan McIlroy @ 2008-12-16  6:05 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs-oss

Eric Sandeen wrote:
> Lachlan McIlroy wrote:
>> I'm still working through this Eric so I don't fully understand what's 
>> going on.
>>
>> It looks to me like the region was never zeroed at all.  In
>> xfs_zero_last_block() we only zero up to the end of the last block
>> (hence the name) but if the last page extends beyond that last
>> block we wont zero that extra space in the page.  If that remaining
>> space in the page sits over a hole then xfs_zero_eof() wont zero it
>> either.
> 
> So the testcase steps are (8 blocks per page here).
> 
> 1: |1111|1111|1111|1111|1111|1111|1111|1111|     write 1's
> 
> 2: |RRRR|1111|1111|1111|1111|1111|1111|1111|     mapread first block
> 
>      |<--------trunc-----------------------
> 3: |1100|                                        trunc down to 1/2 block
> 
>  trunc-->|
> 4: |1100|                                        trunc up to block+1byte
> 
> 5: |    |    |    |    |2222|                    write 2's (extending)
> 
> And we get:
> 
> # |1100|0000|1111|1111|2222|----|----|----|
> 
> But we should get:
> 
> # |1100|HHHH|HHHH|HHHH|2222|----|----|----|
> 
> (where "H" means hole)
> 
> So no, the data for blocks 1,2,3 is never zeroed, nor should it be, I
> think - we never partially write those blocks.  And the stale 1's coming
> through in blocks 2 & 3 are not the result of non-zero buffer heads
> getting written to disk in the last stage; they are simply stale disk
> blocks from the first step mapped back into the file.  (They were
> originally written as 1's in the first step.)
Okay, it's making sense now.  I've actually fixed this same problem on
IRIX and our CXFS clients on solaris, osx and windows but in all those
cases when a page is dirty the whole page is dirty so needs to be
zeroed.  With the individual states of the buffer heads we can remap
parts of pages that now cover holes.

> 
>> In your example above the last write extends the file size from 513
>> bytes to 2048 bytes.  In xfs_zero_last_block() we'll only zero from
>> 513 up to 1024 bytes (ie up to the end of the last block) but leave
>> the rest of the page untouched.  
> 
> Actually; after the truncate down step (3) we should have:
> 
>      |<--------trunc-----------------------
> 3: |11??|                                       trunc down to 1/2 block
>      ^
>      |
>     EOF
> 
> Hm, but does the end of this block get zeroed now or only when we
> subsequently extend the size?  The latter I think...?
Only when extending the file size.

> 
> So I think in the next step:
> 
>  trunc-->|
> 4: |1100|                                        trunc up to block+1byte
>       ^^
>   now || this part of the block gets zeroed, right, by xfs_zero_eof?
Yes (by xfs_zero_last_block()).

> 
>> Because of the truncate to 256 bytes
>> only the first block is allocated and everything beyond 512 bytes is
>> a hole.  
> 
> Yep, up until the last write anyway.
> 
>> More specifically there is a hole under the remainder of the
>> page so xfs_zero_eof() will skip that region and not zero anything.
> 
> Well, the last write (step 5) is still completely within the page...
> 
> Right, that's what it *should* be doing; but in page_state_convert (and
> I'll admit to not having this 100% nailed down) we write block 1 and map
> blocks 2 & 3 back into the file, and get:
> 
> # |1100|0000|1111|1111|2222|----|----|----|
>              ^^^^ ^^^^
> where these  |||| |||| blocks are stale data, and block 1 is written
> (but at least zeroed).  How block 1 got zeroed I guess I'm not quite
I think block 1 got zeroed during the last write because the file size
was extended from 513 to 2048.  Byte 513 is just inside block 1.  But
that block should have been a hole and xfs_zero_last_block() should
have skipped it.

> certain yet.  But it does not appear that blocks 2 and 3 get *written*
> any time other than step 1; blktrace seems to confirm this.  block 1
> does get written, and 0s are written.  (But I don't think this block
> ever should get written either; EOF landed there but only via truncate,
> not a write).
Agree.

> 
> Crap, now you've got me slightly confused again, and I'll need to look a
> bit more to be sure I'm 100% clear on what's getting zeroed and when vs.
> what's getting mapped and why.  :)
That makes two.

Something else to consider is that there may be allocated blocks
entirely beyond eof due to speculative allocation.  This means that just
because a block within a page is beyond eof does not mean it covers a
hole.  This is why xfs_zero_eof() looks for blocks to zero between the
old eof and the new eof.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2008-12-16  6:05       ` Lachlan McIlroy
@ 2008-12-16  6:10         ` Eric Sandeen
  2008-12-16  6:21           ` Eric Sandeen
  2008-12-16  7:54           ` Lachlan McIlroy
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Sandeen @ 2008-12-16  6:10 UTC (permalink / raw)
  To: lachlan; +Cc: xfs-oss

Lachlan McIlroy wrote:
> Eric Sandeen wrote:
>> Actually; after the truncate down step (3) we should have:
>>
>>      |<--------trunc-----------------------
>> 3: |11??|                                       trunc down to 1/2 block
>>      ^
>>      |
>>     EOF
>>
>> Hm, but does the end of this block get zeroed now or only when we
>> subsequently extend the size?  The latter I think...?
> Only when extending the file size.

Right.

>> So I think in the next step:
>>
>>  trunc-->|
>> 4: |1100|                                        trunc up to block+1byte
>>       ^^
>>   now || this part of the block gets zeroed, right, by xfs_zero_eof?
> Yes (by xfs_zero_last_block()).

Right.  :)  But I *think* that after this step we are actually zeroing
into block 1 (2nd block) and causing it to get zeroed/mapped.  Off by
one maybe?

>>> Because of the truncate to 256 bytes
>>> only the first block is allocated and everything beyond 512 bytes is
>>> a hole.  
>> Yep, up until the last write anyway.
>>
>>> More specifically there is a hole under the remainder of the
>>> page so xfs_zero_eof() will skip that region and not zero anything.
>> Well, the last write (step 5) is still completely within the page...
>>
>> Right, that's what it *should* be doing; but in page_state_convert (and
>> I'll admit to not having this 100% nailed down) we write block 1 and map
>> blocks 2 & 3 back into the file, and get:
>>
>> # |1100|0000|1111|1111|2222|----|----|----|
>>              ^^^^ ^^^^
>> where these  |||| |||| blocks are stale data, and block 1 is written
>> (but at least zeroed).  How block 1 got zeroed I guess I'm not quite
> I think block 1 got zeroed during the last write because the file size
> was extended from 513 to 2048.  Byte 513 is just inside block 1.  But
> that block should have been a hole and xfs_zero_last_block() should
> have skipped it.

I think the 2nd extending write does skip it but from a bit more looking
the first extending truncate might step into it by one... still looking
into that.

>> certain yet.  But it does not appear that blocks 2 and 3 get *written*
>> any time other than step 1; blktrace seems to confirm this.  block 1
>> does get written, and 0s are written.  (But I don't think this block
>> ever should get written either; EOF landed there but only via truncate,
>> not a write).
> Agree.
> 
>> Crap, now you've got me slightly confused again, and I'll need to look a
>> bit more to be sure I'm 100% clear on what's getting zeroed and when vs.
>> what's getting mapped and why.  :)
> That makes two.

:)

> Something else to consider is that there may be allocated blocks
> entirely beyond eof due to speculative allocation.  This means that just
> because a block within a page is beyond eof does not mean it covers a
> hole.  This is why xfs_zero_eof() looks for blocks to zero between the
> old eof and the new eof.

true... yeah, my test may yet be a bit naiive.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2008-12-16  6:10         ` Eric Sandeen
@ 2008-12-16  6:21           ` Eric Sandeen
  2008-12-16  6:51             ` Eric Sandeen
  2008-12-16  7:54           ` Lachlan McIlroy
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Sandeen @ 2008-12-16  6:21 UTC (permalink / raw)
  To: lachlan; +Cc: xfs-oss

Eric Sandeen wrote:
> Lachlan McIlroy wrote:
>> Eric Sandeen wrote:
>>> Actually; after the truncate down step (3) we should have:
>>>
>>>      |<--------trunc-----------------------
>>> 3: |11??|                                       trunc down to 1/2 block
>>>      ^
>>>      |
>>>     EOF
>>>
>>> Hm, but does the end of this block get zeroed now or only when we
>>> subsequently extend the size?  The latter I think...?
>> Only when extending the file size.
> 
> Right.
> 
>>> So I think in the next step:
>>>
>>>  trunc-->|
>>> 4: |1100|                                        trunc up to block+1byte
>>>       ^^
>>>   now || this part of the block gets zeroed, right, by xfs_zero_eof?
>> Yes (by xfs_zero_last_block()).
> 
> Right.  :)  But I *think* that after this step we are actually zeroing
> into block 1 (2nd block) and causing it to get zeroed/mapped.  Off by
> one maybe?
> 
>>>> Because of the truncate to 256 bytes
>>>> only the first block is allocated and everything beyond 512 bytes is
>>>> a hole.  
>>> Yep, up until the last write anyway.
>>>
>>>> More specifically there is a hole under the remainder of the
>>>> page so xfs_zero_eof() will skip that region and not zero anything.
>>> Well, the last write (step 5) is still completely within the page...
>>>
>>> Right, that's what it *should* be doing; but in page_state_convert (and
>>> I'll admit to not having this 100% nailed down) we write block 1 and map
>>> blocks 2 & 3 back into the file, and get:
>>>
>>> # |1100|0000|1111|1111|2222|----|----|----|
>>>              ^^^^ ^^^^
>>> where these  |||| |||| blocks are stale data, and block 1 is written
>>> (but at least zeroed).  How block 1 got zeroed I guess I'm not quite
>> I think block 1 got zeroed during the last write because the file size
>> was extended from 513 to 2048.  Byte 513 is just inside block 1.  But
>> that block should have been a hole and xfs_zero_last_block() should
>> have skipped it.
> 
> I think the 2nd extending write does skip it but from a bit more looking
> the first extending truncate might step into it by one... still looking
> into that.

Gah; or not.  what is going on here...  Doing just steps 1, 2, 3, 4
(ending on the extending truncate):

# xfs_io -c "pwrite -S 0x11 -b 4096 0 4096" -c "mmap -r 0 512" -c "mread
0 512" -c "munmap" -c "truncate 256" -c "truncate 514" -t -d -f
/mnt/scratch/testfile

# xfs_bmap -v /mnt/scratch/testfile
/mnt/scratch/testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
   0: [0..0]:          63..63            0 (63..63)             1
   1: [1..1]:          hole                                     1

It looks like what I expect, at this point.  But then:

# sync
# xfs_bmap -v /mnt/scratch/testfile
/mnt/scratch/testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
   0: [0..1]:          63..64            0 (63..64)             2

Um, why'd that last block get mapped in?  mmap vs. direct IO I'm
guessing... w/o the mmap read this does not happen.

-Eric (heading to bed now...)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2008-12-16  6:21           ` Eric Sandeen
@ 2008-12-16  6:51             ` Eric Sandeen
  2009-01-07  5:23               ` Lachlan McIlroy
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Sandeen @ 2008-12-16  6:51 UTC (permalink / raw)
  To: lachlan; +Cc: xfs-oss

Eric Sandeen wrote:

> Gah; or not.  what is going on here...  Doing just steps 1, 2, 3, 4
> (ending on the extending truncate):
> 
> # xfs_io -c "pwrite -S 0x11 -b 4096 0 4096" -c "mmap -r 0 512" -c "mread
> 0 512" -c "munmap" -c "truncate 256" -c "truncate 514" -t -d -f
> /mnt/scratch/testfile
> 
> # xfs_bmap -v /mnt/scratch/testfile
> /mnt/scratch/testfile:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>    0: [0..0]:          63..63            0 (63..63)             1
>    1: [1..1]:          hole                                     1
> 
> It looks like what I expect, at this point.  But then:
> 
> # sync
> # xfs_bmap -v /mnt/scratch/testfile
> /mnt/scratch/testfile:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>    0: [0..1]:          63..64            0 (63..64)             2
> 
> Um, why'd that last block get mapped in?  mmap vs. direct IO I'm
> guessing... w/o the mmap read this does not happen.

Replying to myself twice?  I really need to go to bed.

So this all does seem to come back to page_state_convert.

Both the extending write in the original case and the sync above find
their way there; but esp. in the sync test above, why do we have *any*
work to do?

With a little instrumentation I see that for the truncate out; sync test
above we get to xfs_vm_writepage() for a page which is *not* dirty, and
yet we call page_state_convert on it and map in that 2nd block... Is
that right!?  I guess it is; ->write_cache_pages() clears dirty before
calling writepage.  Still why would this page be found dirty on this
path.  Bah.  Bedtime.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2008-12-16  6:10         ` Eric Sandeen
  2008-12-16  6:21           ` Eric Sandeen
@ 2008-12-16  7:54           ` Lachlan McIlroy
  1 sibling, 0 replies; 15+ messages in thread
From: Lachlan McIlroy @ 2008-12-16  7:54 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs-oss

Eric Sandeen wrote:
> Lachlan McIlroy wrote:
>> Eric Sandeen wrote:
>>> Actually; after the truncate down step (3) we should have:
>>>
>>>      |<--------trunc-----------------------
>>> 3: |11??|                                       trunc down to 1/2 block
>>>      ^
>>>      |
>>>     EOF
>>>
>>> Hm, but does the end of this block get zeroed now or only when we
>>> subsequently extend the size?  The latter I think...?
>> Only when extending the file size.
> 
> Right.
> 
>>> So I think in the next step:
>>>
>>>  trunc-->|
>>> 4: |1100|                                        trunc up to block+1byte
>>>       ^^
>>>   now || this part of the block gets zeroed, right, by xfs_zero_eof?
>> Yes (by xfs_zero_last_block()).
> 
> Right.  :)  But I *think* that after this step we are actually zeroing
> into block 1 (2nd block) and causing it to get zeroed/mapped.  Off by
> one maybe?
I assumed that the zeroing would stop at the new file size.  This bit of 
code in xfs_zero_eof() should ensure that:

		if ((zero_off + zero_len) > offset)
			zero_len = offset - zero_off;

The truncate down eventually calls truncate_inode_pages_range() which
will zero the remainder of a page beyond eof.  This could be how block
1 is becoming dirty.  If I remember correctly there are checks in fsx
to ensure that when mmaping a file with a file size that is not page
aligned that the space in the last page beyond eof is all zeroes (a
POSIX requirement I think but only holds true on initial mapping since
the area can be modified by a user).  This has to be done somewhere
and may well be in the truncate down.

> 
>>>> Because of the truncate to 256 bytes
>>>> only the first block is allocated and everything beyond 512 bytes is
>>>> a hole.  
>>> Yep, up until the last write anyway.
>>>
>>>> More specifically there is a hole under the remainder of the
>>>> page so xfs_zero_eof() will skip that region and not zero anything.
>>> Well, the last write (step 5) is still completely within the page...
>>>
>>> Right, that's what it *should* be doing; but in page_state_convert (and
>>> I'll admit to not having this 100% nailed down) we write block 1 and map
>>> blocks 2 & 3 back into the file, and get:
>>>
>>> # |1100|0000|1111|1111|2222|----|----|----|
>>>              ^^^^ ^^^^
>>> where these  |||| |||| blocks are stale data, and block 1 is written
>>> (but at least zeroed).  How block 1 got zeroed I guess I'm not quite
>> I think block 1 got zeroed during the last write because the file size
>> was extended from 513 to 2048.  Byte 513 is just inside block 1.  But
>> that block should have been a hole and xfs_zero_last_block() should
>> have skipped it.
> 
> I think the 2nd extending write does skip it but from a bit more looking
> the first extending truncate might step into it by one... still looking
> into that.
Yes, true since the new file size (513 bytes) is not fsb aligned then
the first byte in block 1 needs to be zeroed.  But this will only happen
if block 1 is allocated.  If it's a hole or an unwritten extent then it
will be skipped.

> 
>>> certain yet.  But it does not appear that blocks 2 and 3 get *written*
>>> any time other than step 1; blktrace seems to confirm this.  block 1
>>> does get written, and 0s are written.  (But I don't think this block
>>> ever should get written either; EOF landed there but only via truncate,
>>> not a write).
>> Agree.
>>
>>> Crap, now you've got me slightly confused again, and I'll need to look a
>>> bit more to be sure I'm 100% clear on what's getting zeroed and when vs.
>>> what's getting mapped and why.  :)
>> That makes two.
> 
> :)
> 
>> Something else to consider is that there may be allocated blocks
>> entirely beyond eof due to speculative allocation.  This means that just
>> because a block within a page is beyond eof does not mean it covers a
>> hole.  This is why xfs_zero_eof() looks for blocks to zero between the
>> old eof and the new eof.
> 
> true... yeah, my test may yet be a bit naiive.

I used this command to trigger the problem:
fsx -r 512 -w 512 -W -Z -p 1000 file

It mixes direct I/O with a mmap'ed reads.

1000 mapread    0x3800 thru     0x609c  (0x289d bytes)
2000 mapread    0x7600 thru     0x16d0e (0xf70f bytes)
READ BAD DATA: offset = 0x9a00, size = 0xa600, fname = file
OFFSET  GOOD    BAD     RANGE
0x aa00 0x0000  0x43fb  0x    0
operation# (mod 256) for the bad data may be 67
...
2126(78 mod 256): SKIPPED (no operation)
2127(79 mod 256): TRUNCATE UP   from 0xc580 to 0x26629
2128(80 mod 256): WRITE 0x2bc00 thru 0x35fff    (0xa400 bytes) HOLE
2129(81 mod 256): TRUNCATE DOWN from 0x36000 to 0x231be
2130(82 mod 256): MAPREAD       0x9e00 thru 0xd6f6      (0x38f7 bytes) 
***RRRR***
2131(83 mod 256): READ  0x9a00 thru 0x13fff     (0xa600 bytes)  ***RRRR***
Correct content saved for comparison
(maybe hexdump "file" vs "file.fsxgood")

At operation 2130 the data at offset 0xaa00 is okay but on the next
operation the data at that offset is now bad.

Your patch stops this failure but frankly I don't know why.  Zeroing to
the end of the page in xfs_zero_last_block() also prevents the failure
but now I'm not so sure that makes sense.

The '-f' option to fsx (if you're using the fsx from the XFSQA suite) is
supposed to flush and invalidate pages after write operations and force
them to be read in from disk.  This is very handy to find bugs in the
read path that would otherwise be disguised by the page cache.  But it
looks like msync(MS_INVALIDATE) doesn't work anymore - I'm not seeing 
any reads, weird.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2008-12-16  6:51             ` Eric Sandeen
@ 2009-01-07  5:23               ` Lachlan McIlroy
  2009-01-07  5:53                 ` Eric Sandeen
  0 siblings, 1 reply; 15+ messages in thread
From: Lachlan McIlroy @ 2009-01-07  5:23 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs-oss

Eric Sandeen wrote:
> Eric Sandeen wrote:
> 
>> Gah; or not.  what is going on here...  Doing just steps 1, 2, 3, 4
>> (ending on the extending truncate):
>>
>> # xfs_io -c "pwrite -S 0x11 -b 4096 0 4096" -c "mmap -r 0 512" -c "mread
>> 0 512" -c "munmap" -c "truncate 256" -c "truncate 514" -t -d -f
>> /mnt/scratch/testfile
>>
>> # xfs_bmap -v /mnt/scratch/testfile
>> /mnt/scratch/testfile:
>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>>    0: [0..0]:          63..63            0 (63..63)             1
>>    1: [1..1]:          hole                                     1
>>
>> It looks like what I expect, at this point.  But then:
>>
>> # sync
>> # xfs_bmap -v /mnt/scratch/testfile
>> /mnt/scratch/testfile:
>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>>    0: [0..1]:          63..64            0 (63..64)             2
>>
>> Um, why'd that last block get mapped in?  mmap vs. direct IO I'm
>> guessing... w/o the mmap read this does not happen.
> 
> Replying to myself twice?  I really need to go to bed.
> 
> So this all does seem to come back to page_state_convert.
> 
> Both the extending write in the original case and the sync above find
> their way there; but esp. in the sync test above, why do we have *any*
> work to do?
Eric, did you find out why sync was allocating that second block?

> 
> With a little instrumentation I see that for the truncate out; sync test
> above we get to xfs_vm_writepage() for a page which is *not* dirty, and
> yet we call page_state_convert on it and map in that 2nd block... Is
> that right!?  I guess it is; ->write_cache_pages() clears dirty before
> calling writepage.  Still why would this page be found dirty on this
> path.  Bah.  Bedtime.
> 
> -Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2009-01-07  5:23               ` Lachlan McIlroy
@ 2009-01-07  5:53                 ` Eric Sandeen
  2009-01-07  6:32                   ` Lachlan McIlroy
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Sandeen @ 2009-01-07  5:53 UTC (permalink / raw)
  To: lachlan; +Cc: xfs-oss

Lachlan McIlroy wrote:
> Eric Sandeen wrote:
>> Eric Sandeen wrote:
>>
>>> Gah; or not.  what is going on here...  Doing just steps 1, 2, 3, 4
>>> (ending on the extending truncate):
>>>
>>> # xfs_io -c "pwrite -S 0x11 -b 4096 0 4096" -c "mmap -r 0 512" -c "mread
>>> 0 512" -c "munmap" -c "truncate 256" -c "truncate 514" -t -d -f
>>> /mnt/scratch/testfile
>>>
>>> # xfs_bmap -v /mnt/scratch/testfile
>>> /mnt/scratch/testfile:
>>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>>>    0: [0..0]:          63..63            0 (63..63)             1
>>>    1: [1..1]:          hole                                     1
>>>
>>> It looks like what I expect, at this point.  But then:
>>>
>>> # sync
>>> # xfs_bmap -v /mnt/scratch/testfile
>>> /mnt/scratch/testfile:
>>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>>>    0: [0..1]:          63..64            0 (63..64)             2
>>>
>>> Um, why'd that last block get mapped in?  mmap vs. direct IO I'm
>>> guessing... w/o the mmap read this does not happen.
>> Replying to myself twice?  I really need to go to bed.
>>
>> So this all does seem to come back to page_state_convert.
>>
>> Both the extending write in the original case and the sync above find
>> their way there; but esp. in the sync test above, why do we have *any*
>> work to do?
> Eric, did you find out why sync was allocating that second block?

I'm afraid this has been on the back burner (or maybe further back) for
a while... so... either "no" or "I don't remember" :)

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2009-01-07  5:53                 ` Eric Sandeen
@ 2009-01-07  6:32                   ` Lachlan McIlroy
  2009-01-07 21:42                     ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Lachlan McIlroy @ 2009-01-07  6:32 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs-oss

Eric Sandeen wrote:
> Lachlan McIlroy wrote:
>> Eric Sandeen wrote:
>>> Eric Sandeen wrote:
>>>
>>>> Gah; or not.  what is going on here...  Doing just steps 1, 2, 3, 4
>>>> (ending on the extending truncate):
>>>>
>>>> # xfs_io -c "pwrite -S 0x11 -b 4096 0 4096" -c "mmap -r 0 512" -c "mread
>>>> 0 512" -c "munmap" -c "truncate 256" -c "truncate 514" -t -d -f
>>>> /mnt/scratch/testfile
>>>>
>>>> # xfs_bmap -v /mnt/scratch/testfile
>>>> /mnt/scratch/testfile:
>>>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>>>>    0: [0..0]:          63..63            0 (63..63)             1
>>>>    1: [1..1]:          hole                                     1
>>>>
>>>> It looks like what I expect, at this point.  But then:
>>>>
>>>> # sync
>>>> # xfs_bmap -v /mnt/scratch/testfile
>>>> /mnt/scratch/testfile:
>>>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>>>>    0: [0..1]:          63..64            0 (63..64)             2
>>>>
>>>> Um, why'd that last block get mapped in?  mmap vs. direct IO I'm
>>>> guessing... w/o the mmap read this does not happen.
>>> Replying to myself twice?  I really need to go to bed.
>>>
>>> So this all does seem to come back to page_state_convert.
>>>
>>> Both the extending write in the original case and the sync above find
>>> their way there; but esp. in the sync test above, why do we have *any*
>>> work to do?
>> Eric, did you find out why sync was allocating that second block?
> 
> I'm afraid this has been on the back burner (or maybe further back) for
> a while... so... either "no" or "I don't remember" :)

Just trying your test case.  It's not related to direct I/O or mmap I/O
since I can reproduce it without those.

# xfs_io -f -c "pwrite -S 0x11 -b 513 0 513" -c "truncate 1" -c "truncate 513" file
wrote 513/513 bytes at offset 0
513.000000 bytes, 1 ops; 0.0000 sec (8.895 MiB/sec and 18181.8182 ops/sec)
# xfs_bmap -vvp file; sync; xfs_bmap -vvp file
file:
  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
    0: [0..0]:          48..48            0 (48..48)             1 00000
    1: [1..1]:          hole                                     1
  FLAG Values:
     010000 Unwritten preallocated extent
     001000 Doesn't begin on stripe unit
     000100 Doesn't end   on stripe unit
     000010 Doesn't begin on stripe width
     000001 Doesn't end   on stripe width
file:
  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
    0: [0..1]:          48..49            0 (48..49)             2 00000
  FLAG Values:
     010000 Unwritten preallocated extent
     001000 Doesn't begin on stripe unit
     000100 Doesn't end   on stripe unit
     000010 Doesn't begin on stripe width
     000001 Doesn't end   on stripe width

xfs_bmap will cause the file to be flushed so there should be no dirty
data to be flushed during the sync.  Strange.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2009-01-07  6:32                   ` Lachlan McIlroy
@ 2009-01-07 21:42                     ` Dave Chinner
  2009-01-09  0:18                       ` Lachlan McIlroy
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2009-01-07 21:42 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Eric Sandeen, xfs-oss

On Wed, Jan 07, 2009 at 05:32:09PM +1100, Lachlan McIlroy wrote:
> Eric Sandeen wrote:
> > Lachlan McIlroy wrote:
> >> Eric Sandeen wrote:
> >>> Eric Sandeen wrote:
> >>>
> >>>> Gah; or not.  what is going on here...  Doing just steps 1, 2, 3, 4
> >>>> (ending on the extending truncate):
> >>>>
> >>>> # xfs_io -c "pwrite -S 0x11 -b 4096 0 4096" -c "mmap -r 0 512" -c "mread
> >>>> 0 512" -c "munmap" -c "truncate 256" -c "truncate 514" -t -d -f
> >>>> /mnt/scratch/testfile
> >>>>
> >>>> # xfs_bmap -v /mnt/scratch/testfile
> >>>> /mnt/scratch/testfile:
> >>>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
> >>>>    0: [0..0]:          63..63            0 (63..63)             1
> >>>>    1: [1..1]:          hole                                     1
> >>>>
> >>>> It looks like what I expect, at this point.  But then:
> >>>>
> >>>> # sync
> >>>> # xfs_bmap -v /mnt/scratch/testfile
> >>>> /mnt/scratch/testfile:
> >>>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
> >>>>    0: [0..1]:          63..64            0 (63..64)             2
> >>>>
> >>>> Um, why'd that last block get mapped in?  mmap vs. direct IO I'm
> >>>> guessing... w/o the mmap read this does not happen.
> >>> Replying to myself twice?  I really need to go to bed.
> >>>
> >>> So this all does seem to come back to page_state_convert.
> >>>
> >>> Both the extending write in the original case and the sync above find
> >>> their way there; but esp. in the sync test above, why do we have *any*
> >>> work to do?
> >> Eric, did you find out why sync was allocating that second block?
> > 
> > I'm afraid this has been on the back burner (or maybe further back) for
> > a while... so... either "no" or "I don't remember" :)
> 
> Just trying your test case.  It's not related to direct I/O or mmap I/O
> since I can reproduce it without those.
> 
> # xfs_io -f -c "pwrite -S 0x11 -b 513 0 513" -c "truncate 1" -c "truncate 513" file
> wrote 513/513 bytes at offset 0
> 513.000000 bytes, 1 ops; 0.0000 sec (8.895 MiB/sec and 18181.8182 ops/sec)
> # xfs_bmap -vvp file; sync; xfs_bmap -vvp file
> file:
>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>     0: [0..0]:          48..48            0 (48..48)             1 00000
>     1: [1..1]:          hole                                     1
....
> file:
>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>     0: [0..1]:          48..49            0 (48..49)             2 00000
....
> xfs_bmap will cause the file to be flushed so there should be no dirty
> data to be flushed during the sync.  Strange.

Yeah, I noticed that as well. AFAICT it is because the state on
on the second buffer has been trashed by the discard but it has been
left uptodate. The result is that xfs_flushpages() doesn't think
that there is anything to write out and the bmap code finds no
extent on that range so іt is considered a hole.

However, when the VFS comes along and writes the dirty inode it
thinks that the page is dirty and writes it.
xfs_page_state_convert() then sees the second buffer as uptodate but
unmapped and within EOF so it allocates the block and writes it out.

IIRC, the code in xfs_page_state_convert() did this allocation to
catch mmap() writes into holes. We have ->page_mkwrite to catch this
now and turn them into delalloc writes, so perhaps this code path is
no longer needed anymore?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fix corruption case for block size < page size
  2009-01-07 21:42                     ` Dave Chinner
@ 2009-01-09  0:18                       ` Lachlan McIlroy
  0 siblings, 0 replies; 15+ messages in thread
From: Lachlan McIlroy @ 2009-01-09  0:18 UTC (permalink / raw)
  To: Lachlan McIlroy, Eric Sandeen, xfs-oss

Dave Chinner wrote:
> On Wed, Jan 07, 2009 at 05:32:09PM +1100, Lachlan McIlroy wrote:
>> Eric Sandeen wrote:
>>> Lachlan McIlroy wrote:
>>>> Eric Sandeen wrote:
>>>>> Eric Sandeen wrote:
>>>>>
>>>>>> Gah; or not.  what is going on here...  Doing just steps 1, 2, 3, 4
>>>>>> (ending on the extending truncate):
>>>>>>
>>>>>> # xfs_io -c "pwrite -S 0x11 -b 4096 0 4096" -c "mmap -r 0 512" -c "mread
>>>>>> 0 512" -c "munmap" -c "truncate 256" -c "truncate 514" -t -d -f
>>>>>> /mnt/scratch/testfile
>>>>>>
>>>>>> # xfs_bmap -v /mnt/scratch/testfile
>>>>>> /mnt/scratch/testfile:
>>>>>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>>>>>>    0: [0..0]:          63..63            0 (63..63)             1
>>>>>>    1: [1..1]:          hole                                     1
>>>>>>
>>>>>> It looks like what I expect, at this point.  But then:
>>>>>>
>>>>>> # sync
>>>>>> # xfs_bmap -v /mnt/scratch/testfile
>>>>>> /mnt/scratch/testfile:
>>>>>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
>>>>>>    0: [0..1]:          63..64            0 (63..64)             2
>>>>>>
>>>>>> Um, why'd that last block get mapped in?  mmap vs. direct IO I'm
>>>>>> guessing... w/o the mmap read this does not happen.
>>>>> Replying to myself twice?  I really need to go to bed.
>>>>>
>>>>> So this all does seem to come back to page_state_convert.
>>>>>
>>>>> Both the extending write in the original case and the sync above find
>>>>> their way there; but esp. in the sync test above, why do we have *any*
>>>>> work to do?
>>>> Eric, did you find out why sync was allocating that second block?
>>> I'm afraid this has been on the back burner (or maybe further back) for
>>> a while... so... either "no" or "I don't remember" :)
>> Just trying your test case.  It's not related to direct I/O or mmap I/O
>> since I can reproduce it without those.
>>
>> # xfs_io -f -c "pwrite -S 0x11 -b 513 0 513" -c "truncate 1" -c "truncate 513" file
>> wrote 513/513 bytes at offset 0
>> 513.000000 bytes, 1 ops; 0.0000 sec (8.895 MiB/sec and 18181.8182 ops/sec)
>> # xfs_bmap -vvp file; sync; xfs_bmap -vvp file
>> file:
>>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>     0: [0..0]:          48..48            0 (48..48)             1 00000
>>     1: [1..1]:          hole                                     1
> ....
>> file:
>>   EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>     0: [0..1]:          48..49            0 (48..49)             2 00000
> ....
>> xfs_bmap will cause the file to be flushed so there should be no dirty
>> data to be flushed during the sync.  Strange.
> 
> Yeah, I noticed that as well. AFAICT it is because the state on
> on the second buffer has been trashed by the discard but it has been
> left uptodate. The result is that xfs_flushpages() doesn't think
> that there is anything to write out and the bmap code finds no
> extent on that range so іt is considered a hole.
> 
> However, when the VFS comes along and writes the dirty inode it
> thinks that the page is dirty and writes it.
> xfs_page_state_convert() then sees the second buffer as uptodate but
> unmapped and within EOF so it allocates the block and writes it out.
> 
> IIRC, the code in xfs_page_state_convert() did this allocation to
> catch mmap() writes into holes. We have ->page_mkwrite to catch this
> now and turn them into delalloc writes, so perhaps this code path is
> no longer needed anymore?

Thanks Dave.

Eric, I tried your patch again and it fixed this problem but from the
sound of Dave's description it may be a case of removing code rather
than adding another check.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2009-01-09  0:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-13  7:07 [PATCH] fix corruption case for block size < page size Eric Sandeen
2008-12-13 17:48 ` Eric Sandeen
2008-12-13 18:20 ` Eric Sandeen
2008-12-16  5:00   ` Lachlan McIlroy
2008-12-16  5:40     ` Eric Sandeen
2008-12-16  6:05       ` Lachlan McIlroy
2008-12-16  6:10         ` Eric Sandeen
2008-12-16  6:21           ` Eric Sandeen
2008-12-16  6:51             ` Eric Sandeen
2009-01-07  5:23               ` Lachlan McIlroy
2009-01-07  5:53                 ` Eric Sandeen
2009-01-07  6:32                   ` Lachlan McIlroy
2009-01-07 21:42                     ` Dave Chinner
2009-01-09  0:18                       ` Lachlan McIlroy
2008-12-16  7:54           ` Lachlan McIlroy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.